I have no idea on HMM, but I know H&M well:)
Intro
- Probability graph model The probability graph model (denoted as PGM) is a kind of probability model which uses a graph to represent the relationship between variables. generally, use one node to represent one or a group of random variables. the edge between nodes means the probability between variables.
- directed acyclic graph - Bayes network
- undirected graph - Markov Network
- Sequence labeling problem In the named entity recognition (denoted as NER), mostly, the more complex the label system is, the higher accuracy it has, and the longer training time it needs. Note: in this figure, I take Chinese as an example
- IO: when the label is I (inter), this means the token is an entity, otherwise is O (other), which means is not an entity; in this case, the label set contains 4 types of labels (I-PER, I-DATE, I-LOC and O).
- BIO: in order to differentiate two close entities (like dad and mom), we reckon the first words in the entity as B (begin); in this case, the label set contains 7 types of labels (3*2+1);
- BIOES: E (end), S (single, means the single word entity) ; in this case, the label set contains 13 types of labels (3*4+1).
Mathematical definition of Hidden Markov model
If we hope to use HMM in the sequence labeling problem, we need to take the input (words in sequence) as observation and output as hidden states. Here a "BIO" entity label, is an unobservation hidden state, and HMM is the process to describe the observation state (readable text) given those hidden states (entity labels)
Three basic problems of HMM
First, let's define HMM using the mathematical formulation below. \[ \begin{aligned} &\text { model:}(\pi, \mathrm{A}, \mathrm{B})\\ &\text { observation sequence}: \mathrm{O}=\left(\mathrm{O}_{1}, \mathrm{O}_{2}, \ldots, \mathrm{O}_{T}\right)\\ &\text { state sequence}: \mathrm{I}=\left(i_{1}, i_{2}, \ldots, i_{T}\right)\\ &\text { where }\pi\text { means initial hidden state probability;}\\ &\text { A means transition matrix; B means emission matrix}\\ \end{aligned} \]
Three problems:
Probability calculating problem: given the model and observation sequence, calculate the probability of observation given the model.
Learning problem: given observation sequence, estimate the model's parameter. According to whether the training data contains the corresponding state sequence, it divides into:
- Supervised learning: Maximum likelihood estimation
- Unsupervised learning: EM algorithm
Prediction problem: decoding problem. Given the model parameter and observation sequence, find the most likely corresponding state sequence.
To solve the NER task, we focus on the learning problem of HMM.
Using HMM to solve the problem of sequence labeling
With the training data contains the corresponding state sequence, we do the maximum likelihood estimation for supervised learning to solve the NER task. Below is how we estimate.
Maximum likelihood estimation for supervised learning
\[ \begin{aligned} 1. \text {Estimate initial hidden state probability:}\\ \hat{\pi}_{q_{i}}=\frac{\operatorname{count}\left(q_{i}^{1}\right)}{\operatorname{count}\left(o_{1}\right)} ,\sum_{i} \pi_{q_{i}}=1\\ \end{aligned} \]
\[ \begin{aligned} 2. \text {Estimate transition probability:} \\ \hat{A}_{i j}=P\left(i_{t+1}=q_{j} \mid i_{t}=q_{i}\right)=\frac{\operatorname{count}\left(q_{i} q_{j}\right)}{\operatorname{count}\left(q_{i}\right)} \quad \sum_{j} A_{i j}=\sum_{j} P\left(i_{t+1}=q_{j} \mid i_{t}=q_{i}\right)=1\\ \end{aligned} \]
\[ \begin{aligned} 3. \text {Estimate emission probability:}\\ \hat{B}_{j k}=P\left(o_{t}=v_{k} \mid i_{t}=q_{j}\right)=\frac{\operatorname{count}\left(q_{j} v_{k}\right)}{\operatorname{count}\left(q_{j}\right)} \quad \sum_{k} B_{j k}=\sum_{k} P\left(o_{t}=v_{k} \mid i_{t}=q_{j}\right)=1 \end{aligned} \]
With the among estimations, we can estimate the sequence labeling using the Figure below.
Mathematical definition of Linear Chain Conditional Random Field
CRF used on sequence labeling problem is a discriminative model which uses input sequence to predict output sequence. We have the definition below. \[ \begin{aligned} \text {Suppose }X=(X_{1}, \ldots, X_{n}), Y=(Y_{1}, \ldots, Y_{n}),\\ \text {then the definition of linear chain conditional random field is} \\ P(Y i \mid X, Y_{1}, \ldots, Y i-1, Y i+1, \ldots, Y n)=P(Y i \mid X, Y i-1, Y i+1), i=1, \ldots, n \ \end{aligned} \] Note: 1. When i = 1 or n, the probability only consider for one side;
- Yn is only related with Yn-1 and Yn+1.
Then, we use the characteristic function to get the probability of a state sequence y = y1y2... given the observation sequence x = x1x2... \[ \begin{array} \qquad \begin{aligned} P(Y=y \mid x ; \theta) &=\frac{1}{Z(x)} \exp \left(\sum_{k} \lambda_{k} \sum_{i} t_{k}\left(y_{i-1}, y_{i}, x, i\right)+\sum_{l} \mu_{l} \sum_{i} s_{l}\left(y_{i}, x, i\right)\right) \\ Z(x)=& \sum_{y} \exp \left(\sum_{k} \lambda_{k} \sum_{i} t_{k}\left(y_{i-1}, y_{i}, x, i\right)+\sum_{l} \mu_{l} \sum_{i} s_{l}\left(y_{i}, x, i\right)\right) \end{aligned} \end{array} \]
Z(x) is the normalization factor, which refers to the sum of all possible values of y. We can also write the among formulation as below.
\[ \begin{array} \qquad \begin{aligned} \operatorname{score}(l \mid s))=\sum_{j=1}^{m} \sum_{i=1}^{n} \lambda_{j} f_{j}\left(s, i, l_{i}^{\prime}, l_{i-1}^{\prime}\right) \\ p(l \mid s)=\frac{\exp [\operatorname{score}(l \mid s)]}{\sum_{l^{\prime}} \exp \left[\operatorname{score}\left(l^{\prime} \mid s\right)\right]}=\frac{\exp \left[\sum_{j=1}^{m} \sum_{i=1}^{n} \lambda_{j} f_{j}\left(s, i, l_{i}, l_{i-1}\right)\right]}{\sum_{l^{\prime}} \exp \left[\sum_{j=1}^{m} \sum_{i=1}^{n} \lambda_{j} f_{j}\left(s, i, l_{i}^{\prime}, l_{i-1}^{\prime}\right)\right]} \end{aligned} \end{array} \] Now, we have transfer the features into probabilities.
To understand easily, CRF is the sequence version of the logistic regression.
Difference between HMM and CRF
HMM | CRF | |
---|---|---|
Hypothesis | 1. Yn is related to Yn-1; 2. observation independent hypothesis: Xi-1 is only related to current hidden state (Yi-1) | 1. Yn is linked with Yn-1 and Yn+1, but unrelated with other Yi; 2.Yn is related to all Xi, and all Xi can depend Yn |
model | generative model | discriminative model |
scope | - | CRF is more powerful than HMM in modeling |
features | HMM is necessarily local in nature | CRF can define a broader set of features. CRF can use more global features. |
weight value | the probability value of HMM must satisfy certain constraints | CRF can have any weight value |
Through some constraints condition, HMM can be used to do sequence labeling in the CRF style.
Evaluation
Before deep learning, CRF performed well on sequence labeling; even, it is still good on some data sets.
The performance of CRF in this task will also be better than Bi-LSTM (without CRF). Because CRF's prediction of the state sequence depends not only on the global x, but also on each other, while the Bi-LSTM prediction of the state sequence does not depend on other labels, and there is a loss from the global x to the current x. As for CRF, the loss from the long-term x to the current hidden state can be solved by a suitable characteristic function.
So, if we combine CRF and Bi-LSTM, then the hidden state has the relation between close label. What's more, LSTM (deep learning has good feature extraction ability).
References
[1] 李航. 统计学习方法[M]. Qing hua da xue chu ban she, 2012.