Sparse Higher Order Conditional Random Fields for improved sequence labeling

Xian Qian qianxian@fudan.edu.cn Xiaoqian Jiang  xqjiang@mit.edu Qi Zhang qi zhang@fudan.edu.cn Xuanjing Huang xjhuang@fudan.edu.cn ldwu@fudan.edu.cn Lide Wu  School of Computer Science, Fudan University, Shanghai, 200433, P.R.China  School of Computer Science, Carnegie Mellon University, Pittsburgh, PA 15213, U.S.

Abstract
In real sequence labeling tasks, statistics of many higher order features are not sufficient due to the training data sparseness, very few of them are useful. We describe Sparse Higher Order Conditional Random Fields (SHO-CRFs), which are able to handle local features and sparse higher order features together using a novel tractable exact inference algorithm. Our main insight is that states and transitions with same potential functions can be grouped together, and inference is performed on the grouped states and transitions. Though the complexity is not polynomial, SHO-CRFs are still efficient in practice because of the feature sparseness. Experimental results on optical character recognition and Chinese organization name recognition show that with the same higher order feature set, SHO-CRFs significantly outperform previous approaches.

Recent approaches attempting to capture non-local features can be divided into four classes. The first class employs approximate inference algorithms such as Loopy Belief Propagation, Gibbs sampling. Despite of their simplicity, approximate inference techniques are not guaranteed to converge to a good approximation. The second class uses reranking framework such as (Collins, 2002b). These approaches typically generate N best candidate predictions, then adopt a post processing model to rerank these candidates using non-local features. The main drawback of these methods is that the effectiveness of post processing model is restricted by the number of candidates. The third class chooses Semi-Markov chain as the graphic model, such as Semi-Markov CRFs (Sarawagi & Cohen, 2004). Though the inference is exact and efficient, it can only deal with segment-based higher order features. The last class formulates the labeling task as a linear programming(LP) problem with some relaxations (Roth & tau Yih, 2005), so higher order features can be represented as linear constraints. For many higher order features, such inference is still approximate. Different from the approaches mentioned above, we want to handle local and non-local features together while keeping the inference exact and tractable with some reasonable assumptions on non-local features. Our motivation is that in real applications, statistics of higher order features are not sufficient due to the training data sparseness, most of them may be useless. For example, in the optical character recognition(OCR) task, many higher order transitions are meaningless, such as "aaaa", "ijklmn", only very few of them are helpful, such as "tion", "ment". So in this sense, higher order features are sparse in terms of their contribution. We propose Sparse Higher Order Conditional Ran-

1. Introduction
In sequence labeling tasks, structured learning models owe a great part of their success to the ability in using local structured information, such as Conditional Random Fields (CRFs) (Lafferty et al., 2001), Averaged Perceptron (Collins, 2002a), Max Margin Markov Networks (Taskar et al., 2003), etc. However, these approaches are inefficient to cover long distance features due to high computational complexity.
Appearing in Proceedings of the 26 th International Conference on Machine Learning, Montreal, Canada, 2009. Copyright 2009 by the author(s)/owner(s).

Sparse Higher Order CRFs

dom Fields(SHO-CRFs) which can handle local and sparse higher order features together using a novel tractable exact inference algorithm. Though at the worst case, the complexity is still not polynomial, SHO-CRFs is quite efficient in practice due to feature sparseness. Experiments on optical character recognition(OCR) and Chinese organization name recognition tasks demonstrate our technique, SHO-CRFs significantly outperform conventional CRFs with the help of sparse higher order features, and outperform the candidate reranking approach with the same higher order features. The paper is structured as follows: in section 2, we give the definition of configurations; in section 3, we describe our new inference algorithm using configuration graph; in section 4, we analyze the complexity of SHO-CRFs; experimental results are shown in section 5; we conclude the work in section 6.

Formally, each feature can be factorized into two parts: f (x, ys:t ) = b(x, t)IZ (ys:t ) Both parts are binary functions, b(x, t) indicates whether the observation satisfies certain characteristics at position t, Z is a set of label subsequences that are specified by f . IZ (ys:t ) indicates whether ys:t  Z. For example, we define a feature which is fired if the word subsequence lies between "professor" and "said" is recognized as a person name. Consider the sentence x ="Professor Abdul Rahman Haj Yihye said ...", we have f1 (x, y2:5 ) = b(x, 5)I{BIII} (y2:5 ) where   1 b(x, 5) = if the 1 st word is "professor and the next word is "said  0 otherwise

2. Features and configurations
For probabilistic graphical models, the task of sequence labeling is to learn a conditional distribution p(y|x), where x = x1 . . . xl is the observed node sequence to be labeled, y = y1 y2 . . . yl is the label sequence. Each component yt is assumed to range over a finite label alphabet. For example, in named entity recognition(NER) task, we want to extract person names. Here x might be a sequence of words, and y might be a sequence in {B, I, O}|x| , where yt = B indicates "word xt is the beginning of a person name", yt = I indicates "word xt is inside a person name, but not the beginning" and yt = O indicates other cases. CRFs are undirected graphic models depicted by Markov network distribution:
l

Another example is the feature used in skip chain CRFs, which is fired if a pair of same capitalized words have similar label. Suppose x ="Speaker John Smith ... Professor Smith will ...", "Smith" appears at position 3 and 100. Let U = {B, I, O} denote the full label set, U4:99 = U × · · · × U , and Z = {B, I} × U4:99 × {B, I}, we have, f2 (x, y3:100 ) = b(x, 100)IZ (y3:100 ) where   1 if the 3rd and 100 th words are the same and capitalized b(x, 100) =  0 otherwise Such feature is fired only if both "Smith" are labeled as a part of person name. For a fired feature f at position t, if its corresponding Z yields the form Zs:t = Zs × Zs+1 × · · · × Zt , where Zi  U, s  i  t, such Z is called the configuration of f . Both examples mentioned above are configurations. However, for example, Z = {BI, IB} is not a configuration, in this case, we treat it as union of two configurations. The potential function of a configuration is defined as: (Zs:t ) = exp (wf (x, ys:t )) , for any ys:t  Zs:t

p(y|x) 
t=1

(x, y, t)

where (x, y, t) is a real-valued potential function at position t, which has the form: (x, y, t) = exp
i

wi fi (x, ys:t )

where wi is the parameter to be estimated from the training data, fi is a real valued feature function, or features for short, which is given and fixed. In this paper, for simplicity, we will focus on the case of binary features. However, our results extend easily to the general real valued case. We call a binary feature is fired if its value is true. ys:t = ys ys+1 . . . yt is a label subsequence affected by fi , t - s is the order of fi .

where yp:q  Zs:t (p  s  t  q) indicates that subsequence yp:q satisfies ys:t  Zs:t .

Sparse Higher Order CRFs

3. Inference
3.1. Task We describe our inference algorithm for training, which can be applied for decoding with a slight modification. In training stage, CRFs learn w = [w1 , . . . , wM ]T from training data X = ~ ~ {(x(1) , y(1) ), (x(2) , y(2) ) . . . }: min O(w) = -
w i

paper, we assume that such extension has been done for all configurations. The extension operation will be frequently used in the rest of the paper, given a set of label subsequence, As:t = {ys:t } and new range from p to q, the extension is defined as: Ep:q (As:t )   Ap:q    Up:s-1 × As:q  Ap:t × Ut+1:q =   Up:s-1 × As:t × Ut+1:q    Up:q

log p(~ (i) |x(i) ) y

~ where y(i) is the gold standard label sequence of x(i) , M is the feature number. The goal of inference is to calculate O(w) and is not difficult to obtain that O = wj ~ (Zi,t ) - fj (x(i) , ys:t )
i t (i) O w .

spqt p<sqt spt<q p<st<q otherwise

It

For configuration Zr:t , we wish to compute its marginal probability using (Zr:t ) = 1 z(x) (x, yr:t )(x, yr:t )
yr:t Zr:t

(1)

where Zi,t is the configuration of fj at position t of the ith training sample, and (Zi,t ) is the marginal probability: (Zi,t ) =
yZi,t

where z(x) is the partition function,
t

p(y|x(i) )

(x, yr:t ) =
y1:t {yr:t } k=1 l

(x, y1:t , k)

The inference task is to compute (Zi,t ). 3.2. The configuration graph Given a sequence, we could represent its configurations by a configuration graph. For example, suppose there are two configurations, A1:3 = {B, I} × {I} × {B, I} and B2:4 = {I, O} × {I, O} × {B}, the corresponding configuration graph is shown in Figure 1, each configuration is represented by a box.

(x, yr:t ) =
yr:l {yr:t } k=t+1

(x, yr:l , k)

Intuitively, if we treat y as a flow that flows from left to right, (x, yr:t ) denotes the potential functions that have been obtained by flows end with yr:t , and (x, yr:t ) denotes the potential functions that will be obtained by flows begin with yr:t . However, formulation (1) is incorrect, consider the example in Figure 2, where A, B are configurations at t, t + 1 respectively, y1 r:t = y2 r:t  A. We could not compute (y1 r:t ), since (y1 ) = (y2 ), the reason is that, (y1 ) excludes (B), while (y2 ) does not. Hence we don't know whether (y1 r:t ) = (y1 ) or (y2 ).

{B,I} × {I} × {B,I} {I} × {I} {I,O} × {I,O} × {B} 1 2 3 position 4

Figure 1. A configuration graph

y1 y2
r-1 r

A B
t t+1

y1 y2
r-1 r

A B
t t+1

Since

(Zr:t ) = = yr:t Zr:t (yr:t ) (yr-1:t ) = (Zr-1:t ), so a conyr-1:t U ×Zr:t figuration could be extended to a wider range while keeping its marginal probability. Let rmin denote the leftmost position of the configurations at t, then all the configurations at t could be extended to rmin : t, so that they have same range. In the rest of the

Figure 2. Error example of formulation (1) and correction

However, we could rectify this formulation by extending Ar:t to Ar-1:t , so that y1 and y2 are two different elements of Ar-1:t . Similarly, for configuration Cp:t+2 , if p < r, such extension is also required.

Sparse Higher Order CRFs

Therefore, we should consider the leftmost position (denoted as s) of configurations at t + 1, t + 2, . . . that overlaps Ur:t , and extends configurations Zir:t to Zis:t if s < r. In the following, we assume that such extension has been done for all configurations, i.e., r  s. In fact, the  value is unique for ys:t , even s > r, so we use the following equation for inference: (Zr:t ) = 1 z(x) (x, yr:t )(x, ys:t )
yr:t Zr:t

state. The common  value of Aks:t is denoted by (Aks:t ) . P(S) is called the derived partition of configuration set S. 3.4. Transition partition The next problem is to calculate the  value. First, we show two propositions: Proposition 1 For any As:t  s:t and any y  U , we have, all ys:t+1  As:t ×{y} share a common potential function (x, ys:t+1 , t + 1) at t + 1. Proof Let S = {Zir:t+1 } denote the set of configurations at t + 1, and T = {Yis:t },where Yis:t = Es:t (Zir:t+1 ), for its derived partition P(T ), each part can be represented by Cs:t = = iI Yis:t jT -I Yj s:t , so Cs:t × {y} Hence iI Yis:t × {y} jT -I Yj s:t × {y} . for any ys:t+1  Cs:t × {y}, its potential function (x, ys:t+1 , t + 1) = iI,yZit+1 (Zi ). Since s:t is refinement of P(T ), so proposition 1 holds. Proposition 2 For any As:t  s:t , Br:t+1  r:t+1 , there exists an Yt+1  U , so that As:t × Yt+1  Us:r-1 × Br:t+1 , and As:t × Yt+1 Us:r-1 × Br:t+1 = . Proof is shown in Appendix A, which also gives such Yt+1 . An intuitive illustration is shown in the left part of Figure 4.

(2)

Conventional forward backward algorithm considers all yr:t  Ur:t to calculate z(x) and (Z), hence the complexity is exponential in the number of labels. However, if we could split Us:t into several parts, so that all elements in one part share a common  value, then the complexity is reduced. 3.3. State partition To derive a common , we consider a simple case, Uu:v and one configuration Bp:q , q > v are given. If p > v, Bp:q and Uu:v are disjoint, then all yu:v  Uu:v share a common : (yu:v ) = (B)|Bv+1:l |+|Bv+1:l |, where Bv+1:l = Ev+1:l (Bp:q ). Hence no split needed. If p  v, We have, (yu:v ) = (B)|Bv+1:l | + |Bv+1:l | yu:v  Bu:v |Uv+1:l | otherwise

where Bu:v = Eu:v (Bp:q ), as shown in Figure 3. Hence, we obtain the partition: {Bu:v , Bu:v }, all members in one part share a common .

B u:v B u:v
u v v+1 q

B u:q

A s:t×Y t+1 B r :t+1 A s:t×Y t+1

A s:t×V 1 A s:t×V 2 A s:t×V 3 B r :t+1

Y t+1 = U V i

3 grouped transitions

Figure 3. Split Uu:v so that all members in one part share a common .

Figure 4. Left: Proposition 2, Right: 3 grouped transitions between grouped states As:t and Br:t+1

Generally, consider Us:t mentioned in the previous sunsection, let S = {Zip:q } denote the set of configurations at t + 1, t + 2, . . . that overlap Us:t , we extend each of them from Zip:q to Zis:t , then derive a partition of Us:t , so that all ys:t in one part share a common : P(S) = {Aks:t |Aks:t = , k = 1, . . . , 2|S| } where Aks:t =
|S| i=1

(3)

Akis:t , Akis:t = Zis:t or Zis:t .

This partition is called the state partition of Us:t , denoted as s:t , each part Aks:t  s:t is called a grouped

All ys:t+1  As:t × Yt+1 share a common . If we consider all Bir:t+1  r:t+1 , then the set {As:t ×Yi |As:t × Yi  Us:r-1 × Bir:t+1 , Bir:t+1  r:t+1 } is a partition of As:t × U , and the union of partitions of Ais:t × U over Ais:t  s:t is a partition of Us:t+1 . According to proposition 1, all ys:t+1  As:t × {y} share a common potential function , so we derived a partition of As:t × Yi = j As:t × Vj so that all ys:t+1 in one part As:t × Vj share common  and  values (denoted by (As:t × Vj )). The union of such partitions over all {As:t × Yi } is a partition of Us:t+1 , which is called

Sparse Higher Order CRFs

transition partition, denoted as s:t+1 . Each member As:t × Vj is called a grouped transition. An example is shown in the right part of Figure 4. We could compute (As:t ) recursively by enumerating all its linked grouped transitions {As:t × Vt+1 }: (As:t ) =
As:t ×Vt+1

where (As:t ) = ys:t As:t (ys:t ), which could be calculated recursively: (Br:t+1 ) =
As:t × Vt+1  Us:r-1 × Br:t+1

(As:t )|Vt+1 |(As:t × Vt+1 )

(BVt+1 )|Vt+1 |(As:t × Vt+1 )

And z(x) =

As:l s:l

(As:l ).

where BVt+1 is the grouped state Br:t+1 that satisfies Us:r-1 × Br:t+1  As:t × Vt+1 . 3.5. The extended forward backward algorithm We could build a trellis in which grouped states are represented by nodes, and edges between nodes denote grouped transition sets between two grouped states. If the transition set is empty, the corresponding edge does not exist. For instance, consider the first feature in section 2, suppose state features fi (yt , x, t) and first order transition features gi (yt-1 , yt , x, t) are used additionally, the trellis is shown in Figure 5.

4. Complexity analysis
In SHO-CRFs training, we have to derive state partitions and transition partitions of each sequence, a trellis like Figure 5 is built. Given extended configuration sets At = {Es:t (Zisi :ti )}si t<ti , and the configuration set at t + 1 (denoted by Bt+1 ), the complexity to derive s:t is |s:t ||At |  | s:t+1 ||At |, the complexity to derive s:t+1 is | s:t+1 ||Bt+1 |. In the worse case, | s:t+1 | is exponential in label size when feature represents no sparsity at all. However in most real world scenarios, the size of | s:t+1 | is usually limited, so building such a trellis is tractable. In the second step, the complexity of extended forward-backward algorithm is the sum of | s:t | over x, t.
| s:t | The ratio (t) = |U |t-s+1 is called the sparsity at t, t-s+1 | s:t |, |U | are the complexities of SHO-CRFs and conventional CRFs respectively. For example, the sparsity of the 4th word "Yiyhe" in Figure 5 is (4) = 12/34 .

Abdul
{B} {I} {O}

Rahman
U×{B} {BI} U×{I}-{BI} U×{O}

Haj
U×U×{B} {BII} U×U×{I}-{BII} U×U×{O}

Yiyhe
{B} {I} {O}

Said
{B} {I} {O}

5. Experiments
5.1. Optical Character Recognition(OCR)
Figure 5. Trellis for the first feature example in section 2 with additional features fi (yt , x, t) and gi (yt-1 , yt , x, t)

Recall that we use Equation (2) to compute marginal probability of Zr:t , we have Zr:t = i Tir:t where Tir:t  r:t , and Tir:t has (Zr:t ). So the problem is to compute (Tir:t ). Suppose grouped transition Ts:t+1 = As:t × Vt+1 links grouped states As:t and Br:t+1 , we have, (Ts:t+1 ) 
ys:t+1 Ts:t+1

We conduct two experiments to demonstrate our method. The first is Optical Character Recognition (OCR) task. For comparison, we use the same data and settings as Ben Taskar's work(Taskar et al., 2003), where 6100 handwritten words were divided into 10 folds of 600 training and 5500 testing examples, averaged character accuracy over the 10 folds are reported as evaluation metric. We compare our algorithm to four state of the art algorithms: traditional CRFs (Lafferty et al., 2001), M3Ns (Taskar et al., 2003), candidate reranking (Collins, 2002b) and structured Hidden Markov Model (SHMM) (Galassi et al., 2007), a variant of Hierarchical Hidden Markov Model that provides high level knowledge abstraction. For traditional CRFs, pixel features and first order transition features are used. The results of M3Ns are directly copied from Ben Taskar's paper. For candidate reranking and SHO-CRFs, we use the affixes as additional higher order features. We down-

(ys:t+1 )(yr:t+1 ) (ys:t+1 )
ys:t+1 Ts:t+1

= (Br:t+1 )

= (Br:t+1 )(Ts:t+1 )|Vt+1 |
ys:t As:t

(ys:t )

= (Br:t+1 )(Ts:t+1 )|Vt+1 |(As:t )

Sparse Higher Order CRFs Table 1. Distribution of Affixes Length Number 1 45 2 236 3 216 4 94 5 81 6 38 7 20 8 15 Table 2. Comparison results on OCR data set Algorithm CRFs M3Ns (linear) M3Ns (quad) M3Ns (cubic) 100-Best Rerank SHMM SHO-CRFs Accuracy 80.92% 80.5% 86.5% 87.5% 78.9% 79.05% 88.52% Training time(m) 7.4 66 26 29

load the word list from the web1 , and get the prefixes and suffixes of length 1-8 with frequency higher than 100,100,100,100,70,70,70,50 respectively. Different thresholds are used for affixes of different length because frequencies of short affixes are generally higher than long affixes, using a single threshold will add noisy short affixes or miss important long affixes. The distribution of affixes of different length are shown in table 1. An affix feature is defined as true if and only if the first or last several predicted characters form the corresponding affix. For candidate reranking, we use traditional CRFs to generate 100 candidates, then use averaged perceptron to rerank with affix features. For SHO-CRFs, local features and affix features are used simultaneously, the order of affix feature is one less than the length of corresponding affix, so SHO-CRF is a 7th order CRF. For SHMM, we use three blocks, one emits general observation and two affix (suffix and prefix) feature specific observation respectively. The parameters of every Gaussian mixture (3 Gaussians/State in our implementation) of each state of the HMM are estimated through Expectation Maximization. Comparison results are shown in table 2. SHO-CRFs achieve the highest published accuracy. However, our work is orthogonal to kernel based methods, we believe that combining the new inference algorithm and M3Ns will achieve a further improvement. While for reranking method, affix features degrade performance a little. The reason is that, in perceptron training stage, candidates are generated by CRFs model learned on 90% training samples rather than the full data set, so the best candidate is less accurate due to small training data size (only about 600 samples), which significantly affect the quality of perceptron learning. SHO-CRFs also exceed SHMM which suffers from numerous local minimum traps that riddle the cost surface. We also investigate the effect of affix length to see which order is high enough to capture such long distance features. Hence, affixes of length 1,2,. . . , 8 are gradually added. The results of are shown in Figure 6. The accuracy increases slowly when affix length
1

89 88 87 86 Accuracy 85 84 83 82 81 80 1 2 3 4 5 Max Affix Length 6 7 8 Baseline CRFs without affix features SHO-CRFs with affix features

Figure 6. Effect of Affix Length

> 6. So a 5th order CRF is enough to capture affix features, however, such an order is too high for conventional CRFs. 5.2. Chinese Organization Name Recognition Our second experiment is Chinese Organization (ORG) Name Recognition task, which is the most difficult subtask of Chinese named entity recognition, since the ORG names are often very long and complexly structured. In this task, each Chinese character of sentence is to be assigned with a label y  {B, I, E, S, O}, where "B, I, E" means that current character is the beginning, middle, end of a multi-character ORG name respectively, "S" means a single character ORG name, "O" means other. We use MSRA Chinese named entity corpus in SIGHAN 2008 (Jin & Chen, 2008). A baseline CRF model is trained using local features, including first order transition features and surrounding character unigrams and bigrams within a window of 5 characters. We derived some linguistic characteristics by roughly going through training corpus and represented them by higher order features. Some of them are listed in table 3. Finally, 28 higher order features are defined, a dictionary contains about 20K manually col-

www.curlewcommunications.co.uk/wordlist.html

Sparse Higher Order CRFs Table 3. Some higher order features in ORG recognition task Fired condition xs:t = location + . . . + suffix AND is labeled as a ORG xs:t =" ..." or . . . or . . . AND only one of ys ,yt is O xs:t = + x1 + + . . . + xn + + . . . + suffix AND each xi is labeled as ORG Interpretation Some ORG names have the structure: location + ... + suffix Negative feature, to punish those inconsistently labeled symmetric mark pairs Each unit in parallel structures is possibly a ORG. Example /B /I /I /I /I /I /I /I /I /I /I /E (Huanghua Chinese medical psychiatric hospital) Where location = (Huanghua City), suffix = (hospital) /O /O /O /B /I /E /O ( called Qingmin League for short ) Both brackets should be labeled as O /O /O /O /B /E /O /B /I /I /E /O /B /E /O /O /O /O /O /O /O /O /O (Several large-scale enterprise groups such as Hyundai, LG, Samsung) Where x1 = (Hyundai), x2 = (LG), x3 = (Samsung), suffix = (group)

Table 4. Chinese ORG recognition result F1 CRFs 10-Best Rerank 20-Best Rerank 50-Best Rerank 100-Best Rerank SHO-CRFs Official best (Yang et al., 2008) Official 2nd (Yu et al., 2008) 84.02% 84.81% 84.82% 84.92% 85.03% 88.59% 90.48% Training Time(m) 42 375 375 375 375 164 Additional Resources None

20K items including location names, company names etc.

88.40% -

1998 People s Daily corpus, 40K location names and 300K ORG names Additional Chinese word segmenter and POS tagger. A dictionary contains 445K words, 30K location names, 55K ORG names, etc.

lected items including suffixes, location names, company names are used for detecting high order features, i.e., b(x, t). For completeness, we compared our SHO-CRFs to the candidate reranking algorithm and the top 2 official runs in SIGHAN 2008 open test2 , results are summarized in table 4. SHO-CRFs significantly outperform N-best reranking with the same feature set and achieve the second best performance using much less resources.

Acknowledgments
We appreciate for the suggestions by the four reviewers that help improve this paper in many aspects. This work was (partially) funded by Chinese NSF 60673038, Doctoral Fund of Ministry of Education of China 200802460066, Shanghai Science and Technology Development Funds 08511500302, and Shanghai Leading Academic Discipline Project, Project number: B114.

6. Conclusion
We describe SHO-CRFs which could handle local and sparse higher order features together using a novel tractable exact inference. Our work is motivated by the sparseness of long distance features in real applications. Though the computational complexity is not polynomial theoretically, experiments on OCR task and Chinese organization name recognition task clearly demonstrate the efficiency in real practice.
In fact, the top 2 official runs in SIGHAN achieve 99.86% and 99.2% F-score, we did not list them out because they used a larger training corpus that is the super set of testing data.
2

References
Collins, M. (2002a). Discriminative training methods for hidden markov models: Theory and experiments with perceptron algorithms. Proceedings of Empirical Methods in Natural Language Processing (pp. 1­8). Collins, M. (2002b). Ranking algorithms for named entity extraction: Boosting and the voted perceptron. Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics (pp. 489­ 496). Galassi, U., Giordana, A., & Saitta, L. (2007). Struc-

Sparse Higher Order CRFs

tured hidden markov model: A general framework for modeling complex sequences. AI*IA 2007: Artificial Intelligence and Human-Oriented Computing (pp. 290­301). Jin, G., & Chen, X. (2008). The fourth international chinese language processing bakeoff: Chinese word segmentation, named entity recognition and chinese pos tagging. Proceedings of Sixth Special Interest Group on Chinese Language Processing Workshop (pp. 69­81). Lafferty, J., McCallum, A., & Pereira, F. (2001). Conditional random fields: Probabilistic models for segmenting and labeling sequence data. Proceedings of the 18th International Conference on Machine Learning (pp. 282­289). Roth, D., & tau Yih, W. (2005). Integer linear programming inference for conditional random fields. Proceedings of the 22nd International Conference on Machine learning. (pp. 736­743). Sarawagi, S., & Cohen, W. (2004). Semi-markov conditional random fields for information extraction. Advances in Neural Information Processing Systems (pp. 1185­1192). Taskar, B., Guestrin, C., & Koller, D. (2003). Maxmargin markov networks. Advances in Neural Information Processing Systems (pp. 25­32). Yang, F., Zhao, J., & Zou, B. (2008). CRFs-based named entity recognition incorporated with heuristic entity list searching. Proceedings of Sixth Special Interest Group on Chinese Language Processing Workshop (pp. 171­174). Yu, X., Lam, W., Chan, S.-K., Wu, Y., & Chen, B. (2008). Chinese NER using CRFs and logic for the fourth sighan bakeoff. Proceedings of Sixth Special Interest Group on Chinese Language Processing Workshop (pp. 102­105). A. Proof of proposition 2 Suppose S = {Zisi :ti } are all configurations of (0) x. s:t can be derived as follows: s:t = Us:t ; (i) if si  t < ti , s:t is the refinement of parti(i-1) tion {Es:t (Zisi :ti ), Es:t (Zisi :ti )} and s:t , otherwise, (i) (i-1) (|S|) s:t = s:t . Finally s:t = s:t . The derivation of r:t+1 is similar. We shall prove that for any i, (i) (i) (i) (i) As:t  s:t , Br:t+1  r:t+1 , proposition 2 holds. For i = 0, = proposition 2 holds.
(0) s:t (0) Us:t , s:t

Suppose proposition 2 holds for i = k, that is, there ex(k) (k) (k) (k) (k) ists Yt+1 so that As:t ×Yt+1  Us:r-1 ×Br:t+1 , As:t × Yt+1  Us:r-1 × Br:t+1 = . Then for i = k + 1, there are 5 cases: Case 1, sk+1 = tk+1 . We have s:t (k) r:t+1 , so proposition 2 holds.
(k+1) (k) (k)

= s:t , r:t+1 =

(k)

(k+1)

Cases 2-5 assume that sk+1 < tk+1 . Case 2, sk+1  t + 2 or tk+1  t. Like case 1, neither partition changes, so proposition 2 holds. Case 3, sk+1  t, tk+1  t + 2. Both parti(k) (k+1) tions change. As:t is split into two parts: As:t (k) (k+1) (k) = As:t  Es:t (Zk+1 sk+1 :tk+1 ), Cs:t = As:t  Es:t (Zk+1 sk+1 :tk+1 ), Br:t+1 is split into two parts: Br:t+1 = Br:t+1  Er:t+1 (Zk+1 sk+1 :tk+1 ), Dr:t+1 = Br:t+1  Er:t+1 (Zk+1 sk+1 :tk+1 ). let For As:t , Br:t+1 , (k) Yt+1  Et+1:t+1 (Zk+1 sk+1 :tk+1 ).
(k+1) (k+1) (k+1) (k+1) (k+1) (k) (k+1) (k) (k+1) (k)

Yt+1 We
(k)

(k+1)

= have    get

As:t × Yt+1  As:t × Yt+1 (k+1) (k) (k+1) × Yt+1 Ur:s-1 × Br:t+1 , and As:t Es:t (Zk+1 sk+1 :tk+1 ) × Et+1:t+1 (Zk+1 sk+1 :tk+1 ) Ur:s-1 × Er:t+1 (Zk+1 sk+1 :tk+1 ). So we As:t
(k+1)

× Yt+1

(k+1)


Ur:s-1 × Br:t+1 .

(k+1)

Notice

(k+1) that Yt+1 (k) Since As:t × (k+1)  As:t (k) (k+1) × Yt+1 As:t

(k) = Yt+1  Et+1:t+1 (Zk+1 sk+1 :tk+1 ). (k) (k) Yt+1  Us:r-1 × Br:t+1 = , (k+1) (k) (k)  Br:t+1 , so As:t ,Br:t+1 (k+1)  Us:r-1 × Br:t+1 = . Since

Us:t
(k+1) Br:t+1

× 

Et+1:t+1 (Zk+1 sk+1 :tk+1 ) = , Er:t+1 (Zk+1 sk+1 :tk+1 ),

 so

Us:r-1 
(k+1) As:t

× ×

Er:t+1 (Zk+1 sk+1 :tk+1 )

(k+1) As:t

Us:t ,

(k+1) Et+1:t+1 (Zk+1 sk+1 :tk+1 )  Us:r-1 × Br:t+1 = . Hence (k+1) (k+1) (k+1) × Yt+1  Us:r-1 × Br:t+1 = . we get As:t (k+1) (k+1) So proposition 2 holds for As:t , Yt+1 , (k+1) Br:t+1 . Similarly, we have, proposition 2 holds (k+1) (k+1) (k+1) (k+1) for Cs:t , Wt+1 = , Br:t+1 ; and As:t , (k+1) (k+1) (k) Vt+1 = Yt+1  Et+1:t+1 (Zk+1 sk+1 :tk+1 ), Dr:t+1 ; and (k+1) (k) (k+1) Cs:t , Yt+1 , Dr:t+1 , we omit the detail due to space

limit.

Case 4, sk+1 = t + 1. Only r:t+1 changes, Br:t+1 is split into two parts, proof is similar with case 3. Case 5, tk+1 = t+1. Only s:t changes, proof is similar with case 3.

(k)

= Ur:t+1 , let Yt+1 = U , so