Sparse Higher Order Conditional Random Fields for improved sequence labeling Xian Qian qianxian@fudan.edu.cn Xiaoqian Jiang xqjiang@mit.edu Qi Zhang qi zhang@fudan.edu.cn Xuanjing Huang xjhuang@fudan.edu.cn ldwu@fudan.edu.cn Lide Wu School of Computer Science, Fudan University, Shanghai, 200433, P.R.China School of Computer Science, Carnegie Mellon University, Pittsburgh, PA 15213, U.S. Abstract In real sequence labeling tasks, statistics of many higher order features are not sufficient due to the training data sparseness, very few of them are useful. We describe Sparse Higher Order Conditional Random Fields (SHO-CRFs), which are able to handle local features and sparse higher order features together using a novel tractable exact inference algorithm. Our main insight is that states and transitions with same potential functions can be grouped together, and inference is performed on the grouped states and transitions. Though the complexity is not polynomial, SHO-CRFs are still efficient in practice because of the feature sparseness. Experimental results on optical character recognition and Chinese organization name recognition show that with the same higher order feature set, SHO-CRFs significantly outperform previous approaches. Recent approaches attempting to capture non-local features can be divided into four classes. The first class employs approximate inference algorithms such as Loopy Belief Propagation, Gibbs sampling. Despite of their simplicity, approximate inference techniques are not guaranteed to converge to a good approximation. The second class uses reranking framework such as (Collins, 2002b). These approaches typically generate N best candidate predictions, then adopt a post processing model to rerank these candidates using non-local features. The main drawback of these methods is that the effectiveness of post processing model is restricted by the number of candidates. The third class chooses Semi-Markov chain as the graphic model, such as Semi-Markov CRFs (Sarawagi & Cohen, 2004). Though the inference is exact and efficient, it can only deal with segment-based higher order features. The last class formulates the labeling task as a linear programming(LP) problem with some relaxations (Roth & tau Yih, 2005), so higher order features can be represented as linear constraints. For many higher order features, such inference is still approximate. Different from the approaches mentioned above, we want to handle local and non-local features together while keeping the inference exact and tractable with some reasonable assumptions on non-local features. Our motivation is that in real applications, statistics of higher order features are not sufficient due to the training data sparseness, most of them may be useless. For example, in the optical character recognition(OCR) task, many higher order transitions are meaningless, such as "aaaa", "ijklmn", only very few of them are helpful, such as "tion", "ment". So in this sense, higher order features are sparse in terms of their contribution. We propose Sparse Higher Order Conditional Ran- 1. Introduction In sequence labeling tasks, structured learning models owe a great part of their success to the ability in using local structured information, such as Conditional Random Fields (CRFs) (Lafferty et al., 2001), Averaged Perceptron (Collins, 2002a), Max Margin Markov Networks (Taskar et al., 2003), etc. However, these approaches are inefficient to cover long distance features due to high computational complexity. Appearing in Proceedings of the 26 th International Conference on Machine Learning, Montreal, Canada, 2009. Copyright 2009 by the author(s)/owner(s). Sparse Higher Order CRFs dom Fields(SHO-CRFs) which can handle local and sparse higher order features together using a novel tractable exact inference algorithm. Though at the worst case, the complexity is still not polynomial, SHO-CRFs is quite efficient in practice due to feature sparseness. Experiments on optical character recognition(OCR) and Chinese organization name recognition tasks demonstrate our technique, SHO-CRFs significantly outperform conventional CRFs with the help of sparse higher order features, and outperform the candidate reranking approach with the same higher order features. The paper is structured as follows: in section 2, we give the definition of configurations; in section 3, we describe our new inference algorithm using configuration graph; in section 4, we analyze the complexity of SHO-CRFs; experimental results are shown in section 5; we conclude the work in section 6. Formally, each feature can be factorized into two parts: f (x, ys:t ) = b(x, t)IZ (ys:t ) Both parts are binary functions, b(x, t) indicates whether the observation satisfies certain characteristics at position t, Z is a set of label subsequences that are specified by f . IZ (ys:t ) indicates whether ys:t Z. For example, we define a feature which is fired if the word subsequence lies between "professor" and "said" is recognized as a person name. Consider the sentence x ="Professor Abdul Rahman Haj Yihye said ...", we have f1 (x, y2:5 ) = b(x, 5)I{BIII} (y2:5 ) where 1 b(x, 5) = if the 1 st word is "professor and the next word is "said 0 otherwise 2. Features and configurations For probabilistic graphical models, the task of sequence labeling is to learn a conditional distribution p(y|x), where x = x1 . . . xl is the observed node sequence to be labeled, y = y1 y2 . . . yl is the label sequence. Each component yt is assumed to range over a finite label alphabet. For example, in named entity recognition(NER) task, we want to extract person names. Here x might be a sequence of words, and y might be a sequence in {B, I, O}|x| , where yt = B indicates "word xt is the beginning of a person name", yt = I indicates "word xt is inside a person name, but not the beginning" and yt = O indicates other cases. CRFs are undirected graphic models depicted by Markov network distribution: l Another example is the feature used in skip chain CRFs, which is fired if a pair of same capitalized words have similar label. Suppose x ="Speaker John Smith ... Professor Smith will ...", "Smith" appears at position 3 and 100. Let U = {B, I, O} denote the full label set, U4:99 = U × · · · × U , and Z = {B, I} × U4:99 × {B, I}, we have, f2 (x, y3:100 ) = b(x, 100)IZ (y3:100 ) where 1 if the 3rd and 100 th words are the same and capitalized b(x, 100) = 0 otherwise Such feature is fired only if both "Smith" are labeled as a part of person name. For a fired feature f at position t, if its corresponding Z yields the form Zs:t = Zs × Zs+1 × · · · × Zt , where Zi U, s i t, such Z is called the configuration of f . Both examples mentioned above are configurations. However, for example, Z = {BI, IB} is not a configuration, in this case, we treat it as union of two configurations. The potential function of a configuration is defined as: (Zs:t ) = exp (wf (x, ys:t )) , for any ys:t Zs:t p(y|x) t=1 (x, y, t) where (x, y, t) is a real-valued potential function at position t, which has the form: (x, y, t) = exp i wi fi (x, ys:t ) where wi is the parameter to be estimated from the training data, fi is a real valued feature function, or features for short, which is given and fixed. In this paper, for simplicity, we will focus on the case of binary features. However, our results extend easily to the general real valued case. We call a binary feature is fired if its value is true. ys:t = ys ys+1 . . . yt is a label subsequence affected by fi , t - s is the order of fi . where yp:q Zs:t (p s t q) indicates that subsequence yp:q satisfies ys:t Zs:t . Sparse Higher Order CRFs 3. Inference 3.1. Task We describe our inference algorithm for training, which can be applied for decoding with a slight modification. In training stage, CRFs learn w = [w1 , . . . , wM ]T from training data X = ~ ~ {(x(1) , y(1) ), (x(2) , y(2) ) . . . }: min O(w) = - w i paper, we assume that such extension has been done for all configurations. The extension operation will be frequently used in the rest of the paper, given a set of label subsequence, As:t = {ys:t } and new range from p to q, the extension is defined as: Ep:q (As:t ) Ap:q Up:s-1 × As:q Ap:t × Ut+1:q = Up:s-1 × As:t × Ut+1:q Up:q log p(~ (i) |x(i) ) y ~ where y(i) is the gold standard label sequence of x(i) , M is the feature number. The goal of inference is to calculate O(w) and is not difficult to obtain that O = wj ~ (Zi,t ) - fj (x(i) , ys:t ) i t (i) O w . spqt p r, so we use the following equation for inference: (Zr:t ) = 1 z(x) (x, yr:t )(x, ys:t ) yr:t Zr:t state. The common value of Aks:t is denoted by (Aks:t ) . P(S) is called the derived partition of configuration set S. 3.4. Transition partition The next problem is to calculate the value. First, we show two propositions: Proposition 1 For any As:t s:t and any y U , we have, all ys:t+1 As:t ×{y} share a common potential function (x, ys:t+1 , t + 1) at t + 1. Proof Let S = {Zir:t+1 } denote the set of configurations at t + 1, and T = {Yis:t },where Yis:t = Es:t (Zir:t+1 ), for its derived partition P(T ), each part can be represented by Cs:t = = iI Yis:t jT -I Yj s:t , so Cs:t × {y} Hence iI Yis:t × {y} jT -I Yj s:t × {y} . for any ys:t+1 Cs:t × {y}, its potential function (x, ys:t+1 , t + 1) = iI,yZit+1 (Zi ). Since s:t is refinement of P(T ), so proposition 1 holds. Proposition 2 For any As:t s:t , Br:t+1 r:t+1 , there exists an Yt+1 U , so that As:t × Yt+1 Us:r-1 × Br:t+1 , and As:t × Yt+1 Us:r-1 × Br:t+1 = . Proof is shown in Appendix A, which also gives such Yt+1 . An intuitive illustration is shown in the left part of Figure 4. (2) Conventional forward backward algorithm considers all yr:t Ur:t to calculate z(x) and (Z), hence the complexity is exponential in the number of labels. However, if we could split Us:t into several parts, so that all elements in one part share a common value, then the complexity is reduced. 3.3. State partition To derive a common , we consider a simple case, Uu:v and one configuration Bp:q , q > v are given. If p > v, Bp:q and Uu:v are disjoint, then all yu:v Uu:v share a common : (yu:v ) = (B)|Bv+1:l |+|Bv+1:l |, where Bv+1:l = Ev+1:l (Bp:q ). Hence no split needed. If p v, We have, (yu:v ) = (B)|Bv+1:l | + |Bv+1:l | yu:v Bu:v |Uv+1:l | otherwise where Bu:v = Eu:v (Bp:q ), as shown in Figure 3. Hence, we obtain the partition: {Bu:v , Bu:v }, all members in one part share a common . B u:v B u:v u v v+1 q B u:q A s:t×Y t+1 B r :t+1 A s:t×Y t+1 A s:t×V 1 A s:t×V 2 A s:t×V 3 B r :t+1 Y t+1 = U V i 3 grouped transitions Figure 3. Split Uu:v so that all members in one part share a common . Figure 4. Left: Proposition 2, Right: 3 grouped transitions between grouped states As:t and Br:t+1 Generally, consider Us:t mentioned in the previous sunsection, let S = {Zip:q } denote the set of configurations at t + 1, t + 2, . . . that overlap Us:t , we extend each of them from Zip:q to Zis:t , then derive a partition of Us:t , so that all ys:t in one part share a common : P(S) = {Aks:t |Aks:t = , k = 1, . . . , 2|S| } where Aks:t = |S| i=1 (3) Akis:t , Akis:t = Zis:t or Zis:t . This partition is called the state partition of Us:t , denoted as s:t , each part Aks:t s:t is called a grouped All ys:t+1 As:t × Yt+1 share a common . If we consider all Bir:t+1 r:t+1 , then the set {As:t ×Yi |As:t × Yi Us:r-1 × Bir:t+1 , Bir:t+1 r:t+1 } is a partition of As:t × U , and the union of partitions of Ais:t × U over Ais:t s:t is a partition of Us:t+1 . According to proposition 1, all ys:t+1 As:t × {y} share a common potential function , so we derived a partition of As:t × Yi = j As:t × Vj so that all ys:t+1 in one part As:t × Vj share common and values (denoted by (As:t × Vj )). The union of such partitions over all {As:t × Yi } is a partition of Us:t+1 , which is called Sparse Higher Order CRFs transition partition, denoted as s:t+1 . Each member As:t × Vj is called a grouped transition. An example is shown in the right part of Figure 4. We could compute (As:t ) recursively by enumerating all its linked grouped transitions {As:t × Vt+1 }: (As:t ) = As:t ×Vt+1 where (As:t ) = ys:t As:t (ys:t ), which could be calculated recursively: (Br:t+1 ) = As:t × Vt+1 Us:r-1 × Br:t+1 (As:t )|Vt+1 |(As:t × Vt+1 ) (BVt+1 )|Vt+1 |(As:t × Vt+1 ) And z(x) = As:l s:l (As:l ). where BVt+1 is the grouped state Br:t+1 that satisfies Us:r-1 × Br:t+1 As:t × Vt+1 . 3.5. The extended forward backward algorithm We could build a trellis in which grouped states are represented by nodes, and edges between nodes denote grouped transition sets between two grouped states. If the transition set is empty, the corresponding edge does not exist. For instance, consider the first feature in section 2, suppose state features fi (yt , x, t) and first order transition features gi (yt-1 , yt , x, t) are used additionally, the trellis is shown in Figure 5. 4. Complexity analysis In SHO-CRFs training, we have to derive state partitions and transition partitions of each sequence, a trellis like Figure 5 is built. Given extended configuration sets At = {Es:t (Zisi :ti )}si t