Exact Decoding for Jointly Labeling and Chunking Sequences Nobuyuki Shimizu Department of Computer Science State University of New York at Albany Albany, NY 12222, USA nobuyuki@shimizu.name Andrew Haas Department of Computer Science State University of New York at Albany Albany, NY 12222 USA haas@cs.albany.edu Abstract There are two decoding algorithms essential to the area of natural language processing. One is the Viterbi algorithm for linear-chain models, such as HMMs or CRFs. The other is the CKY algorithm for probabilistic context free grammars. However, tasks such as noun phrase chunking and relation extraction seem to fall between the two, neither of them being the best fit. Ideally we would like to model entities and relations, with two layers of labels. We present a tractable algorithm for exact inference over two layers of labels and chunks with time complexity O(n2 ), and provide empirical results comparing our model with linear-chain models. is shallow parsing. We may want to model part-ofspeech tags and noun/verb chunks at the same time, since performing simultaneous labeling may result in increased joint accuracy by sharing information between the two layers of labels. To apply the Viterbi decoder to such tasks, we need two models, one for each layer. We must feed the output of one layer to the next layer. In such an approach, errors in earlier processing nearly always accumulate and produce erroneous results at the end. If we use CKY, we usually end up flattening the output tree to obtain the desired output. This seems like a round-about way of modeling two layers. There are previous attempts at modeling two layer labeling. Dynamic Conditional Random Fields (DCRFs) by (McCallum et al, 2003; Sutton et al, 2004) is one such attempt, however, exact inference is in general intractable for these models and the authors were forced to settle for approximate inference. Our contribution is a novel model for two layer labeling, for which exact decoding is tractable. Our experiments show that our use of label-chunk structures results in significantly better performance over cascaded CRFs, and that the model is a promising alternative to DCRFs. The paper is organaized a follows: In Section 2 and 3, we describe the model and present the decoding algorithm. Section 4 describes the learning methods applicable to our model and the baseline models. In Section 5 and 6, we describe the experiments and the results. 1 Introduction The Viterbi algorithm and the CKY algorithms are two decoding algorithms essential to the area of natural language processing. The former models a linear chain of labels such as part of speech tags, and the latter models a parse tree. Both are used to extract the best prediction from the model (Manning and Schutze, 1999). However, some tasks seem to fall between the two, having more than one layer but flatter than the trees created by parsers. For example, in relation extraction, we have entities in one layer and relations between entities as another layer. Another task 763 Proceedings of the COLING/ACL 2006 Main Conference Poster Sessions, pages 763­770, Sydney, July 2006. c 2006 Association for Computational Linguistics Token U .K . base rates are at their highest level in eight years . P OS JADJ NOUN NOUN VER B OTHER OTHER JADJ NOUN OTHER OTHER NOUN OTHER NP B I I O O B I I O B I O Table 1: Example with POS and NP tags 2 Model for Joint Labeling and Chunking Consider the task of finding noun chunks. The noun chunk extends from the beginning of a noun phrase to the head noun, excluding postmodifiers (which are difficult to attach correctly). Table 1 shows a sentence labeled with POS tags and segmented into noun chunks. B marks the first word of a noun chunk, I the other words in a noun chunk, and O the words that are not in a noun chunk. Note that we collapsed the 45 different POS labels into 5 labels, following (McCallum et al, 2003). All different types of adjectives are labeled as JADJ. Each word carries two tags. Given the first layer, our aim is to present a model that can predict the second and third layers of tags at the same time. n Assume we have n training samples, {(xi , y i )}i=1 , where xi is a sequence of input tokens and y i is a label-chunk structure for xi . In this example, the first column contains the tokens xi and the second and third columns together represent the label-chunk structures y i . We will present an efficient exact decoding for this structure. The label-chunk structure, shown in Table 2, is a representation of the two layers of tags. The tuples in Table 2 are called parts. If the token at index r carries a POS tag P and a chunk tag C , the first layer includes part C, P, r . This part is called a node. If the tokens at index r - 1 and r are in the same chunk, and C is the label of that chunk, the first layer also includes part C, P 0, P, r - 1, r (where P 0 and P are the POS tags of the tokens at r - 1 and r Token U. K . b ase r ates a re a t t heir h ighest l evel i n e ight y ears . First Layer (POS) I, JADJ , 0 I , JADJ , NOUN , 0, 1 I, NOUN , 1 I , NOUN , NOUN , 1, 2 I, NOUN , 2 O, VERB, 3 O , VERB, OTHER, 3, 4 O, OTHER, 4 I, OTHER, 5 I , OTHER, JADJ , 5, 6 I, JADJ , 6 I , JADJ , NOUN , 6, 7 I, NOUN , 7 O, OTHER, 8 I, OTHER, 9 I , OTHER, NOUN , 9, 10 I, NOUN , 10 O, OTHER, 11 T Second Layer (NP) I, 0, 2 I , O, 2, 3 O, 3, 4 O , I , 4, 5 I, 5, 7 I , O, 7, 8 O, 8, 8 O , I , 8, 9 I, 9, 10 I , O, 10, 11 O, 11, 11 able 2: Example Parts respectively). This part is called a transition. If a chunk tagged C extends from the token at q to the token at r inclusive, the second layer includes part C, q , r . This part is a chunk node. And if the token at q - 1 is the last token in a chunk tagged C 0, while the token at q is the first token of a chunk tagged C , the second layer includes part C0, C, q - 1, q . This part is a chunk transition. In this paper we use the common method of factoring the score of the label-chunk structure as the sum of the scores of all the parts. Each part in a label-chunk structure can be lexicalized, and gives rise to several features. For each feature, we have a corresponding weight. If we sum up the weights for these features, we have the score for the part, and if we sum up the scores of the parts, we have the score for the label-chunk structure. Suppose we would like to score a pair (xi , y i ) in the training set, and it happens to be the one shown in Table 2. To begin, let's say we would like to find the features for the part I, NOUN , 7 of POS node type (1st Layer). This is the NOUN tag on the seventh token "level" in Table 2. By default, the POS node type generates the following binary feature. · Is there a token labeled with "NOUN" in a chunk labeled with "I"? 764 Now, to have more features, we can lexicalize POS node type. Suppose we use xr to lexicalize POS node C, P, r , then we have the following binary feature, as it is I, NOUN , 7 and xi = "level". 7 · Is there a token "level" labeled with "NOUN" in a chunk labeled with "I"? We can also use xr-1 and xr to lexicalize the parts of POS node type. · Is there a token "level" labeled with "NOUN" in a chunk labeled with "I" that's preceded by "highest"? This way, we have a complete specification of the feature set given the part type, lexicalization for each part type and the training set. Let us define f a boolean feature vector function such that each dimension of f (xi , y i ) contains 1 if the pair (xi , y i ) has the feature, 0 otherwise. Now define a realvalued weight vector w with the same dimension as f . To represent the score of the pair (xi , y i ), we write s(xi , y i ) = w f (xi , y i ) We could also have w f (xi , {p}) where p just a single part, in which case we just write s(p). Assuming an appropriate feature representation as well as a weight vector w, we would like to find the highest scoring label-chunk structure y = ar gmaxy (w f (x, y )) given an input sentence x. In the upcoming section, we present a decoding algorithm for the label-chunk structures, and later we give a method for learning the weight vector used in the decoding. that has the ending chunk tag C 0. Now checking the chunk transition from C 0 to C at the index q is simple, and we can record the score of this chunk to chunk table[r ][C ], so that the next chunk starting at r can use this information. In short, we are executing two Viterbi algorithms on the first and second layer at the same time. One extends [q , r - 1] to [q , r ], considering the node indexed by r (first layer). The other extends [0, q ] to [0, r ], considering the node indexed by [q , r ] (second layer). The dynamic programming table for the first layer is kept in the label table (r - 1 and P 0 are used in the Viterbi algorithm for this layer) and that for the second layer in the chunk table (q and C 0 used). The algorithm returns the best score of the label-chunk structure. To recover the structure, we simply need to maintain back pointers to the items that gave rise to the each item in the dynamic programming table. This is just like maintaining back pointers in the Viterbi algorithm for sequences, or the CKY algorithm for parsing. The pseudo-code shows that the run-time complexity of the decoding algorithm is O(n2 ) unlike that of CFG parsing, O(n3 ). Thus the algorithm performs better on long sentences. On the other hand, the constant is c2 p2 where c is the number of chunk tags and p is the number of POS tags. 4 4.1 Learning Voted Perceptron 3 Decoding The decoding algorithm is shown in Figure 1. The idea is to use two tables for dynamic programming: label table and chunk table. Suppose we are examining the current position r , and would like to consider extending the chunk [q , r - 1] to [q , r ]. We need to know the chunk tag C for [q , r - 1] and the last POS tag P 0 at index r - 1. The array entry label table[q ][r - 1] keeps track of this information. Then we examine how the current chunk is connected with the previous chunk. The array entry chunk table[q ][C 0] keeps track of the score of the best label-chunk structure from 0 up to the index q In the CKY and Viterbi decoders, we use the forward-backward or inside-outside algorithm to find the marginal probabilities. Since we don't yet have the inference algorithm to find the marginal probabilities of the parts of a label-chunk structure, we use an online learning algorithm to train the model. Despite this restriction, the voted perceptron is known for its performance (Sha and Pereira, 2003). The voted perceptron we use is the adaptation of (Freund and Schapire, 1999) to the structured setting. Algorithm 4.1 shows the pseudo code for the training, and the function update(wk , xi , y i , y ) returns wk - f (xi , y ) + f (xi , y i ) . Given a training set {(xi y i )}n 1 and the epoch i= number T, Algorithm 4.1 will return a list of 765 Algorithm 3.1: D E C O D E (the scoring function s(p)) scor e := 0; for q := index star t to index end for length := 1 to index end - q r := q + length; for each Chunk Tag C for each Chunk Tag C 0 for each POS Tag P for each POS Tag P 0 scor e := 0; if (length > 1) #Add the score of the transition from r-2 to r-1. (1st Layer, P OS) scor e := scor e + s( C, P 0, P, r - 2, r - 1 ) + label table[q ][r - 1][C ][P 0]; #Add the score of the node at r-1. (1st Layer, POS) scor e := scor e + s( C, P, r - 1 ); if (scor e >= label table[q ][r ][C ][P ]) label table[q ][r ][C ][P ] := scor e; #Add the score of the chunk node at [q,r-1]. (2nd Layer, NP) scor e := scor e + s( C, q , r - 1 ); if (index star t < q ) #Add the score of the chunk transition from q-1 to q. (2nd Layer, NP) scor e := scor e + s( C0, C, q - 1, q ) + chunk table[q ][C 0]; if (scor e >= chunk table[r ][C ]) chunk table[r ][C ] := scor e; end for end for end for end for end for end for scor e := 0; for each C in chunk tags if (chunk table[index end][C ] >= scor e) scor e := chunk table[index end][C ]; last sy mbol := C ; end for return (scor e) Note: Since the scoring function s(p) is defined as w f (xi , {p}), the input sequence xi and the weight vector w are also the inputs to the algorithm. Figure 1: Decoding Algorithm 766 weighted perceptrons {(w1 , c1 ), ..(wk , ck )}. The final model V uses the weight vector w= (Collins, 2002). Algorithm 4.1: T R A I N(T , {(xi , y i )}n 1 ) i= k := 0; w1 := 0; c1 := 0; for t := 1 to T for i := 1 to n y := ar g maxy (wk f (y , xi )) i i f (y = y ) ck := ck + 1; else wk+1 := update(wk , xi , y i , y ); ck+1 := 1; k := k + 1; ck := ck + 1; end for end for return ({(w1 , c1 ), ..(wk , ck )}) After taking the Lagrange dual formation, we have max - 0 k j =1 (cj wj ) 1 2 i ,y i (y )(f (xi , y i ) - f (xi , y )) 2 + i ,y i (y )li (y ) Tn and w= such that y i (y ) = C i ,y i (y )(f (xi , y i ) - f (xi , y )) (1) This quadratic program can be optimized by bicoordinate descent, known as Sequential Minimum Optimization. Given an example i and two labelchunk structures y and y , d= li (y ) - li (y ) - (s(xi , y ) - s(xi , y )) (2) fi (y ) - fi (y ) 2 = max(-i (y ), min(d, i (y )) The updated values are : i (y ) := i (y ) + and i (y ) := i (y ) - . Using the equation (1), any increase in can be translated to w. For a naive SMO, this update is executed for each training sample i, for all pairs of possible parses y and y for xi . See (Taskar and Klein, 2005; Zhang, 2001; Jaakkola et al, 2000). Here is where we differ from (Taskar et al, 2004). We choose y to be the correct parse y i , and y to be the best runner-up. After setting the initial weights using y i , we also set i (y i ) = 1 and i (y ) = 0. Although these alphas are not correct, as optimization nears the end, the margin is wider; i (y i ) and i (y ) gets closer to 1 and 0 respectively. Given this approximation, we can compute . Then, the function update(wk , xi , y i , y ) will return wk - f (xi , y ) + f (xi , y i ) and we have reduced the SMO to the perceptron weight update. 4.2.2 Margin Infused Relaxed Algorithm We can think of maximizing the margin in terms of extending the Margin Infused Relaxed Algorithm (MIRA) (Crammer and Singer, 2003; Crammer et al, 2003) to learning with structured outputs. (McDonald et al, 2005) presents this approach for dependency parsing. In particuler, Single-best MIRA (McDonald et al, 2005) uses only the single margin constraint for the runner up y with the highest score. The resulting online update would be wk+1 with the following Algorithm 4.2: U P D AT E 1(wk , x i , yi, y) return (wk - f (xi , y ) + f (xi , y i )) Algorithm 4.3: U P D AT E 2(wk , x i i i , yi, y) i s(x ,y = max(0, min( li (y )f-(y i )-f )+s(x ,y ) , 1)); i i (y ) 2 i return (wk - f (x , y ) + f (xi , y i )) 4.2 Max Margin 4.2.1 Sequential Minimum Optimization A max margin method minimizes the regularized empirical risk function with the hard (penalized) margin min w 1 w 2 2 - i (s(xi , y i ) - max(s(xi , y ) - li (y ))) y li finds the loss for y with respect to y i , and it is assumed that the function is decomposable just as y is decomposable to the parts. This equation is equivalent to i minw 1 w 2 + C 2 i , y i ) + s(xi , y ) - l (y ) i, y , s(x i i i 767 condition: min wk+1 - wk such that s(xi , y i ) - s(xi , y ) li (y ) where y = ar gmaxy s(xi , y ). Incidentally, the equation (2) for d above when i (y i ) = 1 and i (y ) = 0 solves this minimization problem as well, and the weight update is the same as the SMO case. 4.2.3 Conditional Random Fields Instead of minimizing the regularized empirical risk function with the hard (penalized) margin, conditional random fields try to minimize the same with the negative log loss: min w Original all different types of nouns all different types of verbs all different types of adjectives all different types of adverbs the remaining POS labels Collapsed NOUN VE R B JADJ RBP OTHER Table 3: Rules for collapsing POS tags Token U. K . base rates are at their highest level in eight years . P OS JJ NN NNS VB P IN P RP $ JJS NN IN CD NNS . Collapsed JADJ NOUN NOUN VE R B OTHER OTHER JADJ NOUN OTHER OTHER NOUN OTHER Chunk B - NP I - NP I - NP B - VP B -P P B - NP I - NP I - NP B -P P B - NP I - NP O NP B I I O O B I I O B I O 1 w 2 2 - i (s(xi , y i ) - log( y s(xi , y ))) Usually, CRFs use marginal probabilities of parts to do the optimization. Since we have not yet come up with the algorithm to compute marginals for a label-chunk structure, the training methods for CRFs is not applicable to our purpose. However, on sequence labeling tasks CRFs have shown very good performance (Lafferty et al, 2001; Sha and Pereira, 2003), and we will use them for the baseline comparison. Table 4: Example with POS and NP labels, before and after collapsing the labels. We present two experiments: one comparing our label-chunk model with a cascaded linear-chain model and a simple linear-chain model, and one comparing different learning algorithms. The cascaded linear-chain model uses one linearchain model to predict POS tags, and another linearchain model to predict NP labels, using the POS tags predicted by the first model as a feature. More specifically, we trained a POS-tagger using the training set D. We then used the learned model and replaced the POS labels of the test set with the labels predicted by the learned model. The linearchain NP chunker was again trained on D and evaluated on this new test set with POS supplied by the earlier processing. Note that the new test set has exactly the same word tokens and noun chunks as the original test set. 5.2 5.2.1 Systems POS Tagger and NP Chunker 5 Experiments 5.1 Task: Base Noun Phrase Chunking The data for the training and evaluation comes from the CoNLL 2000 shared task (Tjong Kim Sang and Buchholz, 2000), which is a portion of the Wall Street Journal. We consider each sentence to be a training instance xi , with single words as tokens. The shared task data have a standard training set of 8936 sentences and a test set of 2012 sentences. For the training, we used the first 447 sentences from the standard training set, and our evaluation was done on the standard test set of the 2012 sentences. Let us define the set D to be the first 447 samples from the standard training set . There are 45 different POS labels, and the three NP labels: begin-phrase, inside-phrase, and other. (Ramshaw and Marcus, 1995) To reduce the inference time, following (McCallum et al, 2003), we collapsed the 45 different POS labels contained in the original data. The rules for collapsing the POS labels are listed in the Table 3. There are three versions of POS taggers and NP chunkers: CRF, VP, MMVP. For CRF, L-BFGS, a quasi-Newton optimization method was used for the training, and the implementation we used is CRF++ (Kudo, 2005). VP uses voted perceptron, and MMVP uses max margin update for the voted perceptron. For the voted perceptron, we used aver- 768 if xq matches [A-Z][a-z]+ [ A- Z ] [ A- Z ] + [A-Z]+[a-z]+[A-Z]+[a-z] .*[0-9].* then tq is CAPITAL C AP ONE C AP AL L C A P MI X N U MB E R Table 5: Rules to create tq for each token xq First Layer (POS) x Node C, P, r r -1 xr xr +1 tr Second Layer (NP) x Node C, q , r q Trans. C, P 0, P, r - 1, r x r -1 xr POS tagging CR F VP MMV P NP chunking CR F VP MMV P Both POS & NP CR F + CR F VP + VP MMV P + MMV P VP Joint MMVP Joint P OS 91.56% 90.55% 90.02% P OS given given given P OS above above above 88.42% 88.69% NP N/ A N/ A N/ A NP 94.44% 94.28% 94.17% NP 90.16% 89.21% 88.95% 90.60% 90.84% F1 N/ A N/ A N/ A F1 87.52% 86.96% 86.79% F1 79.08% 76.26% 75.28% 79.69% 80.34% Table 7: Performance Trans. C0, C, q - 1, q x q -1 xq x q -1 xr xr +1 Table 6: Lexicalized Features for Joint Models aging of the weights suggested by (Collins, 2002). The features are exactly the same for all three systems. 5.2.2 Cascaded Models For each CRF, VP, MMVP, the output of a POS tagger was used as a feature for the NP chunker. The feeds always consist of a POS tagger and NP chunker of the same kind, thus we have CRF+CRF, VP+VP, and MMVP+MMVP. 5.2.3 Joint Models Since CRF requires the computation of marginals for each part, we were not able to use the learning method. VP and MMVP were used to train the labelchunk structures with the features explained in the following section. 5.3 Features updated. The intention is to have more weights on the unlexicalized features, so that when lexical feature is not found, unlexicalized features could provide useful information and avoid overfitting, much as back-off probabilities do. 6 Result We evaluated the performance of the systems using three measures: POS accuracy, NP accuracy, and F1 measure on NP. These figures show how errors accumulate as the systems are chained together. For the statistical significance testing, we have used pairsamples t test, and for the joint labeling and chunking task, everything was found to be statistically significant except for CRF + CRF vs VP Joint. One can see that the systems with joint labeling and chunking models perform much better than the cascaded models. Surprisingly, the perceptron update motivated by the max margin principle performed significantly worse than the simple perceptron update for linear-chain models but performed better on joint labeling and chunking. Although joint labeling and chunking model takes longer time per sample because of the time complexity of decoding, the number of iteration needed to achieve the best result is very low compared to other systems. The CPU time required to run 10 iterations of MMVP is 112 minutes. First, as a preprocessing step, for each word token xq , feature tq was created with the rule in Table 5, and included in the input files. This feature is included in x along with the word tokens. The feature tells us whether the token is capitalized, and whether digits occur in the token. No outside resources such as a list of names or a gazetteer were used. Table 6 shows the lexicalized features for the joint labeling and chunking. For the first iteration of training, the weights for the lexicalized features were not 7 Conclusion We have presented the decoding algorithm for labelchunk structure and showed its effectiveness in finding two layers of information, POS tags and NP chunks. This algorithm has a place between the 769 POS tagging VP M M VP CR F NP chunking VP M M VP CR F Both POS & NP VP M M VP Iterations 30 40 126 Iterations 70 50 101 Iterations 10 10 T. Kudo 2005. CRF++: Yet Another CRF toolkit. Available at http://chasen.org/~taku/software/CRF++/ J. Lafferty, A. McCallum, and F. Pereira. 2001. Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data. In Proc. of the 18th International Conference on Machine Learning (ICML) F. Peng and A. McCallum. 2004. Accurate Information Extraction from Research Papers using Conditional Random Fields. In Proc. of the Human Language Technology Conf. (HLT) F. Sha and F. Pereira. 2003. Shallow parsing with conditional random fields. In Proc. of the Human Language Technology Conf. (HLT) C. Manning and H. Schutze. 1999. Foundations of Statistical Natural Language Processing MIT Press. A. McCallum, K. Rohanimanesh and C. Sutton. 2003. Dynamic Conditional Random Fields for Jointly Labeling Multiple Sequences. In Proc. of Workshop on Syntax, Semantics, Statistics. (NIPS) R. McDonald, K. Crammer, and F. Pereira. 2005. Online large-margin training of dependency parsers. In Proc. of the 43rd Annual Meeting of the ACL L. Ramshaw and M. Marcus. 1995. Text chunking using transformation-based learning. In Proc. of Third Workshop on Very Large Corpora. ACL C. Sutton, K. Rohanimanesh and A. McCallum. 2004. Dynamic Conditional Random Fields: Factorized Probabilistic Models for Labeling and Segmenting Sequence Data. In Proc. of the 21st International Conference on Machine Learning (ICML) B. Taskar, D. Klein, M. Collins, D. Koller, and C. Manning 2004. Max Margin Parsing. In Proc. of Empirical Methods in Natural Language Processing (EMNLP) B. Taskar and D. Klein. 2005. Max-Margin Methods for NLP: Estimation, Structure, and Applications Available at http://www.cs.berkeley.edu/~taskar/pubs/maxmargin-acl05-tutorial.pdf E. F. Tjong Kim Sang and S. Buchholz. 2000. Introduction to the CoNLL-2000 shared task: Chunking. In Proc. of the 4th Conf. on Computational Natural Language Learning (CoNLL) T. Zhang. 2001. Regularized winnow methods. In Advances in Neural Information Processing Systems 13 Table 8: Iterations needed for the result Viterbi algorithm for linear-chain models and the CKY algorithm for parsing, and the time complexity is O(n2 ). The use of our label-chunk structure significantly boosted the performance over cascaded CRFs despite the online learning algorithms used to train the system, and shows itself as a promising alternative to cascaded models, and possibly dynamic conditional random fields for modeling two layers of tags. Further work includes applying the algorithm to relation extraction, and devising an effective algorithm to find the marginal probabilities of parts. References M. Collins. 2002. Discriminative training methods for hidden Markov models: Theory and experiments with perceptron algorithms. In Proc. of Empirical Methods in Natural Language Processing (EMNLP) K. Crammer and Y. Singer. 2003. Ultraconservative online algorithms for multiclass problems. Journal of Machine Learning Research K. Crammer, O. Dekel, S. Shalev-Shwartz, and Y. Singer. 2003. Online passive aggressive algorithms. In Advances in Neural Information Processing Systems 15 K. Crammer, R. McDonald, and F. Pereira. 2004. New large margin algorithms for structured prediction. In Learning with Structured Outputs Workshop (NIPS) Y. Freund and R. Schapire 1999. Large Margin Classification using the Perceptron Algorithm. In Machine Learning, 37(3):277-296. T.S. Jaakkola, M. Diekhans, and D. Haussler. 2000. A discriminative framework for detecting remote protein homologies. Journal of Computational Biology 770