Semi-Supervised Conditional Random Fields for Improved Sequence Segmentation and Labeling Feng Jiao University of Waterloo Abstract We present a new semi-supervised training procedure for conditional random fields (CRFs) that can be used to train sequence segmentors and labelers from a combination of labeled and unlabeled training data. Our approach is based on extending the minimum entropy regularization framework to the structured prediction case, yielding a training objective that combines unlabeled conditional entropy with labeled conditional likelihood. Although the training objective is no longer concave, it can still be used to improve an initial model (e.g. obtained from supervised training) by iterative ascent. We apply our new training algorithm to the problem of identifying gene and protein mentions in biological texts, and show that incorporating unlabeled data improves the performance of the supervised CRF in this case. Shaojun Wang Chi-Hoon Lee Russell Greiner Dale Schuurmans University of Alberta (Celeux and Govaert 1992; Yarowsky 1995), cotraining (Blum and Mitchell 1998), informationtheoretic regularization (Corduneanu and Jaakkola 2006; Grandvalet and Bengio 2004), and graphbased transductive methods (Zhou et al. 2004; Zhou et al. 2005; Zhu et al. 2003). Unfortunately, these techniques have been developed primarily for single class label classification problems, or class label classification with a structured input (Zhou et al. 2004; Zhou et al. 2005; Zhu et al. 2003). Although still highly desirable, semi-supervised learning for structured classification problems like sequence segmentation and labeling have not been as widely studied as in the other semi-supervised settings mentioned above, with the sole exception of generative models. With generative models, it is natural to include unlabeled data using an expectation-maximization approach (Nigam et al. 2000). However, generative models generally do not achieve the same accuracy as discriminatively trained models, and therefore it is preferable to focus on discriminative approaches. Unfortunately, it is far from obvious how unlabeled training data can be naturally incorporated into a discriminative training criterion. For example, unlabeled data simply cancels from the objective if one attempts to use a traditional conditional likelihood criterion. Nevertheless, recent progress has been made on incorporating unlabeled data in discriminative training procedures. For example, dependencies can be introduced between the labels of nearby instances and thereby have an effect on training (Zhu et al. 2003; Li and McCallum 2005; Altun et al. 2005). These models are trained to encourage nearby data points to have the same class label, and they can obtain impressive accuracy using a very small amount of labeled data. However, since they model pairwise similarities among data points, most of these approaches require joint inference over the whole data set at test time, which is not practical for large data sets. In this paper, we propose a new semi-supervised training method for conditional random fields (CRFs) that incorporates both labeled and unlabeled sequence data to estimate a discriminative 209 1 Introduction Semi-supervised learning is often touted as one of the most natural forms of training for language processing tasks, since unlabeled data is so plentiful whereas labeled data is usually quite limited or expensive to obtain. The attractiveness of semisupervised learning for language tasks is further heightened by the fact that the models learned are large and complex, and generally even thousands of labeled examples can only sparsely cover the parameter space. Moreover, in complex structured prediction tasks, such as parsing or sequence modeling (part-of-speech tagging, word segmentation, named entity recognition, and so on), it is considerably more difficult to obtain labeled training data than for classification tasks (such as document classification), since hand-labeling individual words and word boundaries is much harder than assigning text-level class labels. Many approaches have been proposed for semisupervised learning in the past, including: generative models (Castelli and Cover 1996; Cohen and Cozman 2006; Nigam et al. 2000), self-learning Proceedings of the 21st International Conference on Computational Linguistics and 44th Annual Meeting of the ACL, pages 209­216, Sydney, July 2006. c 2006 Association for Computational Linguistics beled examples, would like to build a CRF model over sequential input and output data where , and Our goal is to learn such a model from the combined set of labeled and unlabeled examples, . The standard supervised CRF training procedure is based upon maximizing the log conditional likelihood of the labeled examples in is any standard regularizer on , e.g. . Regularization can be used to limit over-fitting on rare features and avoid degeneracy in the case of correlated features. Obviously, . (1) ignores the unlabeled examples in To make full use of the available training data, we propose a semi-supervised learning algorithm that exploits a form of entropy regularization on the unlabeled data. Specifically, for a semisupervised CRF, we propose to maximize the following objective where 2 Semi-supervised CRF training In what follows, we use the same notation as (Lafferty et al. 2001). Let be a random variable over data sequences to be labeled, and be a random variable over corresponding label sequences. All components, , of are assumed to range over a finite label alphabet . For example, might over part-of-speech range over sentences and where the first term is the penalized log conditional likelihood of the labeled data under the CRF, (1), and the second line is the negative conditional entropy of the CRF on the unlabeled data. Here, is a tradeoff parameter that controls the influence of the unlabeled data. 210 D E75 fe ( Q7$ IPH ( 7$ PH 4c QIv Fce f f I ( §9( Q $ PH 5e c d( ¨ B y¥ xp¦ ¥ f @ %V%9 " r `XU wc uE I T q YW v ¨ ( s( $" a h af t %V%9f 2220V%9 ( ( $ " " 1 1 1 " (f $ " ¨ V%9 )( 3222" h ¨ ( $" t " 111 h $ %"9 h f I @ %V%9 " r `XU E S T ¨ q YW ( s( $" f gh YWU ( I ¨ ( Q I @ V$%9 a `XVE S T RE 7$ PH (" eb 2cd dh pid @ G6F%"E222%!E765 C§¥ 111" D ¨ B A@ 7659%8765432220)!'&%#!©§¥ $" " 111"( $" ¨ ¦ B¥ f f f f ¨ pw8 ( G ( § f f I ( GV( Q $ PH 5e c u( ¨ ¦ C¥ f ¤ structured predictor. CRFs are a flexible and powerful model for structured predictors based on undirected graphical models that have been globally conditioned on a set of input covariates (Lafferty et al. 2001). CRFs have proved to be particularly useful for sequence segmentation and labeling tasks, since, as conditional models of the labels given inputs, they relax the independence assumptions made by traditional generative models like hidden Markov models. As such, CRFs provide additional flexibility for using arbitrary overlapping features of the input sequence to define a structured conditional model over the output sequence, while maintaining two advantages: first, efficient dynamic program can be used for inference in both classification and training, and second, the training objective is concave in the model parameters, which permits global optimization. To obtain a new semi-supervised training algorithm for CRFs, we extend the minimum entropy regularization framework of Grandvalet and Bengio (2004) to structured predictors. The resulting objective combines the likelihood of the CRF on labeled training data with its conditional entropy on unlabeled training data. Unfortunately, the maximization objective is no longer concave, but we can still use it to effectively improve an initial supervised model. To develop an effective training procedure, we first show how the derivative of the new objective can be computed from the covariance matrix of the features on the unlabeled data (combined with the labeled conditional likelihood). This relationship facilitates the development of an efficient dynamic programming for computing the gradient, and thereby allows us to perform efficient iterative ascent for training. We apply our new training technique to the problem of sequence labeling and segmentation, and demonstrate it specifically on the problem of identifying gene and protein mentions in biological texts. Our results show the advantage of semi-supervised learning over the standard supervised algorithm. taggings of those sentences; hence would be the set of possible part-of-speech tags in this case. Assume we have a set of labeled examples, , and unla. We , (1) (2) ¡ ¡ ¤ ¡ ¢ £¡ The first three items on the right hand side are just the standard gradient of the CRF objective, (Lafferty et al. 2001), and the final item is the gradient of the entropy regularizer (the derivation of which is given in Appendix A. is the condiHere, , tional covariance matrix of the features, given sample sequence . In particular, the th element of this matrix is given by 211 f g ¨ W Although and are concave here, since is not nondecreasing, is not necessarily concave. So in general there are local maxima in (2). $ . For scalar , the second derivative of a composition, , is given by (Boyd and Vandenberghe 2004) $ $ $ 4 5 f f f ff f ( ) ) t %h( ( 2 e ( ) 3(%( G¥t ( ) 1¨ ( ) 0 ) 2 h h h f YW & a `XU £ '% @$ a(V%"9 php 2d f # f d e # d ( v %( v ¨ v ( v which motivates the negative entropy term in (2). The combined training objective (2) exploits unlabeled data to improve the CRF model, as we will see. However, one drawback with this approach is that the entropy regularization term is not concave. To see why, note that the entropy regularizer can be seen as a composition, , where , and , To efficiently calculate the gradient, we need to be able to efficiently compute the expectations with respect to in (3) and (4). However, this can pose a challenge in general, because there are exponentially many values for . Techniques for computing the linear feature expectations in (3) are already well known if is sufficiently structured (e.g. forms a Markov chain) (Lafferty et al. 2001). However, we now have to develop efficient techniques for computing the quadratic feature expectations in (4). For the quadratic feature expectations, first note that the diagonal terms, , are straightforward, since each feature is an indicator, we have @ (%"9 E Q $ I H vwc @ V$%9 E Q7$ PH 4c $ ( (" ( I v Uh d ph @ (V%9 B V%9 E Q7$ PH B wc ¨ $" ( $" ( Iv @ V$%9 £ & dph e@ V%9U h £ & c (" a ( $" a vB d ph v @ V%9 V%9dU h £ & bA ¨ ( $" ( $" a d ph ( $ U h H v £ & B § @ I( $ 3V%"9 V%"9 v ( `XW Y" f ff BA @ '§ 'ED 5 HC f e ( G 6 I$ 3(V%" ¥ GFED¥£ & Fce v h 6 89( $%" ( 7 PH 2c V( %" Q$I v $ 5e c f ¨ h fh 7 ( 6 ( $" V%9 U V h S T ( $" V%# h Q FD R PEC £ d ph v & f B§@ A U h ff %( 6 6 6 f This approach resembles that taken by (Grandvalet and Bengio 2004) for single variable classification, but here applied to structured CRF training. The motivation is that minimizing conditional entropy over unlabeled data encourages the algorithm to find putative labelings for the unlabeled data that are mutually reinforcing with the supervised labels; that is, greater certainty on the putative labelings coincides with greater conditional likelihood on the supervised labels, and vice versa. For a single classification variable this criterion has been shown to effectively partition unlabeled data into clusters (Grandvalet and Bengio 2004; Roberts et al. 2000). To motivate the approach in more detail, consider the overlap between the probability distribution over a label sequence and the empirical distribution of on the unlabeled data . The overlap can be measured by the Kullback-Leibler . It is well divergence known that Kullback-Leibler divergence (Cover and Thomas 1991) is positive and increases as the overlap between the two distributions decreases. In other words, maximizing Kullback-Leibler divergence implies that the overlap between two distributions is minimized. The total overlap over all possible label sequences can be defined as 3 An efficient training procedure As (2) is not concave, many of the standard global maximization techniques do not apply. However, one can still use unlabeled data to improve a supervised CRF via iterative ascent. To derive an efficient iterative ascent procedure, we need to compute gradient of (2) with respect to the parameters . Taking derivative of the objective function (2) with respect to yields Appendix A for the derivation) (3) f( ¨ fh ¨ ( ¨ ` v ( ¨ ¦¤ §¥£ ¡ (Q I (Q I E 7$ PH E 7$ PH v( 4c E H c ¨ ¨ ¦¤ ( E ¡ H ¡ H PH ©§¥£ I c 4c ¨ v ( (Q I ( (Q E ¡ H E 7$ PH E E 7$ B G¥ h ! $ ! " ( %(E ¡ H 4(E ¡ H (E Q7$ I H ¢ ( E ¡ H h (( ( ( Q I %E ¡ H 4E ¡ H E 7$ PH h ¢ v 2c (4) d Q$g c A %" ) ¨ t 7%" c ¨¤ i £qg ¦ i ¦ 6 g U ©§¦¥£ c¨ ED c#e B @ %9 3V%9 £ & ( $" ( $" a v U d ( T( E I 2%E Q i d E Q " ) g ' ¢ Q ) " ) ) E g ¦ ED i ' ( ( ¤ ¥¤4 ( E Q ) ) E i I A %" ) ) Q g £tg¨ q 7 %$" g 6 ¨ d §¦ ¤ c ¨Q A %0"( " ) 9uv9 7%$)"i 8 4 ¦¤4 i ¦ i ¦ e 6 U h c ED c#e c g ( T( E I 2%E Q ) ) 8i d ( ¢ Q ) ) " E i ¦ ED h' ¤ ¤4 g( g ( E Q " ) h' E Q ) E xI A %" ) ) £¨ v g Q %$" aig ¨6 §¦d ¥¤ £ c Q ¨ A %"0( " ) ¨ t 9 7%$" g 8 4 ¦ ¤ 4 i ¦ i ¦ 6 U h c ED c#e c B U @ V%9 3V%9 £ & a ( $" ( $" v U hg d ( T E I 2%(E Q 0i d (E Q " ) h' (E Q ) " ) ) E g ¦ ED i ' (E Q ) ) " ) ) ) i ' E Q ) ) ) E i I ( ¤ ¤ ¥¤4 r¦¤ ¤4 g A &%0"( ) ) ) " ) ) ¨ 8 7qQgut9 %$" g6 pd¨ h V¦'£ c ¤ A&%0"( " ) 9wv9 7%$%"i 8 4 §¦¥¤4 i ¦ i ¦ e ¨Q 6 U h c ED c#e c ( T( E I 2%E Q ) ) ) i d ( ' g E Q () ) ) " ) ) i g g ( (E Q ) ) " E i ¦ ED y' E Q " ) y' E Q ) E xI ¨ A %"0( ) ) ) " ) ) 9wv9 Q 8 ¤ ¤ ¤ 4 ¦ ¤ ¤ 4 £%"%$q ig 6 pdhg ©§¦¥c¤ £ ¨ A %"0( " ) ¨ tu9 Q7%$" g 8 4 r¦¥¤4 i ¦ i ¦ ¨ 6 sU h c ED c#e c B @ V%9 V%9 £ & ( $" ( $" i8 d ph gU 8h v a ¥¤4 ¦4 @ E ¦ ( , g ¨ E Q ) " E i ¦ ED h' ( Y YeE212121" S " ¨ Y" u`W (( $" "( $" %V%9 0V%9 d U that , and therefore the diagonal terms in the conditional covariance are just linear feature expectations and , for , can be recursively calculated. First define the summary matrix ' ED g ¦ E pi e ¨ fW as before. For the off diagonal terms, , however, we need to develop a new algorithm. Fortunately, for structured label sequences, , one can devise an efficient algorithm for calculating the quadratic expectations based on nested dynamic programming. To illustrate the idea, we assume that the dependencies of , conditioned on , form a Markov chain. Define one feature for each state pair , and one feature for each state-observation pair , which we express with indicator functions and respectively. Following (Lafferty et al. 2001), we also add special start and stop states, start and stop. The conditional probability of a label sequence can now be expressed concisely in a matrix form. For each position in the observation matrix random sequence , define the variable by With these definitions, the expectation of the product of each pair of feature func, , tions, ( ¨ v Q $"i E%" £ 7%) 4 ¤4 ( ( EU R ED U d E ED ' Y V £VG` ¨ U bYW UTSS F U P W Yc R d ( E ( E U U Here is the edge with labels and is the vertex with label . define the forFor each index ward vectors with base case ( ¡" U UR U ! ¥Wa`YXV £WG`Yc ¨ YbW UTSSF Q&E Q V R P ¨( (I S H " 1 1 1 F ¨ E e £E222`¡" GyW U E ¡ U U Q A %" ¨ D@ 7%$" Cc e Bd U6dd A&%"0( " ) ¨ ©@9 Q %$" 8 5d c ¨ E Q " ) ( 4 ( 6 d h Y W %(E Q U " )7 ppd 32 `XU ¨ E Q " ) U ( 10E Q " ) U ' )&E ' U ( ( ¨( Q ¤ Q &Q ¤ Q U % U where ( ") ! ¨ ED #$¡ ¨ "¡ (¢ £" g( " $ ¨ (E%" Q7%" © 4 $¦ ( r$ ( ) " B $g ¨ E%" ¦ B Q7%"0s £" ¨©q 4 §¦¥4 " ( $ ¤ (¢ £" h ¡ ¡ f ¨ ¡W ¨ @ V%9 £ & ( $" v Uh Similarly, the backward vectors W B @ V%9 ( $" U a ( $" V%9 U h ¨ %9 ( $" and recurrence are given by 212 (( $" "( $" %V%9 0V%9 %V%9 0V%9 (( $" "( $" d Uh d ph U dh ¨ E d ( ¨ E Q U ED # ( ¨( &E U d ' E ( EI U I I U 8 2 ' Then the quadratic feature expectations can be computed by the following recursion, where the two double sums in each expectation correspond to the two cases depending on which feature occurs first ( occuring before ). h £ v & B a U h 5 Identifying gene and protein mentions We have developed our new semi-supervised training procedure to address the problem of information extraction from biomedical text, which has received significant attention in the past few years. We have specifically focused on the problem of identifying explicit mentions of gene and protein names (McDonald and Pereira 2005). Recently, McDonald and Pereira (2005) have obtained interesting results on this problem by using a standard supervised CRF approach. However, our contention is that stronger results could be obtained in this domain by exploiting a large corpus of unannotated biomedical text to improve the quality of the predictions, which we now show. Given a biomedical text, the task of identifying gene mentions can be interpreted as a tagging task, where each word in the text can be labeled with a tag that indicates whether it is the beginning of gene mention (B), the continuation of a gene mention (I), or outside of any gene mention (O). To compare the performance of different taggers learned by different mechanisms, one can measure the precision, recall and F-measure, given by # correct predictions precision = # predicted gene mentions # correct predictions recall = # true gene mentions precision recall F-measure = precision recall In our evaluation, we compared the proposed semi-supervised learning approach to the state of the art supervised CRF of McDonald and Pereira (2005), and also to self-training (Celeux and Govaert 1992; Yarowsky 1995), using the same feature set as (McDonald and Pereira 2005). The CRF training procedures, supervised and semi213 4 Time and space complexity The time and space complexity of the semisupervised CRF training procedure is greater than that of standard supervised CRF training, but nevertheless remains a small degree polynomial in the size of the training data. Let = size of the labeled set = size of the unlabeled set = labeled sequence length = unlabeled sequence length = test sequence length = number of states = number of training iterations. Then the time required to classify a test sequence is , independent of training method, since the Viterbi decoder needs to access each path. For training, supervised CRF training requires time, whereas semi-supervised CRF training requires time. The additional cost for semi-supervised training arises from the extra nested loop required to calculated the quadratic feature expectations, which introduces in an additional factor. However, the space requirements of the two training methods are the same. That is, even though the covariance matrix has size , there is never any need to store the entire matrix in memory. Rather, since we only need to compute the product of the covariance with , the calculation can be performed iteratively without using extra space beyond that already required by supervised CRF training. ¨ § ¨¨ ©§ D ¨¨ ©§ The computation of these expectations can be organized in a trellis, as illustrated in Figure 1. Once we obtain the gradient of the objective function (2), we use limited-memory L-BFGS, a quasi-Newton optimization algorithm (McCallum 2002; Nocedal and Wright 2000), to find the local maxima with the initial value being set to be the optimal solution of the supervised CRF on labeled data. ( E I T ( ( E Q ) 0i d E Q ) " E g ¦ ED i ' E Q E i I ( ¤4 ¤4 (E%" ¨ t£ Q7%" g 3 $ c A %" ) vgp¨ q i 7Q¦ %$" i i¦ g 6 ¨ U §¦ ¥£ c ¤ d ce ED c# e ( E I T ( ( E Q i d E Q " ) E i ¦ ED g ' E Q ) E g I ( ( 3e ¥ B ( ¦¡ H B £ start 0 1 2 stop Figure 1: Trellis for computing the expectation of a feature vs , where the product over a pair of feature functions, occurs first. This leads to one double sum. feature f ¤¢ e ¡ ¦ H ¦ ¤¢ ¡ B H £ ( ¡ ¦ H ¦ ¤¢ (i w ¡ ©H ¦ ¦ i BH H ¢ ¡ £ BH £ ously increase both the number of predicted gene mentions and the number of correct predictions, thus the precision remains almost the same as the supervised CRF, and the recall increases significantly. Both experiments as illustrated in Figure 2 and Tables 1 and 2 show that clearly better results are obtained by incorporating additional unlabeled training data, even when evaluating on disjoint testing data (Figure 2). The performance of the semi-supervised CRF is not overly sensitive to the tradeoff parameter , except that cannot be set too large. 5.1 Comparison to self-training For completeness, we also compared our results to the self-learning algorithm, which has commonly been referred to as bootstrapping in natural language processing and originally popularized by the work of Yarowsky in word sense disambiguation (Abney 2004; Yarowsky 1995). In fact, similar ideas have been developed in pattern recognition under the name of the decision-directed algorithm (Duda and Hart 1973), and also traced back to 1970s in the EM literature (Celeux and Govaert 1992). The basic algorithm works as follows: 1. Given and , begin with a seed set of labeled examples, , chosen from . 2. For (a) Train the supervised CRF on labeled examples , obtaining . (b) For each sequence , find via Viterbi decoding or other inference algorithm, and add the pair to the set of labeled examples (replacing if present). any previous label for 214 & ( %" $ () 7 $ F C I H W X Q B y¥ f & v 7 ¦ G¥ f ! ¥ © a`X ¨ $ 7 ¥ 111 " 222" S `F ¨ B¥ ¦¥ ¨ f supervised, were run with the same regularization function, , used in (McDonald and Pereira 2005). First we evaluated the performance of the semisupervised CRF in detail, by varying the ratio between the amount of labeled and unlabeled data, and also varying the tradeoff parameter . We choose a labeled training set consisting of 5448 words, and considered alternative unlabeled training sets, (5210 words), (10,208 words), and (25,145 words), consisting of the same, 2 times and 5 times as many sentences as respectively. All of these sets were disjoint and selected randomly from the full corpus, the smaller one in (McDonald et al. 2005), consisting of 184,903 words in total. To determine sensitivity to the parameter we examined a range of discrete values . In our first experiment, we train the CRF models using labeled set and unlabeled sets , and respectively. Then test the performance on the and respectively The results of our sets , evaluation are shown in Table 1. The performance of the supervised CRF algorithm, trained only on the labeled set , is given on the first row in Table ). By comparison, the 1 (corresponding to results obtained by the semi-supervised CRFs on the held-out sets , and are given in Table 1 by increasing the value of . The results of this experiment demonstrate quite clearly that in most cases the semi-supervised CRF obtains higher precision, recall and F-measure than the fully supervised CRF, yielding a 20% improvement in the best case. In our second experiment, again we train the CRF models using labeled set and unlabeled sets , and respectively with increasing values of , but we test the performance on the heldwhich is the full corpus minus the laout set beled set and unlabeled sets , and . The results of our evaluation are shown in Table 2 and Figure 2. The blue line in Figure 2 is the result of the supervised CRF algorithm, trained only on the labeled set . In particular, by using the supervised CRF model, the system predicted 3334 out of 7472 gene mentions, of which 2435 were correct, resulting in a precision of 0.73, recall of 0.33 and F-measure of 0.45. The other curves are those of the semi-supervised CRFs. The results of this experiment demonstrate quite clearly that the semi-supervised CRFs simultane- number of correct prediction (TP) f ¢ ¡ ¡ ¢ f F f f ¨ p40 9( G ¨f ¡ ¢ 3500 set B set C 3000 set D CRF 2500 2000 1500 1000 500 0 0.1 0.5 1 5 7 10 gamma 12 14 16 18 20 F ¦¤" aF8" F S " ¤¦" S " ¥¤f £YF" S £YF`F ¢ ¢" ¢ ¡ § f ¡ ¡ ¢ ¢ Performance of the supervised and semisupervised CRFs. The sets , and refer to the unlabeled training set used by the semi-supervised algorithm. Figure 2: Test Set B, Trained on A and B Precision Recall F-Measure Test Set C, Trained on A and C Precision Recall F-Measure 0.77 0.79 0.79 0.77 0.78 0.66 0.29 0.32 0.33 0.34 0.38 0.38 0.43 0.46 0.46 0.47 0.51 0.48 Test Set D, Trained on A and D Precision Recall F-Measure 0.74 0.74 0.74 0.73 0.72 0.66 0.30 0.31 0.31 0.33 0.36 0.38 0.43 0.44 0.44 0.45 0.48 0.47 0 0.1 0.5 1 5 10 0.80 0.82 0.82 0.82 0.84 0.78 0.36 0.4 0.4 0.4 0.45 0.46 0.50 0.54 0.54 0.54 0.59 0.58 Test Set E, Trained on A and B # predicted # correct prediction Test Set E, Trained on A and C # predicted # correct prediction 3376 3450 3588 4206 4762 2470 2510 2580 2947 2827 Test Set E, Trained on A and D # predicted # correct prediction 3366 3376 3607 4165 4778 2466 2469 2590 2888 2845 (c) If for each , , , iterate. stop; otherwise We implemented this self training approach and tried it in our experiments. Unfortunately, we were not able to obtain any improvements over the standard supervised CRF with self-learning, using and . The the sets semi-supervised CRF remains the best of the approaches we have tried on this problem. 6 Conclusions and further directions We have presented a new semi-supervised training algorithm for CRFs, based on extending minimum conditional entropy regularization to the structured prediction case. Our approach is motivated by the information-theoretic argument (Grandvalet and Bengio 2004; Roberts et al. 2000) that unlabeled examples can provide the most benefit when classes have small overlap. An iterative ascent optimization procedure was developed for this new criterion, which exploits a nested dynamic programming approach to efficiently compute the covariance matrix of the features. We applied our new approach to the problem of identifying gene name occurrences in biological text, exploiting the availability of auxiliary unlabeled data to improve the performance of the state of the art supervised CRF approach in this domain. Our semi-supervised CRF approach shares all of the benefits of the standard CRF training, including the ability to exploit arbitrary features of the inputs, while obtaining improved accuracy 215 S ! E $ ¨ e $ y¥¨ B £ ¤¢ " " ¡ ¢ ¡ B y¥ 0.1 0.5 1 5 10 3345 3413 3446 4089 4450 2446 2489 2503 2878 2799 through the use of unlabeled data. The main drawback is that training time is increased because of the extra nested loop needed to calculate feature covariances. Nevertheless, the algorithm is sufficiently efficient to be trained on unlabeled data sets that yield a notable improvement in classification accuracy over standard supervised training. To further accelerate the training process of our semi-supervised CRFs, we may apply stochastic gradient optimization method with adaptive gain adjustment as proposed by Vishwanathan et al. (2006). Acknowledgments Research supported by Genome Alberta, Genome Canada, and the Alberta Ingenuity Centre for Machine Learning. References S. Abney. (2004). Understanding the Yarowsky algorithm. Computational Linguistics, 30(3):365-395. Y. Altun, D. McAllester and M. Belkin. (2005). Maximum margin semi-supervised learning for structured variables. Advances in Neural Information Processing Systems 18. A. Blum and T. Mitchell. (1998). Combining labeled and unlabeled data with co-training. Proceedings of the Workshop on Computational Learning Theory, 92-100. S. Boyd and L. Vandenberghe. (2004). Convex Optimization. Cambridge University Press. V. Castelli and T. Cover. (1996). The relative value of labeled and unlabeled samples in pattern recognition with an unknown mixing parameter. IEEE Trans. on Information Theory, 42(6):2102-2117. G. Celeux and G. Govaert. (1992). A classification EM algorithm for clustering and two stochastic versions. Computational Statistics and Data Analysis, 14:315-332. ¢ ¡ Table 2: Performance of the semi-supervised CRFs trained by using unlabeled sets , and ¢ ¡ Table 1: Performance of the semi-supervised CRFs obtained on the held-out sets , and & " ¨£§¥ ¦ A. Corduneanu and T. Jaakkola. (2006). Data dependent regularization. Semi-Supervised Learning, O. Chapelle, B. Scholkopf and A. Zien, (Editors), 163-182, MIT Press. ¨ T. Cover and J. Thomas, (1991). Elements of Information Theory, John Wiley & Sons. R. Duda and P. Hart. (1973). Pattern Classification and Scene Analysis, John Wiley & Sons. Y. Grandvalet and Y. Bengio. (2004). Semi-supervised learning by entropy minimization, Advances in Neural Information Processing Systems, 17:529-536. J. Lafferty, A. McCallum and F. Pereira. (2001). Conditional random fields: probabilistic models for segmenting and labeling sequence data. Proceedings of the 18th International Conference on Machine Learning, 282-289. W. Li and A. McCallum. (2005). Semi-supervised sequence modeling with syntactic topic models. Proceedings of Twentieth National Conference on Artificial Intelligence, 813-818. A. McCallum. (2002). MALLET: A machine learning for language toolkit. [http://mallet.cs.umass.edu] R. McDonald, K. Lerman and Y. Jin. (2005). Conditional random field biomedical entity tagger. [http://www.seas.upenn.edu/ sryantm/software/BioTagger/] R. McDonald and F. Pereira. (2005). Identifying gene and protein mentions in text using conditional random fields. BMC Bioinformatics 2005, 6(Suppl 1):S6. K. Nigam, A. McCallum, S. Thrun and T. Mitchell. (2000). Text classification from labeled and unlabeled documents using EM. Machine learning. 39(2/3):135-167. J. Nocedal and S. Wright. (2000). Numerical Optimization, Springer. S. Roberts, R. Everson and I. Rezek. (2000). Maximum certainty data partitioning. Pattern Recognition, 33(5):833839. S. Vishwanathan, N. Schraudolph, M. Schmidt and K. Murphy. (2006). Accelerated training of conditional random fields with stochastic meta-descent. Proceedings of the 23th International Conference on Machine Learning. D. Yarowsky. (1995). Unsupervised word sense disambiguation rivaling supervised methods. Proceedings of the 33rd Annual Meeting of the Association for Computational Linguistics, 189-196. D. Zhou, O. Bousquet, T. Navin Lal, J. Weston and B. Scholkopf. (2004). Learning with local and global con¨ sistency. Advances in Neural Information Processing Systems, 16:321-328. D. Zhou, J. Huang and B. Scholkopf. (2005). Learning from ¨ labeled and unlabeled data on a directed graph. Proceedings of the 22nd International Conference on Machine Learning, 1041-1048. X. Zhu, Z. Ghahramani and J. Lafferty. (2003). Semisupervised learning using Gaussian fields and harmonic functions. Proceedings of the 20th International Conference on Machine Learning, 912-919. First, note that some simple calculation yields and A Deriving the gradient of the entropy We wish to show that 216 In the vector form, this can be written as (5) 8 X1V%" ( Q I H 4c ( I0( $ $v Uh 10V%$" ( QI ( 7$ PH v c ( f ED 5 d ph ($%" V%" ( 7$ PH ($ QI v c H d c 7 Fce ¨ d U d ph dh 8 1V%" ( Q I H 4c ( 0( $ $v Uf h 10%sV%$" " q ( 7$ PH 2vc ( E75 ( QI D fh s$ %(V%" " q %" ( 7$ PH 4c F c e ¨ ($ QIv 7 h Uh 8 V%" ( 7 PH v c ($ Q$I U f dh f s( $ %%" " q v ( % UQ $6 I H 6 2c e D E75 h ($ V%" ( 7$ PH wc F c e ¨ QIv 7 U h @ ( I T f f D E75 I s%V%" ($ Q " q ( 7$ PH 4c 6 F c e ¨ v 7 U6 f h ED 5 ( Q $ I H 6 8 I ( Q7$ PH v c Therefore £¤ 8 9 I. Cohen and F. Cozman. (2006). Risks of semi-supervised learning. Semi-Supervised Learning, O. Chapelle, B. Scholkopf and A. Zien, (Editors), 55-70, MIT Press. ¨ (V$%" ( 7$ PH 4c ( 7$ PH QIv QI ($ QI U h V%" ( f 7$ PH ¨ f ( I f T U h ¨ 6 ( U7$6 PH 6 QI @ %V%) " g Y XU U qW s( $" 6 f h (V%" ( 7$ PH $ QI v c ¨ ( U I T 6 6 Uh f D B A @ E75 I( $ 3V%" H PFEDC £ & Fce ¨ v h D E75 f ( 7$ PH ( 7$ PH 2c QI QIv 6 ¡¢ Fce 7 Fce 6 (5) 7 U6