Structure Compilation: Trading Structure for Features Percy Liang Computer Science Division, University of California, Berkeley, CA, USA Hal Daum´ I I I e School of Computing, University of Utah, Salt Lake City, UT, USA Dan Klein Computer Science Division, University of California, Berkeley, CA, USA pliang@cs.berkeley.edu me@hal3.name klein@cs.berkeley.edu Abstract Structured models often achieve excellent performance but can be slow at test time. We investigate structure compilation, where we replace structure with features, which are often computationally simpler but unfortunately statistically more complex. We analyze this tradeoff theoretically and empirically on three natural language processing tasks. We also introduce a simple method to transfer predictive power from structure to features via unlabeled data, while incurring a minimal statistical penalty. the expressive power of the ILR and that of the CRF. Punyakanok et al. (2005) investigated this question for margin-based models1 and concluded that structure was not needed when the independent problems were "easy." They characterized difficulty in terms of classifier separability, which is a very rigid notion. In Section 3.1, we provide an information-theoretic analysis, decomposing the gap between the ILR and CRF into three terms, each one representing a shortcoming of the ILR. The impact of each is investigated empirically. Even if the ILR were made as expressive as the CRF by adding additional features, an important remaining question is whether the ILR could generalize as well as the CRF given limited labeled data. Indeed, the ILR overfits more easily, and we provide generalization bounds in Section 3.2 to quantify this effect. At this point, it seems as though we are forced to make a tradeoff between the computational simplicity of the ILR and the statistical simplicity of the CRF. However, we propose structure compilation as a way to have the best of both worlds. Our strategy is to label a plethora of unlabeled examples using the CRF and then train the ILR on these automatically labeled examples. If we label enough examples, the ILR will be less likely to overfit. Although training now takes longer, it is only a one-time cost, whereas prediction at test time should be made as fast as possible. Many authors have used unlabeled data to transfer the predictive power of one model to another, for example, from high accuracy neural networks to more interpretable decision trees (Craven, 1996), or from high accuracy ensembles to faster and more compact neural networks (Bucil et al., 2006). Our foa 1 In their case, the independent models are not endowed with extra features, but coherence of the predictions is enforced at test time. 1. Intro duction Structured models have proven to be quite effective for tasks which require the prediction of complex outputs with many interdependencies, e.g., sequences, segmentations, trees, etc. For example, conditional random fields (CRFs) can be used to predict tag sequences where there are strong dependencies between adjacent tags (Lafferty et al., 2001). In part-of-speech tagging, for instance, a CRF can easily model the fact that adjectives tend to precede nouns in English. However, the excellent performance of structured models comes at a computational cost: inference in loopy graphs requires approximate inference, and even for sequences, there is a quadratic dependence on the number of tags. In this paper, we ask a bold question: do we really need structure? Consider replacing the edges in a CRF with additional contextual features, i.e., having an independent logistic regression (ILR) at each position. A fundamental question is whether there is a gap between Appearing in Proceedings of the 25 th International Conference on Machine Learning, Helsinki, Finland, 2008. Copyright 2008 by the author(s)/owner(s). Structure Compilation: Trading Structure for Features cus is on structured classification tasks, specifically on studying the tradeoff between structure and features. We ran experiments on three tasks: part-of-speech tagging (POS), named-entity recognition (NER), and constituency parsing (Figure 1). S DT NNP NNP VBD The European Commission agreed Part-of-sp eech tagging (POS) O B-ORG I-ORG O The European Commission agreed Named-entity recognition (NER) NP DT NN VBD The man ate DT a JJ VP NP NN NN In this work, we use a generic set of features for both POS and NER. The components of the node features f (yi , x, i) are all indicator functions of the form I[yi = a, s(xi+o ) = b], where a ranges over tags, s(·) ranges over functions on words,4 b are values in the range of s(·), and -L o L is an offset within a radius L window of the current position i (we used L = 0 for POS, L = 1 for NER). The components of the edge features g (yi , yj ) are of the form I[yi = a, yj = b]. Let f1 denote this base feature set. tasty sandwich Parsing Figure 1. Examples of inputs and their outputs for the three tasks we experimented on. The input (what's seen at test time) is italicized. 2. From Structure to Features In this section, we will walk through the process of replacing structure with features, using empirical results on POS2 and NER3 as running examples. Table 1 summarizes the notation we will use. 2.1. Conditional Random Fields (CRFs) In structured classification, our goal is to learn to predict an output y Y (e.g., a tag sequence, a segmentation, or a parse tree) given an input x X (e.g., a sentence). In this paper, we consider conditional exponential family models, which have the following form: p (y | x) = exp{(x, y) Training Suppose we are given n labeled examples (x(1) , y(1) ), . . . , (x(n) , y(n) ), which for the purposes of our theoretical analysis are assumed to be drawn i.i.d. from some unknown true distribution p . We train the CRF using standard maximum likelihood:5 max Epl (x,y) log p(y | x; ), where pl (x, y) = n 1 (i) (i) i=1 I[x = x , y = y ] denotes the empirical disn tribution of the labeled data. Later, we will consider other training regimes, so we need to establish some new notation. Let pc = Tr(crf, f1 , pl ) denote the CRF trained with the base feature set f1 on the labeled data. CRFs achieve state-of-the-art performance on POS and NER. Using just our generic feature set, we obtain 96.9% tagging accuracy on POS (training with 30K examples) and 85.3% F1 on NER (training with 10K examples). However, the performance does come at a computational cost, since inference scales quadratically with the number of tags K . This cost only increases with more complex models. 2.2. Indep endent Logistic Regression (ILR) - A(; x)}, (1) Let us try something drastic: remove the edges from the CRF to get an independeint logistic regression (ILR), where now (x, y) = V f (yi , x, i). For an ILR trained on the labeled data with our base feature set (denoted formally as Tr(ilr, f1 , pl )), inference can be done independently for each node. For POS, the ILR takes only 0.4ms to process one sentence whereas the CRF takes 2.7ms, which is a speedup of 5.8x, not including the time for precomputing features.6 Unfortunately, the accuracy of POS drops from 96.9% to 93.7%. For NER, F1 drops from 85.3% to 81.4%. 4 We used 10 standard NLP functions which return the word, prefixes and suffixes (up to length 3) of the word, word signatures (e.g., McPherson maps to AaAaaaaaa and AaAa), and whether the word is capitalized. 5 In our experiments, we ran stochastic gradient for 50 iterations with a 1/(iteration + 3) step-size. 6 If we include the time for computing features, the speedup drops to 2.3x. where (x, y) are the sufficient statistics (features), y Rd are the parameters, and A(; x) = log exp{(x, y) } is the log-partition function. One important type of conditional exponential family is a conditional random field (CRF) defined on a graph G = (V , E ). In this case, the output y = {yi }iV is a collection of labels, one for each node i V , with yi {1, . . . , K }. The features include functions over both nodes and edges: i ( (x, y) = f (yi , x, i) + g (yi , yj ). V 2 i,j )E We used the Wall Street Journal (WSJ) portion of the Penn Treebank, with sections 0­21 as the training set (38K sentences) and sections 22­24 as the test set (5.5K sentences). 3 We used the English data from the 2003 CoNLL Shared Task, consisting of 14.8K training sentences and 3.5K test sentences (set A). Structure Compilation: Trading Structure for Features 2.3. Adding New Features Without edges, the ILR has less expressiveness compared to the CRF. We can compensate for this loss by expanding our base feature set. We will use f2 to denote the expanded feature set. We use a simple recipe to automatically construct f2 from f1 , but in general, we could engineer the features more carefully for better performance (see the parsing experiments in Section 4, for example). Essentially, our recipe is to allow the ILR at node i to use the base features f1 applied to the nodes in a local window around i. For the chain CRF, this amounts to simply increasing the window size from L to L + r (Section 2.1), where we call r the expansion radius. For POS, we used an expansion radius of r = 1; for NER, r = 2. By training with these new features (Tr(ilr, f2 , pl )), we get 96.8% on POS (compared to the 96.9% of the CRF), taking 0.8ms per example (compared to 2.7ms for the CRF). In this case, we have successfully traded structure for features with a negligible loss in performance and a 3.4x speedup. For NER, the story is quite different: adding features actually makes F1 drop from 81.1% to 78.8%. We believe there are two reasons for the poor performance on NER. First, since NER is a segmentation problem, the structure plays a more integral role and thus cannot be easily replaced with features. In other words, the approximation error of the ILR with respect to the CRF is higher for NER than POS. Section 3.1 provides a more formal treatment of this matter. Second, adding more features increases the risk of overfitting. In other words, the estimation error is larger when there are more features. Section 3.2 analyzes this error theoretically. 2.4. Using Unlab eled Data f1 f2 p (x, y) pl (x, y) pc (y | x) pc (y | x) pu (x) pu (x, y) c p (x, y) c pi (y | x) pi (y | x) base feature set expanded feature set true data distribution original labeled examples (few) = Tr(crf, f1 , pl ) [trained CRF] = Tr(crf, f1 , p ) [limiting CRF] unlabeled examples (many) = pu (x)pc (y | x) [labeled with CRF] = p (x)pc (y | x) [labeled with CRF] = Tr(ilr, f2 , pu ) [trained ILR] c = Tr(ilr, f2 , p ) [limiting ILR] c Table 1. Notation used in this paper. In general, superscripts denote marginal distributions over x and subscripts denote conditional distributions over y given x. labeled data pl , we instead train it on our automatically labeled data pu .7 c Structure compilation is most useful when there are few original labeled examples. Figure 2 shows the performance of a ILR obtained using structure compilation when the CRF is trained on only 2K labeled examples. We see that using more unlabeled data reduces the performance gap between the CRF and ILR as the estimation error is reduced. For POS, the gap is closed entirely, whereas for NER, there is a remaining approximation error. 100 CRF(f1 ) 100 CRF(f1 ) ILR(f1 ) ILR(f2 ) tag accuracy 98 96 94 92 92 ILR(f1 ) ILR(f2 ) Labeled F1 84 76 68 2 4 8 16 32 64 128 200 2 4 8 16 32 64 128 200 There seems to be a tradeoff between approximation error and estimation error: More features can provide a better substitute for structure (decreasing the approximation error), but at the risk of overfitting the data (increasing the estimation error). The algorithmic contribution of this paper is using unlabeled data to reduce the estimation error of the ILR via structure compilation. Suppose we have m unlabeled examples x(1) , . . . , x(m) (which we assume are also generated from p ); let pu (x) denote the corresponding empirical distribution. We can use the CRF pc (y | x) (which has been trained on pl ) to label this unlabeled data. Let pu (x, y) = pu (x)pc (y | x) denote c this new plentiful source of labeled data. Instead of training the ILR on the limited amount of originally m (thousands) m (thousands) (a) POS (b) NER Figure 2. crf(f1 ) is the CRF trained using the base feature set on 2K examples. ilr(f1 ) is the ILR trained the same way; performance suffers. However, by using the expanded feature set and training on m examples (New York Times articles from Gigawords) which are labeled with the CRF, ilr(f2 ) can recover all of the performance for POS and more than half of the performance for NER. Strictly speaking, training on pu would involve creatc ing many examples weighted by pc (yi | x) for each original x. But since the posteriors are sharply peaked, using argmaxy pc (y | x) was faster and an adequate approximation for our applications. 7 Structure Compilation: Trading Structure for Features 3. Analysis In this section, we try to get a better understanding of when structure compilation would be effective. For the theoretical analysis, we will measure performance in terms of log-loss (negative cross-entropy): (p1 , p2 ) = Ep1 (x,y) [- log p2 (y | x)], def (2) can, however, derive the following approximate triangle inequality, where we pay an extra multiplicative factor (see Appendix A.1 for the proof ): Theorem 1. Consider a conditional exponential family P with features . Let be a compact subset of parameters, and define P = {p : }. For any p1 , p2 P and any distribution p0 such that p0 = argmaxpP Ep (x)p0 (y|x) log p(y | x) P , Ep (x) KL (p0 (y | x) || p2 (y | x)) [Ep (x) KL (p0 (y | x) || p1 (y | x)) + Ep (x) KL (p1 (y | x) || p2 (y | x))], m where = 2 inf + ax(E var (|x)) . Here, max () and where p1 (x, y) is the data generating distribution and p2 (y | x) is the model under evaluation. We will also use conditional KL-divergence to quantify approximation error: (p1 , p2 ) = (p1 , p2 ) - (p1 , p1 ). def (6) sup (E var (|x)) (3) We are interested in (p , pi ), the loss of the final compiled ILR. As we will later show in (7), this loss can be expressed in terms of two parts: (1) (p , pc ), the loss of the CRF; and (2) (pc , pi ), the penalty due to structure compilation. The CRF loss can be decomposed into approximation and estimation errors as follows, using telescoping sums (refer to Table 1 for notation): (p , pc ) = (p , pc ) - (pl , pc ) + c l + in () are the largest and smal lest nonzero eigenvalm ues of , respectively. Theorem 1 generalizes Lemma 3 of Crammar et al. (2007) to conditional distributions and the case where p0 is not necessarily in an exponential family. Let us apply Theorem 1 with p , pc , pi . We then add the conditional entropy E H (p (y | x)) to the LHS of the resulting inequality and E H (p (y | x)) to the RHS (note that 1), thus obtaining a bound for the total loss of the final compiled ILR: (p , pi ) ( (p , pc ) + (p , pi )) c 2(crf-est-err + ilr-est-err). (7) (crf-apx-err + ilr-apx-err) + min (4) rf-est-err (pl , pc ) - (pl , pc ) + 0 (p , pc ) - (p , pc ) + (p , pc ) . c c rf-est-err rf-apx-err In the remaining sections, we analyze the various pieces of this bound. 3.1. Approximation Error We start by analyzing ilr-apx-err = (p , pi ), c which measures how well the ILR can approximate the CRF. Specifically, we show that (p , pi ) decomc poses into three terms, each reflecting a limitation of the ILR: (1) Ic , the inability to produce a coherent output; (2) In , the inability to express nonlinearities; and (3) Ig , the inability to use information about the input outside a local window. The following theorem formalizes these concepts (see Appendix A.2 for the proof ): Theorem 2 (Decomposition of approximation error). (p , pi ) = Ic + In + Ig , where c p , i Ic = Ep (x) KL c (y | x) || pc (yi | x) V The second term on the RHS is non-positive because because pc is chosen to minimize log-loss on pl . The first and third terms are the estimation errors resulting from using pl instead of p ; these will be uniformly bounded in Section 3.2. Finally, the last term is an approximation error reflecting the modeling limitations of the CRF. The structure compilation penalty can be decomposed analogously: (p , pi ) = (p , pi ) - (pu , pi ) + c c c i (5) lr-est-err u (pc , pi ) - (pu , pi ) + c 0 (pu , pi ) - (p , pi ) + (p , pi ) . c c i ic lr-est-err lr-apx-err We would now like to combine (p , pc ) and (pc , pi ) to get a handle on (p , pi ), but unfortunately, KLdivergence does not satisfy a triangle inequality. We In Ig = Ep (x) = Ep (x) i V KL (pc (yi | x) || pa (yi | x)) , KL (pa (yi | x) || pi (yi | x)) , i V Structure Compilation: Trading Structure for Features with pa = Tr(ilr, f , p ), where the node features c f are constructed from f1 with an expansion radius of (so the entire input sequence x is used). Coherence One advantage of structured models is their ability to predict the output jointly. This could be especially important for NER, where the output tag sequence actually codes for a segmentation of the input. Ic measures the information lost by using the independent marginals of the CRF rather than the joint.8 For a chain CRF defined on y = (y1 , . . . , y ), one can check that Ic is the sum of mutual information terms -1 along the edges: Ic = E i=1 I (yi , yi+1 | x). We computed Ic empirically: for POS, Ic = 0.003 and for NER, Ic = 0.009 (normalized by sequence length). Also, when we predict using the CRF marginals, the performance drops from 76.3% to 76.0% for NER but stays at 95.0% for POS. From these results, we conclude that coherence is not a big concern for our applications, although it is slightly more serious for NER, as we would expect. Nonlinearities Although we think of CRFs as linear models, their marginal predictions actually behave nonlinearly. In captures the importance of this nonlinearity by comparing pc (yi | x) and pa (yi | x). Both depend on x through the same sufficient statistics f1 (·, x, ·), but pa acts on these sufficient statistics in a linear way whereas pc allows the information about x to propagate in a nonlinear way through the other hidden labels y-i = {yj : j = i} in a manner roughly similar to that of a neural network. However, one difference is that the parameters of pc (yi | x) are learned with y-i fixed at training time; they are not arbitrary hidden units in service of yi . A neural network therefore offers more expressive power but could be more difficult to learn. We would like to measure the effect of nonlinearity empirically, but pa has too many parameters to learn effectively. Thus, instead of comparing pa and pc , we compare pi and a truncated CRF ptc , which we train as follows:9 For each labeled example (x, y) (which are sequences of length ), we create new examples: (xi-L-r..i+L+r , yi-r..i+r ) for i = 1, . . . , , where r is the expansion radius (Section 2.3). Then we train a CRF with features f1 on these new examples to get ptc . To label node i, we use ptc (yi | xi-L-r..i+L+r ), On the other hand, if we evaluate predictions using Hamming distance, it could actually be better to use the marginals. In that case, coherence is irrelevant. 9 Truncated CRFs are closely related to piecewisetrained CRFs (Sutton & McCallum, 2005). 8 marginalizing out yi-r..i-1,i+1..i+r . By this setup, both ptc (yi | x) and pi (yi | x) depend on x through the same features. Table 2 compares the NER performance of ptc and pi . As we can see, the truncated CRF significantly outperforms the ILR, demonstrating the power of nonlinearities. Expansion radius r compiled ILR truncated CRF 1 0.725 0.748 2 0.727 0.760 3 0.721 0.762 -- 0.760 Table 2. NER F1 (2K originally labeled examples, 200K automatically labeled examples for structure compilation) showing the importance of nonlinearity. Both the ILR and truncated CRF depend on the input x in the same way, but only the latter permits nonlinearities. Global information Ig compares pa and pi , both of which are independent linear classifiers. The difference is that pa uses all features of x, while pi uses only features of x in a local window. Instead of comparing pa and pi , we compare their nonlinear counterparts pc and ptc . From Table 2, we can see that the truncated CRF with just an expansion radius of 2 has the same performance as the original CRF (expansion radius ). Therefore, we suspect that the features on x outside a local window have little impact on performance, and that most of the approximation error is due to lacking nonlinearities. 3.2. Estimation Error In this section, we quantify the estimation errors for the CRF and ILR. First, we establish a general result about the estimation error of log-loss for exponential families. Our strategy is to uniformly bound the difference between empirical and expected log-losses across all parameter values of the exponential family. Our proof uses covering numbers and is based on Collins (2001), which derived an analogous result for a 0-1 margin loss. Assume our parameters and features are bounded: = { : ||||2 B } and R = supx,y ||(x, y)||2 . The following theorem relates the difference between empirical and expected log-loss to the number of examples n (see Appendix for proof ): Theorem 3. For > 0, with probability 1 - , for | (p , p ) - (pl , p )| to hold for al l , it suffices to use n = (B 4 R4 log2 |Y | -4 log(1/ )) examples, where we have suppressed logarithmic factors. The above result gives uniform convergence of (·, ·), which allows us to bound crf-est-err. In order to bound ilr-est-err, we need uniform convergence of Structure Compilation: Trading Structure for Features (·, ·) (5). This requires the convergence of one adP ditional term: | (pu , pc ) - (p , pc )| - 0 (pc is non c c random in this context). The asymptotics for n in Theorem 3 therefore remain unchanged. We now apply Theorem 3 to the CRF and ILR with the features described in Section 2.1. For both models, log |Y | = K |V |. Where they differ is on the norms of the parameters and features (B and R). Let d1 be the total number of features in the base feature set; d2 , the expanded feature set. Let ck be the number of nonzero entries in fk (yi , x, i). Natural language processing is typified by sparse binary feature vectors, so dk ck . For the CRF, ||(x, y)||2 is bounded by c 2 2 c |V | + |E |. The ILR has no R 1 |V | + |E | 1 edge potentials so R c2 |V |. In general, R is small for NLP applications. On the other hand, B d1 for the CRF and B d2 for the ILR if the magnitude of the individual parameters are comparable. Since d2 is significantly larger than d1 , the generalization bound for the ILR is much worse than for the CRF. While comparing upper bounds is inconclusive, showing that one upper bound is larger than another via the same methodology is weak evidence that the actual quantities obey a similar inequality. (1, 2), (5, 6), (4, 6), and (3, 6). To parse a sentence at test time, we first use the independent model to assign each span a probability and then use dynamic programming to find the most likely tree. The independent model uses the following features evaluated (a) within the span, (b) at the boundary of the span, and (c) within a window of 3 words on either side of the span: identity, parts-of-speech, prefixes/suffixes (length 1-3), and case patterns. Additional features include 3- and 4-grams of the words and POS tags that occur within the span; the entire POS sequence; the entire word sequence; the 3-character suffix sequence; the case sequence within the span; the length of the span; the position of the span relative to the entire sentence; the number of verbs, conjunctions and punctuation marks within the span; and whether the span centers on a conjunction and has symmetric POS tags to the left and right. 4.2. Results In all of our experiments, we evaluate according to the F1 score on unlabeled, binarized trees (using rightbinarization). This scoring metric is slightly nonstandard, but equally difficult: a parser that achieves a labeled F1 (the usual metric) of 89.96% on the Treebank test data (section 23) achieves 90.29% under our metric; on 10% of the data, the two metrics are 82.84% and 85.16%, respectively. To test our independent parser, we trained a structured parser on 10% of the WSJ portion of the Penn Treebank (4K sentences). We then used the structured parser to parse 160K unlabeled sentences,10 which were then used, along with the original 4K sentences, to train the independent model. Figure 3(a) shows the F1 scores of the various models. When only 4K is used, the independent parser achieves a score of 79.18%, whereas the structured parser gets 85.16%. With more automatically labeled examples, the performance of the independent parser approaches that of the structured parser (84.97% with 164K sentences). On the other hand, if we trained the structured parser on 40K sentences, then the independent parser has a much harder time catching up, improving from 85.42% (40K sentences) to just 87.57% (360K sentences) compared to the structured parser's 90.78% (Figure 3(b)). Since parsing is a much more complex task compared to POS or NER, we believe that richer features would be needed to reduce this gap. These sentences are from the North American National Corpus, selected by test-set relativization. 10 4. Parsing Exp eriments We now apply structure compilation to parsing. In this case, our structured model is a log-linear parser (Petrov & Klein, 2008), which we would like to replace with independent logistic regressions. For simplicity, we consider unlabeled binary trees. We could always use another independent classifier to predict node labels. Standard parsing algorithms require O(|G| 3) time to parse a sentence with words, where |G| is the size of the grammar. The ILR-based parser we will describe only requires O( 3) time. An extension to the labeled case would require O( 3 + K ) time, where K is the number of labels. This is a significant gain, since the grammars used in real-world parsers are quite large (|G| , K ). 4.1. Indep endent Mo del An example parse tree is shown in Figure 1 (recall that we do not predict the labels). For each span (i, j ) (1 i < j ), the independent model predicts whether a node in the parse tree dominates the span. For example, of the 14 non-trivial spans in the sentence in Figure 1, the positively labeled spans are Structure Compilation: Trading Structure for Features 87 92 Unlabeled F1 85 82 80 77 4 14 24 Structured Independent Unlabeled F1 90 89 87 86 40 50 60 80 120 200 360 Structured Independent Punyakanok, V., Roth, D., Yih, W., & Zimak, D. (2005). Learning and inference over constrained output. International Joint Conference on Artificial Intel ligence (IJCAI). Sutton, C., & McCallum, A. (2005). Piecewise training of undirected models. Uncertainty in Artificial Intel ligence (UAI). 44 84 164 m (thousands) m (thousands) (a) 4K sentences (b) 40K sentences Zhang, T. (2002). Covering number bounds of certain regularized linear function classes. Journal of Machine Learning Research, 2, 527­550. Figure 3. A comparison of the structured and independent parsers when the structured parser is trained on 4K (a) and 40K (b) sentences. m is the number of examples (original + automatically labeled) used to train the independent parser. A. Pro ofs Lemma 1 gives conditions under which a conditional KL-divergence can be decomposed exactly. Lemma 2 specializes to the exponential family. These lemmas are variants of standard results from information geometry (see Csisz´r and Shields (2004)). We will use a them in the proofs of Theorems 1 and 2. Lemma 1 (Conditional Pythagorean identity). Let d(p, p ) = Ep (x)p(y|x) log p (y | x) be the negative conditional cross-entropy. For any three conditional distributions p0 , p1 , p2 , if d(p0 , p1 ) = d(p1 , p1 ) and d(p0 , p2 ) = d(p1 , p2 ), then Ep (x) KL (p0 (y | x) || p2 (y | x)) = Ep (x) KL (p0 (y | x) || p1 (y | x)) + Ep (x) KL (p1 (y | x) || p2 (y | x)) . Proof. Use the fact that Ep (x) KL (p || p ) = d(p, p) - d(p, p ) and perform algebra. Lemma 2 (Conditional Pythagorean identity for exponential families). Let P be a conditional exponential family. If p1 = argmaxpP Ep (x)p0 (y|x) log p(y | x) and p2 P , (8) holds for p0 , p1 , p2 . Proof. Since p1 is the maximum likelihood solution, def µ = Ep (x)p0 (y|x) (x, y) = Ep (x)p1 (y|x) (x, y), where are the features of P . Then for p P with parameters , d(p0 , p) = µ - E A(; x) = d(p1 , p) (follows from (1)). Plug in p = p1 , p2 and apply Lemma 1. A.1. Pro of of Theorem 1 Proof. The first part of the proof is similar to that of Lemma 3 in Crammar et al. (2007). Denote k (p, p ) = Ep (x) KL (p(y | x) || p (y | x)) and B () = Ep (x) A(; x). The key is to note that for p, p P , the conditional KL-divergence is the residual term in the firstorder approximation of B : k (p, p ) = B() ( - ) - (B () - B ( )), where , are the parameters of p, p . Thus, we can use Taylor's theorem to get that (8) 5. Conclusion The importance of deploying fast classifiers at test time motivated our investigation into the feasibility of replacing structure with features. We presented a method to compile structure into features and conducted theoretical and empirical analyses of the estimation and approximation errors involved in structure compilation. We hope that a better understanding of the role structure plays will lead to more computationally efficient methods that can still reap the benefits of structure. References Bucila, C., Caruana, R., & Niculescu-Mizil, A. (2006). Model compression. International Conference on Know ledge Discovery and Data Mining (KDD). Collins, M. (2001). Parameter estimation for statistical parsing models: Theory and practice of distribution-free methods. International Workshop on Parsing Technologies. Crammar, K., Kearns, M., & Wortman, J. (2007). Learning from multiple sources. Advances in Neural Information Processing Systems (NIPS). Craven, M. W. (1996). Extracting comprehensible models from trained neural networks. Doctoral dissertation, University of Wisconsin at Madison. Csiszar, I., & Shields, P. (2004). Information theory and ´ statistics: A tutorial. Foundations and Trends in Communications and Information Theory, 1, 417­528. Lafferty, J., McCallum, A., & Pereira, F. (2001). Conditional random fields: Probabilistic models for segmenting and labeling data. International Conference on Machine Learning (ICML). Petrov, S., & Klein, D. (2008). Discriminative log-linear grammars with latent variables. Advances in Neural Information Processing Systems (NIPS). Pollard, D. (1984). Convergence of stochastic processes. Springer-Verlag. Structure Compilation: Trading Structure for Features - ||2 p,p , where Vp,p V ~ E var ( | x) for some . ~ 1 2 || k (p, p ) = = ~ 2B() = One can check that ||0 - 2 ||2 0,2 2[||0 - 1 ||2 0,2 + V V ||1 - 2 ||2 0,2 ], where V0,2 = Vp0 ,p2 . By definition of , V V1,2 ,11 so ||0 - 2 ||2 0,2 V0,2 V0,1 and V0,2 V 2 2 [||0 - 1 ||2 0,1 + ||1 - 2 ||2 1,2 ]. Rewritten another V V way: k (p0 , p2 ) [k (p0 , p1 ) + k (p1 , p2 )]. Applying Lemma 2 twice to p0 , p0 , p1 and p0 , p0 , p2 yields k (p0 , p2 ) - k (p0 , p0 ) [k (p0 , p1 ) - k (p0 , p0 ) + k (p1 , p2 )]. Subtracting k (p0 , p0 ) from both sides and noting 1 yields the theorem. A.2. Pro of of Theorem 2 i Proof. Define pmc (y i | x) = V pc (yi | x). Check i that d(pc (y | x), V p(yi | x)) = d( V pc (yi | i x), V p(yi | x)) for any p, in particular, pmc and pi . Thus we can apply Lemma 1 with pc , pmc , pi to get (p , pi ) = Ic + Ep (x) KL (pmc (y | x) || pi (y | x)). c Since f is a superset of f2 , both pi and pa are members of the f -exponential family, with pa being the maximum likelihood solution. Apply Lemma 2 with pmc , pa , pi to get Ep (x) KL (pmc (y | x) || pi (y | x)) = In + Ig . A.3. Pro of of Theorem 3 We use covering numbers to bound the complexity of the class of log-losses. Our proof is inspired by Collins (2001), who works with a 0-1 margin-based loss. Define the loss class: M = {(x, y) - log p (y | x) : } . def (1984)) applied to M: With probability 1 - , s ( l P up | (p , p ) - (p , p )| > 10) -2, n 8N1 (M, /8, n) exp 128L2 where Np (F , , n), the covering number of function class F , is the supremum over all points z1 , . . . , zn , of the size of the smallest cover {g1 , . . . , gk } such that for aln f F , there exists a gj in the cover with l 1 ( n i=1 |f (zi ) - gj (zi )|p )1/p . We now upper bound N (M, /8, n), adapting the method used in Collins (2001). First define the set of linear functions: . def v L= v: (11) Theorem 4 of Zhang (2002) (with p = q = 2) allows us to bound the complexity of this class: log2 N (L, , n) 2 (12) 36(B R/ ) log2 (2 4BR/ + 2 n + 1). We now relate the covering numbers of L and M: Lemma 4. N (M, , n) N (L, /2, n|Y |). Proof. Let S = {(x(1) , y(1) ), . . . , (x(n) , y(n) )}. We will construct a covering of M (with respect to S ) by reducing to the problem of finding a covering of L. Let viy = (x(i) , y) and V = {viy : i = 1, . . . , n, y Y }. By (12), we can cover L with respect to V with a set CL . Consider the corresponding set CM M (note the natural 3-way bijections between , L, and M). To prove the lemma, it suffices to show that CM is a covering of M. Fix some g M, which is associated ~ with some f L and . There exists a f CL ~ and a g CM ) such that (corresponding to a ~ ~ ~ |f (x, y) - f (x, y)| = | (x, y) - (x, y)| /2 for all (x, y) S and y Y . We now argue that g is close ~ to g . For each (x, y) S , g (x, y ) = = = - log p (y | x) - ~ -( (9) We first show that the elements of M are bounded: Lemma 3. For each f M and (x, y), 0 f (x, y) L, where L = B R(1 + log |Y |). Proof. The lower bound holds since probabilities are bounded above by 1. For the upper bound, consider the absolute value of the two terms in (1) separately. For the linear term, |(x, y) | B R by the CauchySchwartz inequality. The log-partition function can be bounded by B R log |Y | by applying the linear term result to the exponent. Add the two bounds. Theorem 1 of Zhang (2002) (originally due to Pollard It suffices to consider nonzero eigenvalues because zero eigenvalues correspond to non-identifiable directions, which are the same for all parameters . 11 def (x, y) + log y Y e (x,y ) (x, y) - /2) + log y Y e ~ (x,y )+ /2 g (x, y) + . ~ Similarly, g (x, y) g (x, y) - . Therefore, CM is a ~ cover of M. Substitute (12) into Lemma 4 to get an expression for N (M, , n); substitute that into (10), using the fact that N1 N . Solving for n yields the theorem.