Discriminative Log-Linear Grammars with Latent Variables Slav Petrov and Dan Klein Computer Science Department, EECS Division University of California at Berkeley, Berkeley, CA, 94720 { petrov, klein} @cs.berkeley.edu Abstract We demonstrate that log-linear grammars with latent variables can be practically trained using discriminative methods. Central to efficient discriminative training is a hierarchical pruning procedure which allows feature expectations to be efficiently approximated in a gradient-based procedure. We compare L1 and L2 regularization and show that L1 regularization is superior, requiring fewer iterations to converge, and yielding sparser solutions. On full-scale treebank parsing experiments, the discriminative latent models outperform both the comparable generative latent models as well as the discriminative non-latent baselines. 1 Introduction In recent years, latent annotation of PCFG has been shown to perform as well as or better than standard lexicalized methods for treebank parsing [1, 2]. In the latent annotation scenario, we imagine that the observed treebank is a coarse trace of a finer, unobserved grammar. For example, the single treebank category NP (noun phrase) may be better modeled by several finer categories representing subject NPs, object NPs, and so on. At the same time, discriminative methods have consistently provided advantages over their generative counterparts, including less restriction on features and greater accuracy [3, 4, 5]. In this work, we therefore investigate discriminative learning of latent PCFGs, hoping to gain the best from both lines of work. Discriminative methods for parsing are not new. However, most discriminative methods, at least those which globally trade off feature weights, require repeated parsing of the training set, which is generally impractical. Previous work on end-to-end discriminative parsing has therefore resorted to "toy setups," considering only sentences of length 15 [6, 7, 8] or extremely small corpora [9]. To get the benefits of discriminative methods, it has therefore become common practice to extract n-best candidate lists from a generative parser and then use a discriminative component to rerank this list. In such an approach, repeated parsing of the training set can be avoided because the discriminative component only needs to select the best tree from a fixed candidate list. While most state-of-the-art parsing systems apply this hybrid approach [10, 11, 12], it has the limitation that the candidate list often does not contain the correct parse tree. For example 41% of the correct parses were not in the candidate pool of 30-best parses in [10]. In this paper we present a hierarchical pruning procedure that exploits the structure of the model and allows feature expectations to be efficiently approximated, making discriminative training of full-scale grammars practical. We present a gradient-based procedure for training a discriminative grammar on the entire WSJ section of the Penn Treebank (roughly 40,000 sentences containing 1 million words). We then compare L1 and L2 regularization and show that L1 regularization is superior, requiring fewer iterations to converge and yielding sparser solutions. Independent of the regularization, discriminative grammars significantly outperform their generative counterparts in our experiments. 1 FRAG RB No t DT this NP NN y ear . FRAG ROOT FRAG . NP DT this NN y ear . RB-x No t FRAG-x ROOT FRAG-x .- x NP-x DT-x this NN- x y ear . . RB No t (a) (b) (c) Figure 1: (a) The original tree. (b) The (binarized) X-bar tree. (c) The annotated tree. 2 Grammars with latent annotations Context-free grammars (CFGs) underlie most high-performance parsers in one way or another [13, 12, 14]. However, a CFG which simply takes the empirical productions and probabilities off of a treebank does not perform well. This naive grammar is a poor one because its context-freedom assumptions are too strong in some places and too weak in others. Therefore, a variety of techniques have been developed to both enrich and generalize the naive grammar. Recently an automatic statesplitting approach was shown to produce state-of-the art performance [2, 14]. We extend this line of work by investigating discriminative estimation techniques for automatically refined grammars. We consider grammars that are automatically derived from a raw treebank. Our experiments are based on a completely unsplit X-bar grammar, obtained directly from the Penn Treebank by the binarization procedure shown in Figure 1. For each local tree rooted at an evaluation category X , we introduce a cascade of new nodes labeled X so that each has two children in a right branching fashion. Each node is then refined with a latent variable, splitting each observed category into k unobserved subcategories. We refer to trees over unsplit categories as parse trees and trees over split categories as derivations. Our log-linear grammars are parametrized by a vector which is indexed by productions X . The conditional probability of a derivation tree t given a sentence w can be written as: X T 1 1 P (t|w) = e f (t) (1 ) eX = Z (, w) Z (, w) t where Z (, w) is the partition function and f (t) is a vector indicating how many times each production occurs in the derivation t. The inside/outside algorithm [15] gives us an efficient way of summing over an exponential number of derivations. Given a sentence w spanning the words w1 , w2 , . . . , wn = w1:n , the inside and outside scores of a (split) category A spanning (i, j ) are computed by summing over all possible children B and C spanning (i, k ) and (k , j ) respectively:1 A i SIN (A, i, j ) = AB C × SIN (B , i, k ) × SIN (C, k , j ) B C