Exploring the Potential of Intractable Parsers
Mark Hopkins Dept. of Computational Linguistics Saarland University Saarbrucken, Germany ¨ mhopkins@coli.uni-sb.de Jonas Kuhn Dept. of Computational Linguistics Saarland University Saarbrucken, Germany ¨ jonask@coli.uni-sb.de

Abstract
We revisit the idea of history-based parsing, and present a history-based parsing framework that strives to be simple, general, and flexible. We also provide a decoder for this probability model that is linear-space, optimal, and anytime. A parser based on this framework, when evaluated on Section 23 of the Penn Treebank, compares favorably with other stateof-the-art approaches, in terms of both accuracy and speed. NP DT that NN

NP PP IN in DT the NP NN market

money

Figure 1: Example parse tree.

1 Introduction
Much of the current research into probabilistic parsing is founded on probabilistic contextfree grammars (PCFGs) (Collins, 1996; Charniak, 1997; Collins, 1999; Charniak, 2000; Charniak, 2001; Klein and Manning, 2003). For instance, consider the parse tree in Figure 1. One way to decompose this parse tree is to view it as a sequence of applications of CFG rules. For this particular tree, we could view it as the application of rule "NP  NP PP," followed by rule "NP  DT NN," followed by rule "DT  that," and so forth. Hence instead of analyzing P (tr ee), we deal with the more modular: P(NP  NP PP, NP  DT NN, DT  that, NN  money, PP  IN NP, IN  in, NP  DT NN, DT  the, NN  market) Obviously this joint distribution is just as difficult to assess and compute with as P (tr ee). However there exist cubic-time dynamic programming algorithms to find the most likely parse if we assume that all CFG rule applications are marginally
369

independent of one another. The problem, of course, with this simplification is that although it is computationally attractive, it is usually too strong of an independence assumption. To mitigate this loss of context, without sacrificing algorithmic tractability, typically researchers annotate the nodes of the parse tree with contextual information. A simple example is the annotation of nodes with their parent labels (Johnson, 1998). The choice of which annotations to use is one of the main features that distinguish parsers based on this approach. Generally, this approach has proven quite effective in producing English phrase-structure grammar parsers that perform well on the Penn Treebank. One drawback of this approach is its inflexibility. Because we are adding probabilistic context by changing the data itself, we make our data increasingly sparse as we add features. Thus we are constrained from adding too many features, because at some point we will not have enough data to sustain them. We must strike a delicate balance between how much context we want to include versus how much we dare to partition our data set.

Proceedings of the COLING/ACL 2006 Main Conference Poster Sessions, pages 369­376, Sydney, July 2006. c 2006 Association for Computational Linguistics


The major alternative to PCFG-based approaches are so-called history-based parsers (Black et al., 1993). These parsers differ from PCFG parsers in that they incorporate context by using a more complex probability model, rather than by modifying the data itself. The tradeoff to using a more powerful probabilistic model is that one can no longer employ dynamic programming to find the most probable parse. Thus one trades assurances of polynomial running time for greater modeling flexibility. There are two canonical parsers that fall into this category: the decision-tree parser of (Magerman, 1995), and the maximum-entropy parser of (Ratnaparkhi, 1997). Both showed decent results on parsing the Penn Treebank, but in the decade since these papers were published, history-based parsers have been largely ignored by the research community in favor of PCFG-based approaches. There are several reasons why this may be. First is naturally the matter of time efficiency. Magerman reports decent parsing times, but for the purposes of efficiency, must restrict his results to sentences of length 40 or less. Furthermore, his twophase stack decoder is a bit complicated and is acknowledged to require too much memory to handle certain sentences. Ratnaparkhi is vague about the running time performance of his parser, stating that it is "observed linear-time," but in any event, provides only a heuristic, not a complete algorithm. Next is the matter of flexibility. The main advantage of abandoning PCFGs is the opportunity to have a more flexible and adaptable probabilistic parsing model. Unfortunately, both Magerman and Ratnaparkhi's models are rather specific and complicated. Ratnaparkhi's, for instance, consists of the interleaved sequence of four different types of tree construction operations. Furthermore, both are inextricably tied to the learning procedure that they employ (decision trees for Magerman, maximum entropy for Ratnaparkhi). In this work, our goal is to revisit history-based parsers, and provide a general-purpose framework that is (a) simple, (b) fast, (c) space-efficient and (d) easily adaptable to new domains. As a method of evaluation, we use this framework with a very simple set of features to see how well it performs (both in terms of accuracy and running time) on the Penn Treebank. The overarching goal is to develop a history-based hierarchical labeling frame370

work that is viable not only for parsing, but for other application areas that current rely on dynamic programming, like phrase-based machine translation.

2 Preliminaries
For the following discussion, it will be useful to establish some terminology and notational conventions. Typically we will represent variables with capital letters (e.g. X , Y ) and sets of variables with bold-faced capital letters (e.g. X, Y ). The domain of a variable X will be denoted dom(X ), and typically we will use the lower-case correspondent (in this case, x) to denote a value in the domain of X . A partial assignment (or simply assignment) of a set X of variables is a function w that maps a subset W of the variables of X to values in their respective domains. We define dom(w) = W. When W = X, then we say that w is a full assignment of X. The trivial assignment of X makes no variable assignments. Let w(X ) denote the value that partial assignment w assigns to variable X . For value x  dom(X ), let w[X = x] denote the assignment identical to w except that w[X = x](X ) = x. For a set Y of variables, let w|Y denote the restriction of partial assignment w to the variables in dom(w)  Y .

3 The Generative Model
The goal of this section is to develop a probabilistic process that generates labeled trees in a manner considerably different from PCFGs. We will use the tree in Figure 2 to motivate our model. In this example, nodes of the tree are labeled with either an A or a B . We can represent this tree using two charts. One chart labels each span with a boolean value, such that a span is labeled tr ue iff it is a constituent in the tree. The other chart labels each span with a label from our labeling scheme (A or B ) or with the value null (to represent that the span is unlabeled). We show these charts in Figure 3. Notice that we may want to have more than one labeling scheme. For instance, in the parse tree of Figure 1, there are three different types of labels: word labels, preterminal labels, and nonterminal labels. Thus we would use four 5x5 charts instead of two 3x3 charts to represent that tree. We will pause here and generalize these concepts. Define a labeling scheme as a set of symbols including a special symbol null (this will desig-


A B A B B

Figure 2: Example labeled tree. 1 2 3 1 true 2 true true 3 true false true 1 2 3 1 A 2 B B 3 A null B

other words, we assign values to the variables in 3 VL . First we need to choose the order in which we will make these assignments. For our example, we will assign model variables in the following order: S11 , L11 , S22 , L12 , S33 , L13 , S12 , L12 , 1 3 2 1 S23 , L13 , S13 , L13 . A detailed look at this assign2 1 ment process should help clarify the details of the model. Assigning S11 : The first model variable in our order is S11 . In other words, we need to decide whether the span (1, 1) should be a constituent. We could let this decision be probabilistically determined, but recall that we are trying to generate a well-formed tree, thus the leaves and the root should always be considered constituents. To handle situations when we would like to make deterministic variable assignments, we supply an auxilliary function A that tells us (given a model variable X and the history of decisions made so far) whether X should be automatically determined, and if so, what value it should be assigned. In our running example, we ask A whether S11 should be automatically determined, given the previous assignments made (so far only the value chosen for n, which was 3). The so-called auto-assignment function A responds (since S11 is a leaf span) that S11 should be automatically assigned the value tr ue, making span (1, 1) a constituent. Assigning L11 : Next we want to assign a la1 bel to the first leaf of our tree. There is no compelling reason to deterministically assign this label. Therefore, the auto-assignment function A declines to assign a value to L11 , and we pro1 ceed to assign its value probabilistically. For this task, we would like a probability distribution over the labels of labeling scheme L1 = {null, A, B }, conditioned on the decision history so far. The difficulty is that it is clearly impractical to learn conditional distributions over every conceivable history of variable assignments. So first we distill the important features from an assignment history. For instance, one such feature (though possibly not a good one) could be whether an odd or an even number of nodes have so far been labeled with an A. Our conditional probability distribution is conditioned on the values of these features, instead of the entire assignment history. Consider specifically model variable L11 . We compute its 1 features (an even number of nodes ­ zero ­ have so far been labeled with an A), and then we use these feature values to access the relevant prob371

Figure 3: Chart representation of the example tree: the left chart tells us which spans are tree constituents, and the right chart tells us the labels of the spans (null means unlabeled). nate that a given span is unlabeled). For instance, we can define L1 = {null, A, B } to be a labeling scheme for the example tree. Let L = {L1 , L2 , ...Lm } be a set of labeling schemes. Define a model variable of L as a symbol of the form Sij or Lkj , for positive integers i, i j , k , such that i  j and k  m. Model variables of the form Sij indicate whether span (i, j ) is a tree constituent, hence the domain of S ij is {tr ue, f alse}. Such variables correspond to entries in the left chart of Figure 3. Model variables of the form Lkj indicate which label from scheme i Lk is assigned to span (i, j ), hence the domain of model variable Lkj is Lk . Such variables correi spond to entries in the right chart of Figure 3. Here we have only one labeling scheme. Let VL be the (countably infinite) set of model variables of L. Usually we are interested in trees n over a given sentence of finite length n. Let V L denote the finite subset of VL that includes precisely the model variables of the form S ij or Lkj , i where j  n. Basically then, our model consists of two types of decisions: (1) whether a span should be labeled, and (2) if so, what label(s) the span should have. Let us proceed with our example. To generate the tree of Figure 2, the first decision we need to make is how many leaves it will have (or equivalently, how large our tables will be). We assume that we have a probability distribution PN over the set of positive integers. For our example tree, we draw the value 3, with probability PN (3). Now that we know our tree will have three leaves, we can now decide which spans will be constituents and what labels they will have. In


ability distribution over {null, A, B }. Drawing from this conditional distribution, we probabilistically assign the value A to variable L 11 . 1 Assigning S22 , L12 , S33 , L13 : We proceed in 2 3 this way to assign values to S22 , L12 , S33 , L13 (the 3 2 S -variables deterministically, and the L 1 -variables probabilistically). Assigning S12 : Next comes model variable S12 . Here, there is no reason to deterministically dictate whether span (1, 2) is a constituent or not. Both should be considered options. Hence we treat this situation the same as for the L 1 variables. First we extract the relevant features from the assignment history. We then use these features to access the correct probability distribution over the domain of S12 (namely {tr ue, f alse}). Drawing from this conditional distribution, we probabilistically assign the value tr ue to S12 , making span (1, 2) a constituent in our tree. Assigning L12 : We proceed to probabilisti1 cally assign the value B to L12 , in the same man1 ner as we did with the other L1 model variables. Assigning S23 : Now we must determine whether span (2, 3) is a constituent. We could again probabilistically assign a value to S 23 as we did for S12 , but this could result in a hierarchical structure in which both spans (1, 2) and (2, 3) are constituents, which is not a tree. For trees, we cannot allow two model variables S ij and Skl to both be assigned tr ue if they properly overlap, i.e. their spans overlap and one is not a subspan of the other. Fortunately we have already established auto-assignment function A, and so we simply need to ensure that it automatically assigns the value f alse to model variable Skl if a properly overlapping model variable Sij has previously been assigned the value tr ue. 1 Assigning L23 , S13 , L13 : In this manner, we 1 can complete our variable assignments: L 13 is au2 tomatically determined (since span (2, 3) is not a constituent, it should not get a label), as is S 13 (to ensure a rooted tree), while the label of the root is probabilistically assigned. We can summarize this generative process as a general modeling tool. Define a hierarchical labeling process (HLP) as a 5-tuple L, <, A, F , P w here: · L = {L1 , L2 , ..., Lm } is a finite set of labeling schemes. · < is a model order, defined as a total ordering of the model variables VL such that for all
372

H L P G E N(HLP H = L, <, A, F , P ): 1. Choose a positive integer n from distribution PN . Let x be the trivial assignment of V L . 2. In the order defined by <, compute step 3 for n each model variable Y of VL . 3. If A(Y , x, n) = true, y for some y in the domain of model variable Y , then let x = x[Y = y ]. Otherwise assign a value to Y from its domain: (a) If Y = Sij , then let x = x[Sij = sij ], where sij is a value drawn from distribution PS (s|F S (x, i, j, n)). k (b) If Y = Lkj , then let x = x[Lkj = lij ], i i k where lij is a value drawn from distribution Pk (lk |F k (x, i, j, n)). 4. Return n, x . Figure 4: Pseudocode for the generative process. i, j, k : Sij < Lkj (i.e. we decide whether i a span is a constituent before attempting to label it). · A is an auto-assignment function. Specifically A takes three arguments: a model variable Y of VL , a partial assignment x of VL , and integer n. The function A maps this 3tuple to f alse if the variable Y should not be automatically assigned a value based on the current history, or the pair true, y , where y is the value in the domain of Y that should be automatically assigned to Y . · F = {F S , F 1 , F 2 , ..., F m } is a set of feature functions. Specifically, F k (resp., F S ) takes four arguments: a partial assignment x of VL , and integers i , j , n such that 1  i  j  n. It maps this 4-tuple to a full assignment f k (resp., f S ) of some finite set Fk (resp., FS ) of feature variables. · P = {PN , PS , P1 , P2 , ..., Pm } is a set of probability distributions. PN is a marginal probability distribution over the set of positive integers, whereas {PS , P1 , P2 , ..., Pm } are conditional probability distributions. Specifically, Pk (respectively, PS ) is a function that takes as its argument a full assignment f k (resp., f S ) of feature set Fk (resp.,


A(variable Y , assignment x, int n): 1. If Y = Sij , and there exists a properly overlapping model variable Skl such that x(Skl ) = tr ue, then return true, f alse . 2. If Y = Sii or Y true, tr ue . = S1n , then return

3. If Y = Lkj , and x(Sij ) = f alse, then return i true, null . 4. Else return f alse. Figure 5: An example auto-assignment function. It maps this to a probability distribution over dom(Lk ) (resp., {tr ue, f alse}). An HLP probabilistically generates an assignment of its model variables using the generative process shown in Figure 4. Taking an HLP H = L, <, A, F , P as input, H L P G E N outputs an integer n, and an H-labeling x of length n, defined n as a full assignment of VL . Given the auto-assignment function in Figure 5, every H-labeling generated by H L P G E N can be viewed as a labeled tree using the interpretation: span (i, j ) is a constituent iff Sij = tr ue; span (i, j ) has label l k  dom(Lk ) iff Lkj = lk . i FS ).

In our framework, this generalizes to computing: ar g maxx P (x|n, w), where w is a subassignment of an H-labeling x of length n. In natural language parsing, w could specify the constituency and word labels of the leaf-level spans. This would be equivalent to asking: given a sentence, what is its most likely parse? Let W = dom(w) and suppose that we choose a model order < such that for every pair of model variables W  W, X  VL \W, either W < X or W is always auto-assigned. Then P (x|n, w) can be expressed as:

S

PS (x(Sij )|fiS ) j
ij Y \W

·

L

Pk (x(Lkj )|fik ) i j
k Y \W ij

4 Learning
The generative story from the previous section allows us to express the probability of a labeled tree as P (n, x), where x is an H-labeling of length n. < For model variable X , define VL (X ) as the subset of VL appearing before X in model order <. With the help of this terminology, we can decompose P (n, x) into the following product: S

Hence the distributions we need to learn are probability distributions PS (sij |fS ) and k Pk (lij |fk ). This is fairly straightforward. Given a data bank consisting of labeled trees (such as the Penn Treebank), we simply convert each tree into its H-labeling and use the probabilistically determined variable assignments to compile our training instances. In this way, we compile k + 1 sets of training instances that we can use to induce PS , and the Pk distributions. The choice of which learning technique to use is up to the personal preference of the user. The only requirement is that it must return a conditional probability distribution, and not a hard classification. Techniques that allow this include relative frequency, maximum entropy models, and decision trees. For our experiments, we used maximum entropy learning. Specifics are deferred to Section 6.

5 Decoding
P0 (n) · PS (x(Sij )|fiS ) j
ij Y

·

L

Pk (x(Lkj )|fik ) i j
k Y ij

where fiS = F S (x|V< (Sij ) , i, j, n) and j L fik = F k (x|V< (Lk ) , i, j, n) and Y is the subj ij L n set of VL that was not automatically assigned by H L P G E N. Usually in parsing, we are interested in computing the most likely tree given a specific sentence.
373

For the PCFG parsing model, we can find ar g maxtree P (tr ee|sentence) using a cubic-time dynamic programming-based algorithm. By adopting a more flexible probabilistic model, we sacrifice polynomial-time guarantees. The central question driving this paper is whether we can jettison these guarantees and still obtain good performance in practice. For the decoding of the probabilistic model of the previous section, we choose a depth-first branch-and-bound approach, specifically because of two advantages. First, this approach takes linear space. Second, it is anytime,


HLPDECODE(HLP H, int n, assignment w): 1. Initialize stack S with the pair x  , 1 , where x is the trivial assignment of VL . Let xbest = x ; let pbest = 0. Until stack S is empty, repeat steps 2 to 4. 2. Pop topmost pair x, p from stack S . 3. If p > pbest and x is an H-labeling of length n, then: let xbest = x; let pbest = p. 4. If p > pbest and x is not yet a H-labeling of length n, then:
n (a) Let Y be the earliest variable in V L (according to model order <) unassigned by x. (b) If Y  dom(w), then push pair x[Y = w(Y )], p onto stack S . (c) Else if A(Y , x, n) = true, y for some value y  dom(Y ), then push pair x[Y = y ], p onto stack S . (d) Otherwise for every value y  dom(Y ), push pair x[Y = y ], p · q (y ) onto stack S in ascending order of the value of q (y ), where:

solution (in linear time). After that point, we can continue to keep track of the best solution we have found so far, and if at any point we reach an internal node of our search tree with partial cost greater than the total cost of our best solution, we can discard this node and discontinue exploration of that subtree. This technique can result in a significant aggregrate savings of computation time, depending on the nature of the cost function. Figure 6 shows the pseudocode for the depthfirst branch-and-bound decoder. For an HLP H = L, <, A, F , P , a positive integer n, and a partial n assignment w of VL , the call H L P D E C O D E(H, n, w) returns the H-labeling x of length n such that P (x|n, w) is maximized.

6 Experiments
We employed a familiar experimental set-up. For training, we used sections 2­21 of the WSJ section of the Penn treebank. As a development set, we used the first 20 files of section 22, and then saved section 23 for testing the final model. One unconventional preprocessing step was taken. Namely, for the entire treebank, we compressed all unary chains into a single node, labeled with the label of the node furthest from the root. We did so in order to simplify our experiments, since the framework outlined in this paper allows only one label per labeling scheme per span. Thus by avoiding unary chains, we avoid the need for many labeling schemes or more complicated compound labels (labels like "NP-NN"). Since our goal here was not to create a parsing tool but rather to explore the viability of this approach, this seemed a fair concession. It should be noted that it is indeed possible to create a fully general parser using our framework (for instance, by using the above idea of compound labels for unary chains). The main difficulty with this compromise is that it renders the familiar metrics of labeled precision and labeled recall incomparable with previous work (i.e. the LP of a set of candidate parses with respect to the unmodified test set differs from the LP with respect to the preprocessed test set). This would be a major problem, were it not for the existence of other metrics which measure only the quality of a parser's recursive decomposition of a sentence. Fortunately, such metrics do exist, thus we used cross-bracketing statistics as the basic measure of quality for our parser. The crossbracketing score of a set of candidate parses with
374

q (y ) =

P (y |F S (x, i, j, n)) if Y = S S ij Pk (y |F k (x, i, j, n)) if Y = Lkj i

5. Return xbest . Figure 6: Pseudocode for the decoder. i.e. it finds a (typically good) solution early and improves this solution as the search progresses. Thus if one does not wish the spend the time to run the search to completion (and ensure optimality), one can use this algorithm easily as a heuristic by halting prematurely and taking the best solution found thus far. The search space is simple to define. Given an HLP H, the search algorithm simply makes assignments to the model variables (depth-first) in the order defined by <. This search space can clearly grow to be quite large, however in practice the search speed is improved drastically by using branch-and-bound backtracking. Namely, at any choice point in the search space, we first choose the least cost child to expand (i.e. we make the most probable assignment). In this way, we quickly obtain a greedy


word(i+k) = w preterminal(i+k) = p label(i+k) = l category(i+k) = c signature(i,i+k) = s

word(j+k) = w preterminal(j+k) = p label(j+k) = l category(j+k) = c

Magerman (1995) Collins (1996) Klein/Manning (2003) this paper Charniak (1997) Collins (1999)

 40 CB 1.26 1.14 1.10 1.09 1.00 0.90

0CB 56.6 59.9 60.3 58.2 62.1 67.1

 100 CB 1.31 1.25

0CB 57.2 55.2

Figure 7: Basic feature templates used to determine constituency and labeling of span (i, j ). k is an arbitrary integer. respect to the unmodified test set is identical to the cross-bracketing score with respect to the preprocessed test set, hence our preprocessing causes no comparability problems as viewed by this metric. For our parsing model, we used an HLP H = L, <, A, F , P with the following parameters. L consisted of three labeling schemes: the set L wd of word labels, the set Lpt of preterminal labels, and the set Lnt of nonterminal labels. The order < of the model variables was the unique order such that for all suitable integers i, j, k , l: (1) Sij < Lwd < Lpjt < Lnt , (2) Lnt < Skl iff ij ij ij i span (i, j ) is strictly shorter than span (k , l) or they have the same length and integer i is less than integer k . For auto-assignment function A, we essentially used the function in Figure 5, modified so that it automatically assigned null to model variables Lwd and Lpjt for i = j (i.e. no preterminal or ij i word tagging of internal nodes), and to model variables Lnt (i.e. no nonterminal tagging of leaves, ii rendered unnecessary by our preprocessing step). Rather than incorporate part-of-speech tagging into the search process, we opted to pretag the sentences of our development and test sets with an off-the-shelf tagger, namely the Brill tagger (Brill, 1994). Thus the object of our computation was H L P D E C O D E(H, n, w), where n was the length of the sentence, and partial assignment w specified the word and PT labels of the leaves. Given this partial assignment, the job of H L P D E C O D E was to find the most probable assignment of model variables Sij and Lnt for 1  i < j  n. ij The two probability models, P S and P nt , were trained in the manner described in Section 4. Two decisions needed to be made: which features to use and which learning technique to employ. As for the learning technique, we used maximum entropy models, specifically the implementation called MegaM provided by Hal Daume (Daume III, 2004). For P S , we needed features ´
375

Figure 8: Cross-bracketing results for Section 23 of the Penn Treebank. that would be relevant to deciding whether a given span (i, j ) should be considered a constituent. The basic building blocks we used are depicted in Figure 7. A few words of explanation are in order. By label(k ), we mean the highest nonterminal label so far assigned that covers word k , or if such a label does not yet exist, then the preterminal label of k (recall that our model order was bottom-up). By categ or y (k ), we mean the category of the preterminal label of word k (given a coarser, hand-made categorization of preterminal labels that grouped all noun tags into one category, all verb tags into another, etc.). By sig natur e(k , m), where k  m, we mean the sequence label(k), label(k + 1), ..., label(m) , from which all consecutive sequences of identical labels are compressed into a single label. For instance, IN, N P , N P , V P , V P would become IN, N P , V P . Ad-hoc conjunctions of these basic binary features were used as features for our probability model P S . In total, approximately 800,000 such conjunctions were used. For P nt , we needed features that would be relevant to deciding which nonterminal label to give to a given constituent span. For this somewhat simpler task, we used a subset of the basic features used for P S , shown in bold in Figure 7. Adhoc conjunctions of these boldface binary features were used as features for our probability model P nt . In total, approximately 100,000 such conjunctions were used. As mentioned earlier, we used cross-bracketing statistics as our basis of comparision. These results as shown in Figure 8. CB denotes the average cross-bracketing, i.e. the overall percentage of candidate constituents that properly overlap with a constituent in the gold parse. 0CB denotes the percentage of sentences in the test set that exhibit no cross-bracketing. With a simple feature set, we manage to obtain performance comparable to the unlexicalized PCFG parser of (Klein and Manning, 2003) on the set of sentences of length


40 or less. On the subset of Section 23 consisting of sentences of length 100 or less, our parser slightly outperforms their results in terms of average cross-bracketing. Interestingly, our parser has a lower percentage of sentences exhibiting no cross bracketing. To reconcile this result with the superior overall cross-bracketing score, it would appear that when our parser does make bracketing errors, the errors tend to be less severe. The surprise was how quickly the parser performed. Despite its exponential worst-case time bounds, the search space turned out to be quite conducive to depth-first branch-and-bound pruning. Using an unoptimized Java implementation on a 4x Opteron 848 with 16GB of RAM, the parser required (on average) less than 0.26 seconds per sentence to optimally parse the subset of Section 23 comprised of sentences of 40 words or less. It required an average of 0.48 seconds per sentence to optimally parse the sentences of 100 words or less (an average of less than 3.5 seconds per sentence for those sentences of length 41-100). As noted earlier, the parser requires space linear in the size of the sentence.

assignment) but in principle there are many other feasible orders. For instance, one could try a topdown approach, or a bottom-up approach in which internal nodes are assigned immediately after all of their descendants' values have been determined. Throughout this paper, we strove to present the model in a very general manner. There is no reason why this framework cannot be tried in other application areas that rely on dynamic programming techniques to perform hierarchical labeling, such as phrase-based machine translation. Applying this framework to such application areas, as well as developing a general-purpose parser based on HLPs, are the subject of our continuing work.

References
Ezra Black, Fred Jelinek, John Lafferty, David M. Magerman, Robert Mercer, and Salim Roukos. 1993. Towards history-based grammars: using richer models for probabilistic parsing. In Proc. ACL. Eric Brill. 1994. Some advances in rule-based part of speech tagging. In Proc. AAAI. Eugene Charniak. 1997. Statistical parsing with a context-free grammar and word statistics. In Proc. AAAI. Eugene Charniak. 2000. A maximum entropy-inspired parser. In Proc. NAACL. Eugene Charniak. 2001. Immediate-head parsing for language models. In Proc. ACL. Michael Collins. 1996. A new statistical parser based on bigram lexical dependencies. In Proc. ACL. Michael Collins. 1999. Head-driven statistical models for natural language parsing. Ph.D. thesis, University of Pennsylvania. Hal Daume III. 2004. Notes on CG and LM-BFGS op´ timization of logistic regression. Paper available at http://www.isi.edu/ hdaume/docs/daume04cgbfgs.ps, implementation available at http://www.isi.edu/ hdaume/megam/, August. Mark Johnson. 1998. Pcfg models of linguistic tree representations. Computational Linguistics, 24:613­632. Dan Klein and Christopher D. Manning. 2003. Accurate unlexicalized parsing. In Proc. ACL. David M. Magerman. 1995. Statistical decision-tree models for parsing. In Proc. ACL. Adwait Ratnaparkhi. 1997. A linear observed time statistical parser based on maximum entropy models. In Proc. EMNLP.

7 Discussion
This project began with a question: can we develop a history-based parsing framework that is simple, general, and effective? We sought to provide a versatile probabilistic framework that would be free from the constraints that dynamic programming places on PCFG-based approaches. The work presented in this paper gives favorable evidence that more flexible (and worst-case intractable) probabilistic approaches can indeed perform well in practice, both in terms of running time and parsing quality. We can extend this research in multiple directions. First, the set of features we selected were chosen with simplicity in mind, to see how well a simple and unadorned set of features would work, given our probabilistic model. A next step would be a more carefully considered feature set. For instance, although lexical information was used, it was employed in only a most basic sense. There was no attempt to use head information, which has been so successful in PCFG parsing methods. Another parameter to experiment with is the model order, i.e. the order in which the model variables are assigned. In this work, we explored only one specific order (the left-to-right, leaves-to-head
376