Positive Results for Parsing with a Bounded Stack using a Model-Based Right-Corner Transform
William Schuler Dept. of Computer Science and Engineering Minneapolis, MN schuler@cs.umn.edu

Abstract
Statistical parsing models have recently been proposed that employ a bounded stack in timeseries (left-to-right) recognition, using a rightcorner transform defined over training trees to minimize stack use (Schuler et al., 2008). Corpus results have shown that a vast majority of naturally-occurring sentences can be parsed in this way using a very small stack bound of three to four elements. This suggests that the standard cubic-time CKY chart-parsing algorithm, which implicitly assumes an unbounded stack, may be wasting probability mass on trees whose complexity is beyond human recognition or generation capacity. This paper first describes a version of the rightcorner transform that is defined over entire probabilistic grammars (cast as infinite sets of generable trees), in order to ensure a fair comparison between bounded-stack and unbounded PCFG parsing using a common underlying model; then it presents experimental results that show a bounded-stack right-corner parser using a transformed version of a grammar significantly outperforms an unboundedstack CKY parser using the original grammar.

1

Introduction

Statistical parsing models have recently been proposed that employ a bounded stack in time-series (left-to-right) recognition, in order to directly and tractably incorporate incremental phenomena such as (co-)reference or disfluency into parsing decisions (Schuler et al., 2008; Miller and Schuler, 2008). These models make use of a right-corner tree transform, based on the left-corner transform described by Johnson (1998), and are supported by 344

corpus results suggesting that most sentences (in English, at least) can be parsed using a very small stack bound of three to four elements (Schuler et al., 2008). This raises an interesting question: if most sentences can be recognized with only three or four elements of stack memory, is the standard cubic-time CKY chart-parsing algorithm, which implicitly assumes an unbounded stack, wasting probability mass on trees whose complexity is beyond human recognition or generation capacity? This paper presents parsing accuracy results using transformed and untransformed versions of a corpus-trained probabilistic context-free grammar suggesting that this is indeed the case. Experimental results show a bounded-memory time-series parser using a transformed version of a grammar significantly outperforms an unbounded-stack CKY parser using the original grammar. Unlike the tree-based transforms described previously, the model-based transform described in this paper does not introduce additional context from corpus data beyond that contained in the original probabilistic grammar, making it possible to present a fair comparison between bounded- and unbounded-stack versions of the same model. Since this transform takes a probabilistic grammar as input, it can also easily accommodate horizontal and vertical Markovisation (annotating grammar symbols with parent and sibling categories) as described by Collins (1997) and subsequently. The remainder of this paper is organized as follows: Section 2 describes related approaches to parsing with stack bounds; Section 3 describes an existing bounded-stack parsing framework using a rightcorner transform defined over individual trees; Section 4 describes a redefinition of this transform to ap-

Human Language Technologies: The 2009 Annual Conference of the North American Chapter of the ACL, pages 344­352, Boulder, Colorado, June 2009. c 2009 Association for Computational Linguistics

ply to entire probabilistic grammars, cast as infinite sets of generable trees; and Section 5 describes an evaluation of this transform on the Wall Street Journal corpus of the Penn Treebank showing improved results for a transformed bounded-stack version of a probabilistic grammar over the original unbounded grammar.

2

Related Work

The model examined here is formally similar to Combinatorial Categorial Grammar (CCG) (Steedman, 2000). But the CCG account is a competence model as well as a performance model, in that it seeks to unify category representations used in processing with learned generalizations about argument structure; whereas the model described in this paper is exclusively a performance model, allowing generalizations about lexical argument structures to be learned in some other representation, then combined with probabilistic information about parsing strategies to yield a set of derived incomplete constituents. As a result, the model described in this paper has a freer hand to satisfy strict working memory bounds, which may not permit some of the alternative composition operations proposed in the CCG account, thought to be associated with available prosody and quantifier scope analyses.1 Other models (Abney and Johnson, 1991; Gibson, 1991) seek to explain human processing difficulties as a result of memory capacity limits in parsing ordinary phrase structure trees. The Abney-Johnson and Gibson models adopt a left-corner parsing strategy, of which the right-corner transform described in this paper is a variant, in order to minimize memory usage. But the transform-based model described in this paper exploits a conception of chunking (Miller, 1956) -- in this case, grouping recognized words into stacked-up incomplete constituents -- to operate within much stricter estimates of human shortterm memory bounds (Cowan, 2001) than assumed by Abney and Johnson.
The lack of support for some of these available scope analyses may not necessarily be problematic for the present model. The complexity of interpreting nested raised quantifiers may place them beyond the capability of human interactive incremental interpretation, but not beyond the capability of post-hoc interpretation (understood after the listener has had time to think about it).
1

Several existing incremental systems are organized around a left-corner parsing strategy (Roark, 2001; Henderson, 2004). But these systems generally keep large numbers of constituents open for modifier attachment in each hypothesis. This allows modifiers to be attached as right children of any such open constituent. But if any number of open constituents are allowed, then either the assumption that stored elements have fixed syntactic (and semantic) structure will be violated, or the assumption that syntax operates within a bounded memory store will be violated, both of which are psycholinguistically attractive as simplifying assumptions. The HHMM model examined in this paper upholds both the fixed-element and boundedmemory assumptions by hypothesizing fixed reductions of right child constituents into incomplete parents in the same memory element, to make room for new constituents that may be introduced at a later time. These in-element reductions are defined naturally on phrase structure trees as the result of aligning right-corner transformed constituent structures to sequences of random variables in a factored timeseries model.

3

Background

The recognition model examined in this paper is a factored time-series model, based on a Hierarchic Hidden Markov Model (Murphy and Paskin, 2001), which probabilistically estimates the contents of a memory store of three to four partially-completed constituents over time. Probabilities for expansions, transitions and reductions in this model can be defined over trees in a training corpus, transformed and mapped to the random variables in an HHMM (Schuler et al., 2008). In Section 4 these probabilities will be computed directly from a probabilistic context-free grammar, in order to evaluate the contribution of stack bounds without introducing additional corpus context into the model. 3.1 A Bounded-Stack Model HHMMs are factored HMMs which mimic a bounded-memory pushdown automaton (PDA), supporting simple push and pop operations on a bounded stack-like memory store. HMMs characterize speech or text as a sequence

345

of hidden states qt (in this case, stacked-up syntactic categories) and observed states ot (in this case, words) at corresponding time steps t. A most likely sequence of hidden states q1..T can then be hypothe^ sized given any sequence of observed states o1..T : q1..T = argmax P(q1..T | o1..T ) ^
q1..T

1 ft-1

ft1
1 qt-1 1 qt

...
2 ft-1

ft2
2 qt-1 2 qt

...
3 ft-1

(1) (2)

ft3
3 qt-1 3 qt

... ...

= argmax P(q1..T )·P(o1..T | q1..T )
q1..T def T

ot-1

ot

= argmax
q1..T t=1

PA(qt | qt-1 )·PB(ot | qt ) (3)

Model (B ) likelihood probability P(o1..T | q1..T ) = t PB (ot | qt ). Transition probabilities PA (qt | qt-1 ) over complex hidden states qt can be modeled using synchronized levels of stacked-up component HMMs in an HHMM. HHMM transition probabilities are calculated in two phases: a reduce phase (resulting in an intermediate, marginalized state ft ), in which component HMMs may terminate; and a shift phase (resulting in a modeled state qt ), in which unterminated HMMs transition, and terminated HMMs are reinitialized from their parent HMMs. Variables over intermediate ft and modeled qt states are factored into sequences of depth-specific variables ­ one for each of D levels in the HHMM hierarchy: ft = ft1 . . . ftD qt =
1 D qt . . . qt

using Bayes' Law (Equation 2) and Markov independence assumptions (Equation 3) to define a full P(q1..T | o1..T ) probability as the product of a Transition Model (A ) prior probability def P(q1..T ) = t PA (qt | qt-1 ) and an Observation

Figure 1: Graphical representation of a Hierarchic Hidden Markov Model. Circles denote random variables, and edges denote conditional dependencies. Shaded circles are observations.

def

(4) (5)

Transition probabilities are then calculated as a product of transition probabilities at each level, using level-specific reduce R,d and shift S,d models: PA(qt |qt-1 ) =
def

ft

P(ft |qt-1 )·P(qt |ft qt-1 )
D d d-1 PR,d(ftd |ftd+1 qt-1 qt-1 )·

(6)

maintaining competing analyses of the entire memory store. A graphical representation of an HHMM with three levels is shown in Figure 1. Shift and reduce probabilities can then be defined in terms of finitely recursive Finite State Automata (FSAs) with probability distributions over transition, recursive expansion, and final-state status of states at each hierarchy level. In the version of HHMMs used in this paper, each intermediate variable is a reduction or non-reduction state ftd  G  {1, 0} (indicating, respectively, a complete reduced constituent of some grammatical category from domain G, or a failure to reduce due to an `active' transition being performed, or a failure to reduce due to an `awaited' transition being performed, as defined in Section 4.3); and each modeled variable is a synd tactic state qt  G × G (describing an incomplete constituent consisting of an active grammatical category from domain G and an awaited grammatical category from domain G). An intermediate variable ftd at depth d may indicate reduction or nonreduction according to F-Rd,d if there is a reduction at the depth level immediately below d, but must indicate non-reduction (0) with probability 1 if there was no reduction below:2
d d-1 PR,d (ftd | ftd+1 qt-1 qt-1 ) = def

=

1..D d=1 d d d-1 ft PS,d(qt |ftd+1 ftd qt-1 qt )

(7)

with and defined as constants. In Viterbi decoding, the sums are replaced with argmax operators. This decoding process preserves ambiguity by 346

ftD+1

0 qt

if ftd+1 G : [ftd = 0] d d-1 if ftd+1 G : PF-Rd,d (ftd | qt-1 , qt-1 )

(8)

Here [·] is an indicator function: [] = 1 if  is true, 0 otherwise.

2

0 where ftD+1 G and qt = ROOT. d Shift probabilities over the modeled variable qt at each level are defined using level-specific transition Q-Tr,d and expansion Q-Ex,d models: def

Rewrite rules for the right-corner transform are shown below:4 · Beginning case: the top of a right-expanding sequence in an ordinary phrase structure tree is mapped to the bottom of a left-expanding sequence in a right-corner transformed tree: A A·0  A·1  A

d d d-1 PS,d (qt | ftd+1 ftd qt-1 qt ) =  d+1 d d if ft G, ftd G : [qt = qt-1 ] d+1 G, f d G : P d d+1 d d d-1 if f Q-Tr,d (qt | ft ft qt-1 qt ) t  td+1 d-1 d if ft G, ftd G : PQ-Ex,d (qt | qt ) (9)


A /A·1 A·0 

 (10)

= ROOT. This model is and where conditioned on reduce variables at and immediately below the current FSA level. If there is no reduction immediately below the current level (the first case above), it deterministically copies the current FSA state forward to the next time step. If there is a reduction immediately below the current level but no reduction at the current level (the second case above), it transitions the FSA state at the current level, according to the distribution Q-Tr,d . And if there is a reduction at the current level (the third case above), it re-initializes this state given the state at the level above, according to the distribution Q-Ex,d . The overall effect is that higher-level FSAs are allowed to transition only when lower-level FSAs terminate. An HHMM therefore behaves like a probabilistic implementation of a pushdown automaton (or shift­reduce parser) with a finite stack, where the maximum stack depth is equal to the number of levels in the HHMM hierarchy. ftD+1 G
0 qt

This case of the right-corner transform may be considered a constrained version of CCG type raising. · Middle case: each subsequent branch in a right-expanding sequence of an ordinary phrase structure tree is mapped to a branch in a leftexpanding sequence of the transformed tree: A  A·µ A 

A·µ·0 

A·µ·1 A /A·µ A·µ·0   


A /A·µ·1

(11) This case of the right-corner transform may be considered a constrained version of CCG forward function composition. · Ending case: the bottom of a right-expanding sequence in an ordinary phrase structure tree is mapped to the top of a left-expanding sequence in a right-corner transformed tree: A  A·µ a·µ A

3.2 Tree-Based Transforms The right-corner transform used in this paper is simply the left-right dual of a left-corner transform (Johnson, 1998). It transforms all right branching sequences in a phrase structure tree into left branching sequences of symbols of the form A /A·µ , denoting an incomplete instance of an `active' category A lacking an instance of an `awaited' category A·µ yet to come.3 These incomplete constituent categories have the same form and much of the same meaning as non-constituent categories in a Combinatorial Categorial Grammar (Steedman, 2000).
Here  and µ are node addresses in a binary-branching tree, defined as paths of left (0) or right (1) branches from the root.
3


A /A·µ A·µ  a·µ

(12)

This case of the right-corner transform may be considered a constrained version of CCG forward function application.
4 These rules can be applied recursively from bottom up on a source tree, synchronously associating subtree structures matched to variables , , and  on the left side of each rule with transformed representations of these subtree structures on the right.

347

a) binary-branching phrase structure tree: S NP NP JJ strong NN demand IN for NNP NNP new NNP york b) result of right-corner transform: S/NP S/VP NP NP/NNS NP/NNS NP/NNS NP/NP NP/PP NP NP/NN JJ strong IN for NN demand NPpos NPpos/POS NNP NNP/NNP NNP/NNP NNP new NNP york NNP city JJ NN NNS bonds VBN VBN/PRT VBN propped NNP NNP city S/NN DT the PRT up NPpos POS 's JJ general NN obligation PP NP NNS VBN propped NNS NNS bonds S/NN JJ S NN market VBN PRT up DT the JJ municipal VP NP NN NN market

municipal

obligation

general

POS 's

Figure 2: Trees resulting from a) a sample phrase structure tree for the sentence Strong demand for New York City's general obligations bonds propped up the municipal market, and b) a right-corner transform of this tree. Sequences of left children are recognized from the bottom up through in-element transitions in a Hierarchic Hidden Markov Model. Right children are recognized by expanding to additional stack elements.

The completeness of the above transform rules can be demonstrated by the fact that they cover all possible subtree configurations (with the exception of bare terminals, which are simply copied). The soundness of the above transform rules can be demonstrated by the fact that each rule transforms a right-branching subtree into a left-branching subtree labeled with an incomplete constituent. An example of a right-corner transformed tree is shown in Figure 2(b). An important property of this transform is that it is reversible. Rewrite rules for reversing a right-corner transform are simply the converse of those shown above. 348

Sequences of left children in the resulting mostlyleft-branching trees are recognized from the bottom up, through transitions at the same stack element. Right children, which are much less frequent in the resulting trees, are recognized through crosselement expansions in a bounded-stack recognizer.

4

Model-Based Transforms

In order to compare bounded- and unbounded-stack versions of the same model, the formulation of the right-corner and bounded-stack transforms introduced in this paper does not map trees to trees, but rather maps probability models to probability

models. This eliminates complications in comparing models with different numbers of dependent variables -- and thus different numbers of free parameters -- because the model which ordinarily has more free parameters (the HHMM, in this case) is derived from the model that has fewer (the PCFG). Since they are derived from a simpler underlying model, the additional parameters of the HHMM are not free. Mapping probability models from one format to another can be thought of as mapping the infinite sets of trees that are defined by these models from one format to another. Probabilities in the transformed model are therefore defined by calculating probabilities for the relevant substructures in the source model, then marginalizing out the values of nodes in these structures that do not appear in the desired expression in the target model. A bounded-stack HHMM Q,F can therefore be derived from an unbounded PCFG G by: 1. organizing the rules in the source PCFG model G into direction-specific versions (distinguishing rules for expanding left and right children, which occur respectively as active and awaited constituent categories in incomplete constituent labels); 2. enforcing depth limits on these directionspecific rules; and 3. mapping these probabilities to HHMM random variable positions at the appropriate depth. 4.1 Direction-specific rules An inspection of the tree-based right-corner transform rewrites defined in Section 3.2 will show two things: first, that constituents occurring as left children in an original tree (with addresses ending in `0') always become active constituents (occurring before the slash, or without a slash) in incomplete constituent categories, and constituents occurring as right children in an original tree (with addresses ending in `1') always become awaited constituents (occurring after the slash); and second, that left children expand locally downward in the transformed tree (so each A·0 /... locally dominates A·0·0 /...), whereas right children expand locally upward (so each .../A·1 is locally dominated by .../A·1·1 ). This means that rules from the original grammar -- 349

if distinguished into rules applying only to left and right children (active and awaited constituents) -- can still be locally modeled following a right-corner transform. A transformed tree can be generated in this way by expanding downward along the active constituents in a transformed tree, then turning around and expanding upward to fill in the awaited constituents, then turning around again to generate the active constituents at the next depth level, and so on. 4.2 Depth bounds The locality of the original grammar rules in a rightcorner transformed tree allows memory limits on incomplete constituents to be applied directly as depth bounds in the zig-zag generation traversal defined above. These depth limits correspond directly to the depth levels in an HHMM. In the experiments described in Section 5, direction-specific and depth-specific versions of the original grammar rules are implemented in an ordinary CKY-style dynamic-programming parser, and can therefore simply be cut off at a particular depth level with no renormalization. But in an HHMM, this will result in label-bias effects, in which expanded constituents may have no valid reduction, forcing the system to define distributions for composing constituents that are not compatible. For example, if a constituent is expanded at depth D, and that constituent has no expansions that can be completely processed within depth D, it will not be able to reduce, and will remain incompatible with the incomplete constituent above it. Probabilities for depth-bounded rules must therefore be renormalized to the domain of allowable trees that can be generated within D depth levels, in order to guarantee consistent probabilities for HHMM recognition. This is done by determining the (depth- and direction-specific) probability PB-L,d (1 | A·0 ) or PB-R,d (1 | A·1 ) that a tree generated at each depth d and rooted by a left or right child will fit within depth D. These probabilities are then estimated using an approximate inference algorithm, similar to that used in value iteration (Bellman, 1957), which estimates probabilities of infinite trees by exploiting the fact that increasingly longer trees contribute exponentially decreasing probability mass (since each non-terminal expansion must

avoid generating a terminal with some probability at each step from the top down), so a sum over probabilities of trees with increasing length k is guaranteed to converge. The algorithm calculates probabilities of trees with increasing length k until convergence, or to some arbitrary limit K: PB-L,d,k (1 | A·0 ) =
A·1·0 , A·1·1 def

where: FG-L*,d (A
A·0k-1 , · A·0k-1 ·1 k

A·0k ) =
k-1

PG-L*,d (A

A·0k-1 ) A·0k A·0k-1 ·1 ) (18)

PG-L,d (A·0k-1
0

PG (A·0

A·0·0 A·0·1 ) (13)

· PB-R,d,k-1 (1 | A·0·1 )
def

· PB-L,d,k-1 (1 | A·0·0 )

PB-R,d,k (1 | A·1 ) =
A·1·0 , A·1·1

PG (A·1

A·1·0 A·1·1 ) (14)

· PB-R,d,k-1 (1 | A·1·1 )

· PB-L,d+1,k-1 (1 | A·1·0 )

and PG-L*,d (A A ) = [A = A ].   A complete HHMM can now be defined using depth-bounded right-corner PCFG probabilities. HHMM probabilities will be defined over syntactic states consisting of incomplete constituent categories A /A·µ . Expansions depend on only the incomplete constituent category ../A (for any active category `..') d-1 at qt : PQ-Ex,d (a·0·µ | ../A ) =

Normalized probability distributions for depthbounded expansions G-L,d and G-R,d can now be calculated using converged B-L,d and B-R,d estimates: PG-L,d (A·0 PG (A·0 PG-R,d (A·1 A·0·0 A·0·1 ) = A·0·0 A·0·1 ) (15) A·1·0 A·1·1 ) = A·1·0 A·1·1 )
def def

A·0 A·1 )· A·0 , PG-R,d-1 (A  A·1 FG-L*,d (A·0 a·0·µ )

A·0 A·1 )· A·0 , PG-R,d-1 (A  A·1 , FG-L*,d (A·0 a·0·µ ) a·0·µ

(19)

· PB-L,d (1 | A·0·0 ) · PB-R,d (1 | A·0·1 )

PG (A·1

· PB-L,d+1 (1 | A·1·0 ) · PB-R,d (1 | A·1·1 ) (16) 4.3 HHMM probabilities Converting PCFGs to HHMMs requires the calcu lation of expected frequencies FG-L*,d (A A·µ ) of generating symbols A·µ in the left-progeny of a nonterminal symbol A (in other words, of A·µ being a left child of A , or a left child of a left child of A , etc.). This is done by summing over subtrees of increasing length k using the same approximate inference technique described in Section 4.2, which guarantees convergence since each subtree of increasing length contributes exponentially decreasing probability mass to the sum: FG-L*,d (A


Transitions depend on whether an `active' or `awaited' transition was performed at the current level. If an active transition was performed (where ftd = 1), the transition depends on only the incomplete constituent category A·0·µ·0 /.. (for any d awaited category `..') at qt-1 , and the incomplete constituent category ../A (for any active category d-1 `..') at qt-1 : PQ-Tr,d (A·0·µ /A·0·µ·1 | 1, A·0·µ·0 /.., ../A ) =
A·0 , A·1

PG-R,d-1 (A

A·0 A·1 )·  FG-L*,d (A·0 A·0·µ )


FG-L*,d (A0 A0µ0 )-FG-L*,d (A0 A0µ0 )

0

·

PG-L,d (A·0·µ

A·0·µ·0 A·0·µ·1 )

PG-R,d-1 (A A·0 A·1 )·  FG-L*,d (A·0 A·0·µ ) A·0 , · A·1 ,  0 A·0·µ , FG-L*,d (A0 A0µ0 )-FG-L*,d (A0 A0µ0 ) A·0·µ·1 P (A·0·µ A·0·µ·0 A·0·µ·1 ) G-L,d (20) If an awaited transition was performed (where ftd = 0), the transition depends on only the complete constituent category A·µ·0 at ftd+1 , and the incomplete

A·µ ) =

 k=0

FG-L*,d (A

k

A·µ ) (17) 350

d constituent category A /A·µ at qt-1 :

PQ-Tr,d (A /A·µ·1 | 0, A·µ·0 , A /A·µ ) = PG-R,d (A·µ A·µ·0 A·µ·1 ) A·µ·0 A·µ·1 ) A·µ·1 PG-R,d (A·µ (21)

model (sect 22­24, len>40) F unbounded PCFG 66.03 bounded PCFG (D=4) 66.08
Table 1: Results of CKY parsing using bounded and unbounded PCFG.

Reduce probabilities depend on the complete constituent category at ftd+1 , and the incomplete constituent category A·0·µ·0 /.. (for any awaited cated gory `..') at qt-1 , and the incomplete constituent catd-1 egory ../A (for any active category `..') at qt-1 . If the complete constituent category at ftd+1 does not d match the awaited category of qt-1 , the probability is [ftd = f 0]. If the complete constituent category d at ftd+1 does match the awaited category of qt-1 : PF-Rd,d (1 | A·0·µ /.., ../A ) =
A·0 ,A·1

PG-R,d-1 (A A·0 A·1 )·  FG-L*,d (A·0 A·0·µ ) PG-R,d-1 (A FG-L*,d (A·0 -FG-L*,d (A·0
 0

A·0·µ )

were deleted from the PCFG. The right-corner and bounded-stack transforms described in the previous section were then applied to the PCFG. The original and bounded PCFG models were evaluated in a CKY recognizer on sections 22­24 of the Treebank, with results shown in Table 1.6 Results were significant only for sentences longer than 40 words. On these sentences, the bounded PCFG model achieves about a .15% reduction of error over the original PCFG (p < .1 using one-tailed pairwise t-test). This suggests that on long sentences the probability mass wasted due to parsing with an unbounded stack is substantial enough to impact parsing accuracy.

A·0 ,A·1

A·0 A·1 )· A·0·µ )

(22)

6

Conclusion

and: PF-Rd,d (A·0·µ | A·0·µ /.., ../A ) =
A·0 ,A·1

FG-L*,d (A·0
A·0 ,A·1

PG-R,d-1 (A

A·0 A·1 )· A·0·µ ) A·0 A·1 )· A·0·µ ) (23)

0

PG-R,d-1 (A FG-L*,d (A·0


The correctness of the above distributions can be demonstrated by the fact that all terms other than G-L,d and G-R,d probabilities will cancel out in any sequence of transitions between an expansion and a reduction, leaving only those terms that would appear as factors in an ordinary PCFG parse.5

Previous work has explored bounded-stack parsing using a right-corner transform defined on trees to minimize stack usage. HHMM parsers trained on applications of this tree-based transform of training corpora have shown improvements over ordinary PCFG models, but this may have been attributable to the richer dependencies of the HHMM. This paper has presented an approximate inference algorithm for transforming entire PCFGs, rather than individual trees, into equivalent rightcorner bounded-stack HHMMs. Moreover, a comparison with an untransformed PCFG model suggests that the probability mass wasted due to parsing with an unbounded stack is substantial enough to impact parsing accuracy.

5

Results

Acknowledgments
This research was supported by NSF CAREER award 0447685 and by NASA under award NNX08AC36A. The views expressed are not necessarily endorsed by the sponsors.
A CKY recognizer was used in both cases in order to avoid introducing errors due to model approximation or beam limits necessary for incremental processing with large grammars.
6

A PCFG model was extracted from sections 2­21 of the Wall Street Journal Treebank. In order to keep the transform process manageable, punctuation was removed from the corpus, and rules occurring less frequently than 10 times in the corpus
It is important to note, however, that these probabilities are not necessarily incrementally balanced, so this correctness only applies to parsing with an infinite beam.
5

351

References
Steven P. Abney and Mark Johnson. 1991. Memory requirements and local ambiguities of parsing strategies. J. Psycholinguistic Research, 20(3):233­250. Richard Bellman. 1957. Dynamic Programming. Princeton University Press, Princeton, NJ. Michael Collins. 1997. Three generative, lexicalised models for statistical parsing. In Proceedings of the 35th Annual Meeting of the Association for Computational Linguistics (ACL '97). Nelson Cowan. 2001. The magical number 4 in shortterm memory: A reconsideration of mental storage capacity. Behavioral and Brain Sciences, 24:87­185. Edward Gibson. 1991. A computational theory of human linguistic processing: Memory limitations and processing breakdown. Ph.D. thesis, Carnegie Mellon. James Henderson. 2004. Lookahead in deterministic left-corner parsing. In Proc. Workshop on Incremental Parsing: Bringing Engineering and Cognition Together, Barcelona, Spain. Mark Johnson. 1998. Finite state approximation of constraint-based grammars using left-corner grammar transforms. In Proceedings of COLING/ACL, pages 619­623. Tim Miller and William Schuler. 2008. A syntactic timeseries model for parsing fluent and disfluent speech. In Proceedings of the 22nd International Conference on Computational Linguistics (COLING'08). George A. Miller. 1956. The magical number seven, plus or minus two: Some limits on our capacity for processing information. Psychological Review, 63:81­97. Kevin P. Murphy and Mark A. Paskin. 2001. Linear time inference in hierarchical HMMs. In Proc. NIPS, pages 833­840. Brian Roark. 2001. Probabilistic top-down parsing and language modeling. Computational Linguistics, 27(2):249­276. William Schuler, Samir AbdelRahman, Tim Miller, and Lane Schwartz. 2008. Toward a psycholinguisticallymotivated model of language. In Proceedings of COLING, Manchester, UK, August. Mark Steedman. 2000. The syntactic process. MIT Press/Bradford Books, Cambridge, MA.

352