Clavius: Bi-Directional Parsing for Generic Multimodal Interaction Frank Rudzicz Centre for Intelligent Machines McGill University ´ Montreal, Canada frudzi@cim.mcgill.ca Abstract We introduce a new multi-threaded parsing algorithm on unification grammars designed specifically for multimodal interaction and noisy environments. By lifting some traditional constraints, namely those related to the ordering of constituents, we overcome several difficulties of other systems in this domain. We also present several criteria used in this model to constrain the search process using dynamically loadable scoring functions. Some early analyses of our implementation are discussed. C L AV I U S provides a flexible, and trainable new bi-directional parsing algorithm on multidimensional input spaces, and produces modalityindependent semantic interpretation with a low computational cost. 1 Introduction Figure 1: The target immersive environment. 1.1 Graphical Models and Unification Since the seminal work of Bolt (Bolt, 1980), the methods applied to multimodal interaction (MMI) have diverged towards unreconcilable approaches retrofitted to models not specifically amenable to the problem. For example, the representational differences between neural networks, decision trees, and finite-state machines (Johnston and Bangalore, 2000) have limited the adoption of the results using these models, and the typical reliance on the use of whole unimodal sentences defeats one of the main advantages of MMI - the ability to constrain the search using cross-modal information as early as possible. C L AV I U S is the result of an effort to combine sensing technologies for several modality types, speech and video-tracked gestures chief among them, within the immersive virtual environment (Boussemart, 2004) shown in Figure 1. Its purpose is to comprehend multimodal phrases such as "put this here .", for pointing gestures , in either command-based or dialogue interaction. 85 Unification grammars on typed directed acyclic graphs have been explored previously in MMI, but typically extend existing mechanisms not designed for multi-dimensional input. For example, both (Holzapfel et al., 2004) and (Johnston, 1998) essentially adapt Earley's chart parser by representing edges as sets of references to terminal input elements - unifying these as new edges are added to the agenda. In practice this has led to systems that analyze every possible subset of the input resulting in a combinatorial explosion that balloons further when considering the complexities of cross-sentential phenomena such as anaphora, and the effects of noise and uncertainty on speech and gesture tracking. We will later show the extent to which C L AV I U S reduces the size of the search space. Proceedings of the COLING/ACL 2006 Student Research Workshop, pages 85­90, Sydney, July 2006. c 2006 Association for Computational Linguistics Directed graphs conveniently represent both syntactic and semantic structure, and all partial parses in C L AV I U S , including terminallevel input, are represented graphically. Few restrictions apply, except that arcs labelled C AT and T I M E must exist to represent the grammar category and time spanned by the parse, respectively1 . Similarly, all grammar rules, i : LH S - RH S1 RH S2 ... RH Sr , are graphical structures, as exemplified in Figure 2. 2002), we must allow for all possible permutations hf multi-dimensional input - as in "put o this ere." vs. "put this here .", for example. We therefore take the unconvential approach of placing no mandatory ordering constraints on constituents, hence the rule abc : A - B C parses the input " C B". We show how we can easily maintain regular temporal ordering in §3.5. 1.2.3 Partial Qualification Whereas existing bi-directional chart parsers maintain fully-qualified edges by incrementally adding adjacent input words to the agenda, C L AV I U S has the ability to construct parses that instantiate only a subset of their constituents, so abc also parses the input "B", for example. Repercussions are discussed in §3.4 and §4. 2 Figure 2: 1 : O B J E C T R E F E R E N C E - N P click {where(N P :: f1 ) = (click :: f1 )}, with N P expanded by 2 : N P - D T N N. 1.2 Multimodal Bi-Directional Parsing Our parsing strategy combines bottom-up and top-down approaches, but differs from other approaches to bi-directional chart parsing (Rocio, 1998) in several key respects, discussed below. 1.2.1 Asynchronous Collaborating Threads A defining characteristic of our approach is that edges are selected asynchronously by two concurrent processing threads, rather than serially in a two-stage process. In this way, we can distribute processing across multiple machines, or dynamically alter the priorities given to each thread. Generally, this allows for a more dynamic process where no thread can dominate the other. In typical bi-directional chart parsing the top-down component is only activated when the bottom-up component has no more legal expansions (Ageno, 2000). 1.2.2 Unordered Constituents Alhough evidence suggests that deictic gestures overlap or follow corresponding spoken pronomials 85-93% of the time (Kettebekov et al, Usually this timespan corresponds to the real-time occurrence of a speech or gestural event, but the actual semantics are left to the application designer 1 The Algorithm C L AV I U S expands parses according to a best-first process where newly expanded edges are ordered according to trainable criteria of multimodal language, as discussed in §3. Figure 3 shows a component breakdown of C L AV I U S 's software architecture. The sections that follow explain the flow of information through this system from sensory input to semantic interpretation. Figure 3: Simplified information flow between fundamental software components. 2.1 Lexica and Preprocessing Each unique input modality is asynchronously monitored by one of T T R AC K E R S, each sending an n-best list of lexical hypotheses to C L AV I U S for any activity as soon as it is detected. For example, a gesture tracker (see Figure 4a) parametrizes the gestures preparation, stroke/point, and retraction (McNeill, 1992), with values reflecting spatial positions and velocities of arm motion, whereas 86 our speech tracker parametrises words with partof-speech tags, and prior probabilities (see Figure 4b). Although preprocessing is reduced to the identification of lexical tokens, this is more involved than simple lexicon lookup due to the modelling of complex signals. with the all-paths bottom-up strategy in GEMINI (Dowding et al, 1993) that finds all admissable edges of the grammar. Algorithm 1: Simplified Generalisation Data: Subspace [G] , grammar while data remains in [G] do g := highest scoring graph in [G] foreach rule i s.t. Cat (g ) RHS(i ) do i := Unify (i , [· RHS · g ]) if i then Apply Score (i ) to i Insert i into [G] Move g into [S Act] Figure 4: Gestural (a) and spoken (b) `words'. 2.2 Data Structures All T R AC K E R S write their hypotheses directly to the first of three S U B S PAC E S that partition all partial parses in the search space. The first is the G E N E R A L I S E R's subspace, [G] , which is monitored by the G E N E R A L I S E R thread the first part of the parser. All new parses are first written to [G] before being moved to the S P E C I FI E R's active and inactive subspaces, [S Act] , and [S I nact] , respectively. Subspaces are optimised for common operations by organising parses by their scores and grammatical categories into depth-balanced search trees having the heap property. The best partial parse in each subspace can therefore be found in O(1) amortised time. 2.4 2.3 Generalisation The G E N E R A L I S E R monitors the best partial parse, g , in [G] , and creates new parses i for all grammar rules i having C AT E G O RY(g ) on the right-hand side. Effectively, these new parses are instantiations of the relevant i , with one constituent unified to g . This provides the impetus towards sentence-level parses, as simplified in Algorithm 1 and exemplified in Figure 5. Naturally, if rule i has more than one constituent (c > 1) of type C AT E G O RY(g ), then c new parses are created, each with one of these being instantiated. Since the G E N E R A L I S E R is activated as soon as input is added to [G] , the process is interactive (Tomita, 1985), and therefore incorporates the associated benefits of efficiency. This is contrasted 87 Figure 5: Example of G E N E R A L I S AT I O N. Specification The S P E C I FI E R thread provides the impetus towards complete coverage of the input, as simplified in Algorithm 2 (see Figure 6). It combines parses in its subspaces that have the same top-level grammar expansion but different instantiated constituents. The resulting parse merges the semantics of the two original graphs only if unification succeeds, providing a hard constraint against the combination of incongruous information. The result, , of specification must be written to [G] , otherwise could never appear on the RHS of another partial parse. We show how associated vulnerabilities are overcome in §3.2 and §3.4. Specification is commutative and will always provide more information than its constituent graphs if it does not fail, unlike the `overlay' method of S M A RT KO M (Alexandersson and Becker, 2001), which basically provides a subsumption mechanism over background knowledge. Algorithm 2: Simplified Specification Data: Subspaces [S Act] and [S I nact] while data remains in [S Act] do s := highest scoring graph in [S Act] j := highest scoring graph in [S I nact] s.t. Cat (j ) = Cat (s ) while j do i := Unify (s , j ) if i then Apply Score (i ) to i Insert i into [G] j := next highest scoring graph from [S I nact] s.t. Cat (j ) = Cat (s ) ; // Optionally stop after I iterations, for some I though only the latter can modify it. 3 Applying Domain-Centric Knowledge Upon being created, all partial parses are assigned a score approximating its likelihood of being part of an accepted multimodal sentence. The score i|S | i i (), of partial parse , S C O R E() = =0 is a weighted linear combination of independent scoring modules (K N OW L E D G E S O U R C E S). Each module presents a score function i : [0..1] according to a unique criterion of multimodal language, weighted by i , also on [0..1] . Some modules provide `hard constraints` that can outright forbid unification, returning i = - in those cases. A subset of the criteria we have explored are outlined below. 3.1 Temporal Alignment (1 ) Move s into [S I nact] By modelling the timespans of parses as Gaussians, where µ and are determined by the midpoint and 1 the distance between the two 2 endpoints, respectively - we can promote parses whose constituents are closely related in time with the symmetric Kullback-Leibler divergence, 2 2 2 2 (1 -2 )2 +((µ1 -µ2 )(1 +2 ))2 DK L (1 , 2 ) = . 22 4 1 2 Therefore, 1 promotes more locally-structured parses, and co-occuring multimodal utterances. 3.2 Ancestry Constraint (2 ) A consequence of accepting n-best lexical hypotheses for each word is that we risk unifying parses that include two competing hypotheses. For example, if our speech T R AC K E R produces hypotheses "horse" and "house" for ambiguous input, then 2 explicitly prohibits the parse "the horse and the house" with flags on lexical content. Figure 6: Example of S P E C I FI C AT I O N. 2.5 Cognition The C O G N I T I O N thread monitors the best sentence-level hypothesis, B , in [S I nact] , and terminates the search process once B has remained unchallenged by new competing parses for some period of time. Once found, C O G N I T I O N communicates B to the A P P L I C AT I O N. Both C O G N I T I O N and the A P P L I C AT I O N read state information from the MySQL W O R L D database, as discussed in §3.5, 88 3.3 Probabilistic Grammars (3 ) We emphasise more common grammatical constructions by augmenting each grammar rule with an associated probability, P (i ), and assigning 3 () = P (RU L E()) · 3 (c ) where RU L E is the c =constituent of top-level expansion of . Probabilities are trainable by maximum likelihood estimation on annotated data. Within the context of C L AV I U S , 3 promotes the processing of new input words and shallower parse trees. 3.4 Information Content (4 ), Coverage (5 ) The 4 module partially orders parses by preferring those that maximise the joint entropy between the semantic variables of its constituent parses. Furthermore, we use a shifted sigmoid 2 - 1, to promote parses 5 () = - 2 N U M W O R D S I N() 1+e 5 that maximise the number of `words' in a parse. These two modules together are vital in choosing fully specified sentences. 3.5 Functional Constraints (6 ) Each grammar rule i can include constraint functions f : [0,1] parametrised by values in instantiated graphs. For example, the function T F O L L OW S (1 , 2 ) returns 1 if constituent 2 follows 1 in time, and - otherwise, thus maintaining ordering constraints. Functions are dynamically loaded and executed during scoring. Since functions are embedded directly within parse graphs, their return values can be directly incorporated into those parses, allowing us to utilise data in the W O R L D. For example, the function O B J E C TAT(x, y , &o) determines if an object exists at point (x, y ), as determined by a pointing gesture, and writes the type of this object, o, to the graph, which can later further constrain the search. Figure 7 shows sentence-level precision achieved for each i on each of the four tasks, where precision is defined as the proportion of correctly executed sentences. These are compared against the CMU Sphinx-4 speech recogniser using the unimodal projection of the multimodal grammar. Here, conjunctive phrases such as "Put a sphere here and colour it yellow" are classified according to their first clause. Presently, correlating the coverage and probabilistic grammar constraints with higher weights ( > 30%) appears to provide the best results. Creation and colouring tasks appeared to suffer most due to missing or misunderstood head-noun modifiers (ie., object colour). In these examples, C L AV I U S ranged from a -51.7% to a 62.5% relative error reduction rate over all tasks. Config 1 2 3 1 0.4 0.2 0.1 2 0.0 0.0 0.0 () 3 0.3 0.1 0.3 4 0.1 0.3 0.3 5 0.1 0.2 0.15 6 0.1 0.2 0.15 Table 1: Three weight configurations. 4 Early Results We have constructed a simple blocks-world experiment where a user can move, colour, create, and delete geometric objects using speech and pointing gestures with 74 grammar rules, 25 grammatical categories, and a 43-word vocabulary. Ten users were recorded interacting with this system, for a combined total of 2.5 hours of speech and gesture data, and 2304 multimodal utterances. Our randomised data collection mechanism was designed to equitably explore the four command types. Test subjects were given no indication as to the types of phrases we expected - but were rather shown a collection of objects and were asked to replicate it, given the four basic types of actions. Several aspects of the parser have been tested at this stage and are summarised below. 4.1 Accuracy Table 1 shows three hand-tuned configurations of the module weights i , with 2 = 0.0, since 2 provides a `hard constraint' (§3.2). 89 Figure 7: Precision across the test tasks. 4.2 Work Expenditure To test whether the best-first approach compensates for C L AV I U S ' looser constraints (§1.2), a simple bottom-up multichart parser (§1.1) was constructed and the average number of edges it produces on sentences of varying length was measured. Figure 8 compares this against the average number of edges produced by C L AV I U S on the same data. In particular, although C L AV I U S generally finds the parse it will accept relatively quickly (`C L AV I U S - found'), the C O G N I T I O N module will delay its acceptance (`C L AV I U S - accepted') for a time. Further tuning will hopefully reduce this `waiting period'. References Ageno, A., Rodriguez, H. 2000 Extending Bidirectional Chart Parsing with a Stochastic Model, in Proc. of TSD 2000, Brno, Czech Republic. Alexandersson, J. and Becker, T. 2001 Overlay as the Basic Operation for Discourse Processing in a Multimodal Dialogue System in Proc. of the 2nd IJCAI Workshop on Knowledge and Reasoning in Practical Dialogue Systems, Seattle, WA. Bolt, R.A. 1980 "Put-that-there": Voice and gesture at the graphics interface in Proc. of SIGGRAPH 80 Number of edges expanded, given ACM Press, New York, NY. Boussemart, Y., Rioux, F., Rudzicz, F., Wozniewski, M., Cooperstock, J. 2004 A Framework for 3D Visualisation and Manipulation in an Immersive Space using an Untethered Bimanual Gestural Interface in Proc. of VRST 2004 ACM Press, Hong Kong. Dowding, J. et al. 1993 Gemini: A Natural Language System For Spoken-Language Understanding in Meeting of the ACL, ACL, Morristown, NJ. Holzapfel, H., Nickel, K., Stiefelhagen, R. 2004 Implementation and evaluation of a constraintbased multimodal fusion system for speech and 3D pointing gestures, in ICMI '04: Proc. of the 6th intl. conference on Multimodal interfaces, ACM Press, New York, NY. Johnston, M. 1998 Unification-based multimodal parsing, in Proc. of the 36th annual meeting of the ACL, ACL, Morristown, NJ. Johnston, M., Bangalore, S. 2000 Finite-state multimodal parsing and understanding in Proc. of the 18th conference on Computational linguistics ACL, Morristown, NJ. Kettebekov, S., et al. 2002 Prosody Based Coanalysis of Deictic Gestures and Speech in Weather Narration Broadcast, in Workshop on Multimodal Resources and Multimodal System Evaluation. (LREC 2002), Las Palmas, Spain. McNeill, D. 1992 Hand and mind: What gestures reveal about thought University of Chicago Press and CSLI Publications, Chicago, IL. Rocio, V., Lopes, J.G. 1998 Partial Parsing, Deduction and Tabling in TAPD 98 Tomita, M. 1985 An Efficient Context-Free Parsing Algorithm for Natural Languages, in Proc. Ninth Intl. Joint Conf. on Artificial Intelligence, Los Angeles, CA. Figure 8: sentence length. 5 Remarks C L AV I U S consistently ignores over 92% of dysfluencies (eg. "uh") and significant noise events in tracking, apparently as a result of the partial qualifications discussed in §1.2.3, which is especially relevant in noisy environments. Early unquantified observation also suggests that a result of unordered constituents is that parses incorporating lead words - head nouns, command verbs and pointing gestures in particular - are emphasised and form sentence-level parses early, and are later `filled in' with function words. 5.1 Ongoing Work There are at least four avenues open to exploration in the near future. First, applying the parser to directed two-party dialogue will explore contextsensitivity and a more complex grammar. Second, the architecture lends itself to further parallelism - specifically by permitting P > 1 concurrent processing units to dynamically decide whether to employ the G E N E R A L I S E R or S P E C I FI E R, based on the sizes of shared active subspaces. We are also currently working on scoring modules that incorporate language modelling (with discriminative training), and prosody-based co-analysis. Finally, we have already begun work on automatic methods to train scoring parameters, including the distribution of i , and modulespecific training. 6 Acknowledgements Funding has been provided by la bourse de ´´ maitrisse of the fonds quebecois de la recherche sur la nature et les technologies. 90