WWW 2007 / Track: XML and Web Data Session: Querying and Transforming XML Visibly Pushdown Automata for Streaming XML Viraj Kumar P. Madhusudan Mahesh Viswanathan University of Illinois at Urbana-Champaign Urbana, IL, USA {kumar, madhu, vmahesh}@cs.uiuc.edu ABSTRACT We prop ose the study of visibly pushdown automata (Vpa) for processing XML documents. Vpas are pushdown automata where the input determines the stack op eration, and XML documents are naturally visibly pushdown with the Vpa pushing onto the stack on op en-tags and p opping the stack on close-tags. In this pap er we demonstrate the p ower and ease visibly pushdown automata give in the design of streaming algorithms for XML documents. We study the problems of type-checking streaming XML documents against SDTD schemas, and the problem of typing tags in a streaming XML document according to an SDTD schema. For the latter problem, we consider b oth pre-order typing and p ost-order typing of a document, which dynamically determines typ es at op en-tags and close-tags resp ectively as soon as they are met. We also generalize the problems of pre-order and p ost-order typing to prefix querying. We show that a deterministic Vpa yields an algorithm to the problem of answering in one pass the set of all answers to any query that has the prop erty that a node satisfying the query is determined solely by the prefix leading to the node. All the streaming algorithms we develop in this pap er are based on the construction of deterministic Vpas, and hence, for any fixed problem, the algorithms process each element of the input in constant time, and use space O(d), where d is the depth of the document. 1. INTRODUCTION The eXtensible Markup Language (XML) has b ecome the standard for data exchange, particularly on the web [26, 1]. An XML document is a linear representation of hierarchically structured data. The latter is b est viewed as a tree, with the edges of the tree encoding the hierarchy. Hierarchically structured data can b e encoded into linear structures in different ways; XML is the p opular way, where the tree is represented using op en-tags and close-tags, and the paranthesis structure defined by the tags encodes the hierarchical information. Naturally, the study of XML has concentrated on the tree it represents; for example, document typ es are represented using the parse-trees generated by sp ecialized DTDs [22], XML query languages have modalities like "parent" and "child" that refer to the tree edges, and tree automata (over unranked trees) have b een used to model and solve decision problems for XML [25, 20, 9, 11, 16]. An XML document is a linear word structure, and though formalisms based on the tree representation of the document are useful, they are awkward when designing algorithms for processing XML. While one could take an XML document, convert it into a tree, and then apply algorithms for the tree, there are several settings where such changes of representation is not feasible. One class of applications for which building the tree representation of a document is infeasible is in the processing of streams of large XML documents. Streaming algorithms process input data as and when they arrive (for example, streaming stock market data), and have to process them with as little time and space as p ossible. Building trees representing the document would make little sense. Not surprisingly, streaming algorithms prop osed for XML do not build the tree [7, 13], and in fact theoretical algorithms for streaming XML are often based on pushdown automata (not tree automata) [13, 24, 11]. However, there are no standard automata models for streaming XML; pushdown automata are not app ealing as they do not define a robust tractable class of languages: they are not closed under complement, inclusion is undecidable, etc. The main thesis of this pap er is that visibly pushdown automata (Vpa [4]) are the right model for processing streaming XML. A Vpa is a pushdown automaton whose stack op erations are determined by the input letter it reads. An XML document is b est seen as a nested word: a linear structure (word) with a nesting relation formed by associating op en-tags with their matching close-tags. A visibly pushdown automaton can reconstruct the nesting relation by pushing onto the stack on op en-tags and p opping from it on Categories and Subject Descriptors H.2.1 [Information Systems]: Logical Design--schema and subschema ; H.2.3 [Information Systems]: Lanuages-- data description languages (DDL), query languages ; F.1.1 [Theory of Computation]: Models of Computation--automata General Terms Algorithms, Theory Keywords XML, streaming algorithms, schema, typing, query, pushdown automata Supp orted by NSF CCF 04-29639 Supp orted by NSF CCF 04-48178 Copyright is held by the International World Wide Web Conference Committee (IW3C2). Distribution of these papers is limited to classroom use, and personal use by others. WWW 2007, May 8­12, 2007, Banff, Alberta, Canada. ACM 978-1-59593-654-7/07/0005. 1053 WWW 2007 / Track: XML and Web Data close-tags. Visibly pushdown automata, unlike pushdown automata, define a robust class of languages. The class is closed under all b oolean op erations, admits decidable procedures for problems such as inclusion and emptiness, and we can show that it is precisely as p owerful as regular tree languages accepting the tree representation of the data [4]. The most imp ortant feature is that these automata are determinizable, which will help us in building streaming algorithms. Furthermore, recent work on visibly pushdown languages have exp osed several interesting results including a congruence based characterization and minimization results that use a modular notion of these automata. These results affirm the view that this class is very similar in tractability and robustness to that of regular word languages. In this pap er, we argue that visibly pushdown automata are an apt model for XML by studying algorithms for streaming XML documents. We demonstrate the use of VPAs in solving two main problems: (a) type-checking streaming XML documents against SDTDs (sp ecialized data-typ e definitions) [22], and of assigning typ es to tags while streaming the document, and (b) in querying prefix -based queries on XML streams, where the problem is to anwer the set of all p ositions satisfying queries that are determined by the prefix leading to a p osition. An SDTD (which extends the notion of DTDs) defines a class of XML documents using an extended context-free grammar. The first result in this pap er is a visibly pushdown automaton model tailored for XML: we define a notion of modular Vpa called XVPA (X for XML!) that exactly corresp onds to schema defined by SDTDs. This gives a natural machine characterization of SDTDs. By applying results in the literature on visibly pushdown automata to XVPAs, in particular (a) determinability of VPAs [4], (b) the expressive p ower of deterministic single-entry modular VPAs [3], and (c) congruence-based minimization results for complete multi-entry modular VPAs [14], we derive (minimal) streaming machines. We first study the problem of type-checking streaming XML documents against SDTDs. Using the fact that Vpas can b e determinized to get single-entry modular Vpas (Sevpas), we show that these automata typ e-check streaming XML documents against SDTD-defined typ es. Further, by combining the set of single-entry modules into a single module with multiple entries, Sevpas can b e interpreted as complete multi-entry modular Vpas, for which we have recently obtained a minimization result [14]. Unlike the pushdown machines constructed in [24], we obtain provably minimal recognizers that optimize the numb er of states in two ways: they combine strucural conditions shared b oth b etween sp ecializations of the same tag, and across different tags. We then turn to the problem of typing a document. For a document that conforms to an SDTD, the typ es associated with a tag are the sp ecializations that admit a parsing of the document with resp ect to the SDTD. Typing a document facilitates querying (for example, a query asking for all tags that have a particular sp ecialization can b e answered using such a typing). We study the problem of assigning typ es to tags while reading a streaming XML document. A tag is said to b e pre-order typ ed if its typ e is unique and can b e assigned as soon as meeting its op en-tag; similarly, a tag is said to b e p ost-order typ ed if its unique typ e is determined at the close-tag. An SDTD is said to b e pre-order typ ed if al l its tags are pre-order typ ed in al l documents Session: Querying and Transforming XML that accord with the SDTD. Pre-order typ ed SDTDs corresp ond to restrained comp etition grammars and have b een syntactically and semantically characterized [18, 17]. We show automata-theoretic characterizations of pre- and p ostorder typ ed schemas 1 . The characterization of pre-order typ ed schemas is b eautifully simple in terms of XVPAs: a schema is pre-order typ ed if and only if it can b e accepted by a deterministic XVPA. This is a very natural and intuitive characterization­ when reading an op en-tag, the XVPA must b e able to determine the typ e of the tag and hence call the appropriate module deterministically. We then turn to dynamical ly typing streaming documents. Even though a schema may not b e pre-order typ ed, some or all of its tags may b e pre-order typ ed in a particular document. We show that for any SDTD, we can build a deterministic VPA that reads streaming XML documents, and dynamically pre-order typ es the op en-tags whenever p ossible. More precisely, if in the current document read, the typ e of a tag is determined by the prefix of the document read thus far, then the automaton will determine its typ e when reading the op en-tag. We also prove a similar result for partially p ost-order typing documents at close-tags for streaming documents. We also show that checking if a tag is pre-order typ ed in an SDTD (across all documents) is decidable in p olynomial time. Finally, we generalize the results on pre-order and p ostorder typing to prefix querying. We consider monadic queries expressed in a sub-logic of monadic second order logic (MSO), which we call Pre-MSO. Queries here are MSO formulas with one free variable, where all quantified p ositions are required to occur b efore the p osition denoted by the free variable. We show that for such queries, determining if a p osition in a document satisfies a query dep ends solely on the tags seen b efore the p osition. Furthermore, for any such query we can build a deterministic VPA that can answer such queries while processing an XML document in a streaming fashion. Since the problems of dynamically typing (pre-order and p ost-order) can b e expressed in our logic Pre-MSO, our observations on pre-order and p ost-order typing can b e seen as a consequence of these results. However, our direct constructions for the pre-order and p ost-order typing are more efficient in terms of the resources used by the resulting streaming algorithm. There has b een a previous characterization of queries that can b e answered by a 1-pass streaming algorithm (see [19]) in terms of constraint systems and pushdown forest automata; our results in terms of Pre-MSO and VPAs can b e seen a reformulation of the Neumann-Seidl result in terms of logic and pushdown machines on words. In summary, VPAs emerge as a simple apt model for processing XML, particularly in the design of streaming algorithms. The rich set of results obtained for visibly pushdown languages, including determinization and minimization of modular machines, find immediate use in designing streaming algorithms for XML. We have already several other results which we have not rep orted here (including results on typability, improved complexity of static typ e-checking, etc.) that we have proved using the VPA model; these and proofs of the theorems in this pap er can b e found in the technical rep ort [15]. Related work in automata theory: Viewing XML doc1 By "schemas", we mean a collection of documents defined by an SDTD. We do not mean the language "XML Schema". 1054 WWW 2007 / Track: XML and Web Data uments as trees, visibly pushdown automata corresp ond to automata that process trees by reading them on an infix traversal, using a stack to push whenever they go down a left branch, and p opping it when they return to process the right branch. The notion of visibly pushdown automata have b een used implicitly in processing XML streams in the literature (in [24, 9, 19, 10] and as XPush machines in [11]). Note that there is only one copy of the automaton processing the tree (in contrast to tree automata). An alternate model defined in the literature is tree-walking automata [2, 21], which essentially has only one copy of the automaton but can walk up and down the input tree. However these automata do not have access to a stack and are strictly weaker than tree automata in expressive p ower [6]. In contrast, visibly pushdown automata capture the entire class of regular tree languages [4]. Session: Querying and Transforming XML to words over (M M ) in the obvious way: µ(m) = µ(m) and µ(m.w) = µ(m).µ(w). Definition 1 (SDTD). An SDTD over (, M , µ) is a tuple (d, m0 ), where d is a mapping from M to regular expressions over M , and m0 is the start symb ol. The language of an SDTD (d, m0 ) is defined as Ld (m0 ). The size of an SDTD (d, m0 ) is defined as the sum of the lengths of the regular expressions defined by d. Example 1. Let = {movie , vhs , dvd , title , lang , subtitle }, and let the specialized alphabet be M = {Movie , VHS , DVD , Title , Lang , Unisub , Multsub } where µ : M is defined so that Unisub and Multsub are specializations of subtitle , and capitalized names are specializations of their uncapitalized counterparts. Final ly, let d be defined as fol lows: d(Movie ) d(VHS ) d(DVD ) d(Title ) d(Unisub ) d(Multsub ) d(Lang ) = = = = = = = VHS + DVD Title .Unisub Title .(Unisub + Multsub ) Lang Lang .Lang + 2. PRELIMINARIES Let b e a fixed finite alphab et of "op en tags", and let = {c | c } b e the corresp onding alphab et of "close tags". Let = ( ). XML documents will b e treated as words over the input alphab et . A wel l-matched word is any word generated by the grammar: W cW c, W W W , W , where we have a rule W cW c for every c . The grammars and automata we consider in this pap er will always accept only languages of well-matched words. The set of all well-matched words over will b e denoted by WM ( ). A word u is said to have matched closing tags if there is some w WM ( ) such that u is a prefix of w. The set of all words with matched returns (closing tags) will b e denoted by MR( ). Document-type definitions (DTDs) define restrictions on the structure of documents. These definitions hence describ e languages over , that corresp ond to the documents that accord with the document typ e. Since DTDs are not p owerful enough, we use in this pap er an extended version of DTDs called specialized document-type definitions (SDTDs) [22]. We give pushdown automata based descriptions of these languages, which are a sub class of context-free languages. An SDTD is essentially a context-free grammar; however, each non-terminal of the SDTD is associated with a tag, and the idea is that every non-terminal implicitly generates the op en-tag (c) and close-tag (c) whenever the non-terminal is expanded. Non-terminals are called "sp ecializations" in the XML context, and we will also call them "modules" in this pap er. Let M b e a finite set of specializations or modules and let µ : M b e a surjective mapping. Elements of µ-1 (c) corresp ond to sp ecializations of symb ol c, and µ is the mapping b etween the sp ecialized alphab et M and the unsp ecialized alphab et . SDTDs will b e sp ecified using extended context-free grammars. An extended context-free grammar d is a set of rules that map each m M to a regular expression over M . We now define its semantics. Let us denote M = {m | m M }. For every m M , we define Lspl (m) (M M ) as the d smallest sets such that Then (d, Movie ) is an SDTD over (, M , µ) accepting words like: movie.dvd.title.title .subtitle.lang.lang .lang.lang .subtitle .dvd .movie Type-checking and typing XML documents The two problems we consider in this pap er are typ e-checking streaming XML documents, and pre- and p ost-order typing of XML documents. We define these notions now. The typ e-checking problem is to check, given an SDTD, whether an input document b elongs to the language of the SDTD. In the streaming context, we assume that the document is presented as a word, and the typ e-checking must b e accomplished reading the word only once, left to right, and using as little time and space as p ossible. Given an SDTD, and a document that accords to the typ e defined by the SDTD, a tag c in the document can get different typ es dep ending on how the document parses. More precisely, if w b elongs to the language of an SDTD (d, m0 ), and w[i] = c , then we say that (w, i) has typ e m (where m is a sp ecialization), if there exists x Lspl (m0 ) d such that µ(x) = w and x[i] = m. In general, (w, i) can have many typ es with resp ect to an SDTD. An occurrence of an op en (or close) tag in a document is prefix typ ed (with resp ect to a schema) if the tag's typ e is determined by the prefix of the document till that p oint. Pre-order typing refers to op en tags b eing prefix typ ed while p ost-order typing refers to close tags b eing prefix typ ed. Lspl (m) {m.x.m | x Rd (w), w d(m)} d s where: (1) Rd () = {}, and (2) Rd (m.w) = Ldpl (m).Rd (w). The language over unsp ecialized symb ols that m defines is then Ld (m) = {µ(w) | w Lspl (m)}, where µ is extended d Definition 2. For an open or close tag a , we say a is prefix typ ed at position i in w WM ( ), if w = uav , |ua| = i, and if there is a unique m µ-1 (a) such that for al l v , whenever uav Ld (m0 ) and (uav , |ua|) has type m , then m = m. In this case, we say that a has prefix-type m at position i in w. A tag c is pre-order typ ed (post-order typed) in an SDTD (d, m0 ) if for every document w Ld (m0 ) and every 1055 WWW 2007 / Track: XML and Web Data position i |w| such that w[i] = c (resp. w[i] = c), c (resp. c) is prefix-typed at position i in w. An SDTD (d, m0 ) is pre-order typ ed (post-order typed) if every tag is pre-order typed (resp. post-order typed). Said in other words, a p osition i is prefix-typ ed in w if a streaming algorithm that reads w can determine the (unique) typ e of (w, i) at p osition i. We study three problems related to typing: · Automata characterization: which kind of Vpas capture SDTDs that are pre-order and p ost-order typ ed. · Dynamic pre-order typing: Can we build an automaton that streams input and determines the typ e of op en-tags (or close-tags) as soon as it meets them, provided the typ e has b een determined by the prefix read till that p oint. · Given an SDTD and a tag c, can we effectively decide if in al l documents conforming to the SDTD, the typ e of every occurrence of the op en-tag (or close-tag) corresp onding to c is prefix-typ ed? Session: Querying and Transforming XML We extend the definition of - A to words over in the natural manner. The language L(A) accepted by Vpa A is w the set of words w such that (q0 , ) - A (q , ) for F some q Q . A language L is called a visibly pushdown language (Vpl) if there some Vpa A such that L = L(A). We now review some of the basic prop erties of visibly pushdown automata and their languages which we will app eal to later in the pap er. To b egin with, Vpls define a robust class of languages closed under b oolean op erations. a Proposition 1. [4] If L1 and L2 are Vpls over a common alphabet (, ), then L1 L2 , L1 L2 and L1 are also Vpls with respect to (, ). In addition, Vpls have a logical characterization using the monadic second order theory over words augmented with a binary matching predicate , denoted MSO . Proposition 2. [4] A language L over (, ) is a Vpl iff there is an MSO sentence over (, ) that defines L. Unlike pushdown automata, Vpas can b e determinized without any loss in expressive p ower. We state this formally: Proposition 3. [4] For any n-state Vpa M over (, ), 2 there is a deterministic Vpa M over (, ) with O(2n ) n2 states and with stack alphabet of size O(2 · ||) such that L(M ) = L(M ). Finally, the inclusion problem (which is undecidable for pushdown automata) is decidable for Vpas. Proposition 4. [4] Given Vpas A1 and A2 , the inclusion problem L(A1 ) L(A2 ) is decidable in Exptime. Furthermore, if A2 is deterministic, then the problem is decidable in Ptime. Visibly pushdown automata The languages defined by SDTDs do not encompass all contextfree languages. Viewing an SDTD as a normal grammar, a rule in the SDTD of the form ma mb .mc , where µ(ma ) = a, µ(mb ) = b and µ(mc ) = c, translates to a rule ma a.mb .mc .a. In other words, each non-terminal is "guarded" by a tag that occurs b efore and after the expansion of the non-terminal. The usual translation of this grammar into a pushdown automaton will result in a machine that pushes at the op en-tags and p ops at the close-tags. Visibly pushdown automata are precisely these kind of restricted machines [4]. Since Vpas were first motivated in the program analysis context, the symb ols on which the automaton pushes and p ops are called cal ls and returns, instead of op en-tags and closetags. Definition 3 (VPA). A visibly pushdown automaton ( Vpa) over (, ) is a tuple A = (Q, q0 , QF , , ), where Q is a finite set of states, q0 Q is the initial state, QF Q is the set of final states, is a finite stack alphabet, and = call ret is the transition relation, where: · call ((Q × ) × (Q × )); · ret ((Q × × ) × Q). We denote a transition (q , c, q , ) call as q - q , - and a transition (q , c, , q ) ret as q - q . A transition - q - q is a push-transition, where the automaton read- ing c changes state from q to q , pushing onto the stack. - Similarly, a transition q - q is a p op-transition, where on reading c with on the top of the stack, the automaton p ops off the stack and changes state from q to q . A configuration of a Vpa A is a pair (q , s) Q × ( .), where is a sp ecial b ottom-of-stack symb ol ( ). If / a a , we say that (q1 , s1 ) - A (q2 , s2 ) if and only if one of the following conditions are true: c/ c/ c/ c/ Modular visibly pushdown automata We now introduce modular Vpas. Given an SDTD (or in fact, any context-free grammar), one can associate a machine (module) with every non-terminal, which essentially checks whether the word it reads b elongs to a derivation of the non-terminal. While doing so, such a module may need to expand other non-terminals, and can do so by "calling" the modules corresp onding to these non-terminals. This intuition will lead us to capturing SDTDs precisely using a modular notion of Vpas. We have studied modular Vpas in an earlier pap er [3], where we were motivated by their natural use in program modeling: the various modules corresp ond to procedures of a program which can call each other. We first define a notion of a modular Vpa, and then define restricted versions of it called Xvpa (XML Vpa) and Sevpa (single-entry Vpa). The former will turn out to b e the exact machine analog of SDTDs, while the latter has b een studied earlier by us, and will help in proving various constructions in this pap er. Modular Vpas have states partitioned into modules, demand that the symb ol pushed onto the stack is always the current state, and ensure that if a module calls another, then up on return the control returns to a state of the calling module. Definition 4 (Modular VPA (µ-VPA) ; SEVPA). Fix , M and µ : M . A modular Vpa (or µ-Vpa) over (, M , µ) is a tuple A = ({(Qm , em , m )}mM , m0 , F ), where for each m M , · a = c , s2 = .s1 and (q1 , c, q2 , ) · a = c , s1 = .s2 and (q1 , c, , q2 ) call ret , or . 1056 WWW 2007 / Track: XML and Web Data · Qm is a finite set of states of module m 2 3 Session: Querying and Transforming XML Intuitively, the language of well-matched words that will take the automaton from pn to qn is the language defined by the module m with final states Xm (pn , qn ). Definition 6 (XVPA). An Xvpa over (, M , µ) is a tuple A = ({(Qm , em , Xm , m )}mM , m0 , F ) such that ({(Qm , em , m )}mM , m0 , F ) is a µ-Vpa over (, M , µ), and the fol lowing single-exit property holds: for every m M , Xm is the unique exit of module m An Xvpa is intuitively a µ-Vpa where each module defines only one language, no matter which state it is called from. Recall that we require µ-Vpas to b e deterministic on returns; hence single-exit ensures that if we call a module m from state pn , then exiting from m we end in a unique state qn no matter which state in m we exit from. Example 2. An Xvpa that defines the same language as the SDTD in Example 1 is given in Figure 1, and for example, {xmultlang } is the unique exit for module "multlang ". · em is a distinguished entry state of module m ca re · m = m ll mt , where ca m ll {qm - - en | n µ-1 (c)} - c/qm c / pn re mt {qm - - qn | n µ-1 (c)} and is determinis- - - tic, i.e. qn = qn whenever qm - - qn and qm - - qn c / pn c / pn · m0 M is a distinguished start module · F Qm0 is the set of final states. A single-entry Vpa ( Sevpa) over (, M , µ) is a µ-Vpa such that the mapping µ : M is a bijection. Note that we have defined µ-Vpas such that the return transitions are always deterministic; this is for technical conveca nience. A µ-Vpa A is said to b e deterministic if m ll is deterministic as well, i.e. if (qm , c, en , qm ), (qm , c, en , qm ) m , then en = en . An Sevpa is a µ-Vpa that has exactly one module for each tag. Semantics. The semantics of a µ-Vpa is defined by its corresp onding Vpa. Let A = ({(Qm , em , m )}mM , m0 , F ) b e a µ-Vpa over (, M , µ). Then A = (Q, q0 , {qf }, , ) is the Vpa over (, ) where Q = {q0 , qf } ( mM Qm ), = Q and Equivalence of SDTD and XVPA We now show that Xvpas are an appropriate automaton model for SDTDs, by demonstrating their equivalence. More sp ecifically, we will show that for every SDTD there exists an equivalent Xvpa where the modules of the Xvpa exactly corresp ond to the sp ecializations in the SDTD, and vice versa. Intuitively, given an SDTD, we can construct an XVPA in which each module corresp onds to a sp ecialization in the SDTD, and the module checks whether a word b elongs to the language of the sp ecialization. Theorem 1. For every SDTD (d, m0 ) over (, M , µ), there is an Xvpa A over (, M , µ) such that L(A) = Ld (m0 ). Furthermore, for every Xvpa A over (, M , µ), there is an SDTD (d, m0 ) over (, M , µ) such that L(d, m0 ) = L(A). Proof. Let (d, m0 ) b e an SDTD over (, M , µ). For every m M , let Dm b e the deterministic (but p erhaps incomplete ) finite automaton (DFA) obtained from the minimized automaton for the regular expression d(m) after discarding all "dead" states, i.e. states from which no final states can b e reached. Sp ecifically, let Dm = (Qm , em , QF , m ), where m · Qm is a finite set of states, em Qm is the initial state · QF Qm is the set of final states m · m is a (partial) function from Qm × M to Qm · for every q Qm , x M such that m (q , x) QF m (no dead states) Let A = ({(Qm , em , Xm , m )}mM , m0 , F ) b e the XVPA, where Ë =( m --- m ) {q0 - - - em0 } {q - - - qf | q F } --- µ(m0 )/q0 µ(m0 )/q0 M We define L(A), the language accepted by µ-Vpa A, as L(A ). Note that a µ-Vpa always accepts well-matched words which are of the form µ(m0 )wµ(m0 ). A µ-Vpa module m intuitively accepts many well-matched word languages. For example, a single module can distinguish two languages L1 and L2 , and use return edges ("exits") to convey the difference to the calling module. A sp ecialization (non-terminal) of an SDTD, however, is more restricted: it can b e utilized to capture only one language. We will define Xvpas as essentially automata that have this restriction, and the concept of exit b elow will b e useful. Definition 5 (Exit). Let A = ({(Qm , em , m )}mM , m0 , F ) be a µ-Vpa over (, M , µ). A non-empty set Xm Qm is a (pn , qn )-exit for module m in A if qm . qm - - qn iff - qm Xm . In other words, a non-empty set Xm is a (pn , qn )exit if it is the exact set of states of module m from which popping pn from the stack leads to state qn . A non-empty set Xm is an exit of module m in A if there exist states pn , qn such that Xm is a (pn , qn )-exit. Remark 1. Note that for every pn , qn , m, there is at most one (pn , qn )-exit for module m in a µ-Vpa A. If such an exit exists, we denote it by Xm (pn , qn ). Unless otherwise sp ecified, m, n, etc. will range over mod ules, and pm , qm , qm , etc. will range over states in Qm . 3 Modular Vpas can b e defined in a more general way where every module can have multiple entries e1 , . . . ek . In this m m pap er, we will largely focus on modular Vpas that have only a single entry p er module. 2 c / pn · Xm = QF for every m M , F = QF 0 m m · for every m M , m = m m , where: call ret m = {(qm , µ(n), en , qm ) | m (qm , n) = pm } ret F m = {(q , µ(m), qn , pn ) | n (qn , m) = pn and q Qm } Note that m is deterministic, since n is a (partial) funcret tion. Also, by definition of m , Xm is the unique exit of module m. Hence, A is indeed an Xvpa. ret call 1057 WWW 2007 / Track: XML and Web Data Session: Querying and Transforming XML emovie lang /emovie lang /emovie lang /emovie lang /emovie eunilang xunilang xmovie title /emultlang xmultlang lang /xmultlang emultlang lang /xmultlang title /eunilang etitle title /eunilang ealtlang title /emultlang Figure 1: XVPA for the given SDTD Further, for every well-matched word w (M M ) , it is easy to show by induction on the length of w, that µ(w) mwm Lspl (m) if and only if (em , ) - - A (q , ) for - d some q Xm . Since Xm0 = F , it follows that L(A) = Ld (m0 ). Conversely, let A = ({(Qm , em , Xm , m )}mM , m0 , F ) b e an Xvpa. Define an SDTD (d, m0 ) where, for every m M , d(m) is the regular expression corresp onding to the DFA F Dm = (Qm , em , Qm , m ), where QF = Xm and m is defined m as follows: for every qm Qm and n M , if (q , µ(n), qm , pm ) m for some q Xn , then m (qm , n) = pm . Note that this is well-defined b ecause A has the single-exit prop erty. Once again, for every well-matched word w (M M ) , it is easy to show by induction on the length of w, that mwm Lspl (m) if and only if (em , ) - - A (q , ) for - d some q Xm . Hence, L(A) = Ld (m0 ). Remark 2. We wil l use the translation from SDTDs to XVPAs in the sequel. For this purpose we wil l assume that the XVPA is trimmed: for every state q in any module m, we wil l assume that there is a wel l-matched word w that leads from q to some state in Xm . If this wasn't the case, we can remove q and the transitions incident on q without changing the language of the XVPA. Now observe that in such an XVPA, for every run on uc MR ( ) there exists a v such that the run can be extended on v such that it is accepting (showing ucv L(A)). This property wil l simplify many of our constructions. µ(w) are DFAs over , then an Exptime upp er b ound for deciding whether L(A1 ) WM ( ) L(A2 ) WM ( ) was obtained by a reduction to the inclusion problem for tree automata. It was left as an op en problem whether this b ound could b e improved. By Prop osition 4, however, we immediately have a Ptime upp er b ound for this problem, even when A1 and A2 are deterministic Vpas: Theorem 2. If A1 and A2 are DFAs (or even deterministic Vpas) over (, ), then checking if L(A1 ) WM ( ) L(A2 ) WM ( ) is decidable in Ptime. 3. TYPE-CHECKING In this section, we present several results (some of which are new) p ertaining to SDTDs, that follow immediately from existing results for Vpas. We b egin with results in the streaming context, where one is required to determine whether or not an XML document b elongs to the language defined by an SDTD. Streaming Deterministic automata working on strings are the most natural model for processing streams of data. Since Vpls always have deterministic acceptors [4], Vpas are a convenient abstraction for studying streaming problems. In this section we look at some known results for Vpas and examine their consequences for typ e checking XML documents. The following Prop osition shows that deterministic Sevpas capture the class of SDTDs. Proposition 5. [3] For every SDTD (d, m0 ) over (, M , µ), there is a deterministic Sevpa A over (, , id) 4 such that L(A) = L(d, m0 ). The ab ove prop osition is true since an SDTD is always captured by an Xvpa, and results in [3] show that any language accepted by a Vpa can b e accepted by a deterministic Sevpa. Hence, deterministic Sevpas give a streaming algorithm to recognize SDTDs. The streaming algorithm simply simulates the deterministic Sevpa on the input and checks if it 4 In Section 4, we will prove that deterministic Xvpas corresp ond exactly to pre-order typ ed SDTDs. Hence, by Theorem 1, we have: Remark 3. If the regular expressions in SDTD1 and SDTD2 are given as deterministic finite automata (DFA), and SDTD2 is pre-order typed, then the sub-typing problem SDTD1 SDTD2 is decidable in Ptime. Note that a similar result was obtained in [17], for the case when both SDTD1 and SDTD2 are pre-order typ ed. The sub-typing problem was also studied in [24], in the context of streaming algorithms for validating XML documents using a finite amount of memory, under the assumption that the XML document is well-matched. If A1 and A2 id denotes the identity function 1058 WWW 2007 / Track: XML and Web Data is accepting. Note that the algorithm will use only constant space in addition to a stack of symb ols over a fixed alphab et, where the stack height is b ounded by the depth of the nesting of tags in the document (which is typically not very large when compared to the size of the document). Such an algorithm was already observed in [24] using pushdown automata; in fact the pushdown automaton they construct is essentially an Sevpa! By combining the set of single-entry modules into a common module, any such Sevpa can b e reinterpreted as a complete multi-entry µ-Vpa. We can now app eal to the following Prop osition to prove that there is a minimal streaming automaton recognizing a given SDTD. Proposition 6. [14] For every complete multi-entry µVpa A, there is a unique (up to isomorphism) minimumstate deterministic multi-entry µ-Vpa A such that L(A ) = L(A). Hence, the minimization construction outlined in [14] gives a way to minimize the space used by a streaming typ echecking algorithm that is based on pushdown automata. Thus, unlike the result in [24], we are able to construct a provably minimal streaming recognizer for an SDTD. Session: Querying and Transforming XML 4.2 Dynamic pre-order typing We now study the problem of dynamic pre-order typing. Recall that this problem asks for an automaton that streams input and determines the typ e of op en-tags as soon as it meets them, provided the typ e has b een determined by the prefix read till that p oint. Consider the XVPA A = ({(Qm , em , Xm , m )}mM , m0 , F ) corresp onding to (d, m0 ) as defined in Theorem 1. This XVPA is deterministic on return transitions but, in general, is non-deterministic on call symb ols. We construct a deterministic automaton A that simulates the b ehaviour of A on all applicable sp ecializations of call symb ols. After reading prefix u, if there is only one applicable sp ecialization m of call symb ol c, then A outputs m as the uniquely determined prefix-typ e of c at p osition |uc|. For every c , let Qc = {q Qm | µ(m) = c}. Let A = (Q, q0 , Q, ) b e a VPA without final states, where Q = {P Qc | P = , c }, q0 = {em0 } and is defined as follows: - P - P , where P = {em | µ(m) = c and p P. p - A em } = - P - - P , where - P = {p | p P, p P . p - - A p } = - Further, A outputs m on transition P - P if and only - if P = {em } for some m µ-1 (c). We claim that for every op en tag c that has prefix-typ e m at p osition |uc| in |ucv |, the Vpa A outputs m after reading the input uc. First observe that by construction of A and by Remark 2, if uc q0 - A P and |P | > 1, then there are two distinct m, m µ-1 (c) and strings v , v such that ucv , ucv Ld (m0 ) and (ucv , |uc|) has typ e m whereas (ucv , |uc|) has typ e m . Hence, c is not prefix-typ ed at |uc| and the automaton A does not output anything on reading c after input u. However, if P = {em }, then by construction of A, for every v such that ucv Ld (m0 ), (ucv , |uc|) has typ e m. Therefore, c has prefix-typ e m at p osition |uc| in ucv , and A in fact outputs m on reading uc. c/P c/p c/P c/P c/p 4. PRE-ORDER TYPING In this section, we study the class of XML schema that are pre-order typ ed, give an automata theoretic characterization of them, develop a streaming algorithm that determines the typ es of op en-tags whenever p ossible, and also show that checking whether a tag can b e pre-order typ ed is solvable in p olynomial time. 4.1 Automata Theoretic Characterization Pre-order typing has a very simple characterization in terms of the structure of Xvpa defining the XML schema. An SDTD is pre-order typ ed exactly when its Xvpa translation is deterministic: Lemma 1. An SDTD (d, m0 ) is pre-order typed if and only if there is a deterministic XVPA A over the same modules such that L(A) = Ld (m0 ). Proof. Let (d, m0 ) over (, M , µ) b e a pre-order typ ed SDTD and let A = ({(Qm , em , Xm , m )}mM , m0 , F ) b e the corresp onding Xvpa as defined in Theorem 1. Recall that by re definition of Xvpa mt is deterministic for each m. Consider some state q of A and let u b e a string such that A has some computation on u that ends up in state q . Now since (d, m0 ) is pre-order typ ed, there is a unique sp ecialization m associated with c in the string uc. Thus, the state q must ca have only the transition to em on the symb ol c. Hence, m ll is deterministic for every module m. Conversely, supp ose A = ({(Qm , em , Xm , m )}mM , m0 , F ) is a deterministic Xvpa and let (d, m0 ) b e the SDTD corresp onding to it (as defined in Theorem 1). Consider a string u, and let q b e the state reached by A on the string u. Since A is deterministic, for any c there is at most one m such that q has a transition to em on symb ol c. Thus, m is the unique sp ecialization associated with c in the string uc. Hence, (d, m0 ) is pre-order typ ed. Theorem 3. For any SDTD, we can effectively construct an algorithm that dynamical ly pre-order types the tags in a streaming XML document. Further, this algorithm uses only space O(s.d), where s is the size of the SDTD and d is the depth of the document. 4.3 Checking partial pre-order typing of schema We now consider the following problem: Given an SDTD (d, m0 ), which op en tags are pre-order typ ed in every document defined by (d, m0 )? In the procedure for converting an SDTD into an XVPA defined in Theorem 1, we chose a deterministic finite state automaton Dm corresp onding to every regular expression d(m). This results in an automaton that may b e exp onential in the size of the SDTD (d, m0 ). Instead, for every m, let Dm b e a non-deterministic finite state automaton (one that is linear in the size of d(m) can b e constructed efficiently). We therefore obtain a non-deterministic VPA A = (Q, q0 , F, Q, ) such that L(A) = Ld (m0 ), and further, |Q| = O(n), where n is the size of (d, m0 ). To efficiently determine whether an op en tag c is pre-order typ ed or not, we search for a witness to the fact that c is not pre-order typ ed at some p osition. Such a witness consists of words 1059 WWW 2007 / Track: XML and Web Data ucv, ucv Ld (m0 ) such that (ucv , |uc|) has typ e m whereas (ucv , |uc|) has typ e m for some m = m . Searching for such witnesses reduces to p erforming reachability in a VPA A without final states defined as follows: A = (Q , q0 , Q × Q, ) where Q = Q × Q × ( {}), q0 = (q0 , q0 , ) and is defined as follows: T1 : (q , q , ) - - - A (p, p , ) - - and q - A p - --- T2 : (q , q , ) - - - A (p, p , ) and q - - A p - c/q c/q2 c/(q1 ,q2 ) c/q c / (q , q ) Session: Querying and Transforming XML 5.2 Checking partial post-order typing of schema We now consider the problem: Given an SDTD (d, m0 ), which op en tags are p ost-order typ ed in every document defined by (d, m0 )? Once again we can express this as a reachability problem on an appropriately defined Vpa A . Our construction is identical to the construction defined in Section 4.3 except that the condition T3 in the defintion of the transition function is replaced with: --- T3 : (q , q , ) - - - A (p, p , c) if q - - A p, q - - A p - - and q Qm , q Qm with m = m Similar to Lemma 2, we can show that a tag c is p ost-order typ ed if and only if no state (p, p , c) is reachable. Hence: Theorem 5. Given an SDTD (of size n), the problem of checking whether c is post-order typed in the SDTD is solvable in time polynomial in n. c/q1 c/q2 c/(q1 ,q2 ) if q - A p - if q - - A p - if q - A p, - c/q c/q1 c/q T3 : (q , q , ) - - - A (p, p , c) - - q - A p and p = p - c / (q , q ) Lemma 2. For every open tag c , c is pre-order typed in (d, m0 ) if and only if no state (p, p , c) is reachable from the initial state in A . Intuitively, a state of the form (p, p , c) is reachable in A whenever there are states q , q reachable on some common input u in A, and there are distinct sp ecializations m, m of c such that module m (resp. m ) can b e called from state q (resp. q ). Then by Remark 2, there are strings v , v such that ucv , ucv witness the fact that c not pre-order typable at p osition |uc|. Since A has size O(n2 ) and reachability in a Vpa can b e determined in cubic time, we have: Theorem 4. Given an SDTD (of size n), the problem of checking whether c is pre-order typed in the SDTD is solvable in time polynomial in n. 6. PREFIX QUERYING The problem of pre-order (resp ectively p ost-order) typing can b e viewed abstractly as answering which op en-tags (resp ectively close-tags) satisfy the prop erty of b eing uniquely parsed as a particular sp ecialized tag in an SDTD, provided the prop erty of whether a tag has this prop erty is determinable by reading the prefix of the document up till the op en-tag (resp ectively close-tag). In this section, we generalize this idea, by defining a generic class of queries (formalized using monadic second-order logic) that have the prop erty that a p osition in a document satisfying the query is solely determined by the prefix of the document till that p oint. This class of queries is indep endent of the document-typ e definitions used to describ e input documents, and pre-order and p ost-order typing are simply sp ecial instances of this class of queries. The monadic second-order logic structures that we consider will b e over the nested structure [5] defined by a document. More precisely, given any document w , let us lab eled structure ([1, |w|], view the word as a linear , {Qa }a , ), where there are |w| elements corresp onding to each letter in w, is the linear order on this set of p ositions, each Qa is a unary predicate that is true on exactly the p ositions lab eled a (i.e. Qa = {i | w[i] = a), and is the matching binary relation that associates each op en-tag with the corresp onding close-tag (i.e. (x, y ) holds iff w[x] , w[y ] , and w[x] . . . w[y ] is well-matched). Monadic second-order logic can b e now defined as the canonical logic over this structure with interpreted relations , Qa 's and . Since we are interested in prop erties that are determined by the prefix in a document, we define a restricted version of MSO, called Pre-MSO, which allows only quantification over p ositions that occur b efore the query p osition. Formally, fix a first-order variable x. Then the set of all Pre-MSO(x) formulas (x) (with x b eing the only free variable) is defined as: 5. POST-ORDER TYPING In this section, we give an automata theoretic characterization of the class of XML schema that can b e p ostorder typ ed. Given a p ost-order typ ed SDTD, consider the Xvpa corresp onding to it. After reading a word u, if the Xvpa meets an op en-tag c, then c need not b e prefix-typ ed at p osition |uc|. Hence, the Xvpa may call several modules m1 , . . . , mk . However, since the SDTD is p ost-order typ ed, the typ e of c will get determined at the closing tag c, i.e. at the time of exit from these modules. Hence it is clear that the languages accepted by these modules must b e disjoint. More formally, we say that an Xvpa A has disjoint cal ls if for all states q in A and op en tags c , if c c q - A em and q - A em for distinct m, m µ-1 (c), then Ld (m) Ld (m ) = . We therefore have: Lemma 3. An SDTD (d, m0 ) is post-order typed if and only if there is an Xvpa A with disjoint cal ls over the same modules such that L(A) = Ld (m0 ). 5.1 Dynamic post-order typing We now turn to the problem of dynamical ly p ost-order typing a streaming XML document. Our construction of the deterministic Vpa A = (Q, q0 , Q, ) is as in Section 4.2 with the only modification that A outputs m on a return transition P - - P if and only if P Qm for some - m µ-1 (c). Using a similar argument as b efore, we can show that on any input, the ab ove Vpa outputs the typ e of every closing tag c at p osition |uc| whenever this can b e uniquely determined by reading the prefix u. Hence an algorithm analogous to that rep orted in Theorem dynpre follows. c/P ::= y X | Qi (y ) | y y | (y , y ) | | ¬ | z (z x ) | X () where y , y , z F V , z = x, X S V . 1060 WWW 2007 / Track: XML and Web Data The logic Pre-MSO is a restriction that forces any query written in the logic to dep end only on the prefix of the word till the particular p oint of query. Formally, for any word w , let Answer s(w, ) b e the set of of all p ositions i, 1 i w, such that (x) holds in w when x is interpreted to b e i. Session: Querying and Transforming XML Theorem 7. For every query captured by a marking VPA, there is an equivalent Pre-MSO formula (x) that defines the same query. As a corollary to Theorem 6 we have: Corollary 1. For any Pre-MSO formula (x), there is a streaming algorithm (which is effectively constructible) that processes documents and outputs the set of al l answers to the query defined by (x). Moreover, this algorithm processes each letter of the document in constant time, and utilizes space at most O(d), where d is the depth of the document. The algorithm simply simulates the deterministic VPA constructed in Theorem 6; processing a letter can b e done in constant time (for a fixed formula ) and the space required is to store the state and the stack, which is O(d). Sequential XPath is a restricted version of XPath that has the prop erty that a p osition satisfying a query is determined by the prefix till the node [8] (it however has further restrictions aimed at limiting buffering input). Sequential XPath has b een defined and studied precisely to facilitate one-pass querying with minimal buffering of input. Our result can b e seen as a generalization of this idea to the much larger class of MSO-definable queries. A similar precise characterization of queries that can b e answered by 1-pass streaming algorithms is provided in [19]. There the characterization is in terms of constraint systems and pushdown forest automata. Our results here can b e seen as a reformulation of those results in terms of logic and word automata. We conclude this section by observing that our results on dynamic pre-order and p ost-order typing can b e seen as sp ecial cases of Theorem 6. Proposition 8. For any SDTD (d, m0 ) and specialization m, there is a Pre-MSO formula Preorderm,m0 ) (x) such (d that for any document w, Answer s(w, Preorderm,m0 ) ) is ex(d actly the set of positions that can be pre-order typed with specialization m. Analogously, there is a Postorderm,m0 ) (x) (d that exactly describes the positions that can be post-order typed with specialization m. Proof. The construction of the VPA in Section 4.2, when restricted to only answering the p ositions getting typ e m, can b e seen as a VPA with output. Thus, the prop osition follows from Theorem 7. The proof for p ost-order typing is similar. Prop osition 8 can b e used obtain a streaming algorithm for dynamic pre-order (and p ost-order) typing. The streaming algorithm simulates the deterministic marking VPA from Theorem 6 for Preorderm,m0 ) (x) (or Postorderm,m0 ) (x)) for (d (d each sp ecialization m simultaneously on the document w, and outputs m whenever the marking VPA for m enters a marking state. While the results of Section 4.2 can thus b e seen as a corollary to the results presented in this section, the algorithm describ ed in the previous paragraph is likely to b e more inefficient when compared to that presented in Section 4.2. Thus, in practice it will b e more useful to use the direct construction presented earlier for determining typ es. Proposition 7. Let (x) be any Pre-MSO(x) formula and let i Answer s(w, ). Then, for any word w , |w | i, with w[1, i] = w [1, i], i Answer s(w , ) as wel l. Consequently the set of answers to a query (x) in a word w, which is the set of all p ositions i such that (x) is satisfied when x is interpreted to b e i, is determined when the prefix up till i has b een read. Hence, technically, we should b e able to output the answer p ositions i as soon as we read the i'th letter in a document. We show that indeed this is true, and there is a deterministic visibly pushdown automaton that can p erform this task. From this we obtain a streaming algorithm that uses space only linear in the depth of a document, and can output the set of al l answers to any Pre-MSO(x) query. Let us first define VPA with output. Definition 7 (VPA with output). A marking visibly pushdown automaton ( Vpa) over (, ) is a tuple A = (Q, q0 , QF , , , M ), where (Q, q0 , QF , , ) is a VPA and M Q is a subset of marked states. We say such a marking VPA A working on a word w marks position i (1 i |w|) if there is some run of A on w which reaches a marked state in M just after reading the i'th letter in w. (Note that the set of final states play no role in this definition.) We can now show: Theorem 6. Let be a Pre-MSO(x) formula over . Then there is a deterministic marking VPA A that on any word w marks exactly the set of positions Answer s(w, ). Proof. Let (x) b e a Pre-MSO(x) formula. Now, consider the set PrefixLang (), which is the set of all words u, not necessarily well-matched, such that |u| Answer s(u, ), i.e. the set of all words u such that (x) holds in u when x is interpreted to b e the last letter of u. It can b e seen that PrefixLang () is expressible in (full) MSO. Using the corresp ondence b etween monadic second-order logic on nested structures and visibly pushdown automata [4, 5], it follows that the set P r ef ix - Lang () can b e (effectively) recognized by a VPA. Let B = (Q, q0 , QF , , ) b e an automaton accepting this language. Then the automaton which will answer the queries will b e A = (Q, q0 , QF , , , M ), where M = QF . Intuitively, when reading a word w and when at p osition i, the automaton B will mark i iff the prefix u uptil p osition i already satisfies , when i is interpreted for x. Since we know that w satisfies when x is interpreted to b e i iff u satisfies when x is interpreted to b e i, the correctness follows. We can also show the converse, namely that every marking VPA (deterministic or nondeterministic) defines a query that is expressible by a Pre-MSO formula (x). The proof of this relies on the translation from VPAs to MSO, alongwith observations similar to the ones made in the ab ove proof; we skip further details: 7. CONCLUSIONS We have shown that visibly pushdown automata is a convenient and p owerful model for studying problems for XML 1061 WWW 2007 / Track: XML and Web Data that intrinsically involve processing documents from left to right. Modular Vpas emerge as an elegant model to study problems such as streaming and typing XML documents, and constructions and algorithms based on modular Vpas are intuitive and simple, giving clean proofs and efficient algorithms. Modular Vpas are useful in the program verification context as well, and hence our results make an unusual connection b etween the two fields, which could b e mutually b eneficial. For instance, the congruence based characterizations develop ed in the verification setting have b een useful in the XML setting, and the unique minimal modular Vpa result presented herein has p otential uses in building minimal program models. There are several other problems that can b e addressed using the XVPA model. In particular, checking whether SDTDs can b e transformed into another which is pre-order (or p ost-order) typ ed is decidable and these results can b e proved using XVPAs (we refer the reader to the technical rep ort [15]). Pre-order typability of SDTDs seems related to LL[1] grammars [12] and is worth studying. Intuitively, an LL[1] grammar requires that the rule to b e applied is determined when each symb ol is read, which greatly resembles pre-order typing and deterministic Xvpas. The most interesting future direction we see is in defining querying automata using the Vpa model. Note that while we have studied prefix querying in this pap er, general querying (like general XPath queries) can select a node dep ending on prop erties of the document that lie in the future of the node. In work not rep orted here, we have extended the Vpa model to a query model where the automaton is p owerful enough to store fragments of the document, and hence answer all XPath/MSO queries. In b oth this and the work rep orted in this pap er, the ability to minimize Vpa seems to reflect the kind of optimizations in memory prop osed by other researchers [7, 11], and we b elieve that minimization of visibly pushdown automata will b e a formal and p erhaps more effective way to achieve space optimizations. Finally, we plan to also study transformations of XML documents in the streaming setting using VPAs (see also [23], where visibly pushdown expressions have b een used to study effect systems on streaming XML). Session: Querying and Transforming XML [7] Yi Chen, Susan B. Davidson, and Yifeng Zheng. An efficient XPath query processor for XML streams. In ICDE, page 79, 2006. [8] A. Desai. Introduction to sequential xpath. In Proc. of IDEAl liance XML Conference, 2001. [9] Todd J. Green, Gerome Miklau, Makoto Onizuka, and Dan Suciu. Processing XML streams with deterministic automata. In ICDT '03, pages 173­189, 2003. Springer-Verlag. [10] Martin Grohe and Nicole Schweikardt. Lower b ounds for sorting with few random accesses to external memory. In PODS '05, pages 238­249, 2005. ACM Press. [11] Ashish Kumar Gupta and Dan Suciu. Stream processing of XPath queries with predicates. In SIGMOD, pages 419­430. ACM Press, 2003. [12] J.E. Hop croft and J.D. Ullman. Introduction to Automata Theory, Languages, and Computation. Addison-Wesley, 1979. [13] Christoph Koch and Stefanie Scherzinger. Attribute grammars for scalable query processing on XML streams. In DBPL, volume 2921 of LNCS, pages 233­256, 2003. [14] V. Kumar, P. Madhusudan, and M. Viswanathan. Minimization, learning, and conformance testing of b oolean programs. In CONCUR, 2006. to app ear. [15] V. Kumar, P. Madhusudan, and M. Viswanathan. Visibly pushdown languages for xml. Technical Rep ort UIUCDCS-R-2006-2704, UIUC, 2006. [16] B. Lud¨scher, P. Mukhopadhyay, and a Y. Papakonstantinou. A transducer-based XML query processor. In VLDB, pages 227­238, 2002. [17] Wim Martens, Frank Neven, and Thomas Schwentick. Which XML schemas admit 1-pass preorder typing? In ICDT, pages 68­82, 2005. [18] M. Murata, D. Lee, and M. Mani. Taxonomy of XML Schema Languages using Formal Language Theory. In Extreme Markup Languages, Montreal, Canada, 2001. [19] Andreas Neumann and Helmut Seidl. Locating matches of tree patterns in forests. In FSTTCS, pages 134­145, 1998. Springer-Verlag. [20] Frank Neven. Automata, logic, and XML. In CSL, pages 2­26, London, UK, 2002. Springer-Verlag. [21] Frank Neven and Thomas Schwentick. On the p ower of tree-walking automata. Information Computing, 183(1):86­103, 2003. [22] Y. Papakonstantinou and V. Vianu. DTD inference for views of XML data. In PODS, pages 35­46, New York, NY, USA, 2000. ACM Press. [23] Corin Pitcher. Visibly pushdown expression effects for XML stream processing. In Programming Language Technologies for XML, pages 1­14, 2005. [24] Luc Segoufin and Victor Vianu. Validating streaming XML documents. In PODS, pages 53­64, New York, NY, USA, 2002. ACM Press. [25] Victor Vianu. XML: From practice to theory. In SBBD, pages 11­25, 2003. [26] World Wide Web Consortium. Extended Markup Language (XML). http://www.w3.org. 8. REFERENCES [1] Serge Abiteb oul, Peter Buneman, and Dan Suciu. Data on the Web: From Relations to Semistructured Data and XML. Morgan Kaufmann, 1999. [2] A. V. Aho and J. D. Ullman. Translations on a context free grammar. Information and Control, 19(19):439­475, 1971. [3] R. Alur, V. Kumar, P. Madhusudan, and M. Viswanathan. Congruences for visibly pushdown languages. In ICALP, pages 1102­1114, 2005. [4] R. Alur and P. Madhusudan. Visibly pushdown languages. In STOC, pages 202­211. ACM Press, 2004. [5] R. Alur and P. Madhusudan. Adding nesting structure to words. In DLT, LNCS 4036, pages 1­13, 2006. [6] Mikola j Bo janczyk and Thomas Colcomb et. Tree-walking automata do not recognize all regular languages. In STOC, pages 234­243, New York, NY, USA, 2005. ACM Press. 1062