WWW 2007 / Track: XML and Web Data

Session: Querying and Transforming XML

Multiway SLCA-based Keyword Search in XML Data
Chong Sun
School of Computing National University of Singapore

Chee-Yong Chan
School of Computing National University of Singapore

Amit K. Goenka
School of Computing National University of Singapore

sunchong@soe.ucsc.edu chancy@comp.nus.edu.sg amitkuma@comp.nus.edu.sg
a1 x1 b1 x2 x3 x4 d1 e1 b1 d2 x5 bn-1 a1 c1 bn (a) T1 ( b ) T2 e2 b2 a2 b2 c2 b3 . . . a2

ABSTRACT
Keyword search for smallest lowest common ancestors (SLCAs) in XML data has recently b een prop osed as a meaningful way to identify interesting data nodes in XML data where their subtrees contain an input set of keywords. In this pap er, we generalize this useful search paradigm to supp ort keyword search b eyond the traditional AND semantics to include b oth AND and OR b oolean op erators as well. We first analyze prop erties of the LCA computation and prop ose improved algorithms to solve the traditional keyword search problem (with only AND semantics). We then extend our approach to handle general keyword search involving combinations of AND and OR b oolean op erators. The effectiveness of our new algorithms is demonstrated with a comprehensive exp erimental p erformance study.

Categories and Subject Descriptors
H.2.4 [Database Management]: Systems--query processing, texture databases

Figure 1: Example XML Trees T1 and T2 common ancestor (SLCA) semantics [14]. A keyword search using the SLCA semantics returns nodes in the XML data that satisfy the following two conditions: (1) the subtrees rooted at the nodes contain all the keywords, and (2) the nodes do not have any prop er descendant node that satisfies condition (1). The set of returned data nodes are referred to as the SLCAs of the keyword search query. Another recent work on keyword search based on the meaningful LCA (MLCA) semantics also shares the similar principle as SLCA [12]. The following example illustrates the difference b etween the SLCA-based keyword search and the conventional LCAbased keyword search. Example 1.1 Consider the XML tree T1 shown in Figure 1(a), where the keyword nodes are annotated with subscripts for ease of reference. Consider a keyword search using the keywords {a, b, c, d, e} on T1 . If the search is based on the conventional LCA semantics, then the result is given by {x1 , x2 , x3 , x4 } as x1 is the LCA of {a2 , b2 , c2 , d2 , e2 }, x2 is the LCA of {a1 , b1 , c2 , d1 , e1 }, x3 is the LCA of {a1 , b1 , c1 , d2 , e1 }, and x4 is the LCA of {a1 , b1 , c1 , d1 , e1 }. However, if the search is based on the SLCA semantics, then the result is given by {x4 }. Observe that each of x1 , x2 , and x3 is not a SLCA b ecause it has a descendant node x4 that is a SLCA. 2 The state-of-the-art algorithms for keyword search using SLCA semantics are the Scan Eager (SE) and Indexed Lookup Eager (ILE) algorithms [14], which were shown to b e more efficient than stack-based algorithms [8, 12]. The

General Terms
Algorithms

Keywords
keyword search query, smallest lowest common ancestor, XML

1.

INTRODUCTION

Keyword search is a convenient and widely-used approach to retrieve information from b oth unstructured and structured data [1, 4, 7, 10, 11]. Its app eal stems from the fact that keyword queries can b e easily p osed without requiring to use a query language and knowing the schema or structure of the data b eing searched. For XML data, where the data is viewed as a hierarchically-structured rooted tree, a natural keyword search semantics is to return all the nodes in XML tree that contain all the keywords in their subtrees. However, this simple search semantics can result in returning too many data nodes, many of which are only remotely linked to the nodes containing the keywords. A recent direction to improve the effectiveness of keyword search in XML data is based on the notion of smal lest lowest
Copyright is held by the International World Wide Web Conference Committee (IW3C2). Distribution of these papers is limited to classroom use, and personal use by others. WWW 2007, May 8­12, 2007, Banff, Alberta, Canada. ACM 978-1-59593-654-7/07/0005.

1043


WWW 2007 / Track: XML and Web Data
ILE algorithm is the algorithm of choice when the keyword search involves at least one low frequency keyword, while the SE algorithm p erforms b etter when the frequencies of the keywords in the query do not vary significantly. We classify b oth these algorithms as binary-SLCA approach (BS) as they are b oth based on the same principle of computing the SLCAs for a query with k keywords in terms of a sequence of k - 1 intermediate SLCA computations, where each SLCA computation takes a pair of data node lists as inputs and outputs another data node list. Sp ecifically, consider a search query with k keywords w1 , З З З , wk . Let Si denote the list of XML data nodes that are lab eled with keyword ki , i  [1, k]; and let Li denote the SLCAs for a query with the first i keywords, i  [1, k]. The binary-SLCA algorithms compute the SLCAs for w1 , З З З , wk by computing the sequence L2 , L3 , З З З , Lk , where each Li is computed by finding the SLCAs of Li-1 and Si (with L1 = S1 ). An imp ortant observation exploited in the binary-SLCA algorithms is that the result size is b ounded by min{|S1 |, З З З , |Sk |}; therefore, by choosing the keyword with the lowest frequency as k1 (i.e., |S1 |  |Si | for i  [1, k]), the algorithms can guarantee that each |Li |  |S1 |, for i  [2, k]. r1 x1 x2 ЗЗЗЗЗЗ x10 b11 З З З b1001

Session: Querying and Transforming XML
computations. The following example provides an idea of the skipping optimization of our MS approach. Example 1.3 Consider the processing of the SLCA-based keyword search with keywords {a, b} on T3 in Figure 2 using our MS approach. MS will first consider the first data nodes in all the keyword lists and selects the node that occurs the latest in T3 as the anchor node. Thus, b etween a1  Sa and b1  Sb , b1 will b e selected as the anchor node (the prop erty b ehind this optimization will b e explained later in the pap er). Next, using b1 as anchor, our approach will select the closest data nodes from the other keyword lists to compute a p otential SLCA. Thus, the first SLCA is computed for the set {b1 , a100 }. After the first p otential SLCA is computed, our approach will consider the first nodes in the keyword lists that occur after b1 for the next computation (i.e., nodes a101 and b2 ). The next anchor node selected is b2 , and the next SLCA computation involves b2 and a200 . Clearly, the MS approach is able to skip many unnecessary computations. 2 In addition to introducing the notion of anchor nodes that we alluded to for minimizing redundant computations, we also develop several optimizations to further maximize the skipping of data nodes in the keyword list without compromising correctnesses of query results. Our second contribution in this pap er (Section 4) is the generalization of the SLCA-based approach to handle more general keyword search queries b eyond the implicit ANDsemantics to supp ort any combinations of AND and OR semantics. This enables more flexible and expressive keyword search queries such as "(a OR b) AND c AND (d or E)" to b e sp ecified. We extend our MS approach to evaluate general keyword search queries involving b oth AND and OR op erators. Finally, our third contribution in this pap er (Section 5) is a comprehensive exp erimental p erformance evaluation which demonstrates that our prop osed multiway-SLCA approach outp erforms the previous binary-SLCA approach for b oth traditional keyword search queries as well as generalized keyword search queries.

a1 З З З З З З a100 b1 a101З З З З З З a200 b2 a901З З З З З Зa1000 b10

Figure 2: Example XML Tree T3 However, a drawback of the binary-SLCA approach is that by computing the SLCAs in terms of a series of intermediate SLCA computations, it can often incur many unnecessary SLCA intermediate computations even when the result size is small as the following example illustrates. Example 1.2 Consider the XML tree T3 in Figure 2. The SLCAs for the keywords {a, b} in T3 are {x1 , x2 , З З З , x10 }. Since |Sa | < |Sb |, the BS approach will enumerate each of the "a" nodes in Sa to compute a p otential SLCA with it. Clearly, this approach results in many redundant computations; for example, the SLCA of ai and b1 gives the same result x1 for i  [1, 100]. In fact, the BS approach will incur a total of 1000 SLCA computations to produce a result of size 10. 2 Our first contribution in this pap er (Section 3) is the prop osal of a novel approach for processing SLCA-based keyword search queries called multiway-SLCA approach (MS). In contrast to the BS approach, our MS approach computes each p otential SLCA by taking one data node from each keyword list Si in a single step instead of breaking the SLCA computation into a series of intermediate binary SLCA computations. Conceptually, each p otential SLCA computed by the BS approach can b e thought of as b eing driven by some node from S1 (i.e., the keyword list with the lowest frequency); on the other hand, our MS approach picks an "anchor" node from among the k keyword data lists to drive the multiway SLCA computation. By doing so, our approach is able to optimize the selection of the anchor node (not necessarily from S1 ) to maximize the skipping of redundant

2. PRELIMINARIES
Let K = {w1 , З З З , wk } denote an input set of k keywords, where each keyword wi is associated with a set Si of nodes in an XML document T (sorted in document order1 ). A set of nodes S = {v1 , З З З , vk } is defined to b e a match for K if |S | = |K | and each vi  Si for i  [1, k]. We use Si to denote the data node list (sorted in document order) associated with the keyword wi . For simplicity and without loss of generality, we assume w1 to b e the lowest frequency keyword among the keywords in K ; i.e., |S1 |  |Si | for i  [1, k]. A node v in T is a lowest common ancestor (or LCA) for K if v is the lowest common ancestor node of some match S . Moreover, v is also a smal lest lowest common ancestor (or SLCA) for K if each descendant of v in T is not a LCA for K . Given two nodes v and w in a document tree T , v p w denotes that v precedes w (or w succeeds v ) in document order in T ; and v p w denotes that v p w or v = w. More generally, given two matches V = {v1 , З З З , vk } and
1 Document order corresp onds to a preorder traversal of document nodes.

1044


WWW 2007 / Track: XML and Web Data
W = {w1 , З З З , wk }, where vi , wi  Si ,  i  [1, k], we say that V precedes W (or W succeeds V ), denoted by V p W , if they satisfy b oth the following prop erties: (1) vi p wi for each i  [1, k]; and (2) V = W . We use v a w to denote that v is a prop er ancestor of w in T , and v a w to denote that v = w or v a w. Consider a node v and a set of nodes S . The function f ir st(S ) returns the "first" node v  S such that v p vi for each vi  S . Similarly, the function last(S ) returns the "last" node v  S such that vi p v for each vi  S . Both functions return null if any of its input argument values is null. The function out(v , S ) returns the "first" node v  S such that v p v and v is not a descendant of v or equal to v ; i.e., out(v , S ) = f ir st({v  S | v p v , v a v }). The function returns null if no such node exists or if v is null. The function next(v , S ) returns the first node in S that succeeds v if it exists; otherwise, it returns null. The function pr ed(v , S ) returns the predecessor of v in S , that is, the last node in S that precedes v if it exists; otherwise, it returns null. The function closest(v , S ) computes the closest node in S to v as follows:

Session: Querying and Transforming XML
tion 3.1 which is a central idea in our MS approach. Section 3.2 then presents several imp ortant prop erties ab out anchored matches. We present our first MS-based algorithm called basic multiway-SLCA (BMS) in Section 3.3, followed by our second improved MS-based algorithm, called incremental multiway-SLCA (IMS), in Section 3.4.

3.1 Anchor Nodes
A match S = {v1 , З З З , vk } is said to b e anchored by a node va  S if for each vi  S - {va }, vi = closest(va , Si ). We refer to va as the anchor node of S . The concept of anchor nodes is imp ortant to our multiwaySLCA approach as it enables us to restrict matches to those that are anchored by some nodes as the following result demonstrates. Lemma 3.1. If lca(S ) is an SLCA and v  S , then lca(S ) = lca(S ), where S is the set of nodes anchored by v . Proof. The proof is established by contradiction. Supp ose that lca(S ) = lca(S ). Since lca(S ) is an SLCA and v  S  S , this implies that lca(S ) is a prop er ancestor of lca(S ). It follows that there must exists some Si such that lca(S ) a closest(v , Si ). However, since S  Si = , we have a contradiction. Thus, our MS approach only considers anchored sets for computing p otential SLCAs. The optimization p otential of the ab ove result was illustrated earlier in Example 1.3. Recall that b1 is the selected anchor node for the SLCA computation with a100 = closest(b1 , Sa ). By choosing b1 as the anchor node (instead of using a1 as in the BS approach), for the first SLCA computation, it follows from Lemma 3.1 that it is unnecessary to compute SLCAs for matches that include any ai , i  [1, 99] b ecause such matches would necessarily include b1 and Lemma 3.1 states that it is not p ossible to generate new SLCAs with such matches. Note in our MS approach, an anchor node can b e chosen from any of keyword data lists (i.e., S1 , З З З , Sk ). For notational convenience, when we denote an anchor node as vm , we also mean that vm is selected from Sm , m  [1, k].

p r e d(v , S )

closest(v , S ) = next(v , S )

if lca(v , next(v , S )) a lca(v , pr ed(v , S )), otherwise.

However, closest(v , S ) returns null if b oth pr ed(v , S ) and next(v , S ) are null; and it returns the non-null value if exactly one of pr ed(v , S ) and next(v , S ) is null. The function lca(S ) computes the lowest common ancestor (or LCA) of the set of nodes S and returns null if any of its arguments is null. For notational convenience, we assume that the root node of the data tree T has a virtual parent node, denoted by droot , such that droot is a prop er ancestor node of every node in T . The following example illustrates our definitions. Example 2.1 Consider the XML document tree T1 shown in Figure 1(a) and the keyword search query K = {a, b, c}. Note that {a1 , b1 , c2 } is a match for K ; but neither {a1 , c2 } nor {x1 , b2 , c1 } is a match for K . We have e1 p b2 but e2 p x4 . Moreover, {a1 , b1 , c1 } p {a2 , b2 , c2 }. We have x2 a e1 and x3 a c2 . If S = {d1 , c1 , d2 }, then f ir st(S ) = d1 , last(S ) = d2 , out(x4 , S ) = d2 , next(x4 , S ) = c1 , next(c1 , S ) = d2 , next(e2 , S ) = null, pr ed(a1 , S ) = d1 , pr ed(x4 , S ) = null, closest(b1 , S ) = c1 , closest(a2 , S ) = d2 , and lca(S ) = x3 . 2

3.2 Properties
In this section, we present several imp ortant prop erties that form the basis of the optimizations in our MS-based algorithms. The following result states that if the LCAs of two matches S and S are b oth distinct SLCAs, then the two matches must necessarily b e disjoint. Lemma 3.2. If lca(S ) and lca(S then S  S = .
)

3.

OUR APPROACH

are distinct SLCAs,

In this section, we present our new approach of processing SLCA-based keyword search queries called the multiwaySLCA approach (MS). As alluded in the introduction, the key motivation b ehind our MS approach is to avoid the unnecessary overhead of the BS approach where SLCAs are computed in terms of intermediate SLCA computations by enumerating each data node in the lowest-frequency keyword list. As Example 1.2 demonstrates, by rigidly driving the SLCA computations from the lowest-frequency keyword list can result in many redundant computations particularly when the size of the results is small. We first introduce the notion of anchor nodes in Sec-

Proof. Supp ose lca(S ) and lca(S ) are two distinct SLCAs. If b oth S and S contains some common node, then lca(S ) and lca(S ) must b e related in one of three p ossibilities: (a) lca(S ) = lca(S ); (b) lca(S ) is an ancestor of lca(S ); or (c) lca(S ) is a descendant of lca(S ). Case (a) contradicts the fact that lca(S ) and lca(S ) are distinct SLCAs. Cases (b) and (c) imply that either lca(S ) or lca(S ) is not a SLCA, contradicting the fact that lca(S ) and lca(S ) are two distinct SLCAs. Thus, it follows by contradiction that S  S = . Lemma 3.3. Let V and W be two matches such that V p W . If lca(W ) is not a descendant of lca(V ), then for any

1045


WWW 2007 / Track: XML and Web Data
match X where W p X , lca(X ) is also not a descendant of lca(V ). Proof. Let V p W and lca(W ) is not a descendant of lca(V ), then either (a) all the nodes in the subtree rooted at lca(V ) precede all the nodes in the subtree rooted at lca(W ), or (b) lca(W ) is an ancestor of lca(V ). For case (a), if lca(X ) is a descendant of lca(V ), then all the nodes in the subtree rooted at lca(X ) must precede all the nodes in the subtree rooted at lca(W ) contradicting that X succeeds W . Thus, lca(X ) cannot b e a descendant of lca(V ) for case (a). For case (b), W must contain some node w such that w is not a descendant of lca(V ) and w succeeds V ; if not, lca(W ) cannot b e an ancestor of lca(V ). Therefore, if lca(X ) is a descendant of lca(V ), then this implies that w  W succeeds X , contradicting the fact that X succeeds W . Thus, lca(X ) cannot b e a descendant of lca(V ) for case (b) as well. Lemma 3.3 is a useful generalization of Lemma 2 in [14] that is exploited in our algorithms to determine whether a computed LCA is guaranteed to b e a SLCA. Sp ecifically, if V p W and lca(W ) is not a descendant of lca(V ), then one can conclude that lca(W ) is an SLCA. Lemma 3.4. Consider two matches S and S , where S p S , and S is anchored by some node v . If S contains some node u where u p v , then lca(S ) is either equal to lca(S ) or an ancestor of lca(S ). Proof. Let v  Si , i  [1, k]. Let Si  S = {v }. Since S p S , we must have v p v . Since {u, v }  S and u p v p v , v must b e a descendant of lca(S ) which implies that lca(S ) is either equal to lca(S ), a descendant of lca(S ), or an ancestor of lca(S ). However, if lca(S ) is a descendant of lca(S ), it would contradict the fact that S is anchored by v . Therefore, lca(S ) is either equal to lca(S ) or an ancestor of lca(S ). Lemma 3.4 provides a useful prop erty to optimize the selection of the next match to b e considered. Sp ecifically, if we have considered a match S that is anchored by a node va , then we can skip matches that contain any node v p va in S . Lemma 3.5. Let S and S be two matches. If S contains two nodes, where one is a descendant of lca(S ), while the other is not, then lca(S ) is either equal to lca(S ) or an ancestor of lca(S ). Proof. Let v , w  S , where v is a descendant of lca(S ), and w is not a descendant of lca(S ). Since v is a descendant of b oth lca(S ) and lca(S ), either lca(S ) and lca(S ) are equal or one is an ancestor of the other. However, since w is not a descendant of lca(S ), lca(S ) cannot b e a descendant of lca(S ); and the claim follows. Lemma 3.5 provides another useful prop erty to optimize the selection of the next match to b e considered. Sp ecifically, if we have considered a match S and lca(S ) has b een confirmed to b e an SLCA, then we can skip matches S that contains some node that is a descendant of lca(S ) as well as another node that is a not a descendant of lca(S ). Lemma 3.6. Let S be a set of nodes. Then lca(S ) = lca(f ir st(S ), last(S )).

Session: Querying and Transforming XML
Proof. Since {f ir st(S ), last(S )}  S , therefore, lca(S ) is either equal to or an ancestor of lca(f ir st(S ), last(S )). Clearly, for each v  S where f ir st(S ) p v p last(S ), v is a descendant of lca(f ir st(S ), last(S )). It follows that lca(S ) = lca(f ir st(S ), last(S )). Lemma 3.6 states that the LCA of a set of nodes S is equivalent to the LCA of the two extreme nodes (i.e., the first and last nodes) in S . This prop erty enables the LCA computation for a set of nodes S to b e improved significantly as it suffices to compute the LCA of S in terms of only its first and last nodes.

3.3 Basic Multiway-SLCA Algorithm (BMS)
In this section, we present our first Multiway-SLCA-based algorithm called Basic Multiway-SLCA (BMS) for computing SLCAs for a set of keywords {w1 , З З З , wk }. The details are given in Algorithm 1 which takes k keyword data lists S1 , З З З , Sk as input and returns the SLCAs as a collection of nodes. Each Si is the list of data nodes associated with keyword wi . The algorithm computes the SLCAs iteratively. At each iteration, an anchor node vm is judiciously selected to compute the match anchored by vm and its LCA (denoted by ). If  is p otentially an SLCA, it is maintained in an intermediate SLCA result list given by 1 , З З З , n , n  1, where all the LCAs i in the list are definite SLCAs except for the most recently computed candidate n . To minimize the computation of LCAs that are not SLCAs, it is imp ortant to optimize the anchor node selected at each iteration. Initially, step 2 initializes the first candidate SLCA 1 to b e droot (the virtual root node of data tree); if 1 remains as droot at the end of the algorithm (step 22), then it means that the SLCA result list is empty. The first anchor node vm is selected in step 1. Instead of choosing the first node v1  S1 as the anchor (as is done in the BS approach), BMS selects the first node vm  Sm , m  [1, k] that is the "furthest" node among all the first nodes in S1 , З З З , Sk . In doing so, all the nodes in S1 that precede u1 = closest(vm , S1 ) are skipp ed from consideration as anchor nodes. The correctness of this optimization stems from the fact that for each v  S1 that precedes u1 , closest(v , Sm ) must b e vm ; therefore, by Lemma 3.1, no SLCAs will b e missed out by using vm as the first anchor node. Steps 4 to 9 further optimize the selection of the anchor node to ensure that the total numb er of candidate SLCAs computed is no more than |S1 | (elab orated in Section 3.5). Sp ecifically, if the selected anchor node vm precedes closest(vm , S1 ), then by Lemma 3.1, no SLCAs will b e omitted by replacing the anchor node vm with closest(vm , S1 ). The usefulness of this optimization is illustrated in Example 3.2. After an anchor node vm has b een chosen, step 10 computes the match anchored by vm , and step 11 computes the LCA  of this match in terms of only its first and last nodes (based on Lemma 3.6). Steps 12 to 16 check whether the newly computed LCA  can b e a candidate SLCA; and if so, whether  can b e used to eliminate the previous candidate SLCA n . Steps 17 to 20 optimize the selection of the next anchor node by choosing the furthest p ossible node that maximizes the numb er of skipp ed nodes: step 17 is based on Lemma 3.4 while steps 18 to 20 are based on Lemma 3.5. Example 3.1 Consider computing SLCAs for the set of

1046


WWW 2007 / Track: XML and Web Data
Algorithm 1 Basic Multiway-SLCA (S1 , З З З , Sk ) 1: let vm = last({f irst(Si ) | i  [1, k ]}), where vm  Sm 2: initialize n = 1; 1 = droot 3: while (vm = null) do 4: if (m = 1) then 5: v1 = closest(vm , S1 ) 6: if (vm p v1 ) then 7: vm = v1 8: end if 9: end if 10: vi = closest(vm , Si ) for each i  [1, k ], i = m 11:  = lca(f ir st(v1 , З З З , vk ), last(v1 , З З З , vk )) 12: if (n a ) then 13: n =  14: else if ( a n ) then 15: n = n + 1; n =  16: end if 17: vm = last({next(vm , Si ) | i  [1, k ], vi p vm }) 18: if (vm = null) and (n a vm ) then 19: vm = last({vm }  {out(n , Si ) | i  [1, k ], i = m}) 20: end if 21: end while 22: if (1 = droot ) then return  else return {1 , З З З , n }

Session: Querying and Transforming XML
Algorithm 2 Incremental Multiway-SLCA (S1 , З З З , Sk ) 1: let vm = last({f irst(Si ) | i  [1, k ]}), where vm  Sm 2: initialize n = 1; 1 = droot 3: while (vm = null) do 4: if (m = 1) then 5: v1 = closest(vm , S1 ) 6: if (vm p v1 ) then 7: vm = v1 8: end if 9: end if 10: P = {pr ed(vm , Si ) | i  [1, k ], i = m}  {vm } 11: N = {next(vm , Si ) | i  [1, k ], next(vm , Si ) = null} 12: initialize rmax = last(N ); r = vm 13: repeat 14: remove from P , where = f ir st(P ) 15:  = lca( , r ) 16: r = last(r, v) where v  N s.t. v = next(vm , Sj ),  Sj 17: until (r = null) or ( a r ) or (r = rmax ) 18: if (r = null) or ( a r ) then 19: if (n a ) then 20: n =  21: else if ( a n ) then 22: n = n + 1; n =  23: end if 24: vm = last(r, out(n , S1 ), З З З , out(n , Sk )) 25: else 26: vm = r 27: end if 28: end while 29: if (1 = droot ) then return  else return {1 , З З З , n }

keywords {a, b, c, d, e} on the data tree T1 in Figure 1(a) using the BMS algorithm. Since each keyword has the same frequency, let S1 b e the list of data nodes for keyword "a". The first anchor node selected is c1 , and the first candidate SLCA computed is 1 = lca(d1 , e1 , b1 , a1 , c1 ) = x4 . The next anchor node selected is a2 , and the second candidate SLCA computed is  = lca(d2 , e2 , b2 , c2 , a2 ) = x1 . However, since x1 a x4 , x1 is not a SLCA. The next anchor node selected has a null value (due to next(a2 , S1 ) = null), and the algorithm terminates with x4 as the only SLCA. 2 The next example illustrates the imp ortance of the optimization p erformed by steps 4 to 9. Example 3.2 Consider computing SLCAs for the set of keywords {a, b} on the datatree T2 in Figure 1(b). Here, S1 refers to the list of nodes for keyword "a". Using a non-optimized BMS algorithm that excludes steps 4 to 9, the first anchor node is b1 and the candidate SLCA computed is lca(b1 , a2 ) = b1 . Similarly, the subsequent sequence of anchor nodes selected is b2 , b3 , З З З , bn , and the candidate SLCA computed for each of these anchor nodes bi is lca(bi , a2 ) = b1 . Clearly, the non-optimized BMS incurs many redundant SLCA computations that involve the same a2 node. In general, the non-optimized BMS p erforms p oorly when many anchor nodes share the same closest node (w.r.t. some keyword) that succeeds the anchor nodes. The BMS algorithm avoids this problem by b ounding the numb er of computed SLCAs to b e no more than |S1 | using steps 4 to 9. In this case, the first anchor node selected is optimized to a2 and the first candidate SLCA computed is lca(a2 , b1 ) = b1 . The next anchor node selected has a null value (since next(a2 , S1 ) = null) and the algorithm terminates without any redundant SLCA computations. Thus, the numb er of candidate SLCA computations is reduced from n to just one. 2

whose details are shown in Algorithm 2. IMS is an optimized variant of BMS that reduces the numb er of LCA computations. In BMS (Algorithm 1), each computation of  incurs at least 2k - 1 LCA computations: each of the k - 1 calls to closest function in step 10 requires two LCA computations, and step 11 adds another LCA computation. To avoid the large numb er of LCA computations incurred by an explicit computation of M , the IMS algorithm determines f ir st(M ) and last(M ) without actually computing M . In the following, we analyze the prop erties of f ir st(M ) and last(M ), and explain how this can b e achieved. By definition of the match M anchored by vm , M must satisfy the following three conditions: 1. M  {vm }  P  N , where P = {pr ed(vm , Si ) | i  [1, k], i = m, pr ed(vm , Si ) = null} and N = {next(vm , Si ) | i  [1, k], i = m, next(vm , Si ) = null}; 2. M  Si =   i  [1, k]; and 3. vm  M . Since M must contain vm and every node in P precedes vm , it follows that f ir st(M )  P  {vm }. Furthermore, last(M ) can b e determined once f ir st(M ) is known. Let P  P denote the subset of nodes in P that precedes f ir st(M ) (i.e., P = {v  P | v p f ir st(M )}); and let N  N denote the subset of nodes in N that corresp onds to P that succeeds vm (i.e., N = {next(vm , Si ) | i  [1, k], pr ed(vm , Si )  P }). Since P  M = , in order for M to satisfy condition (2), it is necessary that M  N . Moreover, since |P | = |N | a and |M | = k, we must have M = (P - P )  {vm }  N n d l a s t (M ) = l a s t (N ). Since there are |P | + 1 p ossible values for f ir st(M ), let M1 , M2 , З З З , M|P |+1 denote the sequence of matches where for each i  [1, |P |+1], we have (1) vm  Mi ; (2) f ir st(Mi ) 

3.4 Incremental Multiway-SLCA Algorithm (IMS)
In this section, we present our second Multiway-SLCAbased algorithm called Incremental Multiway-SLCA (IMS),

1047


WWW 2007 / Track: XML and Web Data
P  {vm }; and (3) f ir st(M1 ) p f ir st(M2 ) p З З З p f ir st(M|P |+1 ). In other words, f ir st(Mi ) =

Session: Querying and Transforming XML
Note that the numb er of LCA computations incurred by IMS for each candidate SLCA computation is at least one (one iteration of rep eat loop) and at most k + 1 (k - 1 iterations of rep eat loop and one call to closest function). In contrast, BMS requires b etween 2k - 1 and 2k + 1 LCA computations. Example 3.3 Consider again computing SLCAs for the set of keywords {a, b, c, d, e} on the data tree T1 in Figure 1(a), where S1 is associated with keyword "a". Using the IMS algorithm, the first anchor node selected is c1 , P = {d1 , e1 , b1 , a1 , c1 }, and N = {d2 , e2 , b2 , c2 , a2 }. In the first iteration of the rep eat loop,  = lca(d1 , c1 ) = x4 . r is then up dated to d2 and the rep eat loop terminates since x4 a d2 . Therefore, the first candidate SLCA computed is 1 = x4 . The next anchor node selected is a2 , P = {d2 , e2 , b2 , c2 , a2 }, and N = {}. In the first iteration of the rep eat loop,  = lca(d2 , a2 ) = x1 . r is then up dated to null and the rep eat loop terminates. Therefore, since  a 1 ,  is definitely not a SLCA. The next anchor node has a null value and the algorithm terminates with x4 as the only SLCA. The numb er of LCA computations incurred by the IMS algorithm is only two. In contrast, the BMS algorithm incurs 18 LCA computations (Example 3.1). 2

f irst(P  {v }) f ir st((P  {v })- Ы f irst(M ))
m i-1 j =1 m j m

if i = 1, otherwise. (1)

Based on the preceding analysis of f ir st(M ) and last(M ), last(Mi ) can b e computed incrementally as follows:

Дv

last(Mi )

=

last(Mi-1  {next(vm , Sx )})

if i = 1, otherwise.

(2)

where f ir st(Mi-1 )  Sx . It follows that f ir st(M1 ) p З З З p f ir st(M|P |+1 ) p last(M1 ) p ЗЗЗ p last(M|P |+1 ). (3)

Clearly, M = Mj for some j  [1, |P |+1], where lca(Mi ) a lca(Mj ) for i  [1, |P | + 1]. We can characterize Mj by the following two prop erties: (P1) for i  [1, j ), lca(Mi ) a last(Mi+1 ); and a last(Mj +1 ).

(P2) if j < |P | + 1, then lca(Mj )

Prop erty (P1) implies that lca(Mi ) a lca(Mi+1 ) for i  [1, j ). Sp ecifically, since f ir st(Mi ) p f ir st(Mi+1 ) p last(Mi ) (by Equation (3)), it follows that lca(Mi ) a f ir st(Mi+1 ); combining this with prop erty (P1), we have lca(Mi ) a lca(Mi+1 ). Prop erty (P2) implies that lca(Mi ) a lca(Mj ) for i  (j, |P | + 1]. To see this, note that lca(Mj ) a f ir st(Mi ) for i  (j, |P | + 1] (by Equation (3)). Furthermore, since lca(Mj ) a last(Mj +1 ) (prop erty (P2)) and last(Mj +1 ) p last(Mi ) for i  (j + 1, |P | + 1] (by Equation (3)), it follows that lca(Mj ) a last(Mi ) for i  (j, |P | + 1]. Thus, for i  (j, |P | + 1], we have Mj p Mi , lca(Mj ) a f ir st(Mi ) and lca(Mj ) a last(Mi ); it follows from Lemma 3.5 that lca(Mi ) a lca(Mj ). The IMS algorithm (Algorithm 2) shares many similarities with the BMS algorithm in the previous section. The key difference lies in steps 10 to 17 which determines lca(M ) for a match M anchored by a node vm without actually computing M . The rep eat loop enumerates a sequence of matches M1 ,M2 ,З З З to compute f ir st(M ) and last(M ) and hence lca(M ). In the ith iteration of the rep eat loop (steps 13 to 17), f ir st(Mi ) is computed in step 14 as , and i is computed in step 15 (with r representing last(Mi )). Step 16 determines last(Mi+1 ) for the next iteration. The search for M is terminated when any one of the three conditions in step 17 is met. Firstly, if r = null, then it means that next(vm , Sj ) = null and there are no further matches in the data; therefore,  = lca(M ) and the next anchor node is correctly set to null by step 24. Secondly, if  a r , then  = lca(M ) and Lemma 3.5 is applied to optimize the selection of the next anchor node in step 24. Finally, if r = last(N ), then it means that all the matches Mi subsequently enumerated within the rep eat loop must have last(Mi ) = last(N ) as well; therefore, M must corresp ond to the very last match in the enumeration. To quickly skip to this last match without continuing with the enumeration, step 26 applies Lemma 3.1 to up date the next anchor node to b e r .

3.5 Analysis
In this section, we analyze the time complexity of our new algorithms. We b egin by establishing an upp er b ound on the numb er of candidate SLCAs computed by each of the BMS and IMS algorithms. Lemma 3.7. The number of candidate SLCAs computed by each of the BMS and IMS algorithms is no more than |S1 |. Proof. We prove the claim for BMS; the proof for IMS follows similarly. The upp er b ound is established by showing that for any two candidate SLCAs computed by BMS, their corresp onding matches do not contain the same node from S1 . By step 1, the first anchor node selected either succeeds or is equal to f ir st(S1 ). Let vm b e the anchor node used to compute a candidate SLCA lca(M ). By the optimization in steps 4 to 9, vm p closest(vm , S1 ). Moreover, by step 17, if M  S1 = {v1 }, then the next anchor node selected either succeeds or is equal to next(v1 , S1 ). Thus, since no two matches computed by BMS share the same node from S1 , the claim is established for BMS. We now consider the costs of the various functions. Let d denote the height of the XML data tree, and let S denote the data node list with the highest frequency; i.e. |Si |  |S | for i  [1, k]. As in [14], we assume that each data node is stored together with its Dewey lab el which enables the lca function to b e computed efficiently in O(d) time. The cost of f ir st(v1 , З З З , vk ) and last(v1 , З З З , vk ) is each O(k). The cost of pr ed(v , Si ), next(v , Si ), out(v , Si ), and closest(v , Si ) is each O(d log(|Si |)) based on a binary search of Si and comparing nodes using their Dewey lab els. For the BMS algorithm, the cost to compute a candidate SLCA is O(kd log(|S |)) (due to step 10). Since there are at most |S1 | candidate SLCAs (by Lemma 3.7), the time complexity of BMS algorithm is O(kd|S1 | log (|S |)). For the IMS algorithm, the cost to compute a candidate SLCA is also O(kd log(|S |)) (due to steps 10 and 11); thus the time complexity of IMS algorithm is also O(kd|S1 | log(|S |)). However,

1048


WWW 2007 / Track: XML and Web Data
Algorithm 3 Simple AND-OR-SLCA (Q) 1: let Q = C1  C2  З З З  Ck , where each 2: 3: 4: 5: 6: 7: 8: 9: 10: 11: 12: 13: 14: 15: 16: 17: 18: 19: 20: 21:

Session: Querying and Transforming XML
Algorithm 4 AND-OR Multiway-SLCA 1: initialize x = 1; 1 = droot 2: let vm = last({f irst(Ci ) | i  [1, n]}), where vm = f irst(Cm ) 3: while (vm = null) do 4: if (m = 1) then 5: v1 = closest(vm , C1 ) 6: if (vm p v1 ) then 7: vm = v1 8: end if 9: end if 10: P = {pr ed(vm , Ci ) | i  [1, n], i = m}  {vm } 11: N = {next(vm , Ci ) | i  [1, n], next(vm , Ci ) = null} 12: initialize rmax = last(N ); r = vm 13: repeat 14: remove from P , where = f ir st(P ) 15:  = lca( , r ) 16: r = last(r, v) where v  N s.t. v = next(vm , Ci ),  17: until (r = null) or ( a r ) or (r = rmax ) 18: if (r = null) or ( a r ) then 19: if (x a ) then 20: x =  21: else if ( a x ) then 22: x = x + 1; x =  23: end if 24: vm = last(r, out(x , C1 ), З З З , out(x , Cn )) 25: else 26: vm = r 27: end if 28: end while 29: if (1 = droot ) then return  else return {1 , З З З , x } where each conjunct Ci = wi,1  З З З  wi,ni . is a disjunction of ni keywords, ni  1. We use Si,j to denote the list of data nodes associated with each keyword wi,j , i  [1, n], j  [1, ni ]. Note that AOMS algorithm is almost equivalent to IMS algorithm except that the six functions f ir st, last, pr ed, next, closest, and out now involve a conjunct Ci instead of a keyword list Si . These mildly generalized versions of the definitions from Section 2 are extended as follows: З f ir st(Ci ) = f ir st({f ir st(Si,j ) | j  [1, ni ], f ir st(Si,j ) = null}). З last(Ci ) = last({last(Si,j ) | j  [1, ni ], last(Si,j ) = null}). З pr ed(v , Ci ) = last({pr ed(v , Si,j ) | j  [1, ni ], pr ed(v , Si,j ) = null}) З next(v , Ci ) = f ir st({next(v , Si,j ) | j  [1, ni ], next(v , Si,j ) = null}) З closest(v , Ci ) = pr ed(v , Ci ) if lca(v , next(v , Ci )) a lca(v , pr ed(v , Ci )); otherwise, closest(v , Ci ) = next(v , Ci ). З out(v , Ci ) = f ir st({out(v , Si,j ) | j  [1, ni ], out(v , Si,j ) = null}) For simplicity and without loss of generality, we assume i that |C1 |  |Ci | for i  [1, n] where |Ci | = n=1 |Si,j |. j
Si,j

Ci = wi,1  wi,2  З З З  wi,ni , ni  1 initialize R to be empty for i = 1 to k do let T = result of evaluating Ci using some AND-algorithm for each node v  T do initialize toDeleteV = false for each node v  R do if (v a v ) then toDeleteV = true exit inner for-loop a v ) then else if (v remove v from R end if end for if (toDeleteV) then delete v from T end if end for R=R  T end for return R

our exp erimental results show that IMS outp erforms BMS as the numb er of candidate SLCAs computed by IMS is less than that by BMS (by up to a factor of 30). In comparison, the time complexities of SE and ILE are, resp ectively, O(kd|S |) and O(|S1 |kd log (|S |) + |S1 |2 ) [14].

4.

AND-OR KEYWORD SEARCH

In this section, we examine how to process more general SLCA-based keyword search queries that go b eyond the AND-semantics in conventional queries to supp ort any combination of AND and OR b oolean op erators. We consider AND-OR keyword search queries of the form: Q = (Q) | Q and Q | Q or Q | w, where w denotes some keyword. In the following, we consider two approaches to process SLCA-based AND-OR keyword search queries. The first approach is a straightforward application of the algorithms presented in the preceding section for processing conventional SLCA-based AND-keyword search queries by expressing the general AND-OR queries in disjunctive normal form (DNF). The second approach is an extension of our MultiwaySLCA approach (MS).

4.1 Simple AND-OR Algorithm (SA)
A straightforward approach to process a general ANDOR query Q is to rewrite Q in DNF and evaluate it in two stages: first, evaluate each disjunct in Q using an existing AND-query evaluation algorithm; next, the results of the individual evaluations are combined by eliminating intermediate SLCAs that are ancestor nodes of some other intermediate SLCAs. The algorithm, referred to as Simple AND-OR (SA) is shown in Algorithm 3.

4.2 AND-OR Multiway-SLCA (AOMS)
In this section, we show how our Multiway-SLCA approach for processing conventional AND-keyword search queries can b e easily generalized to process AND-OR keyword search queries. The extended algorithm, called AND-OR MultiwaySLCA (AOMS), is shown in Algorithm 4. Our approach requires a general AND-OR query Q to b e expressed in conjunctive normal form (CNF), C1  З З З  Cn ,

Ш

5. EXPERIMENTAL STUDY
To verify the effectiveness of our prop osed algorithms, we conducted extensive exp eriments to compare their p erformance against existing approaches for evaluating b oth AND as well as AND-OR keyword search queries.

1049


WWW 2007 / Track: XML and Web Data

Session: Querying and Transforming XML

400 350 Evaluation Time (ms) 300 250 200 150 100 50 0 Q1 Q2 Q3 Q4 Q5 Q6 Q7 Q8 Q9 Q10 Query SE IMS ILE IIMS Evaluation Time (ms)

400 350 300 250 200 150 100 50 0 Q1 Q2 Q3 Q4 Q5 Q6 Q7 Q8 Q9 Q10 Query SE IMS ILE IIMS

(a) k2-1000-1000
90 80 70 Evaluation Time (ms) 60 50 40 30 20 10 0 Q1 Q2 Q3 Q4 Q5 Q6 Q7 Q8 Q9 Q10 Query SE IMS ILE IIMS SE Evaluation Time (ms) 90 80 70 60 50 40 30 20 10 0 Q1 Q2 Q3

(b) k4-1000-1000

Q4

Q5

Q6

Q7

Q8

Q9

Q10

Query IMS ILE IIMS

(c) k2-100-1000
250 550 500 200 Evaluation Time (ms) Evaluation Time (ms) 450 400 350 300 250 200 150 50 100 50 0 Q1 Q2 Q3 Q4 Q5 Q6 Q7 Q8 Q9 Q10 Query SE IMS ILE IIMS SE 0 Q1 Q2 Q3

(d) k4-100-1000

150

100

Q4

Q5

Q6

Q7

Q8

Q9

Q10

Query IMS ILE IIMS

(e) k2-10-10000

(f ) k4-10-10000

Figure 3: Comparison for AND queries

1050


WWW 2007 / Track: XML and Web Data

Session: Querying and Transforming XML
observe that I IMS generally p erforms b etter than IMS only when the numb er of candidate SLCA computations is small. For the binary-SLCA algorithms, our results are consistent with [14] with SE outp erforming ILE. Overall, IMS generally p erforms the b est for b oth k2-1000-1000 and k4-1000-1000 with I IMS p erforming b etter than IMS for some cases. Figures 3(c) and 3(d) show the results for the case where the low and high frequencies differ by a factor of 10. In this case, the non-indexed methods generally p erform b etter than the indexed methods. For k2-100-1000, IMS has an edge over SE. For k4-100-1000, I IMS p erforms the b est when the numb er of its candidate SLCA computations is small; otherwise, SE generally has the b est p erformance. Figures 3(e) and 3(f ) show the results for the case where the low and high frequencies differ by a factor of 100. In this case, the indexed methods p erform b etter than the nonindexed methods. I IMS and ILE are comparable when the numb er of candidate SLCA computations by I IMS is small; otherwise, ILE gives b etter p erformance.

5.1 Experimental Setup
Similar to what was done in [14], where there is b oth a non-indexed version of the algorithm (i.e., SE) as well as an indexed version of the algorithm (i.e., ILE), we also created two versions for each of our prop osed algorithms. We use IBMS, I IMS, and IAOMS, resp ectively, to refer to the indexed versions of BMS, IMS, and AOMS. As in [14], in the non-indexed algorithms, all the data nodes are organized using a B-tree where the B-tree keys are the keywords of the data nodes and the B-tree data associated with each B-tree key is a list of Dewey lab els of the data nodes having that key as keyword. In the indexed algorithms, all the data nodes are organized using a B-tree where each B-tree key is a comp osite key consisting of a keyword (as primary key) and a Dewey numb er (as secondary key). No data values are associated with this comp osite-key organization. For the simple approach of evaluating AND-OR queries (Section 4.1), we have six variants of the algorithm denoted by SA-SE, SA-BMS, SA-IMS, SA-ILE, SA-IBMS, and SAI IMS; where SA-X denotes the variant using algorithm X to evaluate the AND-sub queries of the AND-OR query. The evaluation of the algorithms was carried out by using different classes of queries. Each class of AND queries is denoted by kN-L-H, where N ,L, and H are three p ositive integer values: N represents the numb er of keywords, and L and H , with L  H , represent two keyword frequencies such that one of the N keywords has the low frequency L while each of the remaining N - 1 keywords has the high frequency H ; thus, L = |S1 |. Each class of AND-OR queries (in CNF) is denoted by cM-kN-L-H, where M represents the numb er of conjuncts in a query, N represents the numb er of keywords in each conjunct, L represents the frequency of each keyword in one conjunct, and H (with L  H ) represents the frequency of each keyword in the remaining M - 1 conjuncts. For each class of queries, a set of 10 random queries were generated and each query was executed six times and its average execution time over the last five runs was computed. All the exp eriments were conducted using a DBLP dataset [6] with two million data nodes. Our implementation used BerkeleyDB (Java Edition) [3] to store the keyword data lists similar to what was done in [14]. The BerkeleyDB database was configured using a page size of 8KB and a cache size of 1GB. All the exp eriments were conducted on a 3GHz dual-core desktop PC with 1GB of RAM.

5.3 AND-OR Keyword Search Queries
Figure 4 compares the p erformance of algorithms SA-SE, SA-IMS, AOMS, SA-ILE, SA-I IMS, and IAOMS for ANDOR queries; algorithms SA-BMS and SA-IBMS have b een omitted as they are outp erformed by the IMS variants. The evaluation times shown in Figure 4 are average evaluation times of ten queries. Figures 4(a) and 4(b) show the results for the case where the low and high frequencies are the same, while Figures 4(c) and 4(d) show the results for the case where the low and high frequencies differ by a factor of 10 and 100, resp ectively. In all these cases, the non-indexed methods outp erform their indexed counterparts, with AOMS giving the b est p erformance.

6. RELATED WORK
Efficient algorithms for computing LCAs on trees have b een carefully studied by a numb er of early work [2, 9]. However, the algorithms designed there are meant for mainmemory resident data. Schmidt et al. [13] prop ose the meet op erator for querying XML document by computing the LC As of nodes in XML trees. XRANK [8] prop oses a stack-based keyword search algorithm and the results are ranked by a Page-Rank hyp erlink metric extended to XML. Their ranking techniques are orthogonal to the retrieval and could b e easily incorp orated into our work. Another work XSEarch [5], which is an extension of information-retrieval techniques, is mainly focused on the semantics and ranking of query results. The research work in [14, 12] is the most closely related to our current work, and b oth work adopt the idea of smallest LC A (SLCA) or Meaningful LC A (MLCA), which are similar ideas. Li et al. [12] prop ose a novel schema-free way to incorp orate keyword search in XQuery. They also develop an efficient stack-based MLCA searching algorithm. XKSearch [14] focuses on finding the smallest LCA of keywords in XML documents, and it prop oses several algorithms, which we compared against in this pap er. More recently, there has also b een a lot of interest on keyword search in relational database systems [1, 4, 7, 10, 11] where the emphasis is mainly on optimizing join queries to generate tree tuples.

5.2 AND Keyword Search Queries
Figure 3 shows the comparison of the two binary-SLCA algorithms (SE and ILE) against our multiway-SLCA algorithms IMS and I IMS. To avoid cluttering the graphs, we have omitted the BMS and IBMS algorithms as they were outp erformed by the IMS and I IMS algorithms (by up to an order of magnitude), resp ectively. Compared to the BMS variants, the IMS variants not only incur fewer numb er of candidate SLCA computations but they are also more efficient in SLCA computations. Each graph in Figure 3 shows the p erformance comparison for a different query class. For each query class, the ten random queries shown (Q1 to Q10) are ordered in nondescending order of the numb er of candidate SLCA computations incurred by the IMS algorithm. Figures 3(a) and 3(b) show the results for the case where the low and high frequencies are the same. Comparing IMS and I IMS, we

1051


WWW 2007 / Track: XML and Web Data

Session: Querying and Transforming XML

7.

CONCLUSIONS
800

In this pap er, we examined the problem of processing SLCA-based keyword search queries for XML data. We have presented a novel approach called multiway-SLCA approach that is more efficient than the state-of-the-art binary-SLCA approach for evaluating SLCA-based keyword queries. In addition, we have also extended our approach to process more general keyword search queries that go b eyond the traditional AND semantics to supp ort any combination of AND and OR b oolean op erators. Our exp erimental p erformance evaluation using real XML datasets demonstrate the efficiency of our new algorithms compared to previous algorithms. As part of our future work, we intend to extend our approach to handle complex keyword search queries with any combination of AND, OR, and NOT op erators.

600 Evaluation Time (ms)

400

200

0 c2-k2 c2-k4 Query Class c4-k2

(a) cM-kN-100-100

8.

REFERENCES

5000

[1] S. Agrawal, S. Chaudhuri, and G. Das. Dbxplorer: A system for keyword-based search over relational databases. In ICDE, 2002. [2] M. Bender and M. F.Colton. The LCA problem revisited. In Latin American Theoretical Informatics, 2000. [3] Berkeley DB. http://www.sleepycat.com. [4] G. Bhalotia, A. Hulgeri, C. Nakhe, S. Chakrabarti, and S. Sudarshan. Keyword searching and browsing in datrabases using BANKS. In ICDE, 2002. [5] S. Cohen, J. Mamou, Y. Kanza, and Y. Sagiv. XSEarch: A Semantic Search Engine for XML. In VLDB, 2003. [6] DBLP. http://www.informatik.uni-trier.de/ley/db/. [7] R. Goldman, N. Shivakumar, S. Venkatasubramanian, and H. Garcia-Molina. Proximity Search in Databases. In VLDB, 1998. [8] L. Guo, F. Shao, C. Botev, and J. Shanmugasundaram. XRANK: Ranked Keyword Search over XML Documents. In SIGMOD, 2003. [9] D. Harel and R. E. Tarjan. Fast algorithm for finding nearest common ancestors. In SIAM Journal on Computing, 1984. [10] V. Hristidis and Y. Papakonstantinou. DISCOVER: Keyword Search in Relational Databases. In VLDB, 2002. [11] V. Hristidis, Y. Papakonstantinou, and A. Balmin. Keyword Proximity Search on XML Graphs. In ICDE, 2003. [12] Y. Li, C. Yu, and H. V. Jagadish. Schema-Free XQuery. In VLDB, 2004. [13] A. Schmidt, M. Kersten, and M. Windhouwer. Querying XML Document Made Easy:Nearest Concept Queries. In ICDE, 2001. [14] Y. Xu and Y. Papakonstantinou. Efficient Keyword Search for Smallest LCAs in XML Database. In SIGMOD, 2005.

4000 Evaluation Time (ms)

3000

2000

1000

0 c2-k2 c2-k4 Query Class c4-k2

(b) cM-kN-1000-1000
550 500 450 400 Evaluation Time (ms) 350 300 250 200 150 100 50 0 c2-k2 c2-k4 Query Class c4-k2

(c) cM-kN-10-100
800 700 Evaluation Time (ms) 600 500 400 300 200 100 0 c2-k2 c2-k4 Query Class AOMS SA-ILE c4-k2

SA-SE SA-IMS

SA-IIMS IAOMS

(d) cM-kN-10-1000

Figure 4: Comparison for AND-OR queries

1052