WWW 2007 / Poster Paper Topic: XML Exploit Sequencing Views in Semantic Cache to Accelerate XPath Query Evaluation Jianhua Feng, Na Ta, Yong Zhang, Guoliang Li Department of Computer Science and Technology Tsinghua University, Beijing 100084, China {fengjh, liguoliang}@tsinghua.edu.cn, dan04@mails.thu.edu.cn, zhangy@tsinghua.org.cn ABSTRACT In XML databases, materializing queries and their results into views in a semantic cache can improve the performance of query evaluation by reducing computational complexity and I/O cost. Although there are a number of proposals of semantic cache for XML queries, the issues of fast cache lookup and compensation query construction could be further studied. In this paper, based on sequential XPath queries, we propose fastCLU, a fast Cache LookUp algorithm and effiCQ, an efficient Compensation Query constructing algorithm to solve these two problems. Experimental results show that our algorithms outperform previous algorithms and can achieve good performance of query evaluation. 2. Problem Definition Generally an XPath query can be modeled as a tree pattern composed of a node set, an edge set of child and descendant edges, a root node and an output node. To simplify the cache lookup process, we convert an XPath query into an equivalent sequential representation which has a Basic Path and a Predicate Condition Set. The Basic Path of an XPath query Q is the path containing all nodes from Q's root node to Q's output node. Nodes in the Basic Path the path nodes and other nodes are referred to as predicate nodes. The number of nodes in a Basic Path BP is the depth of BP, denoted as dBP. Child and descendant axes in a Basic Path are denoted explicitly by "/" and "//". For each path node nBP of an XPath query Q, suppose there are nc leaf nodes {ln1, ln2, ..., lnc}, which are leaves of sub-trees of nBP whose root nodes are not path nodes, we call them predicate leaf nodes. For all the predicate leaf nodes of nBP, we construct a including nc path expressions rooted at nBP and ended at one of the nc predicate leaf nodes, we call this set the Predicate Condition Set of nBP and denote it as PCSN(nBP)={pci | 1inc, pci is a path from nBP to the i-th predicate leaf node of nBP }. The set of all of Q's path nodes' Predicate Condition Sets is the Predicate Condition Sets of Q and is denoted as PCSQ(Q). The homomorphism from one query pattern to another ensures the containment relationship the other way round. In other words, for two query patterns P1 and P2, if there is a homomorphism from P1 to P2, P2 is contained in P1[3]. Thus a materialized view V can answer a query Q if Q is contained in V. Sequential representation of XPath queries can help reduce the time cost of homomorphism mapping checking from queries to views. Figure 1 gives examples of tree patterns and homomorphism. The Basic Paths of P1, P2 and P3 are a/d//e, a/d/e/f and a/d/k/e respectively. The depth of a/d//e is 3. There is a homomorphism from P1 to P2 in Figure 1(a). a * c p1 d e (a) * c p2 a d e f * c p1 a d e (b) * c p3 a d k e Categories and Subject Descriptors H.2.4 [Systems] Subjects: Query processing General Terms: Algorithms, Performance, Languages Keywords: XML, XPath, Query Evaluation, Semantic Cache 1. INTRODUCTION The popularity of XML inspires the need to quickly retrieve XML data. In XML databases, a semantic cache of materialized views, which are queries combined with their result nodes, can help accelerate the process of evaluating XML queries in that when there is a cache hit, there is no need to evaluate the query against the whole database and retrieve the result from lower storage, the cached data can accomplish the task simply. We study a group of XPath queries in XPath fragment XP{/, //, [], *}, which contains four features: child axes (/),descendant axes (//), wildcards (*) and predicates ([]). There are two steps in exploiting the semantic cache of an XML database to answer queries: cache lookup and compensation query construction for evaluation. We propose algorithm fastCLU to accomplish the first step based on Basic Path and Predicate Condition Sets of sequential XPath queries. A view V can answer Q if there exists another query CQ which gives the result of Q when queried against the result of V. CQ is the compensation query and usually has less executing cost than Q. V is the matching view of Q. The other algorithm effiCQ constructs the compensation query efficiently for the second step. For example, suppose there are three views: V1= a[[b[k<100]][j]]/f/g[c[d][.//e]], V2=a[b/c]/u//v, V3=a[b[k<50]]/*/x, and a query: Q1=a[[b[k<100]][j]]/f/g[c[d][e]][h]. Q1 can be answered by view V1 by restricting the e node in V1 to be the child of the c node and the output g node to have an h child. Thus compensation query CQ1=g/[c/e][h]. Figure 1. Homomorphism and containment of queries Definition 1. Basic Path Containment: for two XPath queries Q1 and Q2, let their corresponding Basic Paths be BP1=n1n2...nl1 and BP2=n1'n2'...nl2' respectively, BP2 is contained in BP1 if (1) l1l2 and (2) for any pair of symbols si, si' (1il1) at the i-th position of BP1 and BP2 respectively, one of the following conditions is satisfied: (a) si'.tagName=si.tagName, (b) si="*", (c) si'=si="/", (d) si'="/" or "//" while si="//". Copyright is held by the author/owner(s). WWW 2007, May 8­12, 2007, Banff, Alberta, Canada. ACM 978-1-59593-654-7/07/0005. 1337 WWW 2007 / Poster Paper Definition 2. PCSN Containment: for two path nodes n1 and n2, PCSN(n1)={pi | 1inp1, np1 is the number of predicate leaf nodes of n1}, PCSN(n2)={pj | 1jnp2, np2 is the number of predicate leaf nodes of n2}, PCSN(n2) is contained in PCSN(n1) if (1) np1np2; (2) for each path expression p=s1s2...sl1 in PCSN(n1), there is p'=s1's2'...sl2' in PCSN(n2), such that l1l2 and p is segmented by "//"into k parts which do not contain "//" and have exactly the same occurrences in p', and the "//" symbols in p are mapped to "/", "//" or path fragments in p' between k segments. Definition 3. PCSQ Containment: for two queries Q1 and Q2 PCSQ(Q1)={PCSN(ni)|1idBP1,niBP1}, PCSQ(Q2)={PCSN(nj') | 1jdbp2, nj'BP2}, PCSQ(Q2) is contained in PCSQ(Q1) if (1) BP2 is contained in BP1; (2) PSCN(n)=PCSN(n') for all of P1's path nodes n except P1's output node no; and (3) let no maps to no' in Q2, PCSN(no') is contained in PCSN(no). Since PCSQ containment actually requests Basic Path containment, therefore, the criteria of query/view answerability can be put as follows: if PSCQ(Q) is contained in PCSQ(V) for a query Q and a view V, V can answer Q. This makes the foundation of our algorithms. Topic: XML 4. EXPERIMENTAL EVALUATION We compare our algorithms with the view selection method in [1], which is based on string matching and referred to as algSM, and the naive semantic cache, which requires exact equivalence of view and query. We used a 300 MB XML document generated by the XMark [2] generator. Testing programs run in Windows 2000 system with 768MB memory. Cache Lookup Performance. Figure 2 shows how the hit rate varies with the zipf exponent z used for creating attribute predicates. Hit rate of fastCLU is 1.29 and 7.48 times of that of algSM and the Naive Cache. This is because fastCLU can handle such cases that a descendant axis in Basic Path of a view is mapped to a child axis in Basic Path of a query, which algSM will treat as a cache miss. Query Processing Performance. Figure 3 shows the average time to answer a query by the three algorithms to illustrate the speedup gained by fastCLU and effiCQ. We cached 2,000 queries and submitted 20,000 test queries and set z=1.75. Our strategy of caching path nodes and effiCQ help to enlarge the answering capacity of our semantic cache; consequently, a higher hit rate and a shorter average processing time of one query is achieved. A v e r a g e P r o c e ssin g T im e ( m s) 1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 Hit Rate 3. Algorithms: fastCLU and effiCQ FastCLU runs like this: First find a set of candidate views whose Basic Paths contain the Basic Path of the input query Q, and rank the candidate views by depth of the Basic Paths, views with greater Basic Path depth precede views with smaller Basic Path depth. Then check Predicate Condition Sets containment between Q and the current view in candidate set. If a matching view is found, this view is passed to algorithm effiCQ to construct compensation query. If none of candidate views contains Q, there is a cache miss and Q has to be evaluated against data in lower storage. Note that although [1] also uses string matching in cache lookup, it considers a view in the cache as a whole, and its matching process involves a time-consuming predicate condition set generation and containment test. Meanwhile, our algorithm does not require such a generate-and-test course and does not need the superset of Q's predicate conditions, which makes it more time efficient. Due to space limit, details of fastCLU is omitted. EffiCQ is outlined as follows to present it clearly. Algorithm effiCQ: compensation query construction Input: Q, an XPath query; V, a matching view of Q Output: CQ, the compensation query of Q Let BPQ=n1/(or//)n2/(or//).../(or//)nd, BPV=n1/(or//)n2/(or//).../(or//)ndV 1: BPCQ=nk/(or//)nk+1/(or//)... ndV/(or//).../(or//)ndQ; /* nk is the node before the first different axis symbol of BPQ and BPV if there is any, otherwise it is the output node of V */ 2: for each path expression PEj in PCSN(ndV) of Q do { 3: if (PEj is contained in but not equal to some path expression PEj' of PCSN(ndV) of V) OR (PEj is not contained in any path expression PEj' of PCSN(ndV) of V) 4: put PEj into PCSN(ndV) of CQ; } 5: if (ndV is not the output of Q) 6: attach the predicate conditions of ni+1, ni+2, ..., ndQ to ni+1, ni+2, ..., ndQ to CQ; 7: return CQ; 8 00 6 00 4 00 2 00 0 fast CLU algSM Naive Cache Naive Cache algSM fastCLU 1 2 3 4 5 Workload size (Number of queries * 10000) Figure 2. Hit rate vs. workload size Figure 3. Average processing time 5. CONCLUSION In this paper, we propose algorithm fastCLU, which uses equivalent sequential representation of XPath queries to accelerate cache lookup, and agorithm effiCQ, which constructs compensation queries efficiently with lower computational cost to evaluate XPath queries. Experimental results demonstrate that our algorithms can achieve high performance for query evaluation. 6. ACKNOWLEDGEMENT This work is in part supported by the National Natural Science Foundation of China under Grant No.60573094, the National Grand Fundamental Research 973 Program of China under Grant No.2006CB303103, the National High Technology Development 863 Program of China under Grant No.2006AA01A101, and Tsinghua Basic Research Foundation under Grant No. JCqn2005022. 7. REFERENCES [1] Bhushan Mandhani, Dan Suciu. Query Caching and View Selection for XML Databases. VLDB, 2005. [2] A.R. Schmidt, F. Waas, M.L. Kersten, D. Florescu, I. Manolescu, M.J. Carey and R. Busse. The XML Benchmark Project. Technical Report INS-R0103, CWI, 2001. [3] Wanhong Xu, Z. Meral Ozsoyoglu. Rewriting XPath Queries Using Materialized Views. VLDB, 2005. As presented, EffiCQ constructs the compensation query CQ to answer a query Q by its matching view V found by fastCLU. CQ is queried against V to return result of Q. 1338