WWW 2007 / Track: XML and Web Data Session: Parsing, Normalizing, and Storing XML Querying and Maintaining a Compact XML Storage Raymond K. Wong Franky Lam William M. Shui {raymond.wong, franky.lam, bill.shui}@nicta.com.au National ICT Australia, University of New South Wales & Green Pea Software Sydney, Australia ABSTRACT As XML database sizes grow, the amount of space used for storing the data and auxiliary data structures becomes a major factor in query and update performance. This paper presents a new storage scheme for XML data that supports all navigational operations in near constant time. In addition to supporting efficient queries, the space requirement of the proposed scheme is within a constant factor of the information theoretic minimum, while insertions and deletions can be performed in near constant time as well. As a result, the proposed structure features a small memory footprint that increases cache locality, whilst still supporting standard APIs, such as DOM, and necessary database operations, such as queries and updates, efficiently. Analysis and experiments show that the proposed structure is space and time efficient. When looking for a compact storage scheme for XML, there are several issues that need to be addressed. For example, it has to support fast operations, especially we are considering software applications that target people on the move. Moreover, if intensive compression methods are employed, they need to be optional and can be switched on or off due to low computation power of some mobile devices. In summary, from our experience, the major issues include: · It must support fast navigational operations: Many XML applications, such as collaborative document editing systems, depend upon efficient tree traversal, using a standard interface such as DOM. Halverson et al [10] demonstrated that a combination of navigational and structural join operators is most effective for evaluating queries. Hence, it is imperative that the storage scheme supports fast traversal of the XML tree, in all possible directions, preferably in constant time or near constant time. Previous work, such as that of Zhang et al [23], has addressed the issue of succinctly representing XML, but at the cost of linear time navigational operations, which is not acceptable for many practical applications. Our proposed structure efficiently supports tree navigation primitives in O(lg n/ lg lg n) time, and also includes support for efficient structural joins. · It must support efficient insertions and deletions: Several papers address the space issue by storing XML in compressed form [4, 16, 19, 22]. They also support path expression queries or fast navigational access but do not allow efficient update operations such as node insertion. This can be a critical concern in many database applications. In this paper, we provide a scheme which allows near constant time for update operations in practice, with a theoretical worst case time of O(lg2 n). · It must support efficient join operations: Current query optimization techniques for XML such as work of Halverson et al [10], make heavy use of the structural join [2], which relies on a constant time operator to determine the ancestordescendant relationship between two nodes. Thus, any general XML storage scheme should also support such an operator in near constant time. Our scheme supports ancestordescendant queries in O(lg n/ lg lg n) time. · It must be practical: Many succinct tree representation schemes are elegant theoretical structures that unfortunately do not translate well into practice. Thus, while theoretical guarantees are important for any proposed structure, practical considerations should not be forgotten. In this paper, we Categories and Subject Descriptors H.3.2 [Information Systems]: Information Storage; H.2.4.n [Textual Databases]: XML Databases General Terms Algorithms, Design, Performance Keywords XML, Compact Storage, Storage Optimization, Query Processing 1. INTRODUCTION The popularity of XML as a data representation language has produced a wealth of research on efficiently storing and querying tree structured data. As the amount of XML data available increases, it is becoming vital to be able to not only query and maintain this information quickly, but also store it in a compact manner. Our work is also motivated by the mobile software development at National ICT Australia and Green Pea Software, in which managing large amount of XML data on mobile devices is mandatory. We thus turn to the problem of finding a compact storage scheme for XML, i.e., a space-efficient representation of the data structure which also maintains low access and update costs for all of the desired primitive operations for data processing. The flexibility of XML makes finding a scheme which satisfies all these requirements at the same time extremely challenging. Copyright is held by the International World Wide Web Conference Committee (IW3C2). Distribution of these papers is limited to classroom use, and personal use by others. WWW 2007, May 8­12, 2007, Banff, Alberta, Canada. ACM 978-1-59593-654-7/07/0005. 1073 WWW 2007 / Track: XML and Web Data focus on developing a practical storage scheme, using values that fit to the natural machine word size, block size and byte alignment, to allow our scheme to be used in real-world database systems. · It should separate the topology, schema and text of the document: All XML query languages select and filter results based on some combination of the topology, schema and text data of the document. To allow efficient scans over these parts of the document, it is natural to find a representation that partitions them into separate physical locations. · It should permit extra indexes: Many applications may require addition specialized indexes to be built upon their data. Therefore, a general purpose database system is required to provide a storage representation, such that it is flexible enough to accommodate such need. More specifically, the storage scheme used by the database system must provide a simple, efficient and stable way of referencing its stored data items. In this paper, we propose a compact XML storage engine, called ISX (for Integrated Succinct XML system), to store XML in a more concise structure and address all of the above issues. Theoretically, ISX uses an amount of space near the information theoretic minimum on random trees. For a constant , where 1 2, and a document with n nodes, we need 2n + O(n) bits to represent the topology of the XML document. Node insertions can be handled in constant time on average but worst case O(lg2 n) time, and ln all node navigation operations take worst case O( lggg n ) time but l constant time on average. The rest of this paper is organized as follows: Section 2 summarizes relevant work in the field. Section 3 presents the basics of ISX and its topology layer. The fast node navigation operators, the querying interfaces and the update mechanism are then described in detail in Section 4. Finally, Section 5 presents the experiment results and Section 6 concludes the paper. Session: Parsing, Normalizing, and Storing XML Most XML storage schemes, such as [9, 10, 12, 15], make use of interval and preorder/postorder labeling schemes to support constant time order lookup, but fail to address the issue of maintenance of these labels during updates. Recently, Silberstein et al [21] proposed a data structure to handle ordered XML which guarantees both update and lookup costs. Similarly, the L-Tree labeling scheme proposed by Chen et al [6] addressed the same problem and has the same time and space complexity as [21], however, they do not support persistent identifiers. The major difference between our proposal and these two work is that we try to minimize space usage while allowing efficient access, query and update of the database. In this paper, we show that our proposed topology representation costs linear space while [21] costs n log n space. The work most related to this paper regarding databases with efficient storage is from Zhang et al [23]. The succinct approach proposed by Zhang et al [23] targeted secondary storage, and used a balanced parentheses encoding for each block of data. Unfortunately, their summary and partition schemes support rank and select operations in linear time only. Their approach also uses the Dewey encoding for node identifiers in their indexes. The drawbacks of the Dewey encoding are significant: updates to the labels require linear time, and the size of the labels is also linear to the size of the database in the worst case. Thus, the storage of the topology can require quadratic space in the worst case. Finally, there are several related proposals published recently, e.g. [8, 9]. [9] show that all XPath axes can be handled using a preorder/postorder labeling. Instead of maintaining these two labels (i.e., two integers), our proposed scheme requires less than 3 bits per node to process all XPath axes, which is an attractive alternative for applications that are both space and performance conscious. Ferragina et. al. [8] first shred the XML tree into a table of two columns, then sort and compress the columns individually. It does not offer immediate capability of navigating or searching XML data unless an extra index is built. However, the extra index will degrade the overall storage size (i.e., the compression ratio). Furthermore, the times for disk access and decompression of local regional blocks have been omitted from their experiments. As a result, the performance of actual applications may be different from what the experiments shown. Same as most other related work, data updates have been disregarded. 2. RELATED WORK To our best knowledge, Liefke and Suciu [16] proposed the first compressed XML scheme called XMill. Although XMill achieves a good compression ratio, its major drawback (which is the lack of support for query and update) hinders its broad application in database systems. Various approaches were proposed after XMill and they share similar benefits and drawbacks, e.g., XMLPPM [7]. Related work that share the same motivations with this paper includes Maneth et al [17], Tolani and Haritsa [22], Min et al [19] and Buneman et al [4]. Compared to XMill, XGrind [22] has a lower compression ratio but supports certain types of queries. XPRESS [19] uses reverse arithmetic encoding to encode tags using start/end regions. Both XGrind and XPRESS require top-down query evaluation, and do not support set-based query evaluation such as structural joins. Buneman et al [4] separate the tree structure and its data. They then use bi-simulation to compress the documents that share the same sub-tree, however, they can only support node navigations in linear time. With a similar idea but different technique, Maneth et al [5, 17] also compress XML by calculating the minimal sharing graph equivalent to the minimal regular tree grammar. In order to provide tree navigations, a DOM proxy that maintains runtime traversal information is needed [5]. Since only the compression efficiency was reported in the paper, both query and navigation performance of their proposed scheme are unclear. 3. ISX STORAGE AND TOPOLOGY LAYER dblp inproceedings @mdate author title year booktitle 2003-06-23 Jeffrey D. Ullman Improving the Efficiency of Database-System Teaching. 2003 SIGMOD Conference Figure 1: A DBLP XML document fragment This section describes the storage layer of the ISX system. It consists of three layers, namely, topology layer, internal node layer, and leaf node layer. In Figure 3, the topology layer stores the tree structure of the XML document, and facilitates fast navigational accesses, structural joins and updates. The internal node layer stores the XML elements, attributes, and signatures of the text data for fast text queries. Finally the leaf node layer stores the actual text data. Text data can be compressed by various common compression techniques and referenced by the topology layer. 1074 WWW 2007 / Track: XML and Web Data ISX Topology Layer Jacobson [11] showed that the lower bound space requirement 3 for representing a binary tree is lg(Cn ) = lg(4n · (n- 2 )) = 2n + o(n) bits, where the Catalan number Cn is the number of possible binary trees over n nodes. Our storage scheme is based on the balanced parentheses encoding from [14], representing the topology of XML. Different from [14], our topology layer (Figure 3) actually supports efficient node navigation and updates. The balanced parentheses encoding used in tier 0 reflects the nesting of element nodes within any XML document and can be obtained by a preorder traversal of the tree: we output an open parenthesis when we encounter an opening tag and a close parenthesis when we encounter a closing tag. In Figure 3, the topology of a DBLP XML fragment shown in Figure 1 is represented in tier 0 using the balanced parentheses encoding. In our implementation, we use a single bit 0 to represent an open parenthesis and a single bit 1 to represent a close parenthesis. Definition: An excess is the difference between the number of open and close parentheses occurring in a given section of the topology. For instance, in Figure 3, the excess between the open parenthesis of dblp and the close parenthesis of @mdate is 3. The excess between the close parenthesis of the text node "2003" and booktitle is -1. The depth of a node x in the XML document tree can be calculated by finding the excess between the open parenthesis of x and the beginning of the document. Representation of Elements, Attributes and Texts We avoid any pointer based approach to link a parenthesis to its label, as it would increase the space usage from 2n to a less desirable (n lg n). As our representation of the topology also does not include a O(lg n) bit persistent object identifier for each node in the document, we must find a way to link the open parenthesis of x in tier 0 to the actual label itself. To address this, we adopt from Munro's work [20] although they do not use balanced parentheses encoding. Instead, they control the topology size by using multiple layers of variable-sized pointers, and may require many levels of indirection. In addition, we make the element structure an exact mirror of the topology structure instead of mirroring to the pointers. This allows us to find the appropriate label for a node by simply finding the entry in the corresponding position at the element structure. As mentioned earlier, a pointer based approach would require space usage of (n lg n), which is undesirable. The next issue is to handle the variable length of XML element labels. We adopt the approach taken in previous work [22, 23], and maintain a symbol table, using a hash table to map the labels into a domain of fixed size. In the worst case, this does not reduce the space usage, as every node can have its own unique label. In practice, however, XML documents tend to have a very small number of unique labels. Therefore, we can assume that the number of unique labels used in the internal nodes (E ) is very small, and essentially constant. This approach allows us to have fixed size records in the internal node layer. Note that each element in the XML document actually has two Topology Layer Tier 2 Tier 1 Tier 0 ((())( ())(() )(())) Internal Node Layer Leaf Node Layer (Tags) (Text Data) Symbol Table, Topology Labels + Text Data Signatures Offset Table Character Data Session: Parsing, Normalizing, and Storing XML available entries in the array, corresponding to the opening and closing tags. We could thus make the size of each entry 1 lg E 2 bits, and split the identifier for each elements over its two entries. However, the two entries are not in general adjacent to each other, and hence splitting the identifier could slow down lookups as we would need to find the closing tag corresponding to the opening tag and decrease cache locality. Hence, we prefer to use entries of lg E bits and leave the second entry set to zero; this also provides us with some slack in the event that new element labels are used in updates. Since text nodes are also leaf nodes, they are represented as pairs of adjacent unused spaces in the internal node layer. We thus choose to make use of this "wasted" space by storing a hash value of the text node of size 2 lg E bits. This can be used in queries which make use of equality of text nodes such as //*[year="2003"], by scanning the hash value before scanning the actual data to significantly reduce the lookup time. Since texts are treated independently from the topology and node layers, they can be optionally compressed by any compression schemes. Instead of employing more sophisticated compression techniques such as BWT [8] that are relatively slow on mobile devices, a standard LZW compression method (e.g., gzip) is used in this paper. Algorithm 1 Node Navigation Operators PA R EN T(node) 1 return BAC K WA R D E X C E S S(node, |tier 0|, 2) FIR S TC H I L D(node) 1 if (tier 0[N EX T(node)] is open parenthesis) then 2 return N EX T(node) 3 else 4 return NOT-FOUND N EX TSI B L I N G(node) 1 if (tier 0[N EX T(F I N D C L O S E(node))] is an open parenthesis) then 2 return N EX T(FI N D C L O S E(node)) 3 else 4 return NOT-FOUND PR EV IO U S S I B L I N G(node) 1 if (PR EV(node) is a close parenthesis) then 2 return FIN D O P E N(P R E V(node)) 3 else 4 return NOT-FOUND N EX TPR E C E D I N G(node) 1 pr ec PR EV(node) 2 while (pr ec is an open parenthesis) do pr ec PR EV(pr ec) 3 pr ec PR EV(pr ec) 4 while (pr ec is a close parenthesis) do pr ec PR EV(pr ec) 5 return pr ec N EX TFO L L OW I N G(node) 1 f ollow N EX T(FI N D C L O S E(node)) 2 while (f ollow is a close parenthesis) do f ollow N EX T(f ollow ) 3 return f ollow 3.1 3 .2 4. QUERYING AND UPDATE MAINTENANCE In addition to efficiently storing large volumes of data, an XML database system should also have the following features: 1) direct node navigation operators; 2) XPath query processing interface; db l inpp mdrocee "20ate ding s 03 -0 6- 23 " aut "Jehor ffre y.. ." titl e "Im pro vin g.. ." yea "20r 03 " ..." bo o "SI ktitl GM e OD (((())(())(())(())(()))) Figure 2: Overview of the data structure Figure 3: Balanced parentheses encoding of Figure 1 1075 WWW 2007 / Track: XML and Web Data and 3) efficient node insertion/deletion mechanism. For the rest of this section, we present algorithms and other auxiliary data structures satisfying the above features, utilising the ISX topology layer. Furthermore, we provide a detailed cost analysis of our proposed approach for the database operators. Node Navigation with Topology Layer Primitives Given an arbitrary node x of a large XML document, a navigation operator should be able to traverse back and forth the entire document via various step axes of node x. Some frequently used step axes for an XML document tree are parent, first-child, nextsibling, previous-sibling, next-following and next-preceding. These step axes can then be used to provide programming interfaces, such as the DOM API, for external access to the XML database. Node navigation operators are described by the pseudo-code in Algorithm 1, which shows a tight coupling between the ISX topology layer primitives and the navigation operators. Each navigation operator in Algorithm 1 is mapped to a sequence of calls to the topology layer primitives described in Algorithm 2. Session: Parsing, Normalizing, and Storing XML in the figure corresponds to the balanced parentheses encoding of the topology of the XML document, which was described in Section 3. For tiers 1 and 2, each tier 1 block stores an array of tier 0 0 0 0 tuples T1 , T2 , . . . , Tn , where n is the maximum number of tuples allowed per tier 1 block. Each Ti0 for 0 < i n is defined as (L0 , R0 , m0 , M 0 , b0 , B 0 , D0 ) and the density of each tier 0 block 0 0 can be calculated by using the formula density = L |+R . For B| each tier 0 tuple, L0 is the total number of left parentheses of a block; R0 is the total number of right parentheses of a block; m0 is the minimum excess within a single block by traversing the parentheses array forward from the beginning of the block; M 0 is the maximum excess within a single block by traversing the parentheses array forward from the beginning of the block; b0 is the minimum excess within a single block by traversing the parentheses array backward from the last parenthesis of the block; B 0 is the maximum excess within a single block by traversing the parentheses array backward from the last parenthesis of the block; and D0 is total number of character data nodes. In tier 2, each block stores an array 1 1 1 of tier 1 tuples T1 , T2 , . . . , Tn , where n is the maximum number of tuples allowed per tier 2 block, Each tuple Ti1 for 0 < i n is then defined as (L1 , R1 , m1 , M 1 , b1 , B 1 , D1 ), where: L1 is the P/0 sum of all L0 for all tier 1 tuples T 0 ( |B|0 |T | L0 ); R1 is the sum i = P|B|i/|T 0 | 0 of all R0 for all tier 1 tuples T 0 ( i=0 Ri ); m1 is the local forward minimum excess across all of its tier 1 tuples; M 1 is the local forward maximum excess across all of its tier 1 tuples; b1 is the local backward minimum excess across all of its tier 1 tuples; B 1 is the local backward maximum excess across all of its tier 1 tuples; and D1 is the total number of character data nodes for all P/0 0 tier 1 tuples ( |B|0 |T | Di ). i= Although both tier 1 and tier 2 tuples look similar, the values of 1 1 1 m , M , b and B 1 in tier 2 are calculated differently to that of in tier 1. For tier 2, the function T I E R 2 L O C A L E X C E S S in Algorithm 3 is used to calculate the local minimum/maximum excess and it is not as trivial as the calculation for tier 1 blocks. Let X = (L, R, m, M , b, B , D) be a tier 2 tuple holding the summary information for the tier 1 tuples Y 1, . . . , Y n. To calculate the local forward minimum excess X.m, we know the local minimum excess from the beginning of the first parentheses of Y 1 until the end of Y 1 is equal to Y 1.m, we then assign this value to X.m. We know the excess at the end of Y 1 is Y 1.L - Y 1.R, so the minimum of Y 1.m and (Y 1.L - Y 1.R + Y 2.m) gives the forward minimum excess from beginning parenthesis of Y 1 to the end parenthesis of Y 2. Similarly, the minimum of (Y 1.m, Y 1.L - Y 1.R + Y 2.m, Y 1.L - Y 1.R + Y 2.L - Y 2.R + Y 3.m) gives the minimum excess between the beginning parenthesis of Y 1 to the end parenthesis of Y 3. Therefore, X.m can be calculated by scanning its tier 1 tuples, updating the excess along the way. Both maximum and minimum forward excesses can be calculated at the same time. For backward excesses, the algorithm is identical, except for the direction of traversal of the tier 1 tuples. 4 .1 4 .2 Tier 2 Tier 1 Tier 0 b2 0 Auxiliary Tiers T2 =9,6,0,4,-1,3,3 0 T2 =3,6,-3,1,-4,0,1 1 T1 =4,1,0,4,-1,3,1 0 T1 =2,2,-1,1,-1,1,1 T1 =3,3,-1,1,-1,1,1 1 2 T1 =3,2,-1,1,0,2,1 3 T1 =0,4,-4,0,-4,0,0 4 b1 0 b1 1 (((() b0 0 )(() b0 1 )(())( b0 2 ())(( b0 3 )))) b0 4 Figure 4: Example of Tiers of Topology Part Node navigation operators are highly dependent on topology layer primitives such as F O RWA R D E X C E S S and BAC K WA R D E X C E S S. In the worst case, node navigation operators could take linear time. However, we can significantly improve the performance of the topology layer primitives by adding auxiliary data structures (tier 1 and tier 2 blocks) on top of the tier 0 layer described in Section 3.1. Figure 4 presents the auxiliary tiers 1 (T 1 ) and 2 (T 2 ), where each tier contains contiguous arrays of tuples, with each tuple holding summary information of one block in the lower tier. The tier 0 Algorithm 2 Primitive Operators for Topology Layer Access FO RWA R D E X C E S S (star t, end, k) 1 for each cur r ent from star t to end do 2 if (tier 0[cur r ent] is an open parenthesis) then 3 k k-1 4 else 5 k k+1 6 if (k = 0) then 7 return cur r ent 8 return NOT-FOUND BAC K WA R D E X C E S S(star t, end, k) for each cur r ent from star t to end step -1 do 1 2 if (tier 0[cur r ent] is an open parenthesis) then 3 k k-1 4 else 5 k k+1 6 if (k = 0) then 7 return cur r ent 8 return NOT-FOUND PR EV(node) 1 if (node > 0) then return node - 1 else return NOT-FOUND N EX T(node) 1 if (node < |tier 0|) then return node + 1 else return NOT-FOUND FIN D C LO S E(node) 1 return FO RWA R D E X C E S S(node, |tier 0|, 0) FIN D O P E N(node) 1 return BAC K WA R D E X C E S S(node, |tier 0|, 0) Algorithm 3 Calculate Local Excess in a Tier 2 Block T IER 2 L O C A L E X C E S S(t2) 1 2 3 4 5 6 7 8 9 | {t1start , t1end } { t2||T| | , (t2+1)1 | T | - 1} T1 |T {tier 2[t2].m, tier 2[t2].M } {tier 1[t1start ].m, tier 1[t1start ].M } excess tier 1[t1star t].L - tier 1[t1star t].R for each t1 from t1start + 1 to t1end do if (excess + tier 1[t1].m < tier 2[t2].M ) then tier 1[t1].m excess + tier 1[t1].m if (excess + tier 1[t1].M > tier 2[t2].M ) then tier 1[t1].M excess + tier 1[t1].M excess excess + tier 1[t1].L - tier 1[t1].R 2 2 1076 WWW 2007 / Track: XML and Web Data Example In Figure 4, if we need to calculate the minimum forward excess for the tier 2 tuple T 21 , we first assign it to T 21 .m = T 13 .m = -1. Now the excess at the end of T 13 is T 13 .L - T 13 .R = 1 and 1 + T 14 , m = 1 + (-4) = -3. As -3 is smaller than -1, T 21 .m is assigned -3. In the ISX system, the fixed block size for each tier is 4 kilobytes in size. Therefore, each tier 0 block can hold up to 32768 bits and KB each tier 1 block can hold 4T 0 | tier 0 blocks. Similarly, each tier | KB 2 block can hold up to 4T 0 | tier 1 blocks, which is equivalent to | 4KB 2 ( |T 0 | ) tier 0 blocks. For a 32-bit word machine, there are only 2 tier 2 blocks and in theory, there are (n/ lg2 n) tier 2 blocks. Therefore, the worst case for navigational accesses is O(n/ lg2 n), which is not much of an improvement on O(n). Fortunately, it is relatively simple to fix this limitation: instead of having 3 tiers, we generalize the above structure in a straightforward fashion to use O(lg n/ lg lg n) tiers. This means that the top-most tier has (n/ lglg n/ lg lg n n) = (1) blocks, reducing the worst case navigational access time to O(lg n/ lg lg n). Session: Parsing, Normalizing, and Storing XML improved primitives in Algorithm 4. Furthermore, since the depths of real-world XML documents are generally less than |B| (even the depth of the highly nested Tree Bank dataset [18] is much less than 100), most matching parentheses lie within the same block, and occasionally are found in neighboring blocks. Therefore, when FA S TF O RWA R D E X C E S S is called from navigation operations, we rarely need to access additional blocks in either the auxiliary data structure or the topology bit array. In the worst case, when the matching parentheses lie within different blocks, we only need to read two tier 1 blocks and two tier 2 blocks for a 32-bit word size machine, which is very small in size. 4 .4 Update Operators In ISX system, we also facilitate efficient update operators, such as node insertion. So far for tier 0 layer, we have appeared to treat the balanced parentheses encoding as a contiguous array. This scheme is not suitable for frequent updates as any insertion or deletion of data would require shifting of the entire bit array. 4.4.1 Updating Tier 0 In this section, we present the modification to our storage scheme, that changes the space usage from 2n to 2n, where 1, so that we can efficiently accommodate frequent updates. Density Threshold Depth 4 .3 Improved Topology Layer Primitives Algorithm 4 Topology Primitives using Auxiliary Structures N EX T(node) 0 1 if (Inode < L0 ode + R0 ode ) then n n 0 2 return Inode + 1 3 else 0 4 if (Bnode is the last tier 0 block) then 5 return NOT-FOUND 6 else 0 7 return Bnode + |B| FA S TFO RWA R D E X C E S S(star t, end, k) 0 1 cur r ent FO RWA R D E X C E S S(star t, Bx + |B| - 1, k) 2 if cur r ent = NOT-FOUND then 3 return cur r ent 0 1 0 0 4 for each Ti Bcurrent where Ti > Tcurrent 0 5 if (cur r ent + m0 k cur r ent + Mi ) then i 0 0 6 return FO RWA R D E X C E S S(Ti , BT 0 + |B| - 1, k) i [0.50, 0.75] 0 [0.42, 0.83] 1 [0.33, 0.92] 2 [0.25, 1.00] 3 d 9 =37.5% d 8 =62.5% d 6 =56.25% b0 0 0 1 0 1 v3 1 v2 v1 v0 b1 b2 0 d 7 =68.75% 1 b3 b4 0 (((() d 1 =62.5% )(() d 2 =50% )(())( d 3 =75% ())(( d 4 =62.5% )))) d 5 =50% d: density within a range of blocks height of virtual binary trie: 3 7 cur r ent cur r ent + L0 - R0 i i 1 2 1 1 8 for each Tj Bcurrent where Tj > Tcurrent 0 9 if (cur r ent + m1 k cur r ent + Mj ) then j 0 1 0 0 10 for each Ti Bj where Ti > Tj 0 if (cur r ent + m0 k cur r ent + Mi ) then 11 i 0 0 12 return FO RWA R D E X C E S S (Ti , BT 0 + |B| - 1, k) i 13 cur r ent cur r ent + L0 - R0 i i 14 cur r ent cur r ent + L1 - R1 j j FA S TBAC K WA R D E X C E S S(star t, end, k) // Implemented in the same way as FA S TFO RWA R D E X C E S S, // but in backward direction. F O RWA R D E X C E S S and BAC K WA R D E X C E S S return the position of the first parenthesis matching the given excess k within a given range [star t, end] (in forward and backward direction respectively). Using the auxiliary structures (tiers 1 and 2), instead of just a linear scan of tier 0 layer, we can use tier 1 to test whether the position of the parenthesis, matching k excess, lies within the i-th tier 0 block, i.e., checking whether (m0 + ei ) k (Mi0 + ei ), i where ei is the excess between star t and the beginning of the i-th tier 0 block (excluding the first bit). However, as |B| = (lg n), there are potentially n/|B| tier 1 tuples to scan. Hence, we use tier 2 find the appropriate tier 1 block within which excess lies, thus reducing the cost to a near constant in practice. Using the above approach, we can replace primitives N E X T , F O RWA R D E X C E S S and BAC K WA R D E X C E S S in Algorithm 2 with Figure 5: Densities of the parentheses array and the corresponding virtual balanced trie with block size |B| = 8 and height = 3. In our approach, we first divide the array into blocks of |B| bits each, and store the blocks contiguously. Within each block, we leave some empty space by storing them at the rightmost portion of each block. Now, we only need to shift O(|B|) entries per insertion or deletion. We can control the cost of shifting by adjusting the block size. After the initial loading of an XML document, the empty space allocated to leaf nodes will eventually be used up as more data is inserted into the database. Therefore, we need to guarantee an even distribution of empty bits across the entire parentheses array, so that we can still maintain the O(|B|) bound for the number of shifts needed for each data insertion. This can be achieved by deciding exactly when to redistribute empty space among the blocks and which blocks are to be involved in the redistribution process. To better understand our approach, we first visualize these blocks as leaf nodes of a virtual balanced binary trie, with the position of the block in the array corresponding to the path to that block through the virtual binary trie. Figure 5 shows such a trie, where block 0 corresponds to the leaf node under the path 0 0 0, and similarly block 3 corresponds to the path 0 1 1. For each block, we define: · L: the total number of left parentheses within a block. · R: the total number of right parentheses within a block. · D E N S I T Y(b): the density of a block b, defined as L+R . |B | 1077 WWW 2007 / Track: XML and Web Data Algorithm 5 Node Insertion and Order Maintenance Operations IN S ERT(x) 1 Rightshift tier 0[x, L0 + R0 ] to [x + 2, L0 + R0 + 2] x x x x 2 tier 0[x, x + 1] {open par enthesis, close par enthesis} 3 Increment L0 , R0 , L1 and R1 x x x x 4 if (L0 + R0 > |B| - 2) then x x 5 M A IN TA IN(x) M A IN TA IN(x) 1 {height, w eight, } {lg n, height, 1} 0 0 2 {min, max} {Bx , Bx + |B|} 3 4 5 6 7 8 while d 3 + 4h ) do 4 depth depth - 1 2 min M A X(0, min - ) max max + Evenly distribute bits in blocks [min, max] and update the corresponding tier 1 and tier 2 tuples. 1 PBmax 0 L + R0 B1 min ( (max-min)|B| Session: Parsing, Normalizing, and Storing XML eliminated by updating the upper tiers once per redistribution, instead of once per node. A simple proof then demonstrates that the overall update cost is unaffected, and remains O(lg2 n). During the insertions and deletions in a tier 0 block, we simply update the appropriate tuples in the corresponding blocks in the higher tiers. Since the redistribution process we described in Section 4.4.1 can be seen as a sequence of insertions and deletions, the corresponding updates to the auxiliary tiers do not affect the worst case complexity for updates. 4 .5 Given the above definition of density for leaf nodes, the density of a virtual node is the average density of its descendant leaf nodes. We then control the empty space within all nodes in the virtual binary trie by setting a density threshold [min, max], within which the block densities must lie. For a virtual node at height h and depth d in the virtual trie, we enforce a density threshold of [ 1 - 4d , 3 + 2 h4 d ]. For example, the density threshold range for virtual node v0 4h 2 2 in Figure 5 is [ 1 - 4×3 , 3 + 4×3 ] = [0.33, 0.92], since the depth 2 4 for v0 is 2 and height of the trie is 3. Why do we use the formula above for controlling the density threshold? This is due to two factors: first, in order to guarantee good space utilization, the maximum density of a leaf node should be 1, and the minimum density threshold of root node should be 1/2. Secondly, the density threshold should satisfy the following invariant: the density threshold range of an ancestor node should be tighter than the range for its descendant nodes. This is so that space redistribution for an ancestor node v , the density threshold of all its descendants are also immediately satisfied. In the worst case, we use 4 bits per node, since the root node can be only half full. Thus, on a 32-bit word machine, we can store at most 232 /4 = 230 nodes. However, by adjusting the minimum root node density threshold, from 1 to 1 it is possible to store more 2 than 230 nodes by choosing a smaller . In practice, should be 2 and therefore 2n bits is in effect 4n. The factor should only be less than 2 when the document is relatively static. Notice that although we shift the parentheses within tier 0 during update, we never need to shift the tuples in tier 1 because the same T 0 tuple always corresponds to the same tier 0 block, regardless of its density. Therefore unlike tier 0, we do not need to redistribute tuples within tier 1 (similarly for tier 2) during the update operation. Space Cost Having 2n bits used per node including update, using 32-bits word, we can store as much as 230 nodes. In our implementation we also chose to use four kilobytes sized block. Based on these values, we now discuss the space cost of each component of our storage scheme. Of course, if larger documents need to be stored, we can increase the word size that we use in the data structure and adjust the bit length used on tier 1 and tier 2. Tier 0:. From above, Tier 0 can take up at most 232/2 = 232 bits space (or 2Bn = 217 blocks). || Tier 1:. We need lg |B| = 15 bits for each variable (L0 , R0 , m0 , M 0 , b0 , B 0 , D0 ) within a T 0 tuple. Each T 0 tuple requires a total of 7 lg |B| = 112 bits including bit alignments and based on this calculation, each tier 1 block can then store up to ||B0|| = 292 T 0 tuples, Since the maximum number of nodes T can be stored in tier 0 is 230 , then we only need 2Bn = 217 || T 0 tuples to epresent all tier 0 blocks and they can be stored in ° 2n |B| r / |T 0 | = 14 l|g ||B|n = 449 tier 1 blocks. |B | B2 Tier 2:. We need a total of 24 bits for each variable (L1 , R1 , m1 , M 1 , b1 , B 1 , D1 ) within a T 1 tuple. This is derived 2 from lg |B| + lg( ||B0|| ) = lg( 7 |lB||B| ), where each variable holds g T the size of a tier 1 tuple and total number of bits required to represent the total number of tuples per tier 1 block. So each T 1 tuple 2 requires a total of |T 1 | = 7 lg( 7 |lB||B| ) = 168 bits and each tier g 2 block holds up to ||B1|| = 195 T 1 tuples. Thus, we will only T 7 lg |B| = 2 tier 2 need a total of 14 l|g ||B|n / ||B1|| = B2 T |B |3 blocks to store the 449 tier 1 tuples. Since we only need a maximum of two tier 2 blocks, even for 230 nodes document, we can just keep them in main memory. In fact, the entire tier 1 can also be kept in main memory, since it requires at most 449 4KB < 2MB. In summary, the space required by the topology layer (in bits) is: 98 lg |B| lg( |B|2 )n 4.4.2 Updating Auxiliary Tiers From Section 4.2, the auxiliary tiers may first appear to increase the update costs to O(lg3 n/ lg lg n), since moving a node requires updating O(lg n/ lg lg n) tiers. However, this overhead can be Algorithm 6 Offset calculation for block and indexes within the block in all tiers 0 0 x Bx = |B| , Ix = x mo d |B| 0 Bx 5 lg |B| 1 0 , Ix = (Bx 5 lg |B|) mo d |B| |B| |B|2 ) 1 Bx 5 lg( ) 2) 5 lg |B| 2 1 , Ix = (Bx 5 lg( 5|B||B| )) mo d |B| |B| lg 0 0 0 0 (L0 , R0 , m0 , Mx , Dx ) = (Ix , . . . , Ix + 4 lg |B|) x x x 1 1 1 1 (L1 , R1 , m1 , Mx , Dx ) = (Ix , . . . , Ix + 4 lg |B|) x x x 98 lg |B| lg( 7 |lB||B| )n 14 lg |B|n g 2n + + = 2n + o(n) |B| |B|2 and the space required by the internal node layer (in bits) plus the symbol table is: n lg E + O(E ) We can use the above equations to estimate the space used by an XML file, using as our example a 100 MB copy of DBLP, which was roughly 5 million nodes. If we assume there are no updates after the initial loading, we can set = 1. According to the equation, we will have used roughly 2n = 1MB for the topology layer, and n lg E + O(E ) = 8MB. This, of course, disregards the space needed for the text data in the document. Based on the block size |B|, we know the exact size of tuples 2 1 Bx = 2 Bx = 0 Tx = 1 Tx = 1078 WWW 2007 / Track: XML and Web Data and tiers in our topology layer. Therefore, given a bit position x , we can calculate which tier 0 block this bit belongs to and which tier 1 block contains summary information for the tier 0 block. For a given x , Algorithm 6 lists all the calculations needed to find its resident tier 0 to tier 2 blocks and the index within the blocks to get the summary. Session: Parsing, Normalizing, and Storing XML To further test the scalability of loading even larger XML documents, we compared the loading time of ISX and the other well known systems such as XMill and XGrind on 1 to 16 GB of DBLP documents. During the loading process, XGrind failed to load XML documents greater than 100MB. Although Figure 6(b) shows that the loading time for ISX is slower than XMill's, it still exhibits a similar trend (similar scalability). The gap between the two curves is contributed by the fact that ISX does not compress the XML data as much as XMill does. This results in a larger storage layer than XMill, which will then uses higher number of disk writes. 5. EXPERIMENTS Features Compression Doc traversal Node nav.of all axes Update operation Support XPath query XMill XGrind NoK uncertain TIMBER I SX 5 .3 Table 1: Comparison of supported features The ISX system is implemented in C++ using Expat XML parser1 . In this section, we compare the performance of ISX with other related implementations, namely, XMill [16], XGrind [22], NoK [23] and TIMBER [12]. Experiments were setup to measure various performances according to the feature matrix of these implementations as shown in Table 1. We used an Apple G5 2.0 GHz machine with 2.5GB RAM and 160GB of 7,200 RPM IDE hard drive. The memory buffer pool of ISX has been fixed to 64MB for all the experiments. Three XML datasets were used, namely, DBLP [1], Protein Sequence Database (PSD) [3], TreeBank [18]. We found that the experiment results from PSD are very similar to those from DBLP due to their regular, shallow tree structure. Therefore, PSD results are skipped from some plots below for clarity. Large datasets (i.e., 1GB) were generated by repeatedly duplicating and merging the source dataset, e.g., the 16GB DBLP document contains more than 770 million nodes. Query Performance When consider using the proposed structure as a storage scheme of a full-fledged database system, one must consider its query performance. Figure 7 (with details listed in Tables 4) shows the query performance of ISX against other schemes. Note that the query times in Figure 7 are in logarithmic scale. From this experiment, we found that ISX outperforms other systems in either the ISX or ISX Stream (using the TurboXPath approach [13]) modes. The performance of ISX is measured by using binary structural join to perform XPath queries; while ISX Stream execute the same query by scanning ISX topology layer linearly. Navigation and Document Traversal To test the performance and scalability of random node navigation, we pre-loaded our XML datasets, and for each database, we randomly picked a node and called the node traversal functions (e.g., F I R S T C H I L D, N E X T S I B L I N G) multiple times. The average access time for these node traversal operations are plotted in Figure 8(a). The graph shows that as the database size gets bigger, the running time for these functions remains constant. This is not surprising, since in general most nodes are located close to their siblings, and hence are likely to be in the same block. For example, it generally only takes a scan of a few bits on average to access either the first child node or the next sibling node. Some operations are faster than the others, due to their different implementation complexity (listed in Algorithm 1) and the characteristics of the encoding itself. For instance, as Figure 8(a) shows, F I R S T C H I L D performed slightly faster than N E X T S I B L I N G function, because the first child is always adjacent to a node, whereas its next sibling might be several nodes away. With fast traversal operations, ISX can traversal XML data in the proposed compact encoding significantly faster than other XML compression techniques such as XMill, as shown in Figure 8(b). We argue that this feature is important to examine the content of large XML databases or archives. 5 .4 5 .1 Storage Size Comparison Table 2 and 3 show that XMill has the best compression ratio for both DBLP and TreeBank datasets. Compared to XMill that does not support any direct data navigation and queries, XGrind does allow simple path expressions. Therefore, it has a relatively less attractive compression ratio. In fact, XGrind failed to run on large datasets in our experiments. Both XMill and XGrind have better space consumption as they are primarily designed for readonly data and do not support efficient updates. Furthermore, they only support access to the compressed data in linear time. Table 2 and 3 show again that ISX is relatively less sensitive to the structure of the data. Although the compression ratio of ISX for TreeBank is not as good as for DBLP, the reason is that TreeBank has the text content that are harder to compress (TreeBank text are more random than the DBLP's). XMill compression ratio on TreeBank is relatively much worse than that on DBLP is due to both the random text content as well as the more complex tree structure of the data. 5 .5 5 .2 Bulk Loading Performance The performance comparison of bulk loading using ISX, NoK, XGrind and XMill are shown in Figure 6. For the smaller datasets (up to 500MB DBLP), Figure 6(a) shows our ISX system significantly outperforms NoK and TIMBER in loading. It also highlights the scalability of ISX in loading large datasets. 9 1 http://www.sourceforge.net/projects/expat/ Update Performance The worst case for Algorithm 5 happens when nodes are inserted at the beginning of a completely packed database, i.e., with no gaps between blocks. The insertion experiment was set to measure its average worst case performance by inserting nodes at the beginning of the database. For each experiment, we did multiple runs (resetting the database after each run). The average insertion times (per node) are shown in Figures 9. In Figure 9, we see an initial spike in the execution time for the worst case insertion. This corresponds to the initial packed state of the database, in which case the very first node insertion requires the redistribution of the entire leaf node layer. Clearly, in practice this is extremely unlikely to happen, but the remainder of the graph demonstrates that even this contrived situation has little effect on the overall performance. The graph also shows that the cost of all subsequent insertions in- 1079 WWW 2007 / Track: XML and Web Data Session: Parsing, Normalizing, and Storing XML Source Data (MB) 1 2 5 8 16 32 64 128 ISX (MB) 1 1 3 5 10 21 42 87 ISX Compressed (MB) 0.4 0.7 1.5 2.5 5 10 20 40.2 XMill (MB) 0.1 0.3 0.5 0.9 1.8 3.7 7.2 14.9 XGrind (MB) 0.3 0.6 1.3 2.1 4.3 8.6 17.4 35.8 Source Data (MB) 256 500 750 1000 2000 4000 8000 16000 ISX (MB) 182 363 549 726 1452 2903 5807 9411 ISX Compressed (MB) 82.7 163.7 249.7 327.5 654.9 1309.8 2619.6 4629.9 XMill (MB) 31.5 62.6 94.0 125.3 250.5 501.0 978.48 1952.81 XGrind (MB) 75.0 Failed Failed Failed Failed Failed Failed Failed Table 2: Storage size of ISX (with and without text compression), XMill and XGrind on DBLP Source Data (MB) 1 2 4 8 16 32 64 128 ISX (MB) 0.51 1.02 2.04 4.09 8.19 16.39 32.77 65.54 ISX Compressed (MB) 0.41 0.81 1.63 3.26 6.53 13.07 44.49 52.26 XMill (MB) 0.30 0.58 1.16 2.30 4.60 9.19 18.35 36.69 Source Data (MB) 256 500 750 1000 2000 4000 8000 16000 ISX (MB) 131.08 243.72 365.50 487.43 974.69 1949.39 4052.58 7797.56 ISX Compressed (MB) 104.53 192.79 289.21 385.58 770.98 1541.97 3205.59 6167.87 XMill (MB) 73.38 146.74 220.10 293.489 586.969 1173.93 2347.85 4695.7 Table 3: Storage size of ISX (with and without text compression), XMill on TreeBank 10000 Loading Time (Seconds) Loading Time (Seconds) ISX TIMBER NoK 100000 10000 1000 100 10 1 0.1 0 100 200 300 400 500 600 0 2 4 6 8 10 12 14 16 18 Document Size (GB) 9(b) ISX vs. XGrind and XMill (up to 16 GB data) XMill (DBLP) ISX (DBLP) XGrind (DBLP) XMill (TreeBank) ISX (TreeBank) 1000 100 10 1 XML Document Size (MB) 9(a) ISX vs. TIMBER and NoK (up to 500 MB data) Figure 6: Loading Time Comparison Query # Q1 Q2 Q3 Q4 Q5 Q6 XPath Expression //inpr oceedings //master sthesis /dblp/ar ticle //inpr oceedings/title //ar ticle[.//month/text() = "J uly "]//title //inpr oceedings[.//ee]//pages 1 GB |F inal| 402667 74 442184 402667 857309 796742 2 GB |F inal| 981484 156 717449 981484 1729184 1607116 4 GB |F inal| 2012761 315 1379945 2012761 3454708 3210628 8 GB |F inal| 4160339 627 2630711 4160339 6920136 6430194 16 GB |F inal| 8453066 1251 5135130 8453066 13848372 12868471 Table 4: Test Queries and Final Result Sizes 1080 WWW 2007 / Track: XML and Web Data Session: Parsing, Normalizing, and Storing XML XMill/LibXML Query Time (Seconds) 1000 100 10 1 0.1 0.01 0.001 2 TIMBER N oK ISX ISX Stream Query Time (Seconds) XMill/LibXML 100 0 10 0 10 1 0.1 0.01 0.001 1 2 TIMBER NoK ISX ISX Stream Query Time (Seconds) XMill/LibXML 100 0 10 0 10 1 0.1 0.01 0.001 1 2 TIMBER NoK ISX ISX Stream 4 8 16 32 64 128 3 4 5 6 7 3 4 5 6 7 DBLP Document Size (MB) DBLP Document Size (MB) DBLP Document Size (MB) 9(a) DBLP Q1 XMill/LibXML Query Time (Seconds) 10 00 10 0 10 1 0.1 0.01 0.001 1 2 3 4 5 6 7 DBLP Document Size (MB) 9(b) DBLP Q2 ISX ISX Stream Query Time (Seconds) 9(c) DBLP Q3 ISX ISX Stream TIMBER NoK XMill/LibXML 10 00 10 0 10 1 0.1 0.01 0.001 1 2 TIMBER NoK XM ill/LibXM L Query Time (Seconds) 1000 100 10 1 0.1 0.01 0.001 1 2 T IM BER NoK ISX ISX Stream 3 4 5 6 7 3 4 5 6 7 DBLP Document Size (MB) DBLP Document Size (MB) 9(d) DBLP Q4 9(e) DBLP Q5 9(f) DBLP Q6 Figure 7: Query Performance (in log scale) of ISX vs. Other Systems 10 Navigation Time (µseconds) Traversal Time (seconds) next following next preceding previous sibling next sibling parent first child find open find close 100 80 60 40 20 0 ISX XMill 1 0.1 0.01 0 5 10 15 20 25 30 35 Document Size (GB) 9(a) Node navigation time (microseconds) 0 2 4 6 8 10 12 14 16 18 Document Size (GB) 9(b) Full document traversal time (seconds) Figure 8: Navigation and traversal performance time of ISX 24 22 20 18 16 14 12 0 10 20 30 40 50 60 70 Number of Nodes Inserted 80 90 100 Insertion Time (µsec) 128MB DB 250MB DB 500MB DB 1GB DB 2GB DB 4GB DB 8GB DB 16GB DB Figure 9: Insertion time of ISX using 128 MB - 16 GB DBLP 1081 WWW 2007 / Track: XML and Web Data creases at a rate of approximately O(lg2 n). In fact, all subsequent insertions up to 100,000 took no more than 0.5 milliseconds. Updating the values of nodes will not cause extra processing time apart from the retrieval time for locating the nodes to be updated. In case of deletion, the reverse sequence of steps for node insertion will be performed (freed space will be left as gaps to be filled by subsequent insertions). Session: Parsing, Normalizing, and Storing XML David J. DeWitt. Mixed Mode XML Query Processing. In Proceedings of the 29th International Conference on Very Large Databases (VLDB), pages 225­236. Morgan Kaufmann, 2003. Guy Jacobson. Succinct Static Data Structures. PhD thesis, Carnegie Mellon University, 1988. H. V. Jagadish, S. Al-Khalifa, A. Chapman, L. V. S. Lakshmanan, A. Nierman, S. Paparizos, J. M. Patel, D. Srivastava, N. Wiwatwattana, Y. Wu, and C. Yu. TIMBER: A native XML database. VLDB Journal, 11(4):274­291, 2002. Vanja Josifovski, Marcus Fontoura, and Attlila Barta. Querying XML streams. VLDB Journal, 14(2):197­210, 2005. Jyrki Katajainen and Erkki Makinen. Tree compression and optimization with applications. In International Journal of Foundations of Computer Science (FOCS), Vol. 1, pages 425­447. IEEE Computer Society, 1990. Quanzhong Li and Bongki Moon. Indexing and querying XML data for regular path expressions. In Proceedings of the 27th International Conference on Very Large Databases (VLDB), pages 361­370. Morgan Kaufmann, 2001. Hartmut Liefke and Dan Suciu. XMill: an efficient compressor for XML data. In Proceedings of the 2000 ACM SIGMOD international conference on Management of data, pages 153­164. ACM Press, 2000. Sebastian Maneth and Giorgio Busatto. Tree transducers and tree compressions. In Igor Walukiewicz, editor, FoSSaCS, volume 2987 of Lecture Notes in Computer Science, pages 363­377. Springer, 2004. Mitchell P. Marcus, Beatrice Santorini, and Mary Ann Marcinkiewicz. Building a large annotated corpus of English: the Penn Treebank. Computational Linguistics, 19, 1993. Jun-Ki Min, Myung-Jae Park, and Chin-Wan Chung. XPRESS: A Queriable Compression for XML Data. In Proceedings of the 2003 ACM SIGMOD international conference on Management of data, pages 122­133. ACM Press, 2003. J. Ian Munro, Venkatesh Raman, and Adam J. Storm. Representing Dynamic Binary Trees Succinctly. In Proceedings of the 12th Annual Symposium on Discrete Algorithms (SODA), pages 529­536. SIAM, 2001. Adam Silberstein, Hao He, Ke Yi, and Jun Yang. BOXes: Efficient maintenance of order-based labeling for dynamic XML data. In the 21st International Conference on Data Engineering (ICDE), pages 285­296, 2005. Pankaj M. Tolani and Jayant R. Haritsa. XGRIND: A query-friendly XML compressor. In Proceedings of the 18th International Conference on Data Engineering (ICDE), pages 225­234. IEEE Computer Society, 2002. Ning Zhang, Varun Kacholia, and M. Tamer Ozsu. A Succinct Physical Storage Scheme for Efficient Evaluation of Path Queries in XML. In Proceedings of the 20th International Conference on Data Engineering (ICDE), pages 54­65. IEEE Computer Society, 2004. [11] [12] 6. CONCLUSIONS [13] A compact and efficient XML repository is critical for a wide range of applications such as mobile XML repositories running on devices with severe resource constraints. For a heavily loaded system, a compact storage scheme could be used as an index storage that can be manipulated entirely in memory and hence substantially improve the overall performance. In this paper, we proposed a scalable and yet efficient, compact storage scheme for XML data. Our data structure is shown to be exceptionally concise, without sacrificing query and update performance. While having the benefits of small data footprint, experiments have shown that the proposed structure still out-performs other XML database systems and scales significantly better for large datasets. In particular, all navigational primitives can run in near constant time. Furthermore, as shown in the experiments, our proposed structure allows direct document traversal and queries that are significantly faster and more scalable than previous compression techniques. [14] [15] [16] 7. REFERENCES [17] [1] Dblp bibliography. See http://www.informatik.uni- trier.de/ley/db/. [2] Shurug Al-Khalifa, H. V. Jagadish, Nick Koudas, and Jignesh M. Patel. Structural joins: A primitive for efficient XML query pattern matching. In Proceedings of the 18th International Conference on Data Engineering (ICDE), pages 141­153. IEEE Computer Society, 2002. [3] Rolf Apweiler, Amos Bairoch, and Cathy H. Wu. Protein sequence databases. Current Opinion in Chemical Biology, 8:76­80, 2004. [4] Peter Buneman, Martin Grohe, and Christoph Koch. Path Queries on Compressed XML. In Proceedings of the 29th International Conference on Very Large Databases (VLDB), pages 141­152. Morgan Kaufmann, 2003. [5] Giorgio Busatto, Markus Lohrey, and Sebastian Maneth. Efficient memory representation of xml documents. In DBPL, 2005. [6] Yi Chen, George A. Mihaila, Rajesh Bordawekar, and Sriram Padmanabhan. L-tree: A dynamic labeling structure for ordered xml data. In Wolfgang Lindner, Marco Mesiti, Can Turker, Yannis Tzitzikas, and Athena Vakali, editors, EDBT ¨ Workshops, volume 3268 of Lecture Notes in Computer Science, pages 209­218. Springer, 2004. [7] James Cheney. XMLPPM: XML-Conscious PPM Compression. See http://www.cs.cornell.edu/People/ jcheney/xmlppm/xmlppm.html. [8] Paolo Ferragina, Fabrizio Luccio, Giovanni Manzini, and S. Muthukrishnan. Compressing and searching xml data via two zips. In WWW, pages 751­760, 2006. [9] Torsten Grust, Maurice van Keulen, and Jens Teubner. Accelerating xpath evaluation in any rdbms. ACM Trans. Database Syst., 29:91­131, 2004. [10] Alan Halverson, Josef Burger, Leonidas Galanis, Ameet Kini, Rajasekar Krishnamurthy, Ajith Nagaraja Rao, Feng Tian, Stratis Viglas, Yuan Wang, Jeffrey F. Naughton, and [18] [19] [20] [21] [22] [23] 1082