WWW 2007 / Track: XML and Web Data Session: Parsing, Normalizing, and Storing XML XML Design for Relational Storage Solmaz Kolahi University of Toronto Leonid Libkin University of Edinburgh solmaz@cs.toronto.edu libkin@inf.ed.ac.uk ABSTRACT Design principles for XML schemas that eliminate redundancies and avoid up date anomalies have b een studied recently. Several normal forms, generalizing those for relational databases, have b een prop osed. All of them, however, are based on the assumption of a native XML storage, while in practice most of XML data is stored in relational databases. In this pap er we study XML design and normalization for relational storage of XML documents. To b e able to relate and compare XML and relational designs, we use an information-theoretic framework that measures information content in relations and documents, with higher values corresp onding to lower levels of redundancy. We show that most common relational storage schemes preserve the notion of b eing well-designed (i.e., anomalies- and redundancyfree). Thus, existing XML normal forms guarantee welldesigned relational storages as well. We further show that if this p erfect option is not achievable, then a slight restriction on XML constraints guarantees a "second-b est" relational design, according to p ossible values of the informationtheoretic measure. We finally consider an edge-based relational representation of XML documents, and show that while it has similar information-theoretic prop erties with other relational representations, it can b ehave significantly worse in terms of enforcing integrity constraints. 1. INTRODUCTION Database design and normalization, which originated with early pap ers by Codd [12, 13, 14], is one of the classical areas of relational database theory, and a standard textb ook topic. With a recent shift to XML as a data model, many of classical database sub jects have b een reexamined in the XML context, among them design and normalization [4, 35, 37, 36, 20, 38]. The goal of normalization is to eliminate redundancies from a database or an XML document, and by doing so, eliminate or reduce p otential up date anomalies. With that goal in mind, XML normal forms have b een introduced [4, 38, 36] and proved to eliminate redundancies in data storage [5]. However, all the work on XML design was making an assumption of a native XML storage for reasoning ab out redundancies and anomalies. While native XML storage facilities exist [31, 30, 23, 1], many XML processing systems take advantage of relational databases to store and query XML data [21, 39, 32, 34, 18]. The issues of storing and querying relational representations of XML have b een studied extensively (refer to [26] for a survey). So the rationale for good XML designs that can b e found in the literature applies to the storage model that is not the prevalent one in practice. Thus, a natural question to ask is: Do the existing principles of XML design apply when one stores XML in relational databases? And, in case there is a mismatch b etween the native XML representation and a relational one, how can one adjust XML design principles to guarantee well-designed relational representation of XML documents? As we try to formulate these questions in a more precise way, one issue arises immediately: how do we compare XML designs and designs of their relational translations? After all, the notions of redundancies, up dates, queries, etc. are rather different in these two worlds. To overcome this problem, we use an information-theoretic approach to database designs prop osed in [5] and further develop ed in [25]. The idea of this approach is that it measures, in a way indep endent of features such as up dates and queries, the information content of data, as an entropy of a suitably chosen probability distribution. The higher this information content is, the less redundancy the design carries. This measure applies across different data formats and integrity constraints, and allows us to reason ab out and compare data designs over different data models. The values of the information-theoretic measure are real numb ers b etween 0 and 1, with 0 meaning "completely redundant information" and 1 corresp onding to no redun- Categories and Subject Descriptors H.2.1 [Database Management]: Logical Design--Normal Forms ; H.1.1 [Models and Principles]: Systems and Information Theory--Information Theory, Value of Information General Terms Design, Management, Theory Keywords XML data, Design, Relational storage, Functional dep endencies, Equality-generating dep endencies, Information theory Copyright is held by the International World Wide Web Conference Committee (IW3C2). Distribution of these papers is limited to classroom use, and personal use by others. WWW 2007, May 8­12, 2007, Banff, Alberta, Canada. ACM 978-1-59593-654-7/07/0005. 1083 WWW 2007 / Track: XML and Web Data dancy at all. Good designs are those where all data items have measure 1. It is known, for example, that over relational databases this corresp onds to BCNF (Boyce-Codd Normal Form) [2, 24, 28], and over XML documents, to a normal form XNF introduced in [4], if only functional dep endencies are involved. When XML documents are stored as relations, constraints may not remain in the same forms, e.g. XML keys may change to functional dep endencies over the relational schema [17]. However, the information-theoretic approach applies to any kind of constraints, and as our first result, we are able to show that the normal form XNF corresp onds precisely to the p erfect designs of relational translations of XML. For this result, we imp ose fairly mild conditions on translations of XML; in fact for most translations used in the literature [33, 21] the result holds. While one often tries to achieve a p erfect design, in practice it is not always p ossible. For example, if we deal with relational databases and functional dep endencies, the p erfect design that eliminates all redundancies is BCNF, but if one needs to enforce all the functional dep endencies at the same time, a decomp osition into BCNF may not exist [2]. In that case, one usually tries to obtain a 3NF (3rd Normal Form) design; in fact 3NF is much more commonly used in practice than BCNF. It was recently shown that 3NF can b e explained information-theoretically as the b est relational normal form achievable if all constraints are preserved [25]. Using the information-theoretic measure, one can also characterize it as a normal form always guaranteeing values of the measure at least 1 , which is the highest value that one 2 can guarantee if the preservation of functional dep endencies is imp ortant. Here we show that there is a simple XML design criterion that guarantees this second-b est relational design. Namely, our XML design criterion says that every constraint violating XNF should b e relative [10, 3], i.e. restricted to some element typ e of a DTD that occurs under the scop e of a Kleene star. Thus, our results suggest the following guidelines for XML design if one stores XML documents in relations: 1. try to achieve the normal form XNF (using, e.g., algorithm from [4]); 2. if that fails, try replacing all XNF-violating constraints by relative ones. This way one guarantees good design not only of an XML document itself, but also of its relational storage by removing redundancies. Our final result deals with a relational representation of documents in which we essentially store the document as a tree (the edge relation) [21]. This representation shares some of the information-theoretic characteristics with those more commonly used, but we show that it may require arbitrarily many more relational joins for enforcing XML constraints even for well-designed documents in XNF. Organization. The pap er is organized as follows. In Section 2 we define relational and XML constraints and DTDs. In Section 3 we give a brief overview of normal forms BCNF, 3NF for relational databases, and XNF for XML. In Section 4, we give an overview of the information-theoretic measure, explain how it characterizes good designs. The precise formal definition is in the app endix. In Section 5 we overview Session: Parsing, Normalizing, and Storing XML relational translations of XML and XML constraints. In Section 6, we show that p erfect XML designs (i.e., XNF) corresp ond precisely to p erfect designs of relational representations. In Section 7 we find conditions on XML documents guaranteeing the b est non-p erfect relational designs, akin to 3NF. In Section 8 we discuss the edge representation, and in Sections 9 and 10 we give conclusions and directions for future work. 2. NOTATIONS Relational schemas and instances. A relational schema, usually written as S , is a set of relation names. With each mary relation R S , we associate a set of attributes attr(R). We assume that all data values in a database instance are from a countably infinite domain, N+ (the set of p ositive natural numb ers) for simplicity. Therefore, an instance I of S assigns to each m-attribute relation R in S a finite subset I (R) of Nm . The active domain of I , denoted by adom (I ) + is the set of all elements of N+ that occur in I . The set of positions in an instance I of S , denoted by Pos (I ), is the set {(R, t, A) | R S, t I (R) and A attr(R)}. Schemas may contain integrity constraints. We denote such schemas by (S, ), where S is a set of relation names and is a set of constraints. In this pap er, we deal with constraints such as equality-generating dep endencies (EGDs), and functional dep endencies (FDs) as sp ecial cases of EGDs. An EGD is an expression of the form (R1 (x1 ) . . . Rm (xm ) x = y ) , ¯ ¯ where represents the universal closure of the formula, x, y x1 . . . xm , and there is an assignment of vari¯ ¯ ables to columns such that each variable occurs only in one column, and x, y are assigned to the same column. We assume that FDs are of the form X Y , where X, Y are nonempty sets of attribute names. If is a set of constraints, then + denotes the set of all constraints implied by it, and inst (S, ) stands for the set of all instances of S satisfying . We write inst k (S, ) for the set of instances I inst (S, ) with adom (I ) [1, k]. DTDs and XML trees. Assume that we have the following disjoint sets: El of element names, Att of attribute names, Str of p ossible values of string-valued attributes. All attribute names start with the symb ol @. A DTD D is defined to b e D = (E , A, P, R, r ), where · E E l is a finite set of element typ es; · A Att is a finite set of attributes; · P is a set of rules P for each E , where P is a regular expression over E - {r }; · R assigns a subset of A of attribute names to each element E ; · r E is the root element. For simplicity, we do not consider PCDATA elements in XML trees since they can b e represented by attributes. An XML tree is a finite rooted directed tree T = (N , G), where N is the set of nodes, and G is the set of edges, together with 1084 WWW 2007 / Track: XML and Web Data db Session: Parsing, Normalizing, and Storing XML student @name "Smith" student @name "Collins" student @name "Wilson" contact address contact address contact address phone phone phone @number @streetNo @city "416 2561212" "55 Queen St." "Toronto" @number @streetNo @city "416 3408888" "55 Queen St." "Toronto" @number @streetNo @city "613 1267777" "95 Dundas St." "Ottawa" @aptNo @postalCode "310" "M3Y 2P7" @aptNo @postalCode "210" "M3Y 2P7" @aptNo @postalCode "510" "K2P 3G5" Figure 1: An XML tree. 1. a lab eling function : N E l; 2. an attribute function @a : N S tr for each @a A t t. We say tree T conforms to DTD D = (E , A, P, R, r ), written as T |= D, if · the root of T is lab eled r ; · for every x N with (x) = a, the word (x1 ) . . . (xn ) is in the language defined by Pa where x1 , . . . , xn are children of x in order; · @l R(a) iff the function @l is defined on x. The set of positions in an XML document is intuitively the set of places where values (i.e. attribute values) occur. Formally, for a tree T = (N , G) that conforms to DTD D, the set of p ositions is defined as the set {(x, @l) | x N , @l R((x))}, and is denoted by Pos (T ). Given a DTD D = (E , A, P, R, r ), an element path q is a word in the language E , and an attribute path is a word of the form q .@l, where q E and @l A. An element path q is consistent with D if there is a tree T |= D that contains a node reachable by q ; if the nodes reachable by q have attribute @l, then the attribute path q .@l is consistent with D. The set of all paths consistent with a DTD D is denoted by paths(D). The last element typ e that occurs on a path q is called last(q ). A functional dependency over a DTD D [4] is an expression of the form {q1 , . . . , qn } q , where q , q1 , . . . , qn paths(D). To define satisfaction of functional dep endencies, we need a notion of tree tuples in XML documents. Given an XML tree T = (N , G) such that T |= D, a tree tuple [4] is intuitively a subtree of T rooted at r containing at most one occurrence of every path. Then satisfaction is defined in the usual way: if two tree tuples agree on all the paths q1 , . . . , qn , then they must agree on q . The precise definition requires a bit of care since a tree tuple may not b e defined on some paths (as tree tuples have at most one occurrence of every path, and may have zero occurrences). Let represents such missing values. A tree tuple is formally defined as a mapping t : paths(D) N S tr {} such that if for an element path q whose last letter is a, we have t(q ) = , then t(q ) N and (t(q )) = a; if q is a prefix of q , then t(q ) = and t(q ) lies on the path from the root to t(q ) in T ; if @l is defined for t(q ) and its value is v S tr , then t(q .@l) = v . Then a tree T satisfies an FD {q1 , . . . , qn } q if for any two tree tuples t1 , t2 in T , whenever t1 (qi ) = t2 (qi ) = for all i [1, n], then t1 (q ) = t2 (q ). If in an FD {q1 , . . . , qn } q , for some i the path qi is an element path, and for all j [1, n], j = i, the path qi is a prefix of qj , then we say that such an FD is relative (more precisely, relative to qi ). Example 1. Consider the XML tree of Figure 1 that conforms to DTD D = (E , A, P, R, db), where E = {db, student, contact, addr ess, phone} A = {@name, @str eetN o, @aptN o, @city , @postalC ode, @number } P = {db student , student contact, contact addr ess, addr ess , phone } R(student) = {@name} R(addr ess) = {@str eetN o, @aptN o, @city , @postalC ode} R(phone) = {@number } R(db) = R(contact) = This XML tree satisfies the following constraints: (1) for each student, no more than one address is kept with a single postalCode, and (2) the value of postalCode uniquely determines streetNo and city. These constraints can b e formulated as the following FDs: db.student, db.student.contact.addr ess.@postalC ode db.student.contact.addr ess (1) db.student.contact.addr ess.@postalC ode db.student.contact.addr ess.@city (2) The first FD is an example of a relative FD, where the constraint holds within each fixed element student. 3. OVERVIEW OF NORMALIZATION A normal form sp ecifies a set of syntactic conditions over the constraints defined for a database that will lead to less redundant instances. We briefly review some normal forms 1085 WWW 2007 / Track: XML and Web Data defined for relational and XML data and refer the reader to surveys [7, 22, 9], texts [2, 24, 28], and pap ers [4, 38] for additional information. Relational normal forms. A schema (S, ) is in BCNF if for every relation name R in it and every nontrivial FD X Y over attributes of R, X is a key of R. Prime attributes are those that b elong to a candidate (minimal) key. A schema (S, ) is in 3NF if for every relation name R in it and every nontrivial FD X Y over attributes of R, either X is a key, or every attribute in Y - X is prime. Given a single-relation schema (R, ) and some normal form NF , a lossless NF -decomp osition is a set of schemas (Rj , j ), j J , such that each (Rj , j ) is in NF , and for every I inst (R, ) we have attr(Rj ) (I ) |= j and furthermore I =1 {attr(Rj ) (I ) | j J }. Such a decomp osition is + j called dependency-preserving if j = + . It is wellknown that b oth 3NF and BCNF admit lossless decomp ositions, which in the case of 3NF can b e guaranteed to b e dep endency-preserving. In the case of BCNF dep endency preservation is not always p ossible (consider a schema with attributes A, B , C and FDs AB C and C A). A normal form for XML. Given a DTD D and a set of FDs over D, (D, )+ is the set of all FDs implied by (D, ). An FD is called trivial if it b elongs to (D, )+ . We say that (D, ) is in XML Normal Form (XNF) [4] if for every nontrivial FD X q .@l in (D, )+ , the FD X q is also in (D, )+ . Intuitively, this condition prevents redundancies among values of attributes @l of q . It was shown in [4] that if relational databases are translated into XML documents (say, as instances of DTDs r t , where t has all the attributes of a relation R), then XNF coincides with BCNF. Thus, it is a natural extension of BCNF to XML documents. For further results on XNF, and its justification, see [4, 5]. We can see an example of XNF violation by FD (2) in Example 1. While the path on the left-hand side db.student.contact.addr ess.@postalC ode implies the attribute path db.student.contact.addr ess.@city , it does not imply the element path db.student.contact.addr ess. Session: Parsing, Normalizing, and Storing XML ranges b etween 0 and 1 and shows how much redundancy is carried by p osition p. Numb ers close to 0 mean high redundancy, while numb ers close to 1 mean no or low redundancy for the data value in p osition p. Example 2. (see [5]) Consider relation R(A, B , C ) with the set of FDs = {A B }, and three instances I1 , I2 , and I3 of (R, ) shown in Figures 2(a), 2(b), and 2(c) respectively. Let p1 , p2 , and p3 denote the position of the gray cel l in the instances. We can observe that the information content of the gray cel l decreases as it becomes more redundant by adding tuples to the instances. To define this measure intuitively, supp ose that we have n p ositions (data elements) in a relational or XML instance I , whose values are drawn from a domain of size k (which we may assume without loss of generality to b e the interval [1, k]). We want to measure, on average, how much information is contained in p osition p with resp ect to all other sets X of p ositions -- or, in other words, how much we can derive ab out p from values in p ositions in X . When we can derive the value in p unambiguously, the information content will b e 0; when we cannot say anything ab out it, it will b e 1. This in turn is measured as follows. Supp ose X is a subset of P os(I ) - {p}. Supp ose the values of p ositions in X are lost, and then somehow restored from the set [1, k]. We measure how much information this gives us ab out the value of p. This is done in a standard information-theoretic fashion, by computing an entropy of a certain distribution1 . Then Rick (p|) is the average of such entropy over all such sets I X. We want the information content to b e a value in [0, 1]. It is known that the maximum entropy for a discrete distribution on k elements is log k [15]. So we consider the ratio Rick (p|)/ log k, and then define the relative information I content of p osition p as RicI (p|) = lim Rick (p | ) I . k log k 4. INFORMATION THEORY AND NORMALIZATION Since we shall b e comparing XML designs and relational designs, we would like to work in a framework that applies across several data models and is indep endent of concepts such as query languages and up dates (which are not yet as fully understood for XML as for relational world), and concepts such as dep endency preservation (which has not b een adequately explored in the XML context). Such framework is provided by the classical information theory and its central concept of entropy which measures information content. Information theory has recently b een used to characterize well-known relational normal forms, such as BCNF, 4NF, and 3NF, as well as XML normal forms [5, 25]. Given a database schema S , a set of integrity constraints and an instance I of (S, ), the informationtheoretic measure, introduced in [5], assigns a numb er to every p osition p in the instance that contains a data value. This numb er, which is called relative information content with resp ect to constraints and is written as RicI (p|), A key result of [5] is that this limit exists for all reasonable classes of constraints (e.g., all those definable in first-order logic, such as functional, multi-valued, join, etc. dep endencies). The following definition is from [5] and it applies across different data formats, including relational and XML data. Definition 1. A database schema (S, ), with a set of constraints , is well-designed if for every instance I of (S, ) and every position p in I , we have RicI (p|) = 1. In other words, a schema is well-designed if in every instance, no p osition has any redundancy at all. Here we concentrate on designs guided by functional dep endencies. For them, well-designed schemas have b een precisely characterized in [5]. Fact 1. (see [5]) 1. If S is a relational schema and is a set of functional dependencies, then (S, ) is wel l-designed if and only if it is in BCNF. 1 The precise definition of this distribution and the definition of entropy are in the app endix. 1086 WWW 2007 / Track: XML and Web Data Session: Parsing, Normalizing, and Storing XML A 1 1 1 1 B 2 2 2 2 C 3 4 5 6 A 1 1 B 2 2 C 3 4 A 1 1 1 B 2 2 2 C 3 4 5 RicI1 (p1 |) = 0.875 (a) RicI2 (p2 |) = 0.781 (b) RicI3 (p3 |) = 0.711 (c) Figure 2: Information content vs redundancy. 2. If D is a DTD and is a set of XML functional dependencies, then (D, ) is wel l-designed if and only if it is in XNF. We also note that the framework applies well b eyond functional dep endencies. For example, [5] used it to characterize designs based on multi-valued and join dep endencies, and we shall use it soon for schemas for equality-generating dep endencies (EGDs). Some p opular relational normal forms, such as 3NF, are not well-designed according to our definition, b ecause they allow some amount of redundancy to comp ensate for preserving all the constraints. A way to measure and compare information contents over multiple schemes satisfying a particular condition, such as a normal form, was introduced in [25]. Let C b e some condition on relational schemas, e.g., BCNF or 3NF. Now consider all instances I of these schemas (S, ), and all the p ossible values RicI (p|). Definition 2. The guaranteed information content of a condition C is the largest number c [0, 1] such that RicI (p|) c for al l such instances I and positions p. It is denoted by GIC(C ). Formal ly, GIC(C ) = inf {RicI (p|)}, where I ranges over instances of schemas (S, ) that satisfy C , and p ranges over positions in I . Using this, one can characterize 3NF. In fact, it was noticed long ago that 3NF designs can differ significantly [40], and in general those accepted as good ones are the ones produced by the standard 3NF synthesis algorithm by Bernstein [8]. In fact such a subset of 3NF, which we denote by 3NF+ , can b e characterized information-theoretically as follows: 1 Fact 2. (see [25]) GIC(3NF+ ) = 2 . Moreover, if C is any other normal form that guarantees dependencypreserving decompositions, then GIC(C ) 1 . 2 p ermit adding more redundant values, or redundant tree tuples, to XML documents. In particular, they occur when element typ es occur under the scop e of a Kleene star in the DTD. This is indeed what is required for the worst cases of redundancy, so for lower b ounds we can safely concentrate on DTDs that basically model nested relations, i.e. nonrecursive disjunction-free DTDs, where each element typ e app ears once or under a Kleene star in the production rule of its parent. We shall use the inlining technique [33] as our XML-torelational mapping scheme. While the inlining technique is not the only [21, 39, 32, 6, 19, 18] or necessarily the b est mapping scheme (see [26] for a survey), it produces the most natural relational schema for DTDs that are essentially nested relations. But our results are more general: they apply to any other mapping scheme that produces a similar relational schema for nested-relation-like DTDs. Given a DTD, the basic idea of the inlining mapping is that separate relations are created for the root and the element typ es that app ear under a star, and the other element typ es are inlined in the relations corresp onding to their parents. Each relation corresp onding to an element typ e has an ID attribute that is a key for that relation as well as a parent ID attribute that is a foreign key p ointing to the parent of that element in the document. All the attributes of a given element typ e in the DTD b ecome attributes in the relation corresp onding to that element typ e. For example, the relational schema for storing XML documents conforming to the DTD in Example 1 would b e student(stID, name, conID ) addr ess(addID, conID, postalC ode, str eetN o, aptN o, city ) phone(phID, conID, number ) Key attributes are underlined, and the following foreign keys also hold: addr ess[conID] FK student[conID] and phone[conID] FK student[conID]. The inlining schema generation can b e formally captured as follows. Definition 3. Given a DTD D = (E , A, P, R, r ), we define the inlining of D to be a relational schema S , where · S = {Re | e E , and e occurs in Pe E or e occurs in Pr }, and for Note that we can reformulate Fact 1 as GIC(BCNF) = 1. This means that if we cannot achieve the maximum information content equal to 1 for all p ositions, the next b est thing that we guarantee is the value of the information-theoretic measure equal to 1 . 2 some e 5. RELATIONAL TRANSLATION OF XML The main goal of this pap er is to show how a good XML design can result in having a less redundant relational storage for XML documents. Redundancies occur when DTDs · there is a R apping : E S recursively defined as m if Re S e (e ) = (e ), where e occurs in Pe if Re S such that for each Re S , 1087 WWW 2007 / Track: XML and Web Data attr(Re ) = ({e ID} A(e )) . {e D | e occurs in Pe } e -1 (R e) Session: Parsing, Normalizing, and Storing XML by shredding T into relations of S . Now every p osition p in T is naturally mapp ed into a unique p osition (p) in IT . Definition 4. We say that an inlining translation (S, S ) of (D, ) is well-designed iff for every XML tree T conforming to D and satisfying and every position p in T , we have RicIT ( (p)|S ) = 1. In other words, in p ositions corresp onding to p ositions from the XML document, there is no redundancy whatsoever in the shredded documents according to the translation of XML constraints. Theorem 1. The fol lowing are equivalent for an XML specification (D, ) and its inlining translation (S, S ): 1. (D, ) is wel l-designed (or, equivalently, is in XNF); 2. (S, S ) is wel l-designed. This means that in order to ensure a non-redundant relational storage for our XML data, we need to have an XNF design. In other words, XNF achieves nonredundant design no matter what typ e of storage ­ native or relational ­ is used. I Then given an XML tree T = (N , G) conforming to D and satisfying , we can straightforwardly shred it into an instance IT of relational schema S , the inlining of D. Note that we can use node identifiers in N for values of ID attributes when p opulating IT . The precise definition of this transformation is omitted here. It is easy to observe that the FDs defined over a DTD do not necessarily translate into FDs over the relational schema, simply b ecause the paths involved in an XML functional dep endency may not all occur in a single relation. Therefore, we need to join different relations to enforce the integrity constraints that are now in the form of equalitygenerating dependencies (EGDs). For example, the FDs in Example 1 would translate into the following EGDs. Note that EGD (4) is an FD, but EGD (3) is not. student(s, n, c) addr ess(a, c, pc, st, apt, ct) student(s, n, c) addr ess(a , c, pc, st , apt , ct ) a=a, addr ess(a, c, pc, st, apt, ct) addr ess(a , c , pc, st , apt , ct ) ct = ct (3) . (4) We can show that every constraint expressed in the form of an FD for XML can b e written as an EGD over the inlining relational storage. More formally, Proposition 1. For every DTD D and set of XML functional dependencies defined over D, there is a set of EGDs E over the relations of S , the inlining of D, such that for every XML tree T conforming to D, the tree T satisfies if and only if IT satisfies E . Note that there are also some key and foreign key constraints K,FK over the ID attributes of S , as shown in previous example. We then call (S, S ) the inlining translation of (D, ), where S = E K,FK . 7. RELATIVE CONSTRAINTS AND GOOD DESIGNS Like in the relational case, bad XML designs may lead to very low information contents for p ositions in the relational storage. In fact, the information content of a p osition in a relational storage of an XML document can p otentially b e arbitrarily low. Proposition 2. For every > 0, we can find an XML design (D, ) with inlining translation (S, S ) and an XML tree T conforming to D and satisfying with a position p, such that for the corresponding position (p) in IT , we have RicIT ( (p)|S ) < . To avoid the p ossibility of having such a high redundancy, we need some restrictions that guarantee a reasonable information content for all p ositions in the relational storage of XML. We showed in the previous section that an XNF design corresp onds to maximum information content for the relational storage, but is it always p ossible to achieve XNF? Recall that in the relational context it is not always p ossible to achieve a dep endency-preserving p erfect design. Therefore to guarantee dep endency preservation, we may have to tolerate some redundancy, at most equal to one half of the maximum information content, and this is exactly what a good 3NF (i.e., 3NF+ ) normalization gives us. Here we want to show that if a dep endency-preserving XNF design is not achievable, then a simple restriction on FDs defined for XML guarantees the second b est information content for the relational storage of an XML document. The restriction is simply that all FDs should satisfy XNF or b e relative to an element that occurs under the scop e of a Kleene star. Then the information content of all p ositions of the relational storage of the XML document will b e at least 1 . 2 Let D = (E , A, P, R, r ) b e a DTD, b e a set of XML functional dep endencies defined over D, and (S, S ) b e the inlining translation of (D, ). 6. RELATIONAL TRANSLATIONS PRESERVE PERFECT DESIGNS We know that if we want a p erfectly non-redundant design for XML, we should try to achieve XNF [4]. XML is most commonly stored in relational databases, in order to take advantage of fast relational query engines and well-develop ed storage facilities. We therefore need to study the design and normalization not only for XML p er se, but also for the relational storage of XML documents. Here we study the effect of XNF normalization on the relational storage, and in the next section we will see how the relational storage looks, in terms of redundancy, for non-p erfect XML designs. We observe that the inlining mapping preserves a good XML design by showing that the information content of data values will remain maximum after transforming XML data into relational storage. Note that the FDs over XML data will b ecome EGDs over the relational storage. Let D b e a DTD, b e a set of XML functional dep endencies defined over D, and (S, S ) b e the inlining translation of (D, ). Consider an XML tree T conforming to D and satisfying . Let IT b e the instance of (S, S ) that is obtained 1088 WWW 2007 / Track: XML and Web Data Definition 5. We say that an FD {q1 , . . . , qn } q is relative under the Kleene star if · qi is an element path for some i [1, n], · for al l j [1, n], qi is a prefix of qj , and · for some p which is a prefix of qi and E , last(p) occurs under a Kleene star in P . An example of FD that violates this condition is FD (2) in Example 1. We refer to such FDs as absolute or global FDs. We now extend the definition of guaranteed information content for relational storage of XML documents. Let C b e some condition on XML functional dep endencies defined over a DTD, e.g. XNF or relative under the Kleene star. Now consider all XML trees T conforming to some DTD D and satisfying FDs , such that FDs in are of typ e C . The guaranteed information content of a condition C for inlining translation, written as GICinl (C ) is the largest numb er c [0, 1] such that for all such trees T and p osition p in T , RicIT ( (p)|S ) c, where IT is the shredding of T into the inlining translation (S, S ) of (D, ), and (p) is the p osition in IT to which p is mapp ed. Formally, GICinl (C ) = inf {RicIT ( (p)|S )}, where T ranges over trees conforming to D and satisfying FDs of typ e C , and p ranges over p ositions in T . Using this, we can reformulate Theorem 1 as GICinl (XNF) = 1. Now let relative denote the condition of b eing relative under star. Then we can formally state the main result of this section: Theorem 2. GICinl (relative) 1 . 2 Session: Parsing, Normalizing, and Storing XML of view: 1) redundancy and information content, and 2) the complexity of enforcing integrity constraints. In the Edge representation, an XML tree is viewed as an edge-lab eled graph. Each element-to-element and elementto-attribute edge of the tree has a tuple in E dg e table, and for each data value in the tree, there is a tuple in V alue table associating the node identifier to a value. The relational storage for any XML tree, regardless of its schema, has the following relations: E dg e(sour ce, tar g et, label), V alue(v id, v al). In the original definition of this schema [21], there are two more attributes in the E dg e relation: or dinal, which sp ecifies the ordinal of the tar g et among children of the sour ce, and f lag which sp ecifies whether the tuple corresp onds to an element-to-element edge or an element-to-attribute edge. We simplify the schema by removing these attributes, as they do not have any effect on the redundancy of data values. Assume that we have a set E l of element names, Attr of attribute names, and S tr of all p ossible string-valued attributes. Then given an XML tree T = (N , G), with lab eling function : N E l and attribute functions @a : N S tr for each @a Attr, we can p opulate E dg e and V alue such that · for each edge (x, y ) G, there is a tuple (x, y , (y )) E dg e, and · for each x N and @a Attr such that @a is defined for x, there is a tuple (x, y , @a) E dg e as well as a tuple (y , @a (x)) V alue, where y N is a fresh node identifier. Other variants of this approach also exist. One of them is the binary approach [21, 32], where the E dg e relation is horizontally partitioned based on attribute label. The schema would then have the following relations: Blabel (sour ce, tar g et) V alue(v id, v al). In this representation, edge lab els or element typ es are not stored as attributes. We can therefore assume that the domain of each attribute is an infinite set, and thus the measure of information content can b e directly applied. Here we focus on the binary representation and show that redundancy-wise it looks the same as the inlining representation. Given a set of FDs defined over a DTD D, the translation of over the binary representation of D will again b e a set of EGDs. We denote the binary translation of (D, ) by (SB , B ), where B contains some key and foreign key constraints, as well as some EGDs obtained from . Every T conforming to D and satisfying can b e trivially shredded into an instance IB of (SB , B ), and each p osition p in T is mapp ed to a unique p osition B (p) in IB . Then the definitions of b eing well-designed and guaranteed information content for binary translation, GICbin , would b e very similar to those of inlining translation. Not surprisingly, the binary translation also preserves a p erfect XML design: Theorem 3. The fol lowing are equivalent for an XML specification (D, ) and its binary translation (SB , B ): In other words, if we manage to design an XML document in such a way that there is no global FDs, then the redundancy of each data value in the relational storage of 1 the XML document would not b e worse than 2 . We can explain this result more intuitively by looking at up date anomalies that could happ en in the relational storage of an XML document. Most database management systems disallow up dates that violate key constraints, but the only mechanism to enforce FDs or EGDs would b e through writing assertions. In the absence of global or absolute FDs on the XML side, the p ossibility of FD or EGD violation due to a bad up date in the relational storage will b e restricted to a small p ortion of the entire database. Informally sp eaking, if we are to numerically evaluate the p ossibility of having update anomalies, our results state that by having global FDs, the relational storage of an XML document could b e exp onentially more prone to anomalies, compared to the case when we are restricted to XNF and relative FDs. 8. A DIFFERENT MAPPING SCHEME Beside inlining, there are other XML-to-relational data and query mapping schemes [21, 39, 32, 6, 19, 18], among them b eing the Edge representation [21], which is used as a basis for many XML query translation techniques. Here we would like to study this mapping scheme from two p oints 1089 WWW 2007 / Track: XML and Web Data 1. (D, ) is wel l-designed (or, equivalently, is in XNF); 2. (SB , B ) is wel l-designed. Similarly, to guarantee an information content of 1 for p o2 sitions in the binary representation of an XML document, it is enough to make sure that all the XML functional dep endencies are relative: Theorem 4. GICbin (relative) 1 . 2 This might make us think that, from the design p oint of view, binary and inlining representations are equivalent. However, we will next show that they differ significantly when it comes to the complexity of enforcing integrity constraints. Here by complexity, we mean the numb er of joins needed to write a SQL assertion that enforces an EGD translated from an XML functional dep endency. Given an XML functional dep endency f over a DTD D, we use the notation # 1fnl i and # 1fin b Session: Parsing, Normalizing, and Storing XML 2. If XNF is not achievable, the next b est p ossible design is achieved by relativizing all constraints that violate XNF. By relativizing we mean restricting them to the scop e of an element that occurs in a DTD with a Kleene star (or under the scop e of another element with a Kleene star). When we say "the next b est p ossible design", our criterion is the information content of redundant values in the relational storage of XML documents. 3. The information-theoretic framework allows us to compare different shredding techniques from the p oint of view of redundancies in XML documents. We show that two p opular techniques ­ inlining [33] and the edge representation [21] ­ b ehave in the same way information-theoretically, while the latter can b ehave arbitrarily worse in terms of enforcing integrity constraints. 10. FUTURE WORK While the situation with the b est p ossible design from the p oint of view of eliminating up date anomalies has now b een completely clarified for b oth native and relational storage, it is not yet entirely clear how to handle non-p erfect designs that do not eliminate all redundancies. Such designs are necessary, and in fact essential, in the relational world, where they guarantee database consistency by means of dep endency preservation. This in fact explains why 3NF designs in practice are more p opular than BCNF designs. In the XML world, we do not yet have an adequate understanding of the notion of dep endency preservation and its impact on XML design, whether for native or relational storage. Another op en issue is whether one can relativize constraints, as suggested in this pap er for achieving the b est non-XNF design, and do it while preserving XML constraints. We also would like to extend the idea of using the information-theoretic framework for reasoning ab out and comparing different shredding techniques for XML documents. Acknowledgments. We thank Wenfei Fan and Rene´ e Miller for comments and suggestions. Both authors are supp orted by a grant from NSERC. Part of this work was done while Kolahi was visiting the University of Edinburgh. Libkin is on leave from the University of Toronto, supp orted by the Europ ean Commission Marie Curie Excellence grant MEXC-CT-2005-024502 and EPSRC grant E005039. to denote the numb er of joins required to enforce the translation of f on the inlining and binary relational representation of D resp ectively. Consider for instance FD (2) of Example 1. The translation of this FD over the binary representation of the DTD is an EGD that involves five joins, whereas over the inlining translation, it can b e written as an FD requiring only one join as shown by (4). In fact, we can show that even for key or XNF dep endencies, the numb er of joins needed for the binary representation is never smaller than under inlining, and can b e arbitrarily higher. Proposition 3. 1. For every DTD D and XML functional dependency f , we have # 1fin # 1fnl . i b 2. For every m > 0, we can find a DTD D and an XML functional dependency f over D, such that # 1fin b # 1fnl i > m. 9. CONCLUSIONS We have studied design and normalization for XML documents stored in relations, rather than in native XML storage facilities (which was the assumption of the previous work on XML normalization). To b e able to compare relational and XML designs, we applied the information-theoretic framework of [5] that can b e used for many different data models, and is indep endent of data model features such as update and query languages (which are still in the development stage for XML). The main conclusions are as follows: 1. The XML normal form XNF, prop osed in [4] as a generalization of BCNF for XML documents, achieves the b est p ossible design from the p oint of view of eliminating redundancies in b oth native and relational storage of XML. Note that algorithms for converting XML designs into XNF exist. 11. REFERENCES [1] Xyleme XML server & business document management. http://www.xyleme.com/page/xml storage/. [2] S. Abiteb oul, R. Hull, and V. Vianu. Foundations of Databases. Addison-Wesley, 1995. [3] M. Arenas, W. Fan, and L. Libkin. On verifying consistency of XML sp ecifications. In PODS, pages 259­270, 2002. [4] M. Arenas and L. Libkin. A normal form for XML documents. ACM Trans. Database Syst., 29:195­232, 2004. 1090 WWW 2007 / Track: XML and Web Data [5] M. Arenas and L. Libkin. An information-theoretic approach to normal forms for relational and XML data. J. ACM, 52(2):246­283, 2005. [6] D. Barb osa, J. Freire, and A. O. Mendelzon. Designing information-preserving mapping schemes for XML. In VLDB, pages 109­120, 2005. [7] C. Beeri, P. A. Bernstein, and N. Goodman. A sophisticate's introduction to database normalization theory. In VLDB, pages 113­124, 1978. [8] P. A. Bernstein. Synthesizing third normal form relations from functional dep endencies. ACM Trans. Database Syst., 1(4):277­298, 1976. [9] J. Biskup. Achievements of relational database schema design theory revisited. In Semantics in Databases, pages 29­54, 1995. [10] P. Buneman, S. B. Davidson, W. Fan, C. S. Hara, and W. C. Tan. Keys for XML. In World Wide Web, pages 201­210, 2001. [11] R. Cavallo and M. Pittarelli. The theory of probabilistic databases. In VLDB, pages 71­81, 1987. [12] E. F. Codd. Normalized database structure: A brief tutorial. In ACM SIGFIDET Workshop on Data Description, Access and Control, 1971. [13] E. F. Codd. Further normalization of data base relational model. In Courant Computer Science Symposium 6: Data Base Systems, pages 33­64, 1972. [14] E. F. Codd. Recent investigations in relational database systems. In Information Processing, pages 1017­1021, 1974. [15] T. M. Cover and J. A. Thomas. Elements of Information Theory. John Wiley & sons, 1991. [16] M. M. Dalkilic and E. L. Rob erston. Information dep endencies. In PODS, pages 245­253, 2000. [17] S. Davidson, W. Fan, C. Hara, and J. Qin. Propagating XML constraints to relations. In ICDE, pages 543­554, 2003. [18] A. Deutsch, M. Fernandez, and D. Suciu. Storing semistructured data with STORED. In SIGMOD, pages 431­442, 1999. [19] A. Deutsch and V. Tannen. MARS: A system for publishing XML from mixed and redundant storage. In VLDB, pages 201­212, 2003. [20] D. W. Embley and W. Y. Mok. Developing XML documents with guaranteed "good" prop erties. In ER'01, pages 426­441. [21] D. Florescu and D. Kossmann. Storing and querying XML data using an RDBMS. IEEE Data Eng. Bul letin, 22(3):27­34, 1999. [22] P. C. Kanellakis. Elements of relational database theory. pages 1073­1156, 1990. [23] C.-C. Kanne and G. Moerkotte. A linear-time algorithm for optimal tree sibling partitioning and approximation algorithms in Natix. In VLDB, 2006. [24] M. Kifer, A. Bernstein, and P. M. Lewis. Database systems : an application-oriented approach. Addison-Wesley, 2006. [25] S. Kolahi and L. Libkin. On redundancy vs dep endency preservation in normalization: An information-theoretic study of 3NF. In PODS, pages 114­123, 2006. [26] R. Krishnamurthy, R. Kaushik, and J. F. Naughton. Session: Parsing, Normalizing, and Storing XML XML-to-SQL query translation literature: The state of the art and op en problems. In XSym'03, LNCS 2824, pages 1­18, 2003. T. T. Lee. An information-theoretic analysis of relational databases - part i: Data dep endencies and information metric. IEEE Trans. Software Eng., 13(10):1049­1061, 1987. M. Levene, M. Levene, and G. Loizou. A Guided Tour of Relational Databases and Beyond. Springer-Verlag, London, UK, 1999. M. Levene and G. Loizou. Why is the snowflake schema a good data warehouse design? Inf. Syst., 28(3):225­240, 2003. Z. H. Liu, M. Krishnaprasad, and V. Arora. Native XQuery processing in oracle XMLDB. In SIGMOD, pages 828­833, 2005. M. Nicola and B. van der Linden. Native XML supp ort in DB2 universal database. In VLDB, pages 1164­1174, 2005. A. Schmidt, M. L. Kersten, M. Windhouwer, and F. Waas. Efficient relational storage and retrieval of XML documents. In Selected papers from the Third International Workshop WebDB 2000 on The World Wide Web and Databases, pages 137­150. Springer-Verlag, 2001. J. Shanmugasundaram, K. Tufte, C. Zhang, G. He, D. J. DeWitt, and J. F. Naughton. Relational databases for querying XML documents: Limitations and opp ortunities. In VLDB, pages 302­314, 1999. I. Tatarinov, S. D. Viglas, K. Beyer, J. Shanmugasundaram, E. Shekita, and C. Zhang. Storing and querying ordered XML using a relational database system. In SIGMOD, pages 204­215, 2002. M. W. Vincent and J. Liu. Functional dep endencies for XML. In APWEB, pages 22­34, 2003. M. W. Vincent, J. Liu, and C. Liu. A redundancy free 4NF for XML. In XSym, pages 254­266, 2003. M. W. Vincent, J. Liu, and C. Liu. Strong functional dep endencies and their application to normal forms in XML. ACM TODS, 29(3):445­462, 2004. J. Wang and R. W. Top or. Removing XML data redundancies using functional and equality-generating dep endencies. In Australian Database Conference, pages 65­74, 2005. M. Yoshikawa, T. Amagasa, T. Shimura, and S. Uemura. XRel: a path-based approach to storage and retrieval of XML documents using relational databases. ACM Trans. Inter. Tech., 1(1):110­141, 2001. C. Zaniolo. A new normal form for the design of relational database schemata. ACM Transactions on Database Systems, 7(3):489­499, 1982. [27] [28] [29] [30] [31] [32] [33] [34] [35] [36] [37] [38] [39] [40] APPENDIX A. DEFINITION OF THE INFORMATIONTHEORETIC MEASURE A.1 Basics of information theory The main concept of information theory is that of entropy, which measures the amount of information provided by a 1091 WWW 2007 / Track: XML and Web Data certain event. Assume that an event can have n different outcomes s1 , . . ., sn . Then for a probability space A = ({s1 , . . . , sn }, PA ), where PA is a probability distribution, its entropy is defined as H (A) = in PA (si ) log 1 . PA (si ) Session: Parsing, Normalizing, and Storing XML {p}, ai is either vi or the value in the i-th p osition of I . We make this into a probability space A(I , p) = ((I , p), Pu ) with the uniform distribution Pu (a) = 21-n . ¯ We next define conditional probabilities Pk (a | a) that ¯ show how likely a is to occur in p osition p, if values are removed from I according to the tuple a (I , p) Let I(a,a) b e ¯ ¯ obtained from I by putting a in p osition p, and ai in p osition i = p. A substitution is a map : a [1, k] that assigns ¯ a value to each ai which is a variable, and leaves other ai s intact. We let SATk (I(a,a) ) b e the set of all substitutions ¯ such that (I(a,a) ) |= and | (I(a,a) )| = |I | (the latter ¯ ¯ ensures that no two tuples collapse as the result of applying ). Then Pk (a | a) is defined as: ¯ Pk (a | a) ¯ = b |SATk (I(a,a) )| ¯ . k ¯ [1,k] |SAT (I(b,a) )| =1 For probabilities that are zero, we adopt the convention that 1 1 0 log 0 = 0, since limx0 x log x = 0. It is known that 0 H (A) log n, with H (A) = log n only for the uniform distribution PA (si ) = 1/n [15]. We shall also need the concept of conditional entropy of B assuming A. Supp ose we have two probability spaces A B = ({s1 , . . . , sn }, PA ), = ({s1 , . . . , sm }, PB ) and probabilities P (sj , si ) of all the events (sj , si ). Note that PA and PB need not b e indep endent. Then the conditional entropy of B given A, denoted by H (B | A), gives the average amount of information provided by B if A is known [15]. If P (s j | s i ) = P (s j , s i ) PA (si ) With this, we define Rick (p | ) as I 1a . ¯ 1 Pk (a | a) log ¯ 2n-1 Pk (a | a) ¯ a(I ,p) [1,k] 1 Since ¯ [1,k] Pk (a | a) log Pk (a|a) measures the amount of ¯ information in p, given constraints and some missing values in I , represented by the variables in a, our measure ¯ k RicI (p | ) is the average such amount over all a (I , p). ¯ To see that Rick (p | ) is a conditional entropy, define I 1¯ Pk (a) = n-1 Pk (a | a) . ¯ 2 a are conditional probabilities, then the conditional entropy is defined as . jm in P 1 j H (B | A) = P (s | si ) log A (s i ) P (s j | s i ) =1 =1 a(I ,p) A.2 Relative information content We now give a detailed definition of relative information content from [5] that was used to justify relational and XML normal forms [5, 25]. Unlike other prop osed information-theoretic measures [27, 11, 16, 29] that work only at the level of data, this measure takes into account b oth data and schema constraints. Fix a schema S and a set of constraints, and let I inst (S, ). We want to define RicI (p | ), the relative information content of a p osition p Pos (I ) with resp ect to the set of constraints . Formally, we assume that I has n p ositions (which we enumerate as 1, . . . , n), and fix an n-element set of variables {vi | 1 i n}. Let (I , p) b e the set of all 2n-1 vectors (a1 , . . . , ap-1 , ap+1 , . . . , an ) such that for every i [1, n] - It is a probability distribution on [1, k] (intuitively, it says how likely an element from [1, k] is to satisfy when put in p osition p, given all p ossible interactions b etween p and k sets of p ositions in I ). If B (I , p) is the probability space ([1, k], Pk ), then Rick (p | ) is the conditional entropy: I k Rick (p | ) = H (B (I , p) | A(I , p)). I k Since the domain of B (I , p) is [1, k], we have 0 Rick (p | I ) log k. To normalize this, we consider the ratio k RicI (p | )/ log k. The key observation of [5] is that for most reasonable constraints (certainly for all definable in first-order logic), this sequence converges as k , and we thus define RicI (p|) = lim Rick (p | ) I . k log k 1092