WWW 2007 / Track: Semantic Web Session: Query Languages and DBs Bridging the Gap Between OWL and Relational Databases Boris Motik University of Manchester Manchester, UK Ian Horrocks University of Manchester Manchester, UK Ulrike Sattler University of Manchester Manchester, UK ABSTRACT Schema statements in OWL are interpreted quite differently from analogous statements in relational databases. If these statements are meant to b e interpreted as integrity constraints (ICs), OWL's interpretation may seem confusing and/or inappropriate. Therefore, we prop ose an extension of OWL with ICs that captures the intuition b ehind ICs in relational databases. We discuss the algorithms for checking IC satisfaction for different typ es of knowledge bases, and show that, if the constraints are satisfied, we can disregard them while answering a broad range of p ositive queries. Categories and Subject Descriptors I.2.4 [Knowledge Representation Formalisms and Methods]: OWL General Terms Theory, Languages Keywords Semantic Web, Relational Databases, OWL 1. INTRODUCTION The Web Ontology Language (OWL) is the W3C standard language for modeling ontologies in the Semantic Web. The logical underpinning for OWL is provided by description logics (DLs) [3]. OWL can b e seen as an expressive schema language; however, its axioms have a different meaning from analogous statements in relational databases. When OWL axioms are meant to b e interpreted as integrity constraints (ICs), the formal semantics of OWL may seem confusing and/or inappropriate. To understand the nature of the problem, consider an application for managing tax returns in which each p erson is required to have a social security numb er. In a relational database, this would b e captured by an inclusion dep endency stating that a social security numb er must exist for each p erson. During database up dates, such a dep endency is interpreted as a check: whenever a p erson is added to the database, a check is p erformed whether that p erson's social This work was funded by the EPSRC pro ject REOL. Copyright is held by the International World Wide Web Conference Committee (IW3C2). Distribution of these papers is limited to classroom use, and personal use by others. WWW 2007, May 8­12, 2007, Banff, Alberta, Canada. ACM 978-1-59593-654-7/07/0005. security numb er has b een sp ecified as well; if not, the up date is rejected. An analogous dep endency can b e expressed in OWL by means of an existential restriction, but will result in a quite different b ehavior: adding a p erson without a social security numb er to the ontology does not raise an error, but only leads to the inference that the p erson in question has some (unknown) social security numb er. In Section 2, we study this problem by comparing OWL with relational databases w.r.t. their schema languages, reasoning problems, and constraint checking. Our analysis suggests that OWL ontologies can b e understood as incomplete databases [15]. Many databases encountered in practice are, however, complete. To obtain a flexible schema language, we would like to explicitly control the degree of incompleteness. Integrity constraints (ICs)--formulae that check whether all necessary information has b een provided explicitly--seem to b e the prop er mechanism for this purp ose. Extending logic-based KR formalisms with database-like ICs has already b een considered in the literature. Reiter observed that ICs are not ob jective sentences ab out the world, but are epistemic sentences ab out the database state [23]. Based on this idea, constraints have b een added to DLs using nonmonotonic modal [11] or rule extensions [24, 20]. Such ICs, however, just constrain the shap e of an ABox; they are not a part of the core conceptual model and do not affect schema reasoning at all. In contrast, ICs have a dual role in relational databases: they describ e the p ossible worlds as well as the states of the databases [1], so they are used b oth in data reasoning (e.g., for checking database integrity) and in schema reasoning (e.g., for computing query subsumption). The semantic relationship b etween these two roles of ICs is clear, which simplifies their usage. In Section 3, we introduce extended DL know ledge bases, which allow a modeler to designate a subset of the TBox axioms as ICs. For schema (TBox) reasoning, these axioms are treated as usual. For data (ABox) reasoning, these axioms do not derive new information; instead, they are interpreted as checks. Thus, the relationship b etween the interpretations of ICs in TBox and ABox reasoning is clear, and is analogous to relational databases. In fact, in Section 4, we show that, if an ABox satisfies the ICs, we can disregard the ICs while answering p ositive ABox queries. In Section 5, we discuss the typical usage patterns of our approach. Finally, in Section 6, we discuss algorithms for checking IC satisfaction. Due to space limitations, we present only the intuition b ehind our algorithms; for details, please refer to [19]. Our notion of ICs solves common problems of OWL-based applications: through constraint satisfaction checking, these 807 WWW 2007 / Track: Semantic Web applications can ensure that all required data has explicitly b een sp ecified in an ontology, and thus detect p otential errors in the data. Furthermore, we exp ect significant p erformance improvements in query answering, esp ecially in applications that would use ICs extensively. We assume the reader to b e familiar with the basics of OWL and DLs; please refer to [3] for an introduction. It is well-known that the OWL DL variant of OWL corresp onds to the DL S HOI N (D). Because of that, in this pap er we refer to OWL and DLs interchangeably. Session: Query Languages and DBs There is a slight technical difference b etween models and database instances: models can typically b e infinite, whereas database instances must typically b e finite. The latter is due to the fact that only finite databases can b e stored in practical systems. For certain classes of dep endencies, the restriction to finite structures is not really relevant: whenever an infinite relational structure satisfying the schema exists, a finite structure exists as well (this is the so-called finite model prop erty). For languages such as OWL, this does not necessarily hold: some ontologies are satisfied only in infinite models [3]. Even though the complexity of finite model reasoning is, for numerous DLs, the same as the complexity of reasoning w.r.t. arbitrary models, the former is usually more involved [18]. Hence, in the rest of this pap er, we drop the restriction to finite database instances and consider models and database instances to b e synonymous. 2. OWL VS. RELATIONAL DATABASES An obvious distinction b etween OWL/DLs and relational databases is that the former use op en-world, whereas the latter use closed-word semantics. In this section, we argue that such a classification is too sup erficial; rather, we should see b oth semantics as complementary to each other. 2.1 Schema Language The schema part of a DL knowledge base is typically called a TBox (terminology b ox), and is a finite set of (p ossibly restricted) universally quantified implications. For example, a TBox can state that each p erson has a social security numb er (SSN), that a p erson can have at most one SSN, and that each SSN can b e assigned to at most one individual. These statements are expressed using these TBox axioms: Person hasSSN .SSN Person 1 hasSSN SSN 1 hasSSN - (1) (2) (3) 2.3 Domains and Typing Relational databases assign typ es to columns of relations; for example, the second p osition of hasSSN could b e restricted to strings of a certain form. Typing is used in practice to determine the physical layout of the database. In contrast, typing is often not considered in theory (e.g., in algorithms for checking query containment); rather, columns draw their values from a common countable domain set [1]. To provide for explicit typing of relationships, the DLs underlying OWL have datatypes --a simplified variant of the so-called concrete domains [4]. In this pap er, we consider neither typ ed relational schemas nor DL knowledge bases with concrete domains, and simply interpret b oth relational schemata and TBoxes in first-order logic. This simplifies b oth formalisms significantly. For example, adding key constraints to untyp ed DLs is not a problem [8], whereas adding them to DLs with typ ed predicates is much more involved [17]. Most DLs can b e seen as decidable fragments of first-order logic [7]. Hence, the axioms (1)­(3) can b e translated into the following first-order formulae: x : [Person (x) y : hasSSN (x, y ) SSN (y )] x, y1 , y2 : [Person (x) hasSSN (x, y1 ) hasSSN (x, y2 ) y1 y2 ] x, y1 , y2 : [SSN (x) hasSSN (y1 , x) hasSSN (y2 , x) y1 y2 ] (4) (5) (6) 2.4 Schema Reasoning Checking subsumption relationships b etween concepts has always b een a central reasoning problem for DLs. A concept C is subsumed by a concept D w.r.t. a DL TBox T if the extension of C is included in the extension of D in each model I of T . This inference has many uses; for example, in ontology modeling, derived subsumption relationships can b e used to detect modeling errors. Concept subsumption has b een used to optimize query answering [13], esp ecially when generalized to subsumption of conjunctive queries [10, 12]. Another imp ortant TBox inference is checking concept satisfiability--that is, determining whether a model of T exists in which a given concept has a nonempty extension. Concepts are usually unsatisfiable due to modeling errors, so this inference is also useful in ontology modeling. Reasoning ab out the schema is certainly not the most prominent feature of relational databases, yet a significant amount of research has b een devoted to it. The most imp ortant schema-related inference in databases is checking query containment [1]: a query Q1 is contained in a query Q2 w.r.t. a schema T if the answer of Q1 is contained in the answer of Q2 for each database instance that satisfies T . This inference is used by virtually all database systems to rewrite queries into equivalent ones that can b e answered more efficiently. Another useful inference is dep endency minimization--that is, computing a minimal schema that is equivalent to the given one. The schema of a relational database is defined in terms of relations and dep endencies. Different typ es of dep endencies have b een considered, such as functional, inclusion, or join dep endencies. As discussed in [1], most dep endencies can b e represented as first-order formulae of the form (7), where and are conjunctions of function-free atoms: x1 , . . . , xn : [ (x1 , . . . , xn ) y 1 , . . . , y m : (x 1 , . . . , x n , y 1 , . . . , y m )] (7) Although the expressivity of DLs underlying OWL and of relational dep endencies is clearly different, the schema languages of the two are quite closely related. In fact, the formula (4) corresp onds to an inclusion dep endency, whereas (5) and (6) corresp ond to key dep endencies. 2.2 Interpreting the Schema DL TBoxes and relational schemas are interpreted according to standard first-order semantics: they distinguish the legal from the illegal relational structures--that is, the structures that satisfy all axioms from the structures that violate some axiom. In DLs, the legal structures are called models, whereas in relational databases they are called database instances ; the underlying principle is, however, the same. 808 WWW 2007 / Track: Semantic Web In b oth DLs and relational databases, schema reasoning problems corresp ond to checking whether some formula holds in each model (i.e., database instance) of T --that is, checking whether T |= . In other words, the terminological problems in b oth DLs and relational databases corresp ond to first-order consequences of a first-order theory. Since the problems are the same, it should not come as a surprise that the methods used to solve them are closely related. Namely, reasoning in DLs is typically p erformed by tableau algorithms [5], whereas the state-of-the-art reasoning technique in relational databases is chase [1]. Apart from notational differences, the principles underlying these two techniques are the same: they b oth try to construct a model that satisfies the schema T but not the formula . To summarize, DLs and databases treat schema reasoning problems in the same way. Thus, DLs can b e understood as expressive but decidable database schema languages. Session: Query Languages and DBs 2.6 Constraint Checking In DLs, one can check whether an ABox A is consistent with a TBox T --that is, whether a model I of b oth A and T exists--and thus detect p ossible contradictions in A and T . This inference is not, however, a suitable basis for constraint checking. For example, let T contain the axioms (1)­(3), and let A contain only the following axiom: Person (Peter ) (8) 2.5 Query Answering Apart from the schema (or TBox) part, a DL knowledge base K typically also has a data (or ABox) part. The main inference for ABoxes is instance checking --that is, checking whether an individual a is contained in the extension of a concept C in each model of K, commonly written as K |= C (a). Instance checking can b e generalized to answering conjunctive queries over DL knowledge bases [10, 12]. Thus, a DL query can b e viewed as a first-order formula with free variables x1 , . . . , xn . Like schema reasoning, the semantics of query answering in DLs is defined as first-order entailment; hence, it takes into account all models of K: a tuple a1 , . . . , an is an answer to over K if K |= [a1 /x1 , . . . , an /xn ], where the formula [a/x] is obtained by replacing in all free occurrences of x with a. Queries in relational databases are first-order formulae (restricted in a way to make them domain indep endent) [1], so they are similar to queries in DLs. A significant difference b etween DLs and relational databases is the way in which the queries are evaluated. Let b e a first-order formula with the free variables x1 , . . . , xn . A tuple a1 , . . . , an is an answer to over a database instance I if I |= [a1 /x1 , . . . , an /xn ]. Hence, unlike in DLs, query answering in relational databases does not consider all databases instances that satisfy the knowledge base K; instead, it considers only the given instance I . In other words, query answering in relational databases is not defined as entailment, but as model checking, where the model is the given database instance. Although the definition of query answering in relational databases from the previous paragraph is the most widely used one, a significant amount of research has also b een devoted to answering queries over incomplete databases [15]-- a problem that is particularly interesting in information integration. An incomplete database K is describ ed by a set A of incomplete extensions of the schema relations and a set T of dep endencies sp ecifying how the incomplete extensions relate to the actual database instance. Queries in incomplete databases are also (p ossibly restricted) first-order formulae. In contrast to complete databases, a tuple a1 , . . . , an is a certain answer to over K if I |= [a1 /x1 , . . . , an /xn ] for each database instance I that satisfies A and T . In other words, query answering in incomplete databases is defined as first-order entailment, exactly as in DLs. Consequently, DL query answering can b e understood as answering queries in incomplete databases. If (1) were interpreted as an integrity constraint, we would exp ect it to b e violated by A T : the ABox states that Peter is a p erson without sp ecifying his social security numb er. The knowledge base A T is, however, satisfiable: the axiom (1) is not interpreted as a check; rather, it implies the existence of some (unknown) SSN. Thus, T describ es the legal models, but not the legal ABoxes. In fact, the DLs underlying OWL do not provide any means to express database-like integrity constraints. In contrast, constraints play a central role in relational databases, where they are used to ensure the integrity of data. As in DLs, a relational schema T is a set of formulae that must hold for any database instance. A relational database instance, however, corresp onds to exactly one model, whereas a DL ABox typically has infinitely many models. Just like query answering, constraint checking in relational databases corresp onds to model checking: given a database instance I , a relational database checks the satisfaction of the schema constraints T by checking whether I |= T .1 In our example, if the ABox A is taken as the database instance I , then, clearly, I |= T --as exp ected, the integrity constraints are not satisfied. 2.7 Discussion OWL users have frequently complained ab out the "op enworld semantics of OWL" or "the unintuitive constraints in OWL." From our exp erience, this is often due to incorrect interpretation of the fundamental assumptions b ehind OWL and relational databases, resp ectively. From the standp oint of conceptual modeling, DLs provide a very expressive, but still decidable language that has proven to b e implementable in practice. The op en-world semantics is natural for a schema language since a schema selects the legal database instances. In fact, when computing the subsumption relationship b etween concepts or queries, we do not have a fixed instance. Therefore, we cannot interpret the schema in either OWL or the relational databases under closed-world assumption; rather, we must employ the op en-world semantics in order to consider all instances. In our exp erience, complaints from OWL users are most common in data-centric applications--that is, applications that focus on the management of large volumes of data. In practice, relational databases are typically complete: any missing information is either encoded metalogically (e.g., users often include fields such as hasSpecifiedSSN to signal that particular data has b een supplied in the database), or it is represented by nul l-values (which are also interpreted outside first-order logic). In contrast, ABoxes in DLs are closely related to incomplete (relational) databases. Clearly, problems may arise if certain asp ects of the information ab out In practice, constraints are incrementally checked after database up dates; these dynamic asp ects are, however, not imp ortant for this discussion. 1 809 WWW 2007 / Track: Semantic Web individuals in ABoxes are exp ected to b e complete. To understand the problems that occur in such cases, consider the following example. Biopax2 is an ontology used for data exchange b etween biological databases. It defines a prop erty NAME and states its domain to b e the union of bioSource , entity , and dataSource : NAME . bioSource entity dataSource (9) Session: Query Languages and DBs Section 2 to DLs. In such an approach, an ABox would b e interpreted as a single model and the TBox axioms as formulae that must b e satisfied in a model; the constraints would b e satisfied if A |= T . Such an approach, however, is not satisfactory, as it requires an "all-or-nothing" choice: we then have to assume that al l information in the ABox is complete; furthermore, TBox axioms could only b e used to check whether an ABox is of an appropriate form and would not imply new facts. To obtain a more versatile formalism, we prop ose a combination of inferencing and constraint checking. For example, let A1 b e the following ABox: Student (Peter ) hasSSN (Peter , nr12345 ) SSN (nr12345 ) Student (Paul ) Furthermore, let T1 b e the following TBox: Student Person Person hasSSN .SSN (16) (17) (12) (13) (14) (15) The intention b ehind this axiom is to define which ob jects can b e named--that is, to ensure that a name is stated only for ob jects of the appropriate typ e. In fact, the data in the Biopax ontology is complete w.r.t. this constraint: each object with a name is also typ ed (sometimes indirectly through the class hierarchy) to at least one of the required classes. The axiom (9), however, does not act as a constraint; instead, it says that, if some ob ject has a name, then it can b e inferred to b e either a bioSource , an entity , or a dataSource . Therefore, (9) cannot b e used to check whether all data is correctly typ ed. Furthermore, since the concept in the consequent is a disjunction, the axiom (9) requires reasoning by case, which is one of the reasons why DL reasoning is intractable [3, Chapter 3 ]. Hence, the axiom (9) causes two typ es of problems: on the one hand, it does not have the desired semantics and, on the other hand, it introduces a p erformance p enalty during reasoning. Representing incomplete information is, however, needed in many applications. Consider the following axiom stating that married p eople are eligible for a tax cut: marriedTo . TaxCut (10) Let us assume that we want to treat (17) as a constraint, but (16) as a normal axiom. Then, we derive Person (Peter ) and Person (Paul ) by (16). The constraint (17) is satisfied for Peter due to (12), (13), and (14); however, an SSN has not b een sp ecified for Paul , so we exp ect (17) to b e violated. Following this intuition, we define extended DL knowledge bases to distinguish the axioms that imply new facts from the ones that check whether all necessary information is derivable. Our definition is applicable to any DL. Definition 3.1. An extended DL knowledge base is a triple K = (S , C , A) such that · S is a finite set of standard TBox axioms, · C is a finite set of constraint TBox axioms, and · A is a finite set of ABox assertions (¬)A(a), R(a, b), a b, or a b, for A an atomic concept, R a role, and a and b individuals. In Definition 3.1, we restrict ourselves to ABoxes with only p ossibly negated atomic concepts. This does not result in any loss of generality b ecause S can b e used to introduce names for nonatomic concepts. Next, we discuss how to define an appropriate semantics for extended DL knowledge bases. The simplest solution is to interpret A S in the standard first-order way and to require C to b e satisfied in each model I for which we have I |= A S . The following example, however, shows that this does not satisfy our intuition. Let A2 contain only the fact (12), S2 = , and let C2 contain only the axiom (17). The interpretation I = {Student (Peter ), Person (Peter )} is a model of A2 S2 that does not satisfy C2 , which would make C2 not satisfied for A2 S2 . Intuitively, though, the fact Person (Peter ) is not implied by A2 S2 , so we should not check whether Peter has an SSN at all; C2 should hold only for the facts that are implied by A2 S2 . These considerations might suggest that C should hold for all first-order consequences of A S . In the example from the previous paragraph, this produces the desired b ehavior: Person (Peter ) is not a consequence of A2 S2 , so the axiom from C2 should not b e checked for Peter . Consider, however, the ABox A3 containing only the following axiom: To apply this axiom, we do not necessarily need to know the name of the sp ouse; we only need to know that a sp ouse exists. Thus, we may state the following fact: (marriedTo .Woman )(Peter ) (11) We are now able to derive that Peter is eligible for a tax cut even without knowing the name of his sp ouse. Providing complete information can b e understood as filling in a "Sp ouse name" b ox on a tax return, whereas providing incomplete information can b e understood as just ticking the "Married" b ox. The existential quantifier can b e understood as a well-b ehaved version of null-values that explicitly sp ecifies the semantics of data incompleteness. For use cases that require reasoning with incomplete information, DLs provide a sound and well-understood foundation. Thus, one would ideally like to b e able to explicitly control "the amount of incompleteness" in an ontology. Such a mechanism would allow the ontology modeler to explicitly state which data must b e fully sp ecified and which can b e left incomplete. This goal can b e achieved through an appropriate form of integrity constraints that check whether all data has b een sp ecified as required. Transforming inappropriate and/or erroneously introduced axioms into integrity constraints should also sp eed up query answering by eliminating unintended and p otentially complex inferences. 3. CONSTRAINTS FOR OWL In this section, we extend DL knowledge bases with ICs in order to overcome the problems discussed in the previous section. Since TBoxes are first-order formulae, it is straightforward to apply the model checking approach describ ed in 2 http://www.biopax.org/ 810 WWW 2007 / Track: Semantic Web Session: Query Languages and DBs Furthermore, let S = and C contain the following axiom: Cat(ShereKahn ) Furthermore, let S3 contain the following axiom: Cat Tiger Leopard Finally, let C3 contain the following two axioms: Tiger Carnivore Leopard Carnivore (18) Woman Man (25) (19) (20) (21) Now neither Tiger (ShereKahn ) nor Leopard (ShereKahn ) is a first-order consequence of A3 S3 , which means that the axioms from C3 are satisfied; furthermore, we have A3 S3 |= Carnivore (ShereKahn ). This answer does not satisfy our intuition: in each model of A3 S3 , either Tiger (ShereKahn ) or Leopard (ShereKahn ) holds, but Carnivore (ShereKahn ) does not necessarily hold in either case. Hence, by treating (20)­(21) as constraints and not as standard axioms, we neither get a constraint violation nor derive the consequence Carnivore (ShereKahn ). Intuitively, the constraints should check whether the facts derivable from A S C are also derivable using A S only. This notion seems to b e nicely captured by minimal models; hence, we check C only w.r.t. the minimal models of A S . Roughly sp eaking, a model I with an interpretation domain I of a formula is minimal if each interpretation I over I such that I I is not a model of , where we consider an interpretation to b e represented by the set of all p ositive ground facts that are true in it. Consider again A2 , S2 , and C2 . The fact Person (Peter ) is not derivable from A2 S2 in any minimal model (in fact, there is only a single minimal model), so the constraint axiom (17) is not violated. In contrast, A3 S3 has exactly two minimal models: I1 = {Cat (ShereKahn ), Tiger (ShereKahn )} I2 = {Cat (ShereKahn ), Leopard (ShereKahn )} These two models can b e viewed as the minimal sets of derivable consequences. The constraint TBox C3 is not satisfied in all minimal models (in fact, it is violated in each of them). In contrast, let A4 = A3 and C4 = C3 , and let S4 contain the following axiom: Cat (Tiger Carnivore ) (Leopard Carnivore ) (22) Now Carnivore (ShereKahn ) is derivable whenever we can derive either Tiger (ShereKahn ) or Leopard (ShereKahn ), so the constraints should b e satisfied. Indeed, A4 S4 has the following two minimal models: I3 = I1 {Carnivore (ShereKahn )} I4 = I2 {Carnivore (ShereKahn )} Both I3 and I4 satisfy C4 . Furthermore, we do not lose any consequences, despite the fact that we treat (20)­(21) as constraints, since the following holds: A4 S4 |= Carnivore (ShereKahn ) Minimal models have b een used, with minor differences, in an extension of DLs with circumscription [6] and in the semantics of op en answer set programs [14]. Consider, however, the following ABox A5 : Woman (Alice ) Man (Bob ) (23) (24) No axiom implies that Alice and Bob should b e interpreted as the same individual, so we exp ect them to b e different "by default" and the constraint to b e satisfied. Now the definitions from [6, 14] consider all interpretation domains, so let I = {}. Because I contains only one object, we must interpret b oth Alice and Bob as . Clearly, I = {Woman (), Man ()} is a minimal model of A5 , and it does not satisfy C5 . This problem might b e remedied by making the unique name assumption (UNA)--that is, by requiring each constant to b e interpreted as a different individual. This is, however, rather restrictive, and is not compatible with OWL. Another solution is to interpret A S in a Herbrand model (i.e., a model in which each constant is interpreted by itself ) where is a congruence relation; then, we minimize the interpretation of together with all the other predicates. In such a case, the only minimal model of A5 is I = {Woman (Alice ), Man (Bob )} since the extension of is empty due to minimization, so C5 is satisfied in I . Unfortunately, existential quantifiers p ose a whole range of problems for constraints. Let A6 contain these axioms: HasChild (Peter ) HasHappyChild (Peter ) TwoChildren (Peter ) Furthermore, let S6 contain these axioms: HasChild hasChild .Child HasHappyChild hasChild .(Child Happy ) Finally, let C6 contain the following constraint: TwoChildren 2 hasChild .Child (31) (29) (30) (26) (27) (28) It seems intuitive for C6 to b e satisfied in A S6 : no axiom in S6 forces the children of Peter --the two individuals whose existence is implied by (29) and (30)--to b e the same, so we might conclude that they are different. Now consider the following quite similar example. Let C7 = C6 , and let A7 contain the following axioms: HasChild (Peter ) TwoChildren (Peter ) Furthermore, let S7 contain the following axiom: HasChild hasChild .Child hasChild .Child (34) (32) (33) As in the previous example, C7 is satisfied in A7 S7 since (34) introduces two (p ossibly identical) individuals in the extension of Child . Let S7 b e a standard TBox containing only the axiom (35): HasChild hasChild .Child (35) Now C7 is not satisfied in A S7 since (35) implies the exis tence of only one child. Given that S7 is semantically equivalent to S7 , this is rather unsatisfactory; furthermore, it suggests that C7 should not b e satisfied in A7 S7 , since (34) requires the existence of only one individual. Recall, however, that S6 and S7 are quite closely related: the effect of (34) with resp ect to Child is the same as that of (29) and (30). Hence, if (34) should introduce only one individual, 811 WWW 2007 / Track: Semantic Web then (29) and (30) should do so as well, which is in conflict with our intuition that C6 should b e satisfied in A6 S6 . Thus, our intuition does not give us a clear answer as to the appropriate treatment of existential quantifiers in the standard TBox: the names of the concepts and the structure of the axioms suggest that the existential quantifiers in (29) and (30) should introduce different individuals, whereas the existential quantifiers in (34) should "reuse" the same individual. These two readings pull in opp osite directions, so a choice b etween the two should b e based on other criteria. The example involving S7 and S7 reveals an imp ortant disadvantage of one p ossible choice: if we require each existential quantifier to introduce a distinct individual, then it is p ossible for a constraint TBox C to b e satisfied in A S , but not in A S , even though S and S are semantically equivalent. As we have seen, C7 is satisfied in A7 S7 , but not in A7 S7 , even though S7 and S7 are equivalent. It is clearly undesirable for IC satisfaction to dep end on the syntactic structure of the standard TBox. Introducing distinct individuals for each existential quantifier can b e justified by skolemization [21], the well-known process of representing existential quantifiers with new function symb ols. For example, for = y : [R(x, y ) C (y )], by skolemization we obtain sk() = R(x, f (x)) C (f (x)): the variable y is replaced by a term f (x), for f a new function symb ol. Skolemized formulae are usually interpreted in Herbrand models, whose domain consists of all ground terms built from constants and function symb ols in the formula. If the formula contains at least one function symb ol, then Herbrand models are infinite; furthermore, the models of DL axioms are forest-like (i.e., they can b e viewed as trees p ossibly interconnected at roots). We use these prop erties in our procedure for checking IC satisfaction in Section 6. Definition 3.2. Let be a first-order formula and sk() the formula obtained by outer skolemization of [21]. A Herbrand interpretation w.r.t. is a Herbrand interpretation defined over the signature of sk(). A Herbrand interpretation I w.r.t. is a model of , written I |= , if it satisfies in the usual sense. A Herbrand model I of is minimal if I |= for each Herbrand interpretation I such that I I . We write sk() |=MM if I |= for each minimal Herbrand model I of . We now define the notion of IC satisfaction. We use an op erator that translates a set of DL axioms S into an equivalent formula (S ) of first-order logic with equality and counting quantifiers [3, 7]. Definition 3.3. Let K = (S , C , A) be an extended DL know ledge base. The constraint TBox C is satisfied in K if sk( (A S )) |=MM (C ). By an abuse of notation, we often omit and simply write sk(A S ) |=MM C . Note that the addition of constraints does not change the semantics of DLs or OWL: Definition 3.3 is only concerned with the semantics of constraints, and a traditional knowledge base (T , A) can b e seen as an extended knowledge base (T , , A). For subsumption and concept satisfiability tests, we can use S C together as the schema, as usual. As discussed ab ove, skolemization introduces a new function symb ol for each existential quantifier, which effectively introduces a new individual for each quantifier. We invite the reader to convince himself that Definition 3.3 closely follows Session: Query Languages and DBs our intuition on the examples presented thus far. Furthermore, in Section 4 we show that, if the constraints are satisfied, we can throw them away without losing any p ositive consequences; that is, we can answer p ositive queries by taking into account only A and S . We take this as confirmation that our semantics of IC satisfaction is intuitive. We now discuss a nonobvious consequence of our semantics. Let A8 b e an ABox with only the following axioms: Vegetarian (Ian ) eats (Ian , soup ) (36) (37) Furthermore, let S8 = , and let C8 contain only the following constraint: Vegetarian eats .¬Meat (38) One might intuitively exp ect C8 not to b e satisfied for A8 since the ABox does not state ¬Meat (soup ). Contrary to our intuition, C8 is satisfied in A8 : the interpretation I containing only the facts (36) and (37) is the only minimal Herbrand model of A8 and I |= C8 . In fact, the axiom (38) is equivalent to the following axiom: Vegetarian eats .Meat (39) When written in the latter form, the axiom should b e intuitively satisfied, since Meat (soup ) is not derivable. As this example illustrates, the intuitive meaning of constraints is easier to grasp if we transform them into the form C D, where b oth C and D are negation-free concepts. Namely, our constraints check the p ositive facts. To check negative facts, we must give them atomic names. Let A9 = A8 ; furthermore, let S9 contain the following axiom: NotMeat ¬Meat Finally, let C9 contain the following axiom: Vegetarian eats .NotMeat (41) (40) The constraint (40) is now of the "p ositive" form C D , so it is easier to understand the intuition b ehind it: everything that is eaten by an instance of Vegetarian should provably b e NotMeat . Now, A9 S9 has the following two minimal models, and I5 |= C9 , so C9 is not satisfied for A9 : I5 = {Vegetarian (Ian ), eats (Ian , soup ), Meat (soup )} I6 = {Vegetarian (Ian ), eats (Ian , soup ), NotMeat (soup )} If we add to A9 the fact NotMeat (soup ), then only I6 is a minimal model, and C9 b ecomes satisfied as exp ected. Hence, it is advisable to restrict constraints to p ositive formulae in order to avoid such misunderstandings. We finish this section with a note that different applications might choose to treat different subsets of the same ontology as ICs. In practice, this might b e addressed by a mechanism that allows one to create an application-sp ecific view of an OWL ontology. The discussion of such a mechanism is, however, out of scop e of this pap er; here, we focus on the semantic and computational asp ects of ICs. 4. CONSTRAINTS AND QUERIES We now state an imp ortant result ab out answering unions of p ositive conjunctive queries in extended DL knowledge bases: if the ICs are satisfied, we need not consider them in query answering. This shows that our semantics of IC satisfaction is reasonable: constraints are checks and, if they 812 WWW 2007 / Track: Semantic Web are satisfied, we can discard them without losing relevant consequences. Moreover, this result is practically imp ortant b ecause it simplifies query answering. In Section 6, we show that, for certain typ es of OWL ontologies, b oth checking IC satisfaction and query answering can b e easier than standard DL reasoning. Before proceeding, we first remind the reader of the definition of unions of conjunctive queries. Definition 4.1. Let x be a set of distinguished and y a set of nondistinguished variables. A conjunctive query Q(x, y) is a finite conjunction of positive atoms of the form A(t1 , . . . , tm ), where ti are either constants, distinguished, or nondistinguished variables.3 A union of n conjunctive queries is the formula (x) = n 1 yi : Qi (x, yi ). A tuple i= of constants c is an answer to (x) over a DL know ledge base K, written K |= (c), if (K) |= (x)[c/x]. Session: Query Languages and DBs following example demonstrates. Let S11 = , C11 contain the axiom Cat Dog (48) and A11 contain the axiom (47). By taking S11 into account, we get the following inference: S11 C11 A11 |= ¬Dog (Garfield ) Furthermore, the constraint is satisfied; however, if we disregard C11 , we lose this consequence: S11 A11 |= ¬Dog (Garfield ) A similar example can b e given for queries containing universal quantifiers. Theorem 4.2 has an imp ortant implication with resp ect to TBox reasoning. Let 1 (x) and 2 (x) b e unions of conjunctive queries such that (K) |= x : [1 (x) 2 (x)]. Provided that C is satisfied in K, each answer to 1 (x) w.r.t. A S is also an answer to 2 (x) w.r.t. A S . To summarize, we can check subsumption of unions of conjunctive as usual, by treating C S as an ordinary DL TBox. Subsequently, for knowledge bases that satisfy C , we can ignore C when answering queries, but query answers will still satisfy the established subsumption relationships b etween queries. Ï Our result is captured by the following theorem, whose proof can b e found in [19]: Theorem 4.2. Let K be an extended DL know ledge base that satisfies C . Then, for any union of conjunctive queries (x) over K and any tuple of constants c, we have A S C |= (c) if and only if A S |= (c). Consider, for example, the following knowledge base. Let the standard TBox S10 contain the following axioms: Cat Pet hasPet .Pet PetOwner (42) (43) 5. TYPICAL USAGE PATTERNS The notion of ICs from Section 3 is very general. To provide practical guidance for modelers, we discuss in this section the typical usage patterns. Let the constraint TBox C10 contain the following axiom: CatOwner hasPet .Cat (44) 5.1 Participation Constraints Participation constraints involve two concepts C and D and a relation R b etween them, and they state that each instance of C must participate in one or more R-relationships with instances of D; often, they also define the cardinality of the relationship. The general form of such constraints is as follows, where {, , =} and n is a nonnegative integer: C n R.D (49) Finally, let the ABox A10 contain the following assertions: CatOwner (John ) hasPet (John , Garfield ) Cat (Garfield ) (45) (46) (47) Under the standard semantics, we can draw the following conclusion from K: S10 C10 A10 |= PetOwner (John ) Furthermore, it is easy to see that the constraint (44) is satisfied in K: the only derivable fact ab out CatOwner is CatOwner (John ) and the ABox contains the explicit information that John owns Garfield who is a Cat . Therefore, we do not need the axiom (44) to imply the existence of the owned cat: whenever we can derive CatOwner (x) for some x, we can derive the information ab out the cat of x as well. Hence, we can disregard (44) during query answering, but our conclusion holds just the same: S10 A10 |= PetOwner (John ) Note that b oth entailments in Theorem 4.2 use the standard semantics of DLs; that is, we do not assume a closedworld semantics for query answering. Furthermore, Theorem 4.2 does not guarantee preservation of negative consequences; in fact, such consequences may change, as the 3 The predicate A can b e the equality predicate , an atomic concept, a role, or an n-ary predicate in case of n-ary DLs. Participation constraints are similar to inclusion dep endencies in relational databases. A typical participation constraint is the axiom (1), which states that each p erson must have an explicitly sp ecified SSN. Another example is the following statement, which allowes each p erson to have at most one sp ouse: Person 1 marriedTo .Person (50) To understand the difference in treating (50) as a standard axiom or a constraint, consider the following ABox A: Person (Peter ) marriedTo (Peter , Ann ) marriedTo (Peter , Mary ) (51) (52) (53) If (50) were a part of the standard TBox S , then A S would b e satisfiable; furthermore, due to (50), we would derive Ann Mary . If we put (50) into the constraint TBox C , then the only minimal model of A contains exactly the facts (51)­(53). Namely, the equality predicate is minimized as well, so Ann is different from Mary . This matches our intuition b ecause no other knowledge requires Ann and Mary to b e the same. Thus, Peter is married to two different p eople, so the constraint (50) is not satisfied in A. 813 WWW 2007 / Track: Semantic Web Session: Query Languages and DBs Now in any minimal model of S A, the assertions of the form (60) ensure that O is interpreted exactly as the set of all known ob jects. Hence, (58) ensures that each PersonTR is known, and (59) ensures that the social security numb er for each p erson is known as well. One can ob ject that this solution is not completely modeltheoretic: it requires asserting (60) for each known individual, which is a form of procedural preprocessing. We agree that our solution is not completely clean in that sense; however, we b elieve that it is simple to understand and implement and is therefore acceptable. For TBox reasoning, statements of the form (60) are, by definition, not taken into account. Finally, instead of the axioms (60), one could b e tempted to use the following statement, where ai are all individuals from the ABox: O {a1 , . . . , an } (61) 5.2 Typing Constraints Constraints can b e used to check whether ob jects are correctly typ ed. A typical example of such constraints are domain and range restrictions: for a role R and a concept C , they state that R-links can only p oint from or to ob jects that are explicitly typ ed as C . In this way, these constraints act as checks, saying that R-relationships can b e asserted only for ob jects in C . The general form of domain constraints is R. C whereas for range constraints it is R.C. (55) (54) A typical example of a domain constraint is (9), which ensures that a name can b e given only to ob jects that are either bioSource , entity , or dataSource . Another example is the following axiom, which states that it is only p ossible to b e married to a Person : marriedTo .Person (56) To understand the difference in treating (56) as a standard axiom or a constraint, consider an ABox A containing only the fact (52). If (56) were a part of the standard TBox S , then A S would b e satisfiable; furthermore, due to (56), we would derive Person (Ann ). If we put (56) into the constraint TBox C , then the only minimal model of A contains only the fact (52). Thus, Ann is not explicitly typ ed to b e a Person , so the IC (56) is not satisfied in A. This, however, requires nominals in the DL language, which makes reasoning more difficult [25]. Furthermore, since O occurs only in the constraint axioms, the axioms of the form (60) are sufficient: the minimal model semantics ensures that O contains exactly the individuals a1 , . . . , an . 6. CHECKING IC SATISFACTION We now present an algorithm for checking IC satisfaction. Our algorithm uses an alternative characterization of Definition 3.3 based on logic programming. Note that, since the semantics of other reasoning problems is defined as usual, they can b e solved using existing algorithms. Definition 6.1. Let K = (S , C , A) be an extended DL know ledge base, = sk( (S )),4 and be the result of transforming into conjunctive normal form. Then, LP(S ) is the logic program obtained by ( i) converting each clause ¬A1 . . . ¬An B1 . . . Bm from into a rule A1 . . . An B1 . . . Bm ; ( ii) adding an atom HU (x) to the body of each rule in which the variable x occurs in the head but not in the body; ( iii) adding a fact HU (c) for each constant c; and ( iv) adding the fol lowing rule for each n-ary function symbol f : HU (x1 ) . . . HU (xn ) HU (f (x1 , . . . , xn )) Furthermore, CN() is the stratified datalog program defined as fol lows, where µ and sub are defined in Table 1: CN() = µ() 5.3 Restrictions to Known Individuals Sometimes, we might want to check whether certain objects are known by name. For example, an application for the management of tax returns might deal with two typ es of p eople: those who have submitted a tax return for processing, and those who are somehow related to the p eople from the first group (e.g., their sp ouses or children). For the application to function prop erly, it might not b e necessary to explicitly sp ecify the SSN for all p eople; only the SSNs for the p eople from the first group are of imp ortance. In such an application, we might use axioms (1)­(3) not as ICs, but as elements of the standard TBox S . Furthermore, to distinguish p eople who have submitted the tax return, we would introduce a concept PersonTR for p ersons with a tax return and would make it a subset of Person in S : PersonTR Person (57) Two things should hold for each instance of PersonTR : first, we might require each such p erson to b e explicitly known by name, and second, we might require the SSN of each such p erson to b e known by name as well. Although ICs can b e used to check whether an individual is present in an interpretation, they cannot distinguish named (known) from unnamed (unknown) individuals. We can, however, solve this problem using the following "trick." We can use a sp ecial concept O to denote all individuals known by name and state the following two constraints: PersonTR O PersonTR hasSSN .(O SSN ) (58) (59) CN( ) sub() Final ly, let CN(C ) = CN( (C )) and EC = E(C) . The following theorem, whose proof is given in [19], shows that IC satisfaction can b e checked using logic programming. Theorem 6.2. The constraint TBox C is satisfied in an an extended DL know ledge base K = (S , C , A) if and only if A LP(S ) CN(C ) |=c EC . The program LP(S ) is obtained by translating S into a first-order formula (S ), skolemizing it, and translating the result into conjunctive normal form. The last step does not 4 Please rememb er that is an op erator that translates a set of DL axioms into a first-order formula. Furthermore, we add the following ABox assertion for each individual a occurring in an ABox: O (a ) (60) 814 WWW 2007 / Track: Semantic Web Session: Query Languages and DBs 1 A (t 1 , . . . , t m ) 2 ¬ 1 2 3 4 5 6 7 1 2 y : y : k y : Table 1: Translating Formulae into Constraints µ() A (t 1 , . . . , t m ) E (x 1 , . . . , x n ) HU (x1 ) . . . HU (xn ) not E (x1 , . . . , xn ) E (x1 , . . . , xn ) E 1 (y 1 , . . . , y m ) E 2 (z 1 , . . . , z k ) E (x 1 , . . . , x n ) HU (x1 ) . . . HU (xn ) E1 (y1 , . . . , ym ) E (x1 , . . . , xn ) HU (x1 ) . . . HU (xn ) E2 (z1 , . . . , zk ) E (x1 , . . . , xn ) E (y 1 , . . . , y m ) E (x 1 , . . . , x n ) HU (x1 ) . . . HU (xn ) not Ey :¬ (x1 , . . . , xn ) E (x1 , . . . , xn ) sub() { } {1 , 2 } {1 , 2 } { } {y : ¬ } { } ÎE k (y1 , . . . , ym )[y i /y ] Î not y i y j E (x1 , . . . , xn ) i=1 1i