WWW 2007 / Track: Semantic Web
Session: Query Languages and DBs
From SPARQL to Rules (and back)
Axel Polleres
Universidad Rey Juan Carlos Tulipan s/n, 28933 Mostoles, Madrid, Spain ´ ´
axel@polleres.net
ABSTRACT
As the data and ontology layers of the Semantic Web stack have achieved a certain level of maturity in standard recommendations such as RDF and OWL, the current focus lies on two related asp ects. On the one hand, the definition of a suitable query language for RDF, SPARQL, is close to recommendation status within the W3C. The establishment of the rules layer on top of the existing stack on the other hand marks the next step to b e taken, where languages with their roots in Logic Programming and Deductive Databases are receiving considerable attention. The purp ose of this pap er is threefold. First, we discuss the formal semantics of SPARQL extending recent results in several ways. Second, we provide translations from SPARQL to Datalog with negation as failure. Third, we prop ose some useful and easy to implement extensions of SPARQL, based on this translation. As it turns out, the combination serves for direct implementations of SPARQL on top of existing rules engines as well as a basis for more general rules and query languages on top of RDF.
with defining asp ects such as a formal semantics or layering on top of OWL and RDFS. As for the second part, the RIF working group 1 , who is resp onsible for the rules layer, is just producing first concrete results. Besides asp ects like business rules exchange or reactive rules, deductive rules languages on top of RDF and OWL are of sp ecial interest to the RIF group. One such deductive rules language is Datalog, which has b een successfully applied in areas such as deductive databases and thus might b e viewed as a query language itself. Let us briefly recap our starting p oints: Datalog and SQL. Analogies b etween Datalog and relational query languages such as SQL are well-known and studied. Both formalisms cover UCQ (unions of conjunctive queries), where Datalog adds recursion, particularly unrestricted recursion involving nonmonotonic negation (aka unstratified negation as failure). Still, SQL is often viewed to b e more p owerful in several resp ects. On the one hand, the lack of recursion has b een partly solved in the standard's 1999 version [20]. On the other hand, aggregates or external function calls are missing in pure Datalog. However, also developments on the Datalog side are evolving and with recent extensions of Datalog towards Answer Set Programming (ASP) a logic programming paradigm extending and building on top of Datalog lots of these issues have b een solved, for instance by defining a declarative semantics for aggregates [9], external predicates [8]. The Semantic Web rules layer. Remarkably, logic programming dialects such as Datalog with nonmonotonic negation which are covered by Answer Set Programming are often viewed as a natural basis for the Semantic Web rules layer [7]. Current ASP systems offer extensions for retrieving RDF data and querying OWL knowledge bases from the Web [8]. Particular concerns in the Semantic Web community exist with resp ect to adding rules including nonmonotonic negation [3] which involve a form of closed world reasoning on top of RDF and OWL which b oth adopt an op en world assumption. Recent prop osals for solving this issue suggest a "safe" use of negation as failure over finite contexts only for the Web, also called scoped negation [17]. The Semantic Web query layer SPARQL. Since we base our considerations in this pap er on the assumption that similar corresp ondences as b etween SQL and Datalog can b e established for SPARQL, we have to observe that SPARQL inherits a lot from SQL, but there also remain substantial differences: On the one hand, SPARQL does not deal with nested queries or recursion, a detail which is indeed surpris1
Categories and Subject Descriptors
H.2.3 [Languages]: Query Languages; H.3.5 [Online Information Services]: Web-based services
General Terms
Languages, Standardization
Keywords
SPARQL, Datalog, Rules
1.
INTRODUCTION
After the data and ontology layers of the Semantic Web stack have achieved a certain level of maturity in standard recommendations such as RDF and OWL, the query and the rules layers seem to b e the next building-blocks to b e finalized. For the first part, SPARQL [18], W3C's prop osed query language, seems to b e close to recommendation, though the Data Access working group is still struggling An extended technical rep ort of this article is available at http://www.polleres.net/publications/.
Copyright is held by the International World Wide Web Conference Committee (IW3C2). Distribution of these papers is limited to classroom use, and personal use by others. WWW 2007, May 812, 2007, Banff, Alberta, Canada. ACM 978-1-59593-654-7/07/0005.
http://www.w3.org/2005/rules/wg
787
WWW 2007 / Track: Semantic Web
ing by the fact that SPARQL is a graph query language on RDF where, typical recursive queries such as transitive closure of a prop erty might seem very useful. Likewise, aggregation (such as count, average, etc.) of ob ject values in RDF triples which might app ear useful have not yet b een included in the current standard. On the other hand, subtleties like blank nodes (aka bNodes), or optional graph patterns, which are similar but (as we will see) different to outer joins in SQL or relational algebra, are not straightforwardly translatable to Datalog. The goal of this pap er is to shed light on the actual relation b etween declarative rules languages such as Datalog and SPARQL, and by this also provide valuable input for the currently ongoing discussions on the Semantic Web rules layer, in particular its integration with SPARQL, taking the likely direction into account that LP style rules languages will play a significant role in this context. Although the SPARQL sp ecification does not seem 100% stable at the current p oint, just having taken a step back from candidate recommendation to working draft, we think that it is not too early for this exercise, as we will gain valuable insights and p ositive side effects by our investigation. More precisely, the contributions of the present work are: · We refine and extend a recent prop osal to formalize the semantics of SPARQL from P´rez et al. [16], presente ing three variants, namely c-joining, s-joining and bjoining semantics where the latter coincides with [16], and can thus b e considered normative. We further discuss how asp ects such comp ositionality, or idemp otency of joins are treated in these semantics. · Based on the three semantic variants, we provide translations from a large fragment of SPARQL queries to Datalog, which give rise to implementations of SPARQL on top of existing engines. · We provide some straightforward extensions of SPARQL such as a set difference op erator MINUS, and nesting of ASK queries in FILTER expressions. · Finally, we discuss an extension towards recursion by allowing bNode-free-CONSTRUCT queries as part of the query dataset, which may b e viewed as a lightweight, recursive rule language on top of of RDF. The remainder of this pap er is structured as follows: In Sec. 2 we first overview SPARQL, discuss some issues in the language (Sec. 2.1) and then define its formal semantics (Sec. 2.2). After introducing a general form of Datalog with negation as failure under the answer set semantics in Sec. 3, we proceed with the translations of SPARQL to Datalog in Sec. 4. We finally discuss the ab ove-mentioned language extensions in Sec. 5, b efore we conclude in Sec. 6.
Session: Query Languages and DBs
We assume the pairwise disjoint, infinite sets I , B , L and V ar , which denote IRIs, Blank nodes, RDF literals, and variables resp ectively. In this pap er, an RDF Graph is then a finite set, of triples from I B L × I × I B L,3 dereferenceable by an IRI. A SPARQL query is a quadruple Q = (V , P, DS, S M ), where V is a result form, P is a graph pattern, DS is a dataset, and S M is a set of solution modifiers. We refer to [18] for syntactical details and will explain these in the following as far as necessary. In this pap er, we will ignore solution modifiers mostly, thus we will usually write queries as triples Q = (V , P, DS ), and will use the syntax for graph patterns introduced b elow.
Result Forms. Since we will, to a large extent, restrict ourselves to SELECT queries, it is sufficient for our purp oses to describ e result forms by sets variables. Other result forms will b e discussed in Sec. 5. For instance, let Q = (V , P, DS ) denote the query from Fig. 1, then V = {?X, ?Y }. Query results in SPARQL are given by partial, i.e. p ossibly incomplete, substitutions of variables in V by RDF terms. In traditional relational query languages, such incompleteness is usually expressed using null values. Using such null values we will write solutions as tuples where the order of columns is determined by lexicographical ly ordering the variables in V . Given a set of variables V , let V denote the tuple obtained from lexicographically ordering V . The query from Fig. 1 with result form V = (?X, ?Y ) then has solution tuples ("Bob", : a), ("Alice", alice.org#me), ("Bob", : c). We write substitutions in sqare brackets, so these tuples corresp ond to the substitutions [?X "Bob", ?Y : a], [?X "Alice", ?Y alice.org#me ], and [?X "Bob", ?Y : c], resp ectively.
Graph Patterns. We follow the recursive definition of graph
patterns P from [16]: · a tuple (s, p, o) is a graph pattern where s, o I L V ar and p I V ar .4 · if P and P are graph patterns then (P AND P ), (P OPT P ), (P UNION P ), (P MINUS P ) are graph patterns.5 · if P is a graph pattern and i I V ar , then (GRAPH i P ) is a graph pattern. · if P is a graph pattern and R is a filter expression then (P FILTER R) is a graph pattern. For any pattern P , we denote by v ar s(P ) the set of all variables occurring in P . As atomic filter expression, SPARQL allows the unary predicates BOUND, isBLANK, isIRI, isLITERAL, binary equality predicates '=' for literals, and other features such as comparison op erators, data typ e conversion
3 Following SPARQL, we are slightly more general than the original RDF sp ecification in that we allow literals in sub ject p ositions. 4 We do not consider bNodes in patterns as these can b e semantically equivalently replaced by variables in graph patterns [6]. 5 Note that AND and MINUS are not designated keywords in SPARQL, but we use them here for reasons of readability and in order to keep with the op erator style definition of [16]. MINUS is syntactically not present at all, but we will suggest a syntax extension for this particular keyword in Sec. 5.
2.
RDF AND SPARQL
In examples, we will subsequently refer to the two RDF graphs in Fig. 1 which give some information ab out B ob and Alice. Such information is common in FOAF files which are gaining p opularity to describ e p ersonal data. Similarities with existing examples in [18] are on purp ose. We assume the two RDF graphs given in TURTLE [2] notation and accessible via the IRIs ex.org/bob and alice.org2
2 For reasons of legibility and conciseness, we omit the leading 'http://' or other schema identifiers in IRIs.
788
WWW 2007 / Track: Semantic Web
# Graph: ex.org/bob @prefix foaf: . @prefix bob: . foaf:maker : a. : a a foaf:Person ; foaf:name "Bob"; foaf:knows : b. : b a foaf:Person ; foaf:nick "Alice". foaf:maker : b PREFIX foaf: SELECT?Y ?X FROM FROM WHERE { ?Y foaf:name ?X .} ?X "Bob" "Bob" "Alice" # Graph: alice.org
Session: Query Languages and DBs
@prefix foaf: . @prefix alice: . alice:me a foaf:Person ; foaf:name "Alice" ; foaf:knows : c. :c a foaf:Person ; foaf:name "Bob" ; foaf:nick "Bobby".
?Y :a :c alice.org#me
Figure 1: Two RDF graphs in TURTLE notation and a simple SPARQL query. and string functions which we omit here, see [18, Sec. 11.3] for details. Complex filter expressions can b e built using the connectives '¬','',''. query with dataset DS = ({ex.org/bob }, ) which has an empty solution set. SELECT ?N WHERE {?G foaf:maker ?M . GRAPH ?G { ?X foaf:name ?N } } We will sometimes find the following assumption convenient to avoid such arguably unintuitive effects: Definition 1. (Dataset closedness assumption ) Given a dataset DS = (G, Gn ), Gn implicitly contains (i) all graphs mentioned in G and (ii) all IRIs mentioned explicitly in the graphs corresp onding to G. Under this assumption, the previous query has b oth ("Alice") and ("B ob") in its solution set. Some more remarks are in place concerning FILTER expressions. According to the SPARQL sp ecification "Graph pattern matching creates bindings of variables [where] it is possible to further restrict solutions by constraining the allowable bindings of variables to RDF Terms [with FILTER expressions]." However, it is not clearly sp ecified how to deal with filter constraints referring to variables which do not app ear in simple graph patterns. In this pap er, for graph patterns of the form (P FILTER R) we tacitly assume safe filter expressions, i.e. that all variables used in a filter expression R also app ear in the corresp onding pattern P . This corresp onds with the notion of safety in Datalog (see Sec.3), where the built-in predicates (which obviously corresp ond to filter predicates) do not suffice to safe unb ound variables. Moreover, the sp ecification defines errors to avoid mistyp ed comparisons, or evaluation of built-in functions over unb ound values, i.e. "any potential solution that causes an error condition in a constraint wil l not form part of the final results, but does not cause the query to fail." These errors propagate over the whole FILTER expression, also over negation, as shown by the following example. Example 2. Assuming the dataset does not contain triples for the foaf : dummy property, the example query SELECT ?X WHERE { {?X a foaf:Person . OPTIONAL { ?X foaf:dummy ?Y . } } FILTER ( ¬(isLITERAL (?Y)) ) } would discard any solution for ?X, since the unbound value for ?Y causes an error in the isLITERAL expression and thus the whole FILTER expression returns an error.
Datasets. The dataset DS = (G, {(g1 , G1 ), . . . (gk , Gk )}) of a SPARQL query is defined by a default graph G plus a set of named graphs, i.e. pairs of IRIs and corresp onding graphs. Without loss of generality (there are other ways to define the dataset such as in a SPARQL protocol query), we assume G given as the merge of the graphs denoted by the IRIs given in a set of FROM and FROM NAMED clauses. For instance, the query from Fig. 1 refers to the dataset which consists of the default graph obtained from merging alice.org ex.org/bob plus an empty set of named graphs. The relation b etween names and graphs in SPARQL is defined solely in terms of that the IRI defines a resource which is represented by the resp ective graph. In this pap er, we assume that the IRIs represent indeed network-accessible resources where the resp ective RDF-graphs can b e retrieved from. This view has also b e taken e.g. in [17]. Particularly, this treatment is not to b e confused with so-called named graphs in the sense of [4]. We thus identify each IRI with the RDF graph available at this IRI and each set of IRIs with the graph merge [13] of the resp ective IRIs. This allows us to identify the dataset by a pair of sets of IRIs DS = (G, Gn ) with G = {d1 , . . . , dn } and Gn = {g1 , . . . , gk } denoting the (merged) default graph and the set of named graphs, resp ectively. Hence, the following set of clauses
FROM FROM NAMED defines the dataset DS = ({ex.org/bob }, {alice.org }).
2.1 Assumptions and Issues
In this section we will discuss some imp ortant issues ab out the current sp ecification, and how we will deal with them here. First, note that the default graph if sp ecified by name in a FROM clause is not counted among the named graphs automatically [18, section 8, definition 1]. An unb ound variable in the GRAPH directive, means any of the named graphs in DS , but does NOT necessarily include the default graph. Example 1. This issue becomes obvious in the fol lowing
789
WWW 2007 / Track: Semantic Web
We will take sp ecial care for these errors, when defining the semantics of FILTER expressions later on.
Session: Query Languages and DBs
The semantics of a graph pattern P over dataset DS = (G, Gn ), can now b e defined recursively by the evaluation function returning sets of substitutions. Definition 2. (Evaluation, extends [16, Def. 2] ) Let t = (s, p, o) b e a triple pattern, P, P1 , P2 graph patterns, DS = (G, Gn ) a dataset, and i Gn , and v V ar , then the xjoining evaluation [[·]]x S is defined as follows: D [[t]]x S = { | dom() = v ar s(P ) and t G} D [[P1 AND P2 ]]x S = [[P1 ]]x S x [[P2 ]]x S D D D [[P1 UNION P2 ]]x S = [[P1 ]]x S [[P2 ]]x S D D D [[P1 MINUS P2 ]]x S = [[P1 ]]x S -x [[P2 ]]x S D D D [[P1 OPT P2 ]]x S = [[P1 ]]x S = x [[P2 ]]x S D D D x x [[GRAPH i P ]]D S = [[P ]](i,) [[GRAPH v P ]]x S = { [v g ] | g Gn , [[P [v g ] ]]xg,) } D ( [[P FILTER R]]x S = { [[P ]]x S | R = } D D Let R b e a FILTER expression, u, v V ar , c I B L. The valuation of R on substitution , written R takes one of the three values {, , }6 and is defined as follows. R = , if: (1) (2) (3) (4) (5) (6) (7) (8) (9) R = BOUND(v ) with v dom() v = null; R = isBLANK(v ) with v dom() v B ; R = isIRI(v ) with v dom() v I ; R = isLITERAL(v ) with v dom() v L; R = (v = c) with v dom() v = c; R = (u = v ) with u, v dom() u = v u = null; R = (¬R1 ) with R1 = ; R = (R1 R2) with R1 = R2 = ; R = (R1 R2) with R1 = R2 = .
2.2 Formal Semantics of SPARQL
The semantics of SPARQL is still not formally defined in its current version. This lack of formal semantics has b een tackled by a recent prop osal of P´rez et al. [16]. We will e base on this prop osal, but suggest three variants thereof, namely (a) bravely joining, (b) cautiously-joining, and (c) strictly-joining semantics. Particularly, our definitions vary from [16] in the way we define joining unb ound variables. Moreover, we will refine their notion of FILTER satisfaction in order to deal with error propagation prop erly. We denote by Tnull the union I B L {null }, where null is a dedicated constant denoting the unknown value not app earing in any of I , B , or L, how it is commonly introduced when defining outer joins in relational algebra. A substitution from V ar to Tnull is a partial function : V ar Tnull . We write substitutions in p ostfix notation: For a triple pattern t = (s, p, o) we denote by t the triple (s, p, o) obtained by applying the substitution to all variables in t. The domain of , dom(), is the subset of V ar where is defined. For a substitution and a set of variables D V ar we define the substitution D with domain D as follows: x if x dom() D x D = null if x D \ dom() Let 1 and 2 b e substitutions, then 1 2 is the substitution obtained as follows: 8