Block-Wise Construction of Acyclic Relational Features with Monotone Irreducibility and Relevancy Properties Ondej Kuelka r z kuzelo1@fel.cvut.cz Filip Zelezn´ y zelezny@fel.cvut.cz Czech Technical University in Prague, Technick´ 2, 166 27 Prague 6, Czech Republic a Abstract We describe an algorithm for constructing a set of acyclic conjunctive relational features by combining smaller conjunctive blocks. Unlike traditional level-wise approaches which preserve the monotonicity of frequency, our block-wise approach preserves a form of monotonicity of the irreducibility and relevancy feature properties, which are important in propositionalization employed in the context of classification learning. With pruning based on these properties, our blockwise approach efficiently scales to features including tens of first-order literals, far beyond the reach of state-of-the art propositionalization or inductive logic programming systems. obviously similar to the framework of frequent query discovery (Dehaspe & Toivonen, 1999). The two settings mainly differ in the choice of the feature (query) evaluation criterion; in the latter this is based on the frequency on a set of unlabeled examples (i.e. the number of examples covered by the query), whereas in propositionalization it usually combines the values of feature frequencies in respective classes of a labeled example set into a single measure of feature relevancy. Current systems working in both of the mentioned frameworks (e.g. RSD or WARMR) construct conjunctions in a level-wise manner. Each conjunction in level n extends by one literal some conjunction in level n - 1. Monotonicity of frequency (if F is not frequent then F l is not frequent for any literal l) is greatly exploited for pruning in the frequent-pattern discovery setting. Unfortunately, relevancy that is of interest in propositionalization is in general not monotone in this level-wise approach. Irreducibility, another desirable property of a feature, is unfortunately not monotone either here. For example p(X) is irreducible, p(X) p(Y ) is reducible, and p(X) p(Y ) q(X, Y ) is again irreducible. The purpose of this work is to remove these deficiencies by proposing a novel, block-wise approach to construct a feature set, by identifying small conjunctions (`building blocks') out of which all features can be composed. The properties of irreducibility and relevancy, and the extensions (the sets of examples covered) of the building blocks are exploited to compute the analogical properties of the resulting composition. In the next section we formalize our chosen feature language bias requiring a form of acyclicity. Then we provide our main theoretical contributions expressed in two theorems. Theorem 3.3 in Section 3 mainly shows that irreducibility is monotone in the block-wise composition sense and Theorem 4.4 in Section 4 asserts an analogical property for relevancy. In Section 5 we informally demonstrate the completeness of our 1. Introduction Propositionalization aims at converting structured descriptions of examples into attribute-value descriptions which can be processed by most established machine learning algorithms. A major stream of propositionalization approaches (Lavra & Flach, 2001; Krogel & c al, 2003; Zelezn´ & Lavra, 2006) proceeds by cony c structing a set of features (first-order formulas) which follow some prescribed syntactical constraints and play the role of Boolean attributes. Here we assume that examples are represented as first-order logic interpretations, features are first-order conjunctions and some Horn background theory B (possibly empty) is available. Feature F acquires value true for example I (is covered by the example) if m(B) I |= F where m(B) is the minimal model of B, otherwise it has the false value. With slight abuse of notation, we will write this relation simply as B I |= F . Propositionalization is Appearing in Proceedings of the 26 th International Conference on Machine Learning, Montreal, Canada, 2009. Copyright 2009 by the author(s)/owner(s). Block-Wise Construction of Acyclic Relational Features feature construction algorithm (`RelF'). In Section 6, on three benchmark relational learning problems, we test RelF against the level-wise feature construction algorithm RSD (Zelezn´ & Lavra, 2006) and the iny c ductive logic programming system Progol (Muggleton, 1995). RelF is significantly faster than the competing systems, able to efficiently search among conjunctions of tens of first-order literals, far beyond the reach of RSD or Progol. Of equal importance, classification accuracies achieved with RelF's features are insignificantly different from those achieved with the competing systems, indicating RelF's language bias is not overconstrained. Section 7 concludes the paper. is a -feature F if all its variables are neutral in F ; it is a pos (neg) -feature F if it has exactly one pos (neg) variable, denoted p(F ) (n(F )), and the remaining variables are neutral; it is a pos-neg -feature if it has exactly one pos variable and exactly one neg variable and the remaining variables are neutral. For example, using the defined under Def. 2.1 the following conjunction is a -feature hasCar(C) hasLoad(C, L1) hasLoad(C, L2) box(L1) large(L1) triangle(L2) hasCar(C) is a neg -feature, hasLoad(C, L1) box(L1) is a pos -feature and hasLoad(C, L1) is a pos-neg -feature. It is important to note that the only pos (neg) variable p(F + ) (n(F - )) in a pos (neg) feature is uniquely given despite the fact that inputs and outputs are not given uniquely in general. Wherever we shall deal with a single fixed template , we will simply speak of (pos, neg, pos-neg) features instead of (pos, neg, pos-neg) -features. Given the assumed partial order in templates, a feature F corresponds to a tree graph TF , here called the F -tree, with vertices vi corresponding to literals li . There is an edge between vi and vj if there is a variable which has an output occurrence in li and an input occurrence in lj . We say that a (pos, neg) feature has depth d if the corresponding F -tree has depth d. Analogically, we say that a literal l is in depth d in (pos,neg) feature if it is in depth d in the corresponding F -tree (if d = 0, we call l root of F ). Definition 2.3 (Graft) Let F - be a neg (pos-neg) feature and + = {Fi+ } be a standardized-apart (possibly with exception n(F - ) = p(Fi+ )) set of pos features. We define the graft F - n(F - ) + = F - i Fi i , where each substitution i = {p(Fi+ )/n(F - )}. A pos feature F + is said to be contained in (pos) feature F if and only if F + F and F \F + is a neg (pos-neg) feature. In what follows, the variable in the subscript of the graft operator will be uniquely determined by context. Hence we will mostly drop the subscript for simplicity. Running Example Let us have a set of two positive examples E + and a set of two negative examples E - E + = {{hasCar(c1), hasLoad(c1, l1), circ(l1), box(l1), hasLoad(c1, l2), tri(l2)}, {hasCar(c2), hasLoad(c2, l3), box(l3), tri(l3)}} E + = {{hasCar(c3), hasLoad(c3, l4), box(l4), 2. Features The set of literals in a conjunction C is written as lits(C), |C| = |lits(C)| is size of C. A set of conjuctions is said to be standardized-apart if no two conjunctions in the set share a variable. Given a set A of atoms, we denote Args(A) = {(a, n)|a A, 1 n arity(a)}, i.e. Args(A) is the set of all argument places in A. We will assume that no conjunction contains two equal literals, i.e. p(X) p(Y ) will be allowed, but p(a) p(a) will not. For an atom a, argi (a) is its i-th argument. The admissible features syntax will be constrained by means of templates, which are a simple formalization of mode declarations used in systems such as Progol (Muggleton, 1995). Definition 2.1 (Template) A pre-template is a pair (, µ) where is a finite set of ground atoms and µ Args(). Elements of µ (Args()\µ) are called inputs (outputs) in . A pre-template (, µ) is a template if every atom in has at most one input argument and there is a partial irreflexive order on constants in such that c c whenever c (c ) occurs as an input (output) in . A template = (, µ) is conveniently shown by writing with input (output) arguments marked with the sign + (-), such as {hasCar(-c), hasLoad(+c, -l), box(+l), large(+l), triangle(+l)}. Definition 2.2 (Feature) Given a template = (, µ), a -pre-feature F is a finite conjunction of literals containing no constants or functions, such that lits(F ) for some substitution . The occurrence of variable in the i-th argument of literal l in F is an input (output) occurrence in F if the i-th argument of l is (is not) in µ. A variable is neutral in F if it has [i] at least one input occurrence in F , and [ii] exactly one output occurrence in F . A variable is pos (neg) in F if it complies only with [i] ([ii]). A -pre-feature Block-Wise Construction of Acyclic Relational Features circ(l4)}, {hasCar(c4), hasLoad(c4, l5), tri(l5), circ(l5)}} We define a very simple template {hasCar(-c), hasLoad(+c, -l), box(+l), tri(+l), circ(+l)} to constrain the features for this data. Although there is an infinite number of features correct w.r.t. , there are only finitely many features, which are not H-reducible, as we will see in the next section. correct assignment of inputs and outputs of f and g. If further g H f , we call f and g H-equivalent (written f H g). We say that (pos) -feature f is H-reducible if there is a (pos) -feature f such that f H f and |f | > |f |. Feature f is said to be an H-reduction of f if f H f and f is not H-reducible. In the proof of Theorem 3.3 we will speak interchangeably about substitution from variables to terms and about the induced substitution from literals to literals. Theorem 3.3 Let F + be a pos feature and let F - be a neg feature. Then following holds: (i) F + is H-reducible if and only if F + contains pos features + + + + + + F1 , F2 such that F1 = F2 , p(F1 ) = p(F2 ) and + + F1 H F2 . (ii) If F + is H-reducible, then F - F + is also H-reducible. (iii) Whether a (pos) feature F is H-reducible can be computed in polynomial time (in |F |). Proof In this proof we will use the following observation. Let A, B be pos features and let be a H-substitution such that A B. Observe that if l B\A, then also l B\A for any literal l contained in pos feature Fl+ B, where Fl+ has l as + its root. (i ) Let Fr be H-reduction of F + and + let 1 , 2 be H-substitutions such that Fr 1 F + + + and F 2 Fr . Substitution 3 = 2 1 is a map+ ping 3 : lits(F + ) lits(F + ). Since F + 2 Fr , + + + |F 2 | |Fr | and consequently |F 3 | |Fr | < |F + |, because applying a substitution to a feature cannot increase its size. Therefore there is a literal l F + \F + 3 and, as we have observed, also a whole + pos feature F1 F + \F + 3 . Thus, there is a pos fea+ + + + + ture F2 (F2 = F1 ) such that F1 3 F2 . It remains + + + + to show that for some such F1 , F2 , p(F1 ) = p(F2 ). + + If p(F1 ) = p(F2 ), then there must be pos features + + + + + + F1 , F2 such that F1 F1 , F2 F2 and + + + + F1 3 F2 . For such F1 , F2 with maximum size, + + p(F1 ) = p(F2 ). (i ) Let be a H-substitution + such that F + \F + = F1 , then F + H F + and + |F + | = |F + |-|F1 | < |F + |. (ii) This follows from (i). + + (iii) An approach, which tests whether F1 H F2 for + + all pairs of pos features F1 , F2 contained in F + with equal pos variables, runs in polynomial time in |F | (cf. Algorithm 1). The second property of H-reduction stated in Theorem 3.3 allows us to filter H-reducible pos features during propositionalization process. Running Example Let us return to our running example from Section 2. As we have already mentioned, there is an infinite number of features, but only a finite number of non-H-reducible ones. An example of 3. Irreducibility To define the reducibility property of conjunctive features, we borrow the notion of -subsumption which is usually employed in the context of clauses. Definition 3.1 (Reducibility) Let C and D be conjunctions of literals and let there be a substitution such that lits(C) lits(D). We say that C subsumes D (written C D). If further D C, we call C and D -equivalent (written C D). We say that C is reducible if there exists a clause C such that C C and |C| > |C |. A clause C is said to be a reduction of C if C C and C is not reducible. An example of a -reducible conjunction of literals is C = hasCar(C) hasLoad(C, L1) hasLoad(C, L2) box(L1). Let D = hasCar(C) hasLoad(C, L1) box(L1), then C D, where = {L2/L1}, and |D| < |C|. In this work we cannot rely directly on the established notion of reducibility as defined above. This is because a reduction of a -feature may not be a -feature itself. For example, for {car(-c1 ), hasLoad(+c1 , -l), hasLoad(-c2 , +l), hasCar(+c2 )}, the conjunction car(C1 ) hasLoad(C1 , L1 ) hasLoad(C2 , L1 ) car(C2 ) is a correct -feature but its reduction hasCar(C1 ) hasLoad(C1 , L1 ) is not. The fact that reduction does not preserve correctness of feature syntax may represent a problem because, to avoid redundancy, we would like to work only with reduced features. The next definition introduces Hreduction, which has the property that H-reduction of a -feature is always a -feature. While there is an infinite number of -features for a sufficiently rich template , there is always only a finite number of non-H-reducible ones. Definition 3.2 (H-reduction) We say that (pos) feature f H-subsumes (pos) -feature g (written f H g) if and only if there is a substitution (called Hsubstitution) such that f g and for every literal l lits(f ) it holds depthf (l) = depthg (l) for some Block-Wise Construction of Acyclic Relational Features Algorithm 1 domI,B (S) 1: Input: Pos feature F + , Interpretation I, Back- ground theory B; 2: litsDom {l|pred(l) = pred(root(F + )) ((I B) |= l)} 3: for output variables outi of F + 's root do 4: argDomi cchildrenouti (l) domI (c) 5: litsDom litsDom {l|argi (l) argDomi } 6: end for 7: return {t|t is term at input of l l litsDom} whether (I B) |= F , where F is a neutral feature. If l = pred(X1 , . . . , Xn ) is root of F , it suffices to replace l by l = pred(X0 , X1 , . . . , Xn ) in F and extend background theory to B = B {pred(yes, X1 , . . . , Xn ) pred(X1 , . . . , Xn )}. If domain of this newly created pos feature feat+ (F ) is non-empty, then (I B) |= F . This simplifies notation because we do not need to treat neutral features separately. Example Let us have feature F and interpretation I F = hasCar(C) hasLoad(C, L) tri(L) box(L), an H-reducible feature for template is Freducible = hasCar(C) hasLoad(C, L) box(L) hasLoad(C, L2) box(L2) circ(L2) tri(L2). This feature is indeed H-reducible as may be seen from the following fact hasLoad(C, L) box(L) H H I = {hasCar(c), hasLoad(c, l1), hasLoad(c, l2), tri(l1), circ(l1), tri(l2), box(l2), hasLoad(c, l3), box(l4)}, B = . First, we modify F so that we could use Algorithm 1 to decide whether (IB) |= F , i.e. we replace hasCar(C) by hasCar(X, C) and extend the background theory B = {hasCar(yes, X) hasCar(X)}.1 Then we may proceed as follows: (i) We compute domains of pos features box(L) and tri(L), i.e. domI (box(L)) = {l2, l4}, domI (tri(L)) = {l1, l2}. (ii) Then we compute domain of pos feature with root hasLoad(C, L), i.e. domI (hasLoad(C, L)) = {l1, l2, l3} {l2} {l1, l2} = {l2}. (iii) Since no domain is empty so far, we proceed further and compute domain of pos feature with root hasCar(X, C), which becomes domI (hasCar(X, C)) = {yes}{yes}. So, we see that (I B) |= F . The next definition introduces irrelevancy of boolean attribute (Lavra et al., 1999), which enables one to c filter such irrelevant attributes from dataset. Definition 4.2 (Irrelevant Attribute) Let A be a set of boolean attributes, let cov(a) denote subset of examples covered by a A and let pos(a) (neg(a)) denote subset of positive (negative) examples covered by a A. A boolean attribute a A is said to be E + irrelevant (E - -irrelevant) if and only if there is a A, a = a such that pos(a) pos(a ) and neg(a ) neg(a) (neg(a) neg(a ) and pos(a ) pos(a)). If one of the inclusions, for at least one example, is strict, a is said to be strictly E + -irrelevant (E - -irrelevant). A boolean attribute a is called irrelevant if it is both E + -irrelevant and E - -irrelevant. It is called strictly irrelevant if it is irrelevant and strictly E + -irrelevant or strictly E - -irrelevant. We assume that there had been no literals hasCar(X, C) before we added them to the original feature and to background theory. 1 hasLoad(C, L2) box(L2) circ(L2) tri(L2) When all H-reducible features are removed, there remain only 18 correct -features for from our running example. 4. Relevancy Definition 4.1 (Domain) Let I be an interpretation, B be a background theory, be a template and T be a set of terms. Let S be a standardized-apart set of pos -features. Then, domain w.r.t. I and B (domI,B ) is a mapping domI,B : S 2T such that domI,B (F + ) contains all terms t such that (I B) |= F + , where = {p(F + )/t}. Domain assigns to each pos feature a set of terms {ti }, for which there is a grounding of S such that p(S) = ti and the grounding of S is true in I B. In order to allow efficient computation of domains of pos features, we make the assumption that for every literal l with a subset of output arguments out1 , . . . , outi grounded, it is possible to find the set of all possible groundings of this literal efficiently. If this assumption holds, then an algorithm exists, which correctly computes domain and works in time polynomial in the size of F + (Algorithm 1). This algorithm is not novel, for B = it corresponds to an algorithm known as directed-arcconsistency algorithm in the field of CSP and as conjunctive acyclic query algorithm in database theory (Yannakis, 1981). Once we have established how domain of a pos feature is computed, it is easy to use this method to decide Block-Wise Construction of Acyclic Relational Features In this section, we develop methods for detection of pos features, which give rise to irrelevant features when grafted with neg features. Lemma 4.3 Let I be an interpretation, B back+ + + ground theory and let S1 = {F1 , F2 , . . . , Fm } + + and S2 = {G1 , G2 , . . . , G+ } be standardized-apart n sets of pos features such that m domI,B (Fi+ ) i=1 n domI,B (G+ ), then for any pos-neg feature F - i=1 i domI,B (F - V S1 ) domI,B (F - V S2 ) Proof Let us first consider the case when depthF - (V ) = 0. The only place, where domains of Fi S1 (Gi S2 respectively) are used, is line 4 in Algorithm 1. Clearly argDomS1 = m domI,B (Fi+ ) argDomS2 = n domI,B (G+ ) i=1 i=1 i and consequently litsDomS1 litsDomS2 and therefore also domI,B (F V S1 ) domI,B (F V S2 ). The general case of lemma may be proved by induction on depth of V . (i) The case for depth 0 has been already proved. (ii) Let us suppose that lemma holds for depth d. Now, we suppose that depthF - (V ) = d + 1. We may take the pos-neg feature T F - which contains V such that depthT (V ) = 0, W = p(T ) (respecting inputs/outputs of F - and n(F - ) = V ) and graft it with S1 (S2 , respectively). We have domI,B (T V S1 ) domI,B (T V S2 ) and by induction argument finally also domI,B (F - V S1 ) = domI,B ((F - \T )W {T V S1 }) domI,B ((F - \T )W {T V S2 }) = domI,B (F - V S2 ), which finishes the proof. Theorem 4.4 Let E + be a set of positive examples, let E - be a set of negative examples and let B be background theory. Let S = {Fi+ }n be a set of distinct i=1 pos features with equal types of input arguments such that for all i = j there is an example I E + E - + with domI,B (Fi+ ) = domI,B (Fj+ ). Let domI,B (F1 ) + n + i=2 domI,B (Fi ) be true for all I E (I E - ) + and let n domI,B (Fi+ ) domI,B (F1 ) be true for i=2 - + all I E (I E ). (i) For any neg feature F - , + F - {F1 } is E + -irrelevant (E - -irrelevant). (ii) Let S be a set of pos features obtained from S by repeatedly removing irrelevant pos features. Set S is unique. Proof We will prove only the case for E + -irrelevancy because the proof for E - -irrelevancy is analogous. (i) Let F - be a neg feature and let F ± = feat+ (F - ) be the corresponding pos-neg feature, which has unique + p(F ± ). By application of Lemma 4.3, if domI,B (F1 ) + n + i=2 domI,B (Fi ) for all I E , then domI,B (F ± + + + {F1 }) domI,B (F ± {F2 , . . . , Fn }) for all I + + + n E . Similarly, if i=2 domI,B (Fi ) domI,B (F1 ) + - ± + for all I E , then domI,B (F {F2 , . . . , Fn }) + domI,B (F ± {F1 }) for all I E - . Therefore if + - (I B) |= F {F1 } , then (I B) |= F - + + {F2 , . . . , Fn } for all I E + and similarly if (I B) |= + + + F - {F2 , . . . , Fn }, then (I B) |= F - {F1 } for + - + all I E . This means that F1 is E -irrelevant. (ii) It could happen that by removing some irrel+ evant pos features from S, F1 could become relevant. We need to show that this is not the case. Let G+ be a graph with vertices corresponding to Si . Let there be an edge (vi , vj ) if and only if + domI,B (Fi+ ) kA domI,B (Fk ) for all I E + and + + kA domI,B (Fk ) domI,B (Fi ) for all I E - and j A, i A. Let vi be a vertex corresponding to pos / feature Fi+ . Any vertex with non-zero in-degree corresponds to an irrelevant pos feature. If we show that G is acyclic, then it is easy to find the unique set of relevant pos features as the set of pos features corresponding to vertices with zero in-degree. To show that G is acyclic, we first notice that if two distinct vertices vi , vj G are connected by a directed path, then there is also the edge (vi , vj ). Therefore if there was a directed cycle containing (v1 , v2 ), there would have to be also two edges (v1 , v2 ) and (v2 , v1 ). This would mean that for each positive example it would be true that + + domI,B (F2 ) n domI,B (Fi+ ) domI,B (F1 ) and i=3 + + + n domI,B (F1 ) i=3 domI,B (Fi ) domI,B (F2 ), but + + then domI,B (F1 ) = domI,B (F2 ) and the same would + be true for negative examples implying domI,B (F1 ) = + domI,B (F2 ) for all examples, which contradicts assumption that no two distinct pos features have equal domains for all examples. Therefore G must be acyclic and the set of irrelevant pos features (in a given set) must be unique. 5. Algorithm In this section, we design a propositionalization algorithm RelF (Algorithm 2). RelF merges the two usual phases of propositionalization, i.e. feature construction and extension computation. Specifically, the core algorithm accepts a set of learning examples and a feature template. The features are obtained by combinatorial composition of pos features, which act as the primitive building blocks. The advantage of this assembly approach is that H-redundant or irrelevant pos features may be removed from the set of pos features while guaranteeing that all relevant features will be found. In the filtering phase, the algorithm first filters pos features, which have equal domains for all examples, and only after that it also removes irrelevant pos features (thus satisfying conditions of Theorem 4.4). The rules used for detection of E + irrelevant (E - -irrelevant) pos features are based on Block-Wise Construction of Acyclic Relational Features Algorithm 2 RelF (Sketch of Algorithm): Given a template and a set of examples, RelF computes the propositionalized table. 1: Input: template , examples E; 2: P osF eats {} 3: OrderedDef s topologically ordered predicate Algorithm 3 Combine (Procedure used by RelF): Given a predicate symbol and a set of pos features, Combine constructs pos features. 1: Input: predicate, P osF eats; 2: Constructed {} 3: ArgCombinations [] 4: for output arguments outi of predicate do 5: Smaller pos features from P osF eats with definitions computed from 4: for pred OrderedDef s do 5: N ewP osF eats {} 6: N ewP osF eats Combine(pred, P osF eats) 7: Filter H-reducible pos features 8: Filter pos features with equal domains for all examples 9: Filter irrelevant pos features 10: Add N ewP osF eats to P osF eats 11: end for 12: Save all correct features from P osF eats type of input equal to type of outi ArgCombinations [outi ] all combinations without repetition of F + Smaller 7: end for 8: Constructed all possible graftings of predicate(X1 , . . . , Xk ) with the respective combinations from ArgCombinations 9: return Constructed 6: + + F2 = hasLoad(C, L) box(L) tri(L) and F3 = hasLoad(C, L) tri(L). Notice that if we had not removed circ(L), there would have been seven such + + + pos features. We can now filter also F1 , F2 , F3 in the exactly same manner as we filtered box(L), tri(L), + circ(L). In this case, F2 is E + -irrelevant because + + + domI1 (F2 ) = domI1 (F1 ) domI1 (F3 ) = {c1}, + + + domI2 (F2 ) = {c2} domI2 (F1 )domI2 (F3 ) = {c2}, + + + domI3 (F1 ) domI3 (F3 ) = domI3 (F2 ) = , + + + domI4 (F1 ) domI4 (F3 ) = domI4 (F2 ) = . Theorem 4.4. That means S1 is E -irrelevant if there is set of pos features {Si } and Ind N such that domI,B (S1 ) iInd domI,B (Si ) on positive examples and iInd domI,B (Si ) domI,B (S1 ) on negative examples. The algorithm exploits the partial irreflexive order, which is imposed on types of arguments by Def. 2.2. Due to existence of this order it is possible to sort all declared predicates l topologically with respect to a graph induced by the partial order. When the topological order is found, it is possible to organize generation of features in such a way that pos features are built iteratively by combining smaller pos features into larger ones. With input arguments and E, where is a template and E is a set of examples, it returns set of neutral features TAlg T , where T is set of all correct -features. An important property of Algorithm 2 is that for any neutral feature F T , which is not strictly irrelevant, there is a neutral feature F TAlg such that F and F cover the same set of examples. In other words, RelF correctly outputs all relevant boolean attributes, which means that it is complete in a well-specified sense. Running Example In Section 3, we have made the set of correct -features finite. We have shown that, for our particular template , there are only 18 features. We will now demonstrate how RelF constructs the E + -relevant features. We first generate and filter the following three pos features: box(L), tri(L), circ(L). We may check that circ(L) is E + -irrelevant (due to box(L)), so we may safely throw it away. The next pos features created by grafting with box(L) and + tri(L) are pos features F1 = hasLoad(C, L) box(L), + Finally, we may graft these pos features with car(C ) to obtain the resulting set of neutral features. 6. Experiments In this section we evaluate perfomance and accuracy of RelF2 . We evaluate RelF in two relational learn ing domains in comparison to RSD (Zelezn´ & Lavra, y c 2006) and Progol (Muggleton, 1995). Our intention is to demonstrate (i) that RelF can propositionalize relational data orders of magnitude faster than standard algorithms and (ii) that classifiers built using features generated by RelF are not much worse than those built using more general feature seclarations. In (Krogel & al, 2003) extensive experiments were conducted to compare three state-of-the-art propositionalization systems: RSD, SINUS and RELAGGS. In this study each of the systems obtained best predictive accuracy on exactly two out of six domains. RELAGGS 2 All program codes can be obtained on request from the first author. Block-Wise Construction of Acyclic Relational Features proved itself superior especially in domains where numerical attributes were essential. On the other hand SINUS and RSD performed well in typical ILP tasks such as predicting mutagenicity or learning legal positions of chess-end-games. However, in all experiments RSD was several times faster than SINUS. That is why we chose RSD for comparisons. We also conduct experiments comapring RelF to state-of-the-art ILP system Progol and we also compare our results with those presented in literature. In all experiments described in this section random forest classifiers3 are used (Breiman, 2001). We follow suggestions given in (Scheffer & Herbrich, 1997) to obtain unbiased estimate of quality of learned classifiers. We perform experiments both with RSD having the same feature declaration bias as RelF and with RSD allowing cyclic features. While for RelF the only necessary restriction on features is given by the templates (which implicitly restrict depth of features), for RSD we also need to bound maximum size of features. 6.1. Mutagenesis Our first set of experiments was done on the wellknown Mutagenesis dataset (Lodhi & Muggleton, 2005), which consists of 188 organic molecules marked according to their mutagenicity. We used atom-bond descriptions and numerical attributes lumo and logP. The longest features found by RelF had over 20 bondliterals and were found in 116 seconds. The longest features found by RSD had 3 bond-literals and RSD needed 272 seconds. Bigger features could not be found in 15000s by RSD. RSD without acyclic feature bias was not able to find more features than with the acyclic bias. Table 1 displays predictive accuracies obtained on the Mutagenesis dataset. The third column refers to Progol with maximum number of searched nodes set to 10000 and maximum clause length set to 4. Theory construction runtime was 398s. All clauses found by Progol were acyclic. The third column displays accuracy obtained by an ensemble method with a set of theories found by Progol reported in (Lodhi & Muggleton, 2005), which is to date the highest predictive accuracy for this dataset. However, in this last experiment more information about moleculs was used (functional groups and indicator variables). Therefore we repeated our experiments, but we added also the indicator variables and functional groups and obtained accuracy 87.4%. Adding functional groups was not very useful in this case, because RelF was already We used the random forest classifier from Weka package (Witten & Frank, 2005). 3 able to capture the functional groups due to its ability to construct long features. Table 1. Accuracies on Mutagenesis dataset. Algorithm Accuracy [%] RelF 89.8 RSD 87.8 Progol 82.0 Progol Ens. 95.8 6.2. CAD Documents The second set of experiments was conducted in a domain describing CAD documents (product structures) a (Z´kov´ et al., 2007). The dataset consists of 96 classa labeled examples. This dataset is interesting because long features are needed to obtain reasonable classification accuracy. For RSD we used both the same template (acyclic bias) and a slightly more general template. We needed to significantly constrain size of RSD's features to 12 literals for acyclic case (resulting in runtime 12324s) and 11 literals for cyclic case (resulting in runtime 2198s). This is in contrast with RelF, whose longest features had over 50 literals (with runtime 108s). Importantly, the single feature that separated the dataset best was discovered only by RelF. The accuracy of Progol is low due to the fact that Progol is unable to find clauses with sufficient lengths within 15000s limit, which agrees with a findings reported in (Z´kov´ et al., 2007). a Table 2. Accuracies on CAD dataset. Algorithm Accuracy [%] RelF 96.7 RSD 96.7 RSD (cyclic) 91.2 Progol 81.1 We have performed an additional experiment on CAD dataset to make clear to what extent the speedup achieved by RelF compared to RSD is due is due to filtering of irrelevant pos features. With enabled irrelevancy filtering RelF ran 108 seconds, whereas with disabled irrelevancy filtering it crashed after running for several minutes because of lack of free memory. This shows that the key concept, which makes RelF efficient, is irrelevant pos feature filtering. 6.3. Predictive Toxicology Challenge The last set of experiments was done with data from the Predictive Toxicology Challenge (Helma et al., 2001). The PTC dataset consists of 344 organic molecules marked according to their carcinogenicity on male and female mice and rats. Our experiments were done for male rats. Longest features constructed by RelF had over 20 bond-literals and were constructed in 762 seconds, while longest features constructed by RSD had only 4 bond-literals in 2971 seconds. Block-Wise Construction of Acyclic Relational Features Table 3 refers to predictive accuracies obtained on the PTC dataset. With Progol, we were unable to obtain any theory compression. The third column in Table 3 refers to approach based on optimal assignment kernel (Fr¨hlich et al., 2005). The fourth column also o refers to approach based on kernels (Ralaivola et al., 2005). Predictive accuracy reported for this last approach is the highest presented in literature, however, it is a leave-one-out estimate as opposed to 10-fold cross-validation estimates of the other discussed results. Table 3. Accuracies on PTC dataset for male rats. Algorithm Accuracy [%] RelF 62.5 RSD 64.9 Kernel1 63.0 Kernel2 65.7 Fr¨hlich, H., Wegner, J. K., Sieker, F., & Zell, A. o (2005). Optimal assignment kernels for attributed molecular graphs. International Conference on Machine learning (ICML '05) (pp. 225­232). ACM. Helma, C., King, R. D., Kramer, S., & Srinivasan, A. (2001). The predictive toxicology challenge 20002001. Bioinformatics, 17, 107­108. Krogel, M.-A., & al (2003). Comparative evaluation of approaches to propositionalization. International Conference on Inductive Logic Programming (ILP 03'). Springer. Lavra, N., & Flach, P. A. (2001). An extended transc formation approach to inductive logic programming. ACM Transactions on Computational Logic, 2, 458­ 494. Lavra, N., Gamberger, D., & Jovanoski, V. (1999). c A study of relevance for learning in deductive databases. Journal of Logic Programming, 40, 215­ 249. Lodhi, H., & Muggleton, S. (2005). Is mutagenesis still challenging. International Conference on Inductive Logic Programming (ILP '05), Late-Breaking Papers (pp. 35­40). Muggleton, S. (1995). Inverse entailment and Progol. New Generation Computing, Special issue on Inductive Logic Programming, 13, 245­286. Ralaivola, L., Swamidass, S. J., Saigo, H., & Baldi, P. (2005). Graph kernels for chemical informatics. Neural Networks, 18, 1093­1110. Scheffer, T., & Herbrich, R. (1997). Unbiased assessment of learning algorithms. 15th International Joint Conference on Artificial Intelligence (IJCAI '97) (pp. 798­803). Zelezn´, F., & Lavra, N. (2006). Propositionalizationy c based relational subgroup discovery with RSD. Machine Learning, 62(1-2), 33­63. Witten, I. H., & Frank, E. (2005). Data mining: Practical machine learning tools and techniques. Morgan Kaufmann, San Francisco. 2nd edition. Yannakis, M. (1981). Algorithms for acyclic database schemes. International Conference on Very Large Data Bases (VLDB '81) (pp. 82­94). a a Z´kov´, M., Zelezn´, F., Garcia-Sedano, J., Tissot, y C. M., Lavra, N., Kemen, P., & Molina, J. (2007). c r Relational data mining applied to virtual engineering of product designs. International Conference on Inductive Logic Programming (ILP '07). Springer. 7. Conclusions We have introduced RelF, an algorithm for construction of acyclic relational features. We have shown that blockwise construction of acyclic features enables RelF to remove H-reducible and irrelevant pos features. In experiments, we have shown that RelF is able to construct relevant features with sizes far beyond the reach of state-of-the-art propositionalization systems. Of importance, we have also shown that, for three typical ILP datasets, restriction to acyclic features was not very detrimental with respect to predictive accuracy. In fact, the obtained results were very close to the best results reported in literature. Although there are definitely datasets where cyclic features could provide better predictive accuracies, one can always merge results from different propositionalization algorithms and feed the propositional learners with such merged propositionalized tables. An algorithm for construction of a limited class of features such as RelF would be useful also in such cases. Acknowledgements. We are grateful to ICML 2009 reviewers for their insightful comments. The first author is supported by project 1ET101210513 (Czech Academy of Sciences), the second author is supported by project MSM6840770038 (Czech Ministry of Education). Both authors are further supported by project 201/08/0509 (Czech Science Foundation). References Breiman, L. (2001). Random forests. Machine Learning, 45, 5­32. Dehaspe, L., & Toivonen, H. (1999). Discovery of frequent datalog patterns. Data Mining and Knowledge Discovery, 3, 7­36.