Maximum Likeliho o d Rule Ensembles
Krzysztof Demb czyski kdembczynski@cs.put.poznan.pl Institute of Computing Science, Pozna University of Technology, Pozna, 60-965, Poland Wo jciech Kotlowski wkotlowski@cs.put.poznan.pl Institute of Computing Science, Pozna University of Technology, Pozna, 60-965, Poland Roman Slowiski rslowinski@cs.put.poznan.pl Institute of Computing Science, Pozna University of Technology, Pozna, 60-965, Poland Systems Research Institute, Polish Academy of Sciences, Warsaw, 01-447, Poland

Abstract
We propose a new rule induction algorithm for solving classification problems via probability estimation. The main advantage of decision rules is their simplicity and good interpretability. While the early approaches to rule induction were based on sequential covering, we follow an approach in which a single decision rule is treated as a base classifier in an ensemble. The ensemble is built by greedily minimizing the negative loglikelihood which results in estimating the class conditional probability distribution. The introduced approach is compared with other decision rule induction algorithms such as SLIPPER, LRI and RuleFit.

are removed from the training set and the process is repeated until no examples remain. Although it seems that decision (classification) trees are much more popular in data mining and machine learning applications, recently we are able to observe again a growing interest in decision rule models. As an example, let us mention such algorithms as RuleFit (Friedman & Popescu, 2005), SLIPPER (Cohen & Singer, 1999), Lightweight Rule Induction (LRI) (Weiss & Indurkhya, 2000). All these algorithms follow a specific iterative approach to decision rule generation by treating each decision rule as a subsidiary base classifier in the ensemble. This approach can be seen as a generalization of the sequential covering, because it approximates the solution of the prediction task by sequentially adding new rules to the ensemble without adjusting those that have already been added (RuleFit is an exception since it generates the trees first and then transforms them to rules). Each rule is fitted by concentrating on ob jects which were hardest to classify correctly by rules already present in the ensemble. All these algorithms can be explained within the framework of boosting (Freund & Schapire, 1997; Mason et al., 1999; Friedman et al., 2000) or forward stagewise additive modeling (FSAM) (Hastie et al., 2003), a greedy procedure for minimizing a loss function on the dataset. The algorithm proposed in this paper, Maximum Likelihood Rule Ensembles (MLRules), benefits from the achievements in boosting machines (Freund & Schapire, 1997; Mason et al., 1999; Friedman et al., 2000; Friedman, 2001). Its main idea consists in rule induction by greedily minimizing the negative loglikelihood (also known as logit loss in binary classification case) to estimate the conditional class probability distribution. Minimization of such loss function with a

1. Intro duction
Decision rule is a logical statement of the form: "if condition then response ". It can be treated as a simple classifier that gives a constant response for the objects satisfying the condition part, and abstains from the response for all the other ob jects. Induction of decision rules has been widely considered in the early machine learning approaches (Michalski, 1983; Cohen, 1995; Furnkranz, 1996), and rough set approaches to ¨ knowledge discovery (Stefanowski, 1998). The most popular algorithms were based on a sequential covering procedure (also known as separate-and-conquer approach). In this technique, a rule is learned which covers a part of the training examples, then examples
Appearing in Proceedings of the 25 th International Conference on Machine Learning, Helsinki, Finland, 2008. Copyright 2008 by the author(s)/owner(s).


Maximum Likeliho o d Rule Ensembles

tree being a base classifier has already been used in LogitBoost (Friedman et al., 2000) and MART (Friedman, 2001), however, here we show a modified procedure, adapted to the case when the decision rule is a base classifier in the ensemble. In contrary to RuleFit (where trees are generated first), rules are generated directly; in contrary to SLIPPER and LRI, negative loglikelihood loss is used. Moreover, our approach is distinguished from other approaches to rule induction by the fact of estimating the class probability distribution instead of single classification and by using the same single measure (value of the negative loglikelihood) at all stages of the learning procedure: setting the best cuts (conditions), stopping the rule's growth and determining the response (weight) of the rule. We derive the algorithm for two optimization techniques, depending on whether we expand the loss function to the first order (fitting to the gradient) or to the second order (Newton steps). We report experiments showing the performance of MLRules and comparing them with the competitive rule ensemble methods. The paper is organized as follows. In Section 2, the problems of classification is described. Section 3 presents a framework for learning rule ensembles. Section 4 is devoted to the problem of a single rule generation. In Section 5 we discuss the issue of convergence of the method, and we propose a modification to the main algorithm. Section 6 contains experimental results. The last section concludes the paper and outlines further research directions.

where function f is chosen from a restricted family of functions. The most commonly used loss function is 0-1 loss, L0-1 (y , f (x)) = 1 - y,f (x) , where ij = 1 if i = j , otherwise ij = 0. If the correct class is predicted, classification function is not penalized, otherwise the unit penalty is imposed. Bayes classifier has the following form: f  (x) = arg
k{1,...,K }

min

Pr(y = k |x).

(2)

The 0-1 loss has several drawbacks. Firstly, if we introduce unequal costs of misclassification, f  (x) does not longer have the form (2). Moreover, 0-1 loss is insensitive to the "confidence" of prediction: minimization of 0-1 loss results only in finding the most probable class, without estimating its probability. On the contrary, probability estimation provides us with the conditional class distribution P (y |x), by which we can measure the prediction confidence. Moreover, all we need to obtain the Bayes classifier for any loss function is the conditional probability distribution. Here we consider the estimation of probabilities using the well-known maximum likelihood estimation (MLE) method. MLE can be stated as the empirical risk minimization by taking the negative logarithm of the conditional likelihood (negative log-likelihood) as the loss function: = in
=1

- log P (yi |xi ).

(3)

2. Problem Statement
In the classification problem, the aim is to predict the unknown class label y  {1, . . . , K } of an ob ject using known values of the attributes x = (x1 , x2 , . . . , xm ). This is done by constructing a classification function f (x) that predicts accurately the value of y . The accuracy of a single prediction is measured in terms of the loss function L(y , f (x)), while the overall accuracy of the function f (x) is measured by the expected loss (risk) over the data distribution P (x, y ): R(f ) = E[L(y , f (x))]. Since P (x, y ) is unknown, the risk-minimizing function (Bayes classifier), f  = arg minf E[L(y , f (x))], is also unknown. The learning procedure uses only a set of training examples {(x1 , y1 ), . . . , (xn , yn )} to construct f to be a good approximation of f  . Usually, it is performed by minimization of the empirical risk:
n 1i L(yi , f (xi )), Remp (f ) = n =1

We model probabilities P (1|x), . . . , P (K |x) with a vector f (x) = (f1 (x), . . . , fK (x)) using the multinomial logistic transform: P (y |x) = Then (3) has the form: (f ) = in
=1

efy (x) K f (x) . k k=1 e

(4)

log

K k
=1

- e
fk (xi )

fyi (xi ).

(5)

This expression (with the exception that vector function f is used instead of scalar f ) K as the form of (1) if h - we identify L(yi , f (xi )) = log
k=1

efk (xi )

fyi (xi ).

It is worth mentioning that the Bayes function f  (x) is obtained by the inverse of (4).

3. Rules Ensembles
In this section, we describe the scheme of learning rule ensembles. Let Xj be the set of all possible values of attribute j . Condition part of the rule consists of

(1)


Maximum Likeliho o d Rule Ensembles

a conjunction of elementary expressions of the form xj  Sj , where xj is the value of ob ject x on attribute j and Sj is a subset of Xj , j  {1, . . . , m}. We assume that in the case of ordered value sets, Sj has the form of the interval [sj , ) or (-, sj ] for some sj  Xj , so that the elementary expressions take the form xj  sj or xj  sj . For nominal attributes, we consider elementary expressions of the form xj = sj or xj = sj . Let  be the set of elementary expressions constituting the condition part of the rule and let (x) be an indicator function equal to 1 if x satisfies the condition part of the rule (all elementary expressions in the condition part), otherwise (x) = 0. We say that a rule covers an ob ject x, if (x) = 1. The response of the rule is a vector   RK assigned to the region defined by . Therefore, we define a decision rule as: r(x) = (x). (6)

4. Generation of a Single Rule
In this section, we describe how the algorithm generates single rules. In order to obtain a rule, one has to solve (8). The optimization procedure is still computationally hard. Therefore, we restrict analysis to the rules voting for only one class, so that the response of the rule has the form  = v, where v is a vector with only one non-zero coordinate vk = 1, for some k = 1, . . . , K , and  is a positive real value. We propose two heuristic procedures for solving (8). The first, called gradient method (Mason et al., 1999), approximates (f + v) up to the first order with respect to : (f + v) where  (f + v) (f , v) =   .
=0

(f ) +  (f , v),

(9)

Notice that the decision rule takes only two values, r(x)  {, 0}, depending whether x satisfies the condition part or not. In this paper, we assume the classification function is a linear combination of M rules: f (x) =
M m =1

(10)

rm (x).

(7)

Since the first term in (9) is constant, minimization of the loss for any positive  is equivalent to minimization of the second term. Thus, if we define: Lm () = min - (f , v)
v

(11)

Using (4), we can obtain conditional probabilities from (7). Moreover, from (4) it follows that P (y |x) is a monotone function of fy (x). Therefore, from (2) we have that ob ject x is classified to the class with the highest fk (x). Thus, combination (7) has very simple interpretation as a voting procedure: rules vote for each class k , and ob ject x is classified to the class with the highest vote. The construction of an optimal rules ensemble minimizing the negative loglikelihood (empirical risk) is a hard optimization problem. That is why we follow here a forward stagewise strategy (Hastie et al., 2003), i.e. the rules are added one by one, greedily minimizing the loss function: rm = arg min (fm-1 +r) = arg min (fm-1 +), (8)
r ,

(we remind, that there are only K possible vectors v, so the arg min operation can be done by simply checking all K possibilities), then m can be obtained by minimizing Lm (). The second heuristic, Newton method, approximates (f + v) up to the second order: (f + v) where (f ) +  (f , v) + 2 2 (f , v), (12)

(f , v) is defined as before, and:   2 (f + v) (f , v) = .  2 =0

(13)

where rm is a rule obtained in the m-th iteration and fm-1 is the rule ensemble after m - 1 iterations. It has been shown (Hastie et al., 2003) that "shrinking" the base classifier while adding it to the ensemble improves the prediction accuracy. That is why we set: fm (x) = fm-1 (x) +  · rm (x), where   (0, 1] is the shrinkage parameter, which constitutes a trade-off between accuracy and interpretability. Higher values (  1) produce smaller ensembles, while low values (  0.1) produce larger but more accurate ones.

Due to convexity of the loglikelihood, expression (12) is minimized by the Newton step: =- (f , v) . (f , v) (14)

By substituting (14) into (12), and taking the square root, we get: Lm () = min -
v

(f , v) ( , f , v)

(15)

and we can obtain m by minimizing Lm ().


Maximum Likeliho o d Rule Ensembles

Algorithm 1 MLRules input: set of n training examples {(yi , xi )}n , 1 M ­ number of decision rules. output: rule ensemble {rm (x)}M . 1 f0 := 0 . for m = 1 to M do m (x) = arg min Lm () m = - (f , v)/ (f , v) rm (x) = m m (x) fm (x) = fm-1 (x) +  rm (x) end for Expressions (f , v) and (f , v) have a very simple form. Let v be such that vk = 1. Then:  (f , v) = pik - k,yi , (16)
(xi )=1

following convex line-search problem: m = arg min (f + vm m ).


(18)

To speed up the computations, we follow, however, simpler procedure and obtain m by the Newton step (14). The whole procedure for constructing the rule ensemble is presented as Algorithm 1. We call this procedure MLRules. Note that we start with f (x) equal to 0 , which is a "default rule" with fixed (x)  1, while v0 and 0 are obtained as usual. Since the response always indicates the ma jority class, such a rule serves as a default classification when no other rule covers a given ob ject. In our implementation of the algorithm, we employed the resampling technique (Friedman & Popescu, 2003), which is known to improve both accuracy and computational complexity. To obtain less correlated rules, we search for m , using (11) or (15), on a random subsample (drawn without replacement) of the training set of size  < n. Then, however, the response m is obtained using al l of the training ob jects (including those ob jects, which have not been used to obtain m ). This usually decreases |m |, so it plays the role of a regularization method, and avoids overfitting the rule to the training set.

(f , v)

=


(xi )=1

pik (1 - pik ),

(17)

where pik = P (k |xi ) and i,j = 1 iff i = j . To calculate Lm (), these expressions must be obtained for each k . What we still need for finding m using both gradient and Newton techniques, is a fast procedure for minimizing Lm (), regardless whether it is defined by (11) or (15). We propose the following simple iterative procedure: at the beginning, m is empty (no elementary expressions are specified) and we set Lm () = 0. In each step, an elementary expression xj  Sj is added to m that minimizes Lm () (if it exists). Such expression is searched by sequentially testing the elementary expressions, attribute by attribute. For ordered attributes, each expression of the form xj  sj or xj  sj is tested, for every sj  Xj ; for nominal attributes, we test each expression of the form xj = sj or xj = sj , for every sj  Xj . Adding new expressions is repeated until Lm () cannot be decreased. We also simultaneously obtain vm , i.e. the value of v for which the minimum is reached in (11) or (15). Notice that since Lm () = 0 at the beginning, Lm () must be strictly negative at the end, otherwise no rule will be generated. The procedure for finding optimal  is very fast and proved to be efficient in computational experiments. The ordered attributes can be sorted once before generating any rule. This procedure resembles the way the decision trees are generated. Here, we look, however, for only one path from the root to the leaf. Moreover, let us notice that a minimal value of Lm () is a natural stop criterion in building a single rule and we do not use any other measures (e.g. impurity measures) for choosing the optimal cuts. Having found m , we can obtain m by solving the

5. Extensions
In this section, we shortly discuss the problem of convergence and propose two simple extensions of the main algorithm. 5.1. Convergence The procedure of obtaining m with the Newton step does not always decrease the empirical risk and does not guarantee the convergence of the algorithm. However, using a simple backtracking line-search is sufficient for convergence: we start with m obtained by the Newton step. If (f + v) < (f ), the procedure stops; otherwise repeat m := m /2 until the above condition is satisfied. This procedure ends after a finite number of steps, since from the definition of Lm , either (11) or (15), it follows that Lm = 0 if and only if (f , v) = 0, so if a rule is generated, Lm < 0 and v is a descent direction. Therefore, is decreased in each iteration. Since is bounded from below, the procedure converges, i.e.: limm (fm ) = . In the implementation of the algorithm we do not use such a procedure since the algorithm is stopped after M rounds anyway. This raises the question, whether  is the solution with the minimum achievable value of negative log-


Maximum Likeliho o d Rule Ensembles

likelihood in the class of rule ensembles F , i.e. if  is equal to  = inf f F (f )? The answer is negative because a greedy procedure is used to find the condition part of rule . Then, even if a "descent direction" rule exists, the procedure may fail to find it (although the resampling strategy improves the procedure by randomly perturbing the training set in each iteration, which helps to avoid "local minima"). Nevertheless, this questions seems not to be of practical importance here, since we fix the maximal number of rules M . This is due to the empirical evidences showing that ensemble methods sometimes overfit on real-life data when the size of the ensemble is too large. In the next subsection, we describe another stopping condition, independent of the parameter M . 5.2. Avoiding overfitting A decision rule has the form of m-dimensional rectangle. It can be shown, that the class of m-dimensional rectangles has Vapnik-Chervonenkis (VC) dimension equal to 2m and the VC dimension does not depend on the number of cuts. This is contrary to the tree classifier, for which the VC dimension grows to infinity with increasing number of cuts (nodes). Therefore, in case of tree ensembles, one usually specifies some constraints on tree complexity, e.g. maximal number of nodes, while in case of a rule ensemble no such constraints are necessary. The theoretical results (Schapire et al., 1998) suggest that an ensemble with a simple base classifier (with low VC dimension) and high prediction confidence (margin) on the dataset generalizes well, regardless of the size of the ensemble. Nevertheless, we conducted the computational experiments which show that the performance of rule ensemble can deteriorate as the number of rules grows, especially for the problems with high noise level. Similar phenomenon has been observed for other boosting algorithms, in particular for AdaBoost (Mason et al., 1999; Friedman et al., 2000; Dietterich, 2000). Therefore, we propose a procedure for stopping the ensemble growth, based on the simple "holdout set" analysis. Each rule is induced from the subsample of size  < n without replacement. Thus, there are n -  ob jects which do not take part in the induction procedure and can be used as a holdout set to estimate the quality of the induced rule. Since each rule votes for a single class, we calculate a simple 0-1 error (accuracy) of such a rule on the covered ob jects from the holdout set. A rule is acceptable if the holdout error is better (lower) than random guessing. Then, the stopping condition has the following form: in any p subsequent

iterations at least q rules are not acceptable. Such "averaging" over the iterations removes variations and allows us to observe the longer-term behavior of rule acceptability. We set p = 10 and q = 8, and those values were obtained by noticing, that when the null hypothesis states that rules are not worse than random guessing, at least 8 unacceptable rules must be obtained in 10 trials to reject the null hypothesis in the binomial test with confidence level 0.05. Another possibility for stopping the ensemble growth is running the internal cross validation, but such procedure has not been used in the experiment due to computational complexity. 5.3. Ordinal classification It is often the case that a meaningful order relation between class labels exists. For example, in recommender systems, users are often asked to evaluate items on five value ("stars") scale. Such problems are often referred to as ordinal classification problems. Here we show how the order relation can be taken into account in MLRules. Without loss of generality, we assume that the order between classes is concordant with the order between class labels coded as natural numbers Y = {1, . . . , K }. To capture the ordinal properties of Y , we only take into account rules voting for "at least" and "at most" class unions, where by "at least" class union we mean set {k , . . . , K } for some k , while by "at most" class union we mean {1, . . . , k }. Such rules can be incorporated by considering the vectors v in the response of the rule to be of the form: v = {-1, . . . , -1, 1, . . . , 1} (vote for "at least" union) or v = {1, . . . , 1, -1, . . . , -1} (vote for "at most" union), so that the rule increases the probability of a class union, and not of a single class. The whole algorithm remains the same, apart from the formulas (16) and (17), which now takes the form: (f , v) =  kK
(xi )=1 =1

vk (pik - k,yi ), 1 vk pik - kK
=1

(19) . vk pik (20)

(f , v) =


1K
(xi )=1 =t

The experimental verification of the usefulness of such rule representation is postponed for future research due to the lack of space.

6. Exp erimental Results
In this section, we show the results of the computational experiments on real datasets. First, we examine


Maximum Likeliho o d Rule Ensembles

the behavior of the ensemble as the number of rules increases. Then, we compare our algorithm with existing approaches to rule induction. 6.1. Error curves
sonar
0.4

6.2. Comparison with other rule ensemble algorithms To check the performance of MLRules on the real datasets, we compare them with three existing rule induction algorithms. SLIPPER (Cohen & Singer, 1999) was proposed within the AdaBoost reweighting scheme and uses an induction procedure which involves pruning. LRI (Weiss & Indurkhya, 2000) generates rules in the form of a DNF-formula and uses a specific reweighting scheme based on the cumulative errors. RuleFit (Friedman & Popescu, 2005) is based on FSAM framework (Hastie et al., 2003), but it uses the regression trees as base classifiers and then transforms them to rules. All three approaches are thus based on some boosting/reweighting strategy. According to our knowledge, RuleFit has not been compared with SLIPPER and LRI yet. We used 35 files taken from UCI Repository (Asuncion & Newman, 2007), among which 20 files are binary classification tasks and 15 are multi-class tasks. We omit characteristics of the datasets due to lack of space. We tested four classifiers on each dataset (MLRules with gradient and Newton steps, LRI and SLIPPER) and RuleFit on binary datasets only (RuleFit does not handle multi-class case). We selected the following parameters for each method: · SLIPPER: we set maximum number of iteration to 500, rest of parameters were set to default (we kept the internal cross validation, used to choose the optimal number of rules). · LRI: According to (Weiss & Indurkhya, 2000), we set the rule length to 5 and froze feature after 50 rounds; we also chose 200 rules per class and 2 disjuncts since some previous tests showed that those values work well in practice. · RuleFit: According to the experiment in (Friedman & Popescu, 2005), we chose mixed rule-linear mode, set average tree size to 4, increased the number of trees to 500and chose subsample size ,  as min{0.5n, 100 + 6 n}. · MLRules: We set  = 0.5n,  = 0.1, M = 500, but for the tree biggest datasets (letter, optdigits, pendigits ) we increased M to 2000 (to compare, LRI had 26 × 200 rules for letter ). Those parameters have not been optimized on the UCI data. We used artificial data to this end and due to the space limit we omit the characteristic of the data generating model. Each test was performed using 10-fold cross validation

error

0.1

0.2

0.3

Gradient Newton 0.0 0

200

400

600

800

1000

number of rules 0.00 0.05 0.10 0.15 0.20 0.25 0.30

haberman

error

Gradient Newton

0

200

400

600

800

1000

number of rules

Figure 1. Train and test errors as functions of the ensemble size, obtained by splitting the data into train (66%) and test (33%) sets and averaging over 50 random splits. The lower lines (dashed-dotted) correspond to the train error, upper solid lines ­ to the test error with stopping condition described in Section 5.2, upper dashed line ­ test error when the stopping condition was not applied. Parameters of the MLRules are:  = 0.1,  = 0.5n

In Figure 6.1, we present the train and test error as a function of the ensemble size M for two real datasets, taken from the UCI Repository (Asuncion & Newman, 2007). On the sonar dataset, both ensembles (gradient and Newton) do not overfit and the test error decreases even if the number of rules reaches 1000; this is a rather typical situation. An atypical one can be found on the second dataset (haberman ), where from some point, test error starts to increase. However, then a stopping condition described in Section 5.2 is satisfied which prevents rule ensemble from overfitting.


Maximum Likeliho o d Rule Ensembles Table 1. Test errors and ranks (in parenthesis). MLRules.G and MLRules.N are MLRules with gradient and Newton method, respectively. Average ranks in the last row correspond to comparing LRI and MLRules on all 35 files. Dataset haberman breast-c diabetes credit-g credit-a ionosphere colic hepatitis sonar heart-statlog liver-disorders vote heart-c heart-h breast-w sick tic-tac-toe spambase cylinder-bands kr-vs-kp avg. rank anneal balance-scale ecoli glass iris letter segment soybean vehicle vowel car cmc dermatology optdigits pendigits avg. rank SLIPPER 0.268 (3.0) 0.279 (3.0) 0.254 (4.0) 0.277 (5.0) 0.170 (5.0) 0.065 (3.0) 0.150 (4.0) 0.167 (2.0) 0.264 (5.0) 0.233 (5.0) 0.307 (5.0) 0.050 (5.0) 0.195 (5.0) 0.200 (5.0) 0.043 (5.0) 0.016 (2.0) 0.024 (2.0) 0.059 (5.0) 0.217 (4.0) 0.006 (2.0) 3.9 0.018 0.17 0.211 0.340 0.080 0.821 0.215 0.505 0.301 0.448 0.045 0.477 0.161 0.560 0.460 -- LRI RuleFit Binary-class Datasets 0.275(5.0) 0.272(4.0) 0.293(4.0) 0.297(5.0) 0.254(3.0) 0.262(5.0) 0.239(1.0) 0.259(3.0) 0.122(1.0) 0.132(3.0) 0.068(4.0) 0.085(5.0) 0.161(5.0) 0.147(3.0) 0.180(3.0) 0.194(4.0) 0.149(2.0) 0.197(4.0) 0.196(4.0) 0.185(3.0) 0.266(1.0) 0.307(4.0) 0.039(3.0) 0.050(5.0) 0.185(3.0) 0.189(4.0) 0.183(3.0) 0.183(4.0) 0.033(2.0) 0.041(4.0) 0.018(4.0) 0.019(5.0) 0.122(5.0) 0.053(3.0) 0.049(3.0) 0.059(4.0) 0.165(2.0) 0.381(5.0) 0.031(5.0) 0.029(4.0) 3.15 4.05 Multi-class Datasets 0.007(3.0) -- 0.088(2.0) -- 0.140(2.0) -- 0.285(3.0) -- 0.053(2.0) -- 0.069(1.0) -- 0.021(2.0) -- 0.413(3.0) -- 0.210(1.0) -- 0.059(1.0) -- 0.054(2.0) -- 0.435(1.0) -- 0.057(3.0) -- 0.019(1.0) -- 0.010(2.0) -- 2.26 -- MLRules.G 0.262 0.259 0.247 0.241 0.133 0.060 0.139 0.162 0.120 0.167 0.275 0.034 0.165 0.180 0.031 0.016 0.113 0.047 0.144 0.010 (2.0) (1.0) (1.0) (2.0) (4.0) (1.0) (2.0) (1.0) (1.0) (1.0) (2.0) (1.0) (2.0) (2.0) (1.0) (3.0) (4.0) (2.0) (1.0) (3.0) 1.85 (1.5) (1.0) (3.0) (1.0) (2.0) (3.0) (3.0) (2.0) (3.0) (3.0) (3.0) (2.0) (1.0) (3.0) (3.0) 1.9 MLRules.N 0.249 0.273 0.253 0.260 0.130 0.063 0.133 0.201 0.154 0.174 0.278 0.037 0.155 0.170 0.034 0.012 0.003 0.046 0.193 0.005 (1.0) (2.0) (2.0) (4.0) (2.0) (2.0) (1.0) (5.0) (3.0) (2.0) (3.0) (2.0) (1.0) (1.0) (3.0) (1.0) (1.0) (1.0) (3.0) (1.0) 2.05 (1.5) (3.0) (1.0) (2.0) (2.0) (2.0) (1.0) (1.0) (2.0) (2.0) (1.0) (3.0) (2.0) (2.0) (1.0) 1.84

0.006 0.078 0.149 0.244 0.053 0.137 0.029 0.073 0.236 0.148 0.057 0.437 0.019 0.026 0.014

0.006 0.091 0.140 0.248 0.053 0.088 0.020 0.067 0.216 0.104 0.028 0.439 0.024 0.021 0.010

CD = 1.364 SLIPPER RuleFit LRI MLRules.N MLRules.G

5

4

3

2

1

Figure 2. Critical difference diagram

(with exactly the same train/test splits for each classifier) and average 0-1 loss on the test set was calculated. The results are shown in Table 6.2. We first restrict the analysis to binary-class problems only. To compare multiple classifiers on the multi-

ple datasets, we follow Demar (2006), and make the s Friedman test, which uses ranks of each algorithm to check whether all the algorithms perform equally well (null hypothesis). Friedman statistics gives 33.28 which exceeds the critical value 9.488 (for confidence level 0.05), so we can reject the null hypothesis and state that classifiers are not equally good. Next, we proceed to a post-hoc analysis and calculate the critical difference (CD) according to the Nemeneyi statistics. We obtain C D = 1.364 which means that algorithms with difference in average ranks more than 1.364 are significantly different. In Figure 6.2 average ranks were marked on a line, and groups of classifiers that are not significantly different were connected. This shows that both MLRules algorithms are not sig-


Maximum Likeliho o d Rule Ensembles

nificantly different to LRI, however they outperform both SLIPPER and RuleFit. On the other hand, none of three well-known rule ensemble algorithms (LRI, SLIPPER, RuleFit) is significantly better to any other. The situation remains roughly the same if we compare the algorithms using all 35 datasets. We exclude RuleFit (it does not work with multi-class problems) and SLIPPER (its results are very poor, the worst almost every time1 ). Thus, we end up with 3 algorithms. Friedman statistics gives 3.53 which does not exceed the critical value 5.991, so that the null hypothesis cannot be rejected. Note that the difference in ranks decreased, mainly because LRI performs excellent on the largest datasets (letters and digits recognition). It is interesting to check how much of the improvement in accuracy of MLRules comes from shrinkage, resampling and regularizing the rule response, because those techniques can also be simply incorporated to SLIPPER and LRI. We plan to investigate this issue in our future research.

Cohen, W. W., & Singer, Y. (1999). A simple, fast, and effective rule learner. National Conference on Artificial Intel ligence (pp. 335­342). Demar, J. (2006). Statistical comparisons of classis fiers over multiple data sets. Journal of Machine Learning Research, 7, 1­30. Dietterich, T. G. (2000). An experimental comparison of three methods for constructing ensembles of decision trees: bagging, boosting, and randomization. Machine Learning, 40, 139­158. Freund, Y., & Schapire, R. E. (1997). A decisiontheoretic generalization of on-line learning and an application to boosting. Journal of Computer and System Sciences, 55, 119­139. Friedman, J. H. (2001). Greedy function approximation: A gradient boosting machine. Annals of Statistics, 29, 1189­1232. Friedman, J. H., Hastie, T., & Tibshirani, R. (2000). Additive logistic regression: a statistical view of boosting. Annals of Statistics, 337­407. Friedman, J. H., & Popescu, B. E. (2003). Importance sampled learning ensembles (Technical Report). Dept. of Statistics, Stanford University. Friedman, J. H., & Popescu, B. E. (2005). Predictive learning via rule ensembles (Technical Report). Dept. of Statistics, Stanford University. Furnkranz, J. (1996). Separate-and-conquer rule ¨ learning. Artificial Intel ligence Review, 13, 3­54. Hastie, T., Tibshirani, R., & Friedman, J. H. (2003). Elements of statistical learning: Data mining, inference, and prediction. Springer. Mason, L., Baxter, J., Bartlett, P., & Frean, M. (1999). Functional gradient techniques for combining hypotheses, 33­58. MIT Press. Michalski, R. S. (1983). A theory and methodology of inductive learning, 83­129. Tioga Publishing. Schapire, R. E., Freund, Y., Bartlett, P., & Lee, W. S. (1998). Boosting the margin: A new explanation for the effectiveness of voting methods. The Annals of Statistics, 26, 1651­1686. Stefanowski, J. (1998). On rough set based approach to induction of decision rules. Rough Set in Know ledge Discovering (pp. 500­529). Physica Verlag. Weiss, S. M., & Indurkhya, N. (2000). Lightweight rule induction. ICML (pp. 1135­1142).

7. Conclusions and Future Research
We proposed a new rule induction algorithm for solving classification problems, called MLRules, based on the maximum likelihood estimation method and using boosting strategy in rule induction. In contrary to previously considered algorithms, it estimates the conditional class probability distribution and therefore can work with any cost matrix for classification. We considered two optimization techniques, based on gradient and Newton steps, and introduced a stopping condition to avoid overfitting. The performance of MLRules was verified on a large collection of datasets, both binary- and multi-class. Our algorithm is competitive or outperforms the best existing approaches to rule induction. We also suggested the way in which MLRules can capture the order between classes and therefore can solve the ordinal classification problems. We plan to verify this issue experimentally in the future.

References
Asuncion, A., & Newman, D. J. (2007). UCI machine learning repository. Cohen, W. W. (1995). Fast effective rule induction. ICML (pp. 115­123).
1 The SLIPPER software documentation says: "it is an experiment multi-class version" and "there is known bug which occasionally causes incorrect output", so we decided not to consider multi-class results.