Boosting products of base classifiers

a e Bal´ zs K´ gl1 R´ bert Busa-Fekete1,2 o 1 LAL/LRI, University of Paris-Sud, CNRS, 91898 Orsay, France
2

BALAZS . KEGL @ GMAIL . COM BUSAROBI @ GMAIL . COM

Research Group on Artificial Intelligence of the Hungarian Academy of Sciences and University of Szeged, Aradi v´ rtan´ k tere 1., H-6720 Szeged, Hungary e u

Abstract
In this paper we show how to boost products of simple base learners. Similarly to trees, we call the base learner as a subroutine but in an iterative rather than recursive fashion. The main advantage of the proposed method is its simplicity and computational efficiency. On benchmark datasets, our boosted products of decision stumps clearly outperform boosted trees, and on the MNIST dataset the algorithm achieves the second best result among no-domain-knowledge algorithms after deep belief nets. As a second contribution, we present an improved base learner for nominal features and show that boosting the product of two of these new subset indicator base learners solves the maximum margin matrix factorization problem used to formalize the collaborative filtering task. On a small benchmark dataset, we get experimental results comparable to the semi-definite-programming-based solution but at a much lower computational cost.

imator (e.g., the XOR problem is not in the class), and it is easy to find even toy examples for which there is a mismatch between the function class and the data distribution. The most common solution for overcoming this problem is to use trees as base learners that call the simple base learner in a recursive fashion to partition the input space. As the main contribution of this paper, we propose and explore another possibility of learning products of simple base learners. The main advantage of the proposed method is its simplicity and computational efficiency. Similarly to trees, we call the base learner as a subroutine but in an iterative rather than recursive fashion. In the optimization loop we fix all but one of the base classifier terms, temporarily re-label the points, and call the base learner using the "virtual" labels. On benchmark datasets, A DA B OOST.MH using products of decision stumps definitely outperforms boosted trees. As a highlight, note that the 1.26% test error on the MNIST dataset is the best reported error rate among no-domain-knowledge algorithms after Hinton and Salakhutdinov's (2007) deep belief nets (1.00%); it is significantly better than the error rates of support vector machines (1.4%), randomly initialized back-propagation neural nets (1.6%), or boosted trees (1.7%). As a second contribution, we propose an improved base learner for nominal features. In the standard approach a nominal base learner selects only one value to test in each boosting iteration. Here we show how we can optimize a nominal base learner that acts as a subset indicator using the same computational effort. We will also show that boosting the product of two indicator base learners solves the maximum margin matrix factorization (MMMF) problem (Srebro et al., 2005) used to formalize and solve the collaborative filtering problem. We carried out experiments on a small benchmark dataset and compared the method to the semi-definite-programming-based (SDP) solution (Srebro et al., 2005). The results are slightly worse, but the computational effort needed to produce them was an order of magnitude smaller than in the case of SDP MMMF. Since A DA B OOST.MH scales linearly with the data size, the approach has a greater prospective on large collaborative filtering problems.

1. Introduction
A DA B OOST (Freund & Schapire, 1997) is one of the best off-the-shelf learning methods developed in the last decade. It constructs a classifier in an incremental fashion by adding simple classifiers to a pool, and using their weighted "vote" to determine the final classification. A DA B OOST was later extended to multi-class classification problems (Schapire & Singer, 1999). Although various other attempts have been made to deal directly with the multi-class setting, A DA B OOST.MH has become the gold standard of multiclass boosting due to its simplicity and versatility. In practice, boosting simple learners like decision stumps has often been found to be sub-optimal. Indeed, the class of linear combinations of stumps is not a universal approxAppearing in Proceedings of the 26 th International Conference on Machine Learning, Montreal, Canada, 2009. Copyright 2009 by the author(s)/owner(s).

Boosting products of base classifiers

The paper is organized as follows. First we describe the algorithm in Section 2, then we present the experimental results in Section 3. Lastly we draw some brief conclusions and make a few pertinent remarks.

a sum of base classifiers h(t) : X  RK returned by a base learner algorithm BASE(X, Y, W(t) ) in each iteration t. In general, the base learner should seek to minimize the base objective
n K

2. The algorithm
In Section 2.1 we first describe A DA B OOST.MH to the extent that is necessary to understand our contribution. For further details we refer the reader to the original paper (Schapire & Singer, 1999). Section 2.2 contains details on the base learning, including our contribution on the subset indicator base learner. Then we describe the new product base learner in Section 2.3. 2.1. A DA B OOST.MH For the formal description let X = (x1 , . . . , xn ) be the (j) n × d observation matrix, where the elements xi of the d-dimensional observation vectors xi  X are either real numbers, or they come from an unordered set of cardinality M . In this latter case, without loss of generality, we will (j) assume that xi  I = {1, . . . , M }. The column vectors of X will be denoted by x(j) , j = 1, . . . , d. We are also given a label matrix Y = (y1 , . . . , yn ) of dimension n×K where yi  {+1, -1}K . In multi-class classification one and only one of the elements of y1 is +1, whereas in multilabel (or multi-task) classification yi is arbitrary, meaning that the observation xi can belong to several classes at the same time. In the former case we will denote the index of the correct class by (xi ). The goal of the A DA B OOST.MH algorithm ((Schapire & Singer, 1999), Figure 1) is to return a vector-valued classifier f : X  RK with a small Hamming loss RH f (T ) , W(1) =
n K

E h, W(t) =
i=1 =1

wi, exp -h (xi )yi, .

(t)

(3)

Using the weight update formula of line 5 (Figure 1), it can be shown that
T

Re f (T ) , W(1) =
t=1

E h(t) , W(t) ,

so minimizing (3) in each iteration is equivalent to minimizing (1) in an iterative greedy fashion. By obtaining the multi-class prediction (x) = arg max f
(T )

(x),

it can also be proven that the "traditional" multi-class loss (or one-error)
n

R f (T ) =
i=1

I

(xi ) = (xi )

has an upper bound KRe f (T ) , W(1) if the weights are  initialized uniformly, and K - 1Re f (T ) , W(1) with the multi-class initialization (2). This justifies the minimization of (1). 2.2. Learning the base classifier In this section we will first describe the details of multiclass base learning, then we will briefly define decision stumps used on numerical features, and finally we will introduce a new technique for optimizing a nominal base learner. There are two generic learning schemes available that are used to transform a scalar binary base classifier  : R  {+1, -1} (or  : I  {+1, -1}) into a (real) vector-valued base classifier h. In discrete A DA B OOST.MH, h(x) is represented by h(x) = v(x), where   R+ is the base coefficient and v  {+1, -1}K is the vote vector, whereas in real A DA B OOST.MH the vote vector v is real-valued and  = 1 (therefore it can be omitted). In discrete A DA B OOST.MH it can be shown that for a given , (3) is minimized by v = and 1  = ln 2
K =1 K =1

wi, I sign f
i=1 =1

(1)

(T )

(xi ) = yi,

1

by minimizing its upper bound (the exponential margin loss)
n K

Re f (T ) , W(1) =
i=1 =1

wi, exp - f

(1)

(T )

(xi )yi, ,

(1) where f (xi ) is the th element of f (xi ). The user-defined (1) weights W(1) = wi, are usually set either uniformly to wi, = 1/(nK), or, in the case of multi-class classification, to wi, =
(1) 1 2n 1 2n(K-1) (1)

1 -1

if µ + > µ otherwise,

-

= 1, . . . , K,

(4)

if = (xi ) (i.e., if yi, = 1), otherwise (i.e., if yi, = -1)

(2)

to create K well-balanced one-against-all classification problems. A DA B OOST.MH builds the final classifier f as

µ + I {v = +1} + µ µ
- I {v

- I {v

= -1} ,

= +1} + µ + I {v = -1}

Boosting products of base classifiers

A DA B OOST.MH(X, Y, W(1) , BASE(·, ·, ·), T ) 1 2 3 4 5 6 for t  1 to T (t) , v(t) , (t) (·)  BASE X, Y, W(t) h(t) (·)  (t) v(t) (t) (·) for i  1 to n for  1 to K wi,
(t+1)

 wi,
T t=1

(t) n i =1

exp -h (xi )yi,
K =1

(t)

wi , exp -h (xi )yi ,

(t)

(t)

return f (T ) (·) =

h(t) (·)

Figure 1. The pseudocode of the A DA B OOST.MH algorithm. X is the observation matrix, Y is the label matrix, W(1) is the initial weight matrix, BASE(·, ·, ·) is the base learner algorithm, and T is the number of iterations. (t) is the base coefficient, v(t) is the vote vector, (t) (·) is the scalar base classifier, h(t) (·) is the vector-valued base classifier, and f (T ) (·) is the final (strong) classifier.

where
n

the edge (7) can be found very efficiently in (ndK) time (making the total running time  nd(log n + KT ) ). wi, I {(xi ) = yi, }, = 1, . . . , K (5) To handle nominal features xi  I (j) = {1, . . . , M (j) } with a possibly large number of values M (j) , we introduce a novel subset indicator base learner (Figure 2). The standard procedure (Schapire & Singer, 1999) is equivalent to using stumps on one-hot encoded nominal features: only one of the nominal values   I (j) is tested in each iteration using base learners of the form j, (x) = 1 -1 if x(j) = , otherwise. (8)
(j)

µ

-

=
i=1

is the weighted per-class error rate, and
n

µ

+

=
i=1

wi, I {(xi ) = yi, }

= 1, . . . , K.

(6)

The goal of the scalar base learner is to return a  that maximizes the edge
n K

() =
i=1 =1

wi, v (xi )yi, .

(7)

In real A DA B OOST.MH,  is found in the same way by maximizing the edge (7) using the discrete votes (4), but then we set v = 1 µ ln 2 µ
+ -

,

= 1, . . . , K.

This procedure finds the minimizer of (3) in discrete A DA B OOST.MH. In real A DA B OOST.MH it is suboptimal (corresponding to the two-stage greedy functional gradient descent approach of (Mason et al., 2000)); however, it ensures that E (t) (h) < 1, so the algorithm will converge. The simplest scalar base learner used in practice on numerical features is the decision stump, a one-decision two-leaf decision tree of the form j,b (x) = 1 -1 if x(j)  b, otherwise,

Optimizing this selector base learner takes d  (ndK + K) time, where  = M (j) is the j=1 number of all the different feature values. Hence, to obtain a strong learner that potentially uses (tests) (M (j) ) values for each feature, we will need  (ndK + K) operations, which can be prohibitive if  is large. On the other hand, we saw that we can optimize a base learner that makes a decision on all values of I (j) in each iteration, using essentially the same number  (ndK + K) of operations per iteration as the selector base learner. Formally, we consider base learners of the form j,u (x) = ux(j) ,
(j)

(9)

where j is the index of the selected feature and b is the decision threshold. If the features are pre-ordered before the first boosting iteration, a decision stump maximizing

where u is a vector over the index set I . In the base learner, we initialize u randomly, and then we optimize v and u in an alternating iteration until convergence is achieved. Convergence is guaranteed since E vj,u , W is bounded from below by zero, it must decrease in each iteration, and the decrease is bounded away from zero because the weights wi, are bounded away from zero. In practice I NDICATOR BASE always stopped after a few iterations. There is no guarantee that the global minimum of the base objective will be found, but this is not a problem in A DA B OOST.MH: the boosting loop can continue with any base learner h with E h, W(t) < 1.

Boosting products of base classifiers

I NDICATOR BASE(X, Y, W) 1 2 3 4 for j  1 to d all (nominal) features x(j) = x1 , . . . , xn
(j) (j)

(j , vj , uj )  B EST I NDICATOR(x(j) , Y, W, I (j) ) j   arg min E j vj j,uj , W
j

return j  , vj  , j  ,uj (·)

B EST I NDICATOR(x, Y, W, I) 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23   for   I for  1 to K + - ,  ,  0 u  R ANDOM(±1) for i  1 to n for  1 to K if wi, yi, > 0 then + + xi ,  xi , + wi, yi, else - - xi ,  xi , - wi, yi,   0, v  0 while TRUE prev  , vprev  v for  1 to K v  sign
I 1 2
=1 ln PK PI

save current optimal  and v
1 2

+ - (, - , )u

or v 

ln PI

P

+ - (, I{u >0}+, I{u <0}) - + , I{u >0}+, I{u <0}) I (

+ - (, I{u >0}+, I{u <0}) or   1 - + =1 I (, I{u >0}+, I{u <0}) if E vu , W  E prev vprev u , W then return (prev , vprev , u) prev  , uprev  u save current optimal  and u for   I

PK

P

K

u  sign
=1 1 2
=1 ln PI PK

+ - (, - , )v

or u 

1 2

=1 ln PK

PK

+ (, I{v - =1 (, I{v

- >0}+, I{v <0}) + >0}+, I{v <0})

+ - (, I{v >0}+, I{v <0}) or   1 - + I =1 (, I{v >0}+, I{v <0}) if E vu , W  E prev vuprev , W then return (prev , v, uprev )

P

PK

Figure 2. The pseudocode of the subset indicator base learner. Between lines 12 and 17 we optimize v and  given a fixed u, and between lines 18 and 23 we optimize u and  given a fixed v. The two alternatives in lines 14, 15, 20, and 21 refer to discrete and real A DA B OOST.MH, respectively.

Another advantage of the new indicator base learner (besides being faster) is that the vote vector v is shared among the selectors while in the standard learner we learn a vote vector for each selector. This means that the capacity of the base classifier is considerably reduced (compared to the sum of  selectors), suggesting that the algorithm is less susceptible to overfitting. Sharing the vote vector may also help to "pick up" dependencies between feature values in

the case where there is a structure in the feature space (e.g., using discrete A DA B OOST.MH, each indicator j,u clusters the feature values I (j) into two groups). In a broader sense, the proposed learner is related to Cohen's (1996) model which allows set-valued features as well as the common real-valued and nominal features. The main difference is that in (Cohen, 1996) it is the features

Boosting products of base classifiers

of the observations that can take a subset of a large set of possible values but the decision functions are still selectors of the form (8), whereas what we propose here are subsetindicator decision functions (9). In another related algorithm (SLIPPER), Cohen and Singer (1999) build rules that are conjunctions of selectors (8) using a growing/pruning procedure. The selectors in the same base rule can use different features, whereas I NDICATOR BASE acts on different values of one feature. Although superficially similar to our method, both approaches use different base classifiers and different algorithms to learn the classifiers. On the other hand, they can be incorporated into our approach with just a few modifications to I NDICATOR BASE. 2.3. The product base learner The goal of the procedure is to optimize base learners of the form
m

d-dimensional.2 Second, it is clear that the class of sums of products of stumps is a subclass of decision trees: there are constraints on how homogeneous regions are created, and each base learner term of the product has a "global" contribution. Nevertheless, we also found that one product of stumps does not have enough capacity to achieve a low test error even if m is set to infinity: P RODUCT BASE underfits the data, probably due to the greediness of the optimization loop. Hence, unlike trees, products cannot be used as standalone strong learners. On the other hand, boosting products is less susceptible to overfitting than boosting trees, for which we also provide some experimental evidence. R EMARK 2: P RODUCTS OF SUBSET INDICATORS . Boosting products of indicator base classifiers (9) is a solution of maximum margin matrix factorization (MMMF) (Srebro et al., 2005), used to formalize and solve the collaborative filtering problem. Taking the simple case when the observations consist of two index features x(1)  I (1) and x(2)  I (2) (for example, movie id's and user id's), and the classification is binary (for example, y = +1 if userx(1) liked moviex(2) and y = -1 if he or she disliked it), individual base classifiers will take the form hu(t) ,z(t) x(1) , x(2) = (t) ux(1) zx(2) , where u(t) and z(t) are vectors over I (1) and I (2) , respectively. The strong classifier f can then be used to classify every pair of I (1) × I (2) . The resulting M (1) × M (2) matrix can be expressed as the product of two matrices, an M (1) × T "user feature" matrix containing the column vec tors (t) u(t) , and a T ×  (2) "movie feature" matrix M containing the row vectors (t) z(t) . The maximization of the exponential margin loss (1) leads to a large margin solution, and all the relevant results on the generalization error (e.g., (Schapire et al., 1998)) apply immediately. Finding two low-rank (or otherwise low-complexity) matrices whose product produces large margins on the training data is the exact goal of MMMF (Srebro et al., 2005). In a certain sense, boosting products of two index learners is related to the semi-definite-programming-based solution of MMMF as the classical binary A DA B OOST is related to support vector machines. The algorithm can be applied directly to multi-valued preferences by adding the optimization of the vote vectors v(t) . In this case, the preference prediction matrix becomes a hyper-matrix with vector-valued elements. The formulation also permits one to boost products of more than two indices (or to use two or more times a base learner on the same index) which goes beyond the matrix-decompositionbased formulation of the collaborative filtering problem. It
2 Here we shall only give an outline of the constructive proof: f can be set to an arbitrary value on any hypercube in Rd , independently of the rest of the space by carefully selecting and weighting products of stumps placed at the corners of the hypercube.

h(·) = 
j=1

vj j (·),

where the vote vectors vj are multiplied element-wise, and the number of terms m is a complexity parameter that should be validated (similarly to the size of decision trees). We suppose that j (·) is a simple base classifier equipped with a base learner that returns the optimal , vj , and j (·), as described in Section 2.2. To maximize the edge (7), we follow a simple iterative approach (Figure 3). In each iteration we fix each base learner except for one j , and maximize the edge with respect to j . Because of the product form of the edge (7), we can carry out this maximization by simply calling the base learner of j with "virtual" labels defined as the sign of the product of the real labels and the outputs of the remaining base learners (line 7 in Figure 3). The procedure is guaranteed to converge since in each batch of m iterations at least one "virtual" sign yi, must change (otherwise the base learner in line 8 returns the same set of m classifiers, and we have equality in line 9) and the number of different sign vectors is finite. As with subset indicator base learners, we found that in practice P RODUCT BASE always returns after a few iterations. R EMARK 1: P RODUCTS OF STUMPS . In experiments (see Section 3) we found that A DA B OOST.MH with products of stumps achieves excellent test results on benchmark data sets, for which, at this point, we can only provide a partial explanation. First, it is easy to see that the algorithm solves the XOR problem: a product of two stumps can obviously implement the XOR function. As an extension, the product of d stumps can implement the parity function on any subsequence of the (binary or thresholded real-valued) feature vector up to d elements. It can also be shown that the class of sums of products of d stumps is a universal approximator if and only if the observation space X is at most

(t)

(t)

(t)

Boosting products of base classifiers

P RODUCT BASE(X, Y, W, BASE(·, ·, ·), m) 1 2 3 4 5 6 7 8 9 10 for j  1 to m j (·)  1, vj  1 terms are initialized to the constant 1 function 1 while TRUE for j  1 to m m m  (·)  j =1 j (·), v  j =1 vj ,    save current optimal classifier for i  1 to n for  1 to K v   (xi ) "virtual" labels yi,  sign yi, vj j (xi ) , vj , j (·)  BASE(X, Y , W) if E 
m j =1

vj j , W  E  v  , W then

return  , v ,  (·)

Figure 3. The pseudocode of the product base learner. m is the number of base classifier terms to be optimized. The vote vectors are multiplied element-wise in lines 5 and 9. The sign(·) operator in line 7 can be omitted in the case of discrete A DA B OOST.MH.

is also quite straightforward to combine collaborative features with "traditional" numerical or nominal covariates by allowing mixed products. The main shortcoming of the method at this point is that the exponential margin loss (1) is designed for multi-class or multi-label classification, and it is sub-optimal if the preferences come from an ordered set. One of our aims for future study is to examine the possibility of using product learners in regression or ranking.

the minimum). For the two image datasets we also report results using A DA B OOST.MH with stumps on Haar filters (Viola & Jones, 2004).5 We conducted experiments using a range of values for the number of terms m for which we report the test errors in Table 1. The optimal number of terms (error rate in bold face) was selected by a 80%-20% simple validation on the training set. We compared our algorithm with two versions of boosted decision trees. We implemented a tree learner similar to ID3 in our AdaBoost.MH version which uses the boosting's base objective (3) (instead of the entropy-based criterion) to learn stumps in a recursive fashion. Since this algorithm was more prone to overfitting (last plot in Figure 4), we validated both the number of leaves N and the number of boosting iterations T . Table 1 indicates that, with one exception, boosting products is significantly better than boosting trees. We also tested the Weka (Witten & Frank, 2005) implementation of boosted trees. However the comparison is slightly unfair for two reasons: first, the Weka implementation uses A DA B OOST.M1 which is known to be inferior to A DA B OOST.MH for multi-class boosting, and second, the high computational complexity did not allow us a complete exploration of the hyperparameter space, so we cannot claim that we have found the optimal tree size and number of iterations T . Nevertheless, these results provide a reasonable idea about what the novice user's experience with AdaBoost. On an absolute scale the results are on a par with the stateIt is not the subject of this paper, however, we give some details for reproducibility. We used five filter types ("bw" and "bwb", vertical and horizontal, and four-field checkerboard). Due to the large number of possible filters (order of 105 ), we selected just the best of a random 100 in each boosting iteration.
5

3. Experiments
To test the algorithm3 , we carried out two sets of experiments. In the first we boosted products of decision stumps on five benchmark datasets4 using the standard train/test cuts. We used discrete A DA B OOST.MH with multi-class initial weights (2). We found no overfitting whatsoever in terms of the number of boosting iterations T (Figure 4), so instead of validating T and performing an early stopping, we decided to run the algorithm for a long time after convergence (T = 105 in each experiment), and measure the average test error R f (T ) on the last T /2 iterations. The advantage of this approach is that this estimate is more robust in terms of random fluctuations after convergence than the error at a given iteration. It is also a pessimistic estimate of the error when there is a slight overfitting (since the average is always an upper bound of
3 The multi-platform C++ source code is available at http://users.web.lal.in2p3.fr/kegl/ research/multiboost. 4 The data sets are available at yann.lecun.com/exdb/mnist (MNIST), www.kernel-machines.org/data.html (USPS), www.ics.uci.edu/~mlearn/MLRepository.html (letter, pendigit, isolet), and www.cs.umn.edu/Research/GroupLens (MovieLens).

Boosting products of base classifiers
MNIST test error with m = 3 0.1 0.08 0.06 0.04 0.02 0 100 1000 T UCI pendigit test error with m = 2 0.1 0.08 0.06 0.04 0.02 0 100 1000 T 10000 100000 0.1 0.08 0.06 0.04 0.02 0 100 1000 T 10000 100000 10000 100000 0.05 0.04 0.03 0.02 0.01 0 100 1000 T UCI isolet test error with m = 8 0.1 0.08 0.06 0.04 0.02 0 100 1000 T 10000 100000 10000 100000 MNIST test error with m = 3 (Haar) 0.1 0.08 0.06 0.04 0.02 0 100 1000 T UCI letter test error with m = 10 0.1 0.08 0.06 0.04 0.02 0 100 1000 T 10000 100000 10000 100000 USPS test error with m = 3 0.1 0.08 0.06 0.04 0.02 0 100 1000 T UCI letter test error with trees, N = 38 10000 100000 USPS test error with m = 2 (Haar)

Figure 4. The "winning" learning curves on the five benchmark datasets indicate no overfitting in terms of the number of boosting iterations T when products are used. With a tree base learner (last plot), on the other hand, we do observe slight overfitting sometimes. ` ´ P 2 Table 1. Test error percentages 100 T T /2 R f (t) on benchmark datasets using discrete A DA B OOST.MH with products of stumps. t=T The results with the optimal number of terms m selected by 80%-20% simple validation on the training set are shown in bold. Tree and Haar/Tree are "in-house" implementations of a tree base learner. In this case both T and the number of leaves N were validated. For AdaBoost.M1/C4.5 we used the Weka implementation of it.

learner \ data set Stump (m = 1) Product / m = 2 Product / m = 3 Product / m = 4 Product / m = 5 Product / m = 6 Product / m = 8 Product / m = 10 Haar (m = 1) (Viola & Jones, 2004) Haar / Product / m = 2 Haar / Product / m = 3 Haar / Product / m = 4 Tree Haar / Tree AdaBoost.M1 / C4.5

MNIST 7.71 (0.05) 1.56 (0.03) 1.26 (0.02) 1.38 (0.02)

USPS 6.48 (0.05) 4.92 (0.06) 4.24 (0.04) 4.72 (0.07)

UCI pendigit 4.97 (0.00) 1.89 (0.02) 2.07 (0.03)

UCI isolet 4.88 (0.07) 3.91 (0.10) 3.97 (0.04) 3.95 (0.05) 3.87 (0.07) 3.92 (0.08) 3.89 (0.04) 4.11 (0.04)

UCI letter 14.74 (0.06) 3.65 (0.04) 2.71 (0.02) 2.54 (0.06) 2.41 (0.04) 2.40 (0.03) 2.38 (0.04) 2.35 (0.04)

1.02 (0.02) 0.84 (0.02) 0.87 (0.02) 0.90 (0.02) 1.53 (0.02) 1.08 (0.02) 4.05

4.29 (0.05) 3.84 (0.07) 4.03 (0.06) 4.05 (0.04) 4.73 (0.04) 4.98 (0.05) 5.98

2.14 (0.04) 2.66

3.69 (0.04) 4.81

2.62 (0.04) 2.88

of-the-art results reported in the literature on the same data sets. In particular, they are the best results obtained by any ensemble method. We would especially like to underline the 1.26% test error on the MNIST dataset, which is the best reported error rate among generic classification algorithms6 after Hinton, G. E. and Salakhutdinov's (2007) deep belief nets (1.00%); it is significantly better than the error rates of support vector machines (1.4%) and randomly initialized two-layer back-propagation neural nets using cross-entropy loss (1.6%).
Algorithms that do not make explicit use of the fact that the observation vectors represent images.
6

It seems also quite surprising that we were able to improve on boosting stumps over the feature space generated by Haar filters. Boosting stumps over Haar filters is already one of the best semi-generic method7 , achieving similar error rates to convolutional neural nets (LeCun et al., 1998). Boosting these feature extractors also outperforms boosting stumps and even boosting products of stumps (on MNIST). The feature space has a large dimension (of order 105 ) and the Haar filters are well-adapted to natural images, so one
Algorithms that do make explicit use of the fact that the observation vectors represent images, but not of the fact that the images depict characters.
7

Boosting products of base classifiers Table 2. Test error percentages on the MovieLens dataset using the experimental setup of (Srebro et al., 2005). The errors in columns 3 and 5 were taken directly from (Srebro et al., 2005).

set 1 2 3 4 Avg

WLRA (Marlin, 2004) rank 2 57.5 rank 2 56.2 rank 1 54.3 rank 2 55.3 55.8

MMMF SDP (Srebro et al., 2005) max-norm, C = 0.0012 56.2 trace norm, C = 0.24 55.2 max-norm, C = 0.0012 52.7 max-norm, C = 0.0012 55.0 54.8

T T T T

A DA B OOST = 7275 56.3 = 9940 54.5 = 475 53.9 = 9515 56.6 55.3

would expect that a simple linear combination over this space can achieve the best results. While this intuition is reaffirmed when using trees over the Haar space, we were genuinely surprised to see that using the product of a small number of Haar filters as a base learner can significantly outperform boosting single Haar filters. In the second set of experiments we tested the algorithm on a small subset of the MovieLens collaborative filtering database using the same experimental settings as Srebro et al. (2005): the data is divided into four sets, for each of the four test sets the algorithms are trained and validated on the remaining three in a 3-fold cross validation. The two best of the three learners are selected and tested on the holdout test set. In the case of A DA B OOST, we validated the real/discrete version, the weight initialization, the number of terms m and the number of iterations T . In each experiment, using real A DA B OOST.MH with uniform weight (1) initialization wi, = 1/(nK) and m = 2 proved to be the best approach. The results in Table 2 place this approach between WLRA (Marlin, 2004) and semi-definiteprogramming-based MMMF (Srebro et al., 2005). The computational effort needed to produce the results was an order of magnitude smaller than in the case of SDP MMMF (minutes vs. hours). Since A DA B OOST.MH scales linearly with the data size, this approach has a greater prospective on large collaborative filtering problems.

References
Cohen, W. (1996). Learning trees and rules with set-valued features. Proceedings of the AAAI Conference on Artificial Intelligence (pp. 709­716). Cohen, W., & Singer, Y. (1999). A simple, fast, and effective rule learner. Proceedings of the AAAI Conference on Artificial Intelligence (pp. 335­342). Freund, Y., & Schapire, R. E. (1997). A decision-theoretic generalization of on-line learning and an application to boosting. Journal of Computer and System Sciences, 55, 119­139. LeCun, Y., Bottou, L., Bengio, Y., & Haffner, P. (1998). Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86, 2278­2324. Marlin, B. (2004). Collaborative filtering: a machine learning perspective. Master's thesis, University of Toronto. Mason, L., Bartlett, P., Baxter, J., & Frean, M. (2000). Boosting algorithms as gradient descent. Advances in Neural Information Processing Systems (pp. 512­518). Salakhutdinov, R., & Hinton, G. (2007). Learning a nonlinear embedding by preserving class neighbourhood structure. International Conference on Artificial Intelligence and Statistics (pp. 412­419). Schapire, R. E., Freund, Y., Bartlett, P., & Lee, W. S. (1998). Boosting the margin: a new explanation for the effectiveness of voting methods. Annals of Statistics, 26, 1651­1686. Schapire, R. E., & Singer, Y. (1999). Improved boosting algorithms using confidence-rated predictions. Machine Learning, 37, 297­336. Srebro, N., Rennie, J. D. M., & Jaakkola, T. (2005). Maximum margin matrix factorization. Advances in Neural Information Processing Systems (pp. 1329­1336). The MIT Press. Viola, P., & Jones, M. (2004). Robust real-time face detection. International Journal of Computer Vision, 57, 137­154. Witten, I., & Frank, E. (2005). Data mining: Practical machine learning tools and techniques. Morgan Kaufmann.

4. Conclusions
In this paper we described and tested A DA B OOST.MH with products of simple base learners, introduced a new nominal base learner, and demonstrated that boosting the products of this new base learners solves the MMMF problem. We found that boosting products outperforms boosting trees, it is less prone to overfitting, and it is even able to improve boosting stumps in such complex feature spaces where boosting stumps is expected to be the state-of-theart. The main issues that we foresee to attack in the near future are 1) the extension of the algorithm to regression and ranking, and 2) investigating the initializing of the subset indicators in a more sophisticated way than the random initialization used here.