ManifoldBoost: Stagewise Function Approximation for Fully-, Semiand Un-sup ervised Learning

Nicolas Lo eff loeff@uiuc.edu Department of Electrical and Computer Engineering, University of Illinois, Urbana, IL 61801 David Forsyth Deepak Ramachandran Department of Computer Science, University of Illinois, Urbana, IL 61801 daf@uiuc.edu dramacha@uiuc.edu

Abstract
We describe a manifold learning framework that naturally accommodates supervised learning, partially supervised learning and unsupervised clustering as particular cases. Our method chooses a function by minimizing loss sub ject to a manifold regularization penalty. This augmented cost is minimized using a greedy, stagewise, functional minimization procedure, as in Gradientboost. Each stage of boosting is fast and efficient. We demonstrate our approach using both radial basis function approximations and trees. The performance of our method is at the state of the art on many standard semisupervised learning benchmarks, and we produce results for large scale datasets.

There is a rough distinction in semi-supervised learning between manifold based algorithms that expect data to lie embedded in a space of lower intrisic dimensionality, and cluster-based algorithms that expect data to lie in clumps (the distinction seems to explain some differences in performance on different datasets (Chapelle et al., 2006)). There is some disagreement about the benefits of using unlabeled data, which may not always improve the asymptotic error rate of a regression estimator (Lafferty & Wasserman, 2007). On the other hand, (Niyogi, 2008) argues that manifold learning is useful insofar as the marginal of the data Px can be linked with the conditional Py|x via the manifold. Computational Complexity is a common problem for most semi-supervised approaches. Write l for the number of labeled data items and u for the number of unlabeled data items. Many algorithms scale as badly as O((l +u)3 ) (Zhu, 2006). Transductive support vector machines must solve a quadratic programming problem in (l +u) variables (Joachims, 1999). Manifold smoothing of an SVM solves a quadratic programming problem in l variables, followed by a linear problem in l + u variables; the situation is better for a linear SVM if feature vectors are sufficiently sparse (Sindhwani et al., 2006). Harmonic smoothing solves a relatively sparse linear system in l variables. This problem is relatively tractable, because the linear system involves the Laplacian of the smoothing kernel and so should be diagonally dominant (see (Dyn et al., 1986) for relevant observations in the context of radial basis functions). Each method must pay the cost of forming the Laplacian. For functional approximation schemes other than kernel smoothing, the complexity of current manifold learning methods in the number of training examples appears to be high. This is a problem ­ it is natural to want to use a manifold regularization term

1. Intro duction
Manifold Learning algorithms exploit geometric (or correlation) properties of datasets in high-dimensional spaces. The literature is too large to review in detail here (163 references in a recent review (Zhu, 2006)). Many different approaches have been pursued that utilize manifold structure such as constructing an explicit parametrization (e.g. (Tenenbaum et al., 2000; Roweis & Saul, 2000; Donoho & Grimes, 2003)), introducing a penalty term that imposes smoothness conditions on functions restricted to the manifold (e.g. (Sindhwani et al., 2006)), adjusting kernel smoothing bandwidths to account for manifold properties (e.g. (Bickel & Li, 2007)), and infering labels for unlabeled data using a harmonic smoother (e.g. (Zhu et al., 2003)).
Appearing in Proceedings of the 25 th International Conference on Machine Learning, Helsinki, Finland, 2008. Copyright 2008 by the author(s)/owner(s).


ManifoldBoost: Stagewise Function Approximation for Fully-, Semi- and Un-sup ervised Learning

with such methods as tree-structured classifiers, and with very large datasets. Gradient Bo osting poses function approximation as a variational problem, then uses a form of coordinate ascent on that problem ((Friedman, 1999); section 2). In this paper, we describe a variation on gradient boosting that can exploit a manifold regularization term, is fast and efficient for many forms of functional approximation (section 2.1), provides out of sample extensions (section 4), offers performance at the state of the art on standard datasets, and is capable of handling very large datasets (section 6). In the extreme case, when there is no supervision, the generalized method gracefully degrades into a clustering method (section 4). Finally, we show that our framework also easily extends to multi-class problems by choosing suitable loss functions (section 5).

2.1. ManifoldBo ost Framework 2.1.1. Stagewise Functional Minimization (Px Known) Following the work of Friedman (Friedman, 1999), we will find a additive solution of the form FM (x) = mM
=0

fm

(x)

(2)

We will proceed in a greedy fashion. Assume we have a solution for M = m; we will then minimize V [Fm + fm+1 ] with respect to fm+1 . After (Friedman, 1999), we obtain a descent direction from the first variation of V V [Fm + f ] = V [Fm ] + V [Fm , f ] + O( 2) where  V [Fm , f ] = d [Fm + f ]| dV (3)

2. Semi-Sup ervised Bo osting
We follow convention by minimizing the sum of an expected loss and a regularization term. We must predict labels y  Y for patterns x  X . We assume a probability distribution Px,y over X × Y . We will further assume the support of the marginal Px lies on a domain M  X . Typically, this domain is of lower intrinsic dimension than X ; the term manifold is widely used to refer to such domains, though we require no manifold properties. Write the predictor as F (x), and the cost function as  (y , F (x)). We would like to find the function minimizer F  = arg minF H V [F ], of the cost functional  V [F ] = (y , F (x))dPx,y E xpected Loss M + M || MF (x)||2 dPx (1) M anifold Regularization restricted to some function family H. Our regularization term is of the same form as that of (Sindhwani et al., 2006), and encourages smoothness of the solution in regions of high probability density. We control the complexity of the solution by choosing H and using the shrinkage approach of (Friedman, 1999). This expression is very general. There are many possible choices for  [y , F ]. Expressions such as |y - F | and (y - F )2 are typically used for regression. Expressions such as exp(-y F ) and the binomial log likelihood log(1 + exp(-2y F )) penalize the margin y F , and are typically used for classification.

=0

Write u, v for the usual inner product in L2 . Now  V [Fm , f ] is a linear functional of f , so there is some GV (Fm ) -- which we regard as the "gradient" of the cost -- such that  V [Fm , f ] = GV (Fm ), f . Now we have that GV (Fm ), f is equal to + d y u f  (x)  u  (y , u) =Fm (x) dPy |x Px (4) 2M f (x)t Fm (x) Assuming sufficient regularity, recalling that Px = 0 on the boundary of the support of Px , and using the first Green identity, we have that GV (Fm ), f is equal to + d y u f  (x)  (y , u) =Fm (x) dPy|x u Px (5) 2M f (x) M Fm (x) 2 where 2 = - · M is the Laplace-Beltrami opM erator. The optimal descent direction is a function f that maximizes - GV (Fm ), f (subject, if necessary to a norm constraint on f ). The term fm+1 = f is obtained using line search, minimizing the true cost V [Fm + f ] with respect to . 2.1.2. Finite Data Generally, neither Px nor Px,y are known. Instead, we have sample of labelled data {xi , yi }l =1 , and of uni labelled data {xi }u=l+1 . Now integrals become sums i over data points. Generally, {fm (x)} will belong to a parametric family of functions (e. g. Radial Basis  functions, decision trees, etc). The Laplacian operator in equation 5 must be discretized. In high dimensions, we cannot triangulate


ManifoldBoost: Stagewise Function Approximation for Fully-, Semi- and Un-sup ervised Learning

the data set. A smoothed Laplacian is equivalent to the difference between a short-scale average of the data and a long-scale average (e.g. the use of unsharp masking in photography, or the difference of Gaussians in computer vision). The graph Laplacian is a linear operator that takes a function on the graph to the weighted difference between the function value and the average of the K nearest neighbours. This means it is usual to approximate the Laplacian operator with the graph Laplacian L (e.g. see (Sindhwani et al., 2006)). Write the graph Laplacian as LM . The cost function becomes V [F ] = +
l 1i  (yi , F (xi )) l =1 i M F (xi )LM F (xj ) i,j (l + u)K ,j

m, the inner product with the "gradient" becomes, GV (Fm ), f = + 2yi 1i f (xi ) l 1 + exp(2yi Fm (xi )) i 2M f (xi )LM Fm (xj ) (8) i,j (l + u)K ,j

The cases now differ by the procedures used to choose the optimal f Tree-ManifoldBo ost: As in L2 TreeBoost (Friedman, 1999), we use regression trees as base learners. S A tree has the form fm+1 (x) = s=1 m+1,s I [x  Rs ], where I [·] = 1 if the expression inside is true, and I [·] = 0 otherwise. To minimize ||GV - f ||2 , we must search for the parameters Rs (which determine the geometry of the tree) and s (which determine weights within region). Once a tree has been found, we fix Rs and minS imize V (Fm (x) + s=1 m,s [x  Rs ]) with respect to {s }, using a standard continuous optimization method (BFGS; see (Bertsekas, 1996)). In each round, we use a small number of descent steps to prevent overfitting. Algorithm 1 Tree ManifoldBoost Algorithm 1: F0 (x) = 1/2[log(1 + y ) - log(1 - y )] 2: for m = 1 to M do 3: Compute GV as in (8) 4: Obtain regressios tree {Rs,m } by minimizing n i (GV (xi ) - m,s I [xi  Rm,s ])2 V 5: Find {m,s } using BFGS and  s , and fixing {Rm,s } s 6: Fm (x) = Fm-1 (x) + m,s [x  Rm,s ] 7: end for The algorithm converges when M rounds have been run, or the relative change in the cost function in a round is below a threshold. Probability estimates for each x can then be estimated by inverting the loss function: p(y = 1|x) = 1/(1 + exp(-2FM (x))). This in turn can be used for classification: 1 p(y = 1|x)k-1,1 > p(y = -1|x)k1,-1 yi = ~ -1 otherwise (9) where cost ka,b is the penalty for choosing label a when b is the correct label. Figure 1 shows a toy example for semisupervised classification taken from (Sindhwani et al., 2006) (two moons dataset). The unlabeled datapoints are depicted in green and the diamonds represent the labeled examples (one for each class). The algorithm also can

(6)

Again, assume we know Fm , and seek fm+1 . We will find a function f that maximizes - GV (Fm ), f then we will weight this function using line search. The N 1 inner product is a, b = N i=1 a(xi ) · b(xj ) and we have that u 1i   (y , u) GV (Fm ), f = f (xi ) l u =Fm (xi ) 2M i + f (xi )LM Fm-1 (xj )(7) i,j (l + u)2 ,j Now - GV , f is linear in f , and so we should maximize sub ject to a norm constraint on f . If the norm is fixed, then maximizing this expression is equivalent to minimizing ||GV - f ||2 = ||GV ||2 - 2 GV , f + ||f ||2 . This means any squared loss regression algorithm can be used to find the optimal parameters. Our variational formulation explains why Friedman's choosing to make f parallel to the gradient GV and posing the problem as squared error minimization is natural. Once the descent direction f is found, the final fm+1 = f is obtained using line search, minimizing the true cost V [Fm + f ].

3. Two Examples: Tree and RBF ManifoldBo ost Algorithms
We offer two example algorithms with calculations to illustrate our extremely general formalism. For each example, we consider the binary case (y  {-1, 1}, y = 0 for unlabeled data), and use the negative binomial log likelihood as the loss function (Friedman, 1999):  (y , F ) = log(1 + exp(-2y F )) For this case, whatever classifier we use represents F (x) = 1 2 [log(p(y = 1|x)) - log(p(y = -1|x))] and so at round


ManifoldBoost: Stagewise Function Approximation for Fully-, Semi- and Un-sup ervised Learning

Figure 1. Semi-sup ervised learning using Tree- and RBF-ManifoldBoost. first figure from left to right shows a toy example introduced in (Sindhwani et al., 2006) (two moons dataset): the unlabeled datapoints are depicted in green, the diamonds represent the labeled examples (one for each class). The output classification of both algorithm is the same and is depicted on the second image (for datapoints). The third and fourth figures depict the likelihood predicted by the classifiers for the whole space for the tree- and rbf-based classifier respectively.

provide likelihood estimates, as seen in the right figures. RBF-ManifoldBo ost: Tree functions are not the only possible approximation to the "gradient". Step 4 in algorithm 1 can be modified so that R radial basis function of width  , each with a weight wr and centered in a datapoint are chosen as approximation. Again, a BFGS step can be performed to improve the loss by fitting the weights wr . Algorithm 2 describes this. Algorithm 2 RBF ManifoldBoost Algorithm 1: F0 (x) = 1/2[log(1 + y ) - log(1 - y )] 2: for m = 1 to M do 3: Compute GV as in (8) 4: Choose R RBFs greedily to minimize i r (GV (xi ) - wr RB Fr, (xi )])2 5: Find {wr } using BFGS ar d  wr n V 6: Fm (x) = Fm-1 (x) +  wr RB Fr, (x) 7: end for Complexity: The procedure itself is linear in n = l + u, in the Laplacian neighborhood K , the dimensionality of x and the number of rounds. The complexity of the algorithm depends then on the base regressor, and the computation of the Laplacian matrix. Influence trimming can also be used to get tenfold speedups (Friedman, 1999), although the algorithm is still linear in the number of datapoints.

zero and the problem is, F  = arg min i
,j F H

F (xi )LM F (xj ) i,j

(10)

i i under the constraints F (xi ) = 0, F (xi )2 = N (this is a form of spectral clustering problem, see (Sindhwani et al., 2006); without the constraints, the problem is ill-posed). Our formalism yields a greedy method for this problem, rather than the usual generalized eigenvalue problem. To manage constraints, we use the Augmented Lagrangian method (Bertsekas, 1996), which adds a penalty in each round for constraint violations in the unconstrained problem. We choose F  to be i i arg min F (xi )Li,j F (xj ) + m F (xi ) + . . . 1
F H ,j

i m 2 cm 2 2 i F (xi ) - N
2

+ F (xi ) - N 2
2

cm 1 2

i F (xi )

2 + ...

(11)

for non-decreasing sequences {cm , cm }M=1 . Af1 2m ter each round, the values of the Lagrange multipliers are increased by the constraiint violation (Bertsekas, 1996) im+1  m + cm ( F (xi )) and 1 1 1 m+1  m + cm ( F (xi )2 - N ). As before, BFGS 1 1 is applied in each round. Algorithm 3 describes the tree-based version. The algorithm converges to a local minimum of the constrained problem. This formulation, unlike ISOMAP, naturally takes care of out-of-sample evaluation. Compared to (Sindhwani et al., 2006), the computational complexity is greatly reduced. On the other hand the solution is greedy, and there is no straightforward term for controlling the complexity of the function in the ambient space; this is achieved through the depth of the trees used in the algorithm.

4. Unsup ervised Bo osting
The essential step in semi-supervised learning is the observation that similar data items should tend to have similar labels, which means that semi-supervised learning method should be capable of clustering. Our framework can naturally be extended to unsupervised learning, where one wishes to cluster data and the choice of label for a cluster is arbitrary. As there are no labeled data, the first term in equation 6 becomes


ManifoldBoost: Stagewise Function Approximation for Fully-, Semi- and Un-sup ervised Learning

Figure 2. Unsup ervised learning. The framework can be extended to unsupervised learning. The same problem as in figure 1 is presented without labels (left image). The result of the Tree-based algorithm is shown in the center image. The plots on the right show the constraint violations go to zero as learning progresses. This figure is best viewed in color.

Algorithm 3 Unsupervised Tree ManifoldBoost Algorithm 1: Initialize F0 (x) randomly, with zero mean and low variance. 2: for m = 1 to M do 3: Compute GV of the penalized, uncontrained problem. 4: Obtain regressios tree {Rm,s } by minimizing n i (GV (xi ) - m,s [xi  Rm,s ])2 5: Find {m,s } using BFGs and fixing {Rm,s } S 6: Fm (x) = Fm-1 (x) + m,s [x  Rm,s ] 7: Update Lagrange multipliers using the constrain violations. 8: end for

The inner product of f (c) with gradient of V becomes, for class c, 1i (c) (c) (c GV (Fm ) ), f = (-yi + p(c) (xi ))f (xi ) (14) m l +
M i 2c (c f (xi )LM Fm ) (xj ) i,j 2 C · (l + u) ,j

5. Multiclass Case
Algorithm 1 can be extended to K -class problems by introducing a multinomial cost in equation 1,  ({y (c) , F (c) (x)}C 1 ) = - c= cC
=1

Now one regression tree is fitted per class at each round to approximate each descent direction. As in the two (c) class problem, the S regions {Rm,s }S=1 defined by the s (c) terminal nodes are fixed, and the parameters m,s for regions in each tree are learned in order to minimize the total cost V . We use a couple of BFGS iterations per round to find these parameters. In order to do (c) this, the derivatives of the cost with respect to m,s have to be computed. Once the final {FM (x)} are computed, the probability for a given example of each label can be estimated and thus the label can be classified as c(xi ) = ^ C (c ) arg minc c =1 kc,c pM (x) for costs kc,c when label c is assigned when label c is correct. The complexity of this algorithm is also linear in the number of classes, but it scales highly sub-linearly with the number of rounds M when inluence trimming is used (Friedman, 1999).
(c)

y (c log p(c (x) (12)

)

)

where p(c) (x) represents the belief example x belongs to class c, and y (c) is a binary variable which is one if example x belongs to class c. As in (Friedman, 1999) we use the symmetric multiple logistic transform C p(c) (x) = exp F (c) (x) · c
=1

-1 exp F (c (x)
)

6. Exp eriments and Discussion
(13) 6.1. Comparison to Other Regularized Bo osting Algorithms Kegl et. al. (K´gl & Wang, 2005) introduce Rege Boost, an extension to AdaBoost which incorporates a weight decay that depends on a Laplacian regularizer. Our approach is different several senses: first, ours is based on the GradientBoost framework while theirs is based on AdaBoost, second, in the sense that ManifoldBoost does not require manifold-regularized base learners. This makes their approach limited in

Smoothness of F (c) is enforced by defining the cost V ({F (c) }) to be 1i (c)  ({yi , F (c) (xi )}C 1 ) + . . . c= l c ) ) 1 (c ) M F (c (xi )LM F (c (xj ) i,j C · (l + u) · K i,j


ManifoldBoost: Stagewise Function Approximation for Fully-, Semi- and Un-sup ervised Learning

Figure 3. The framework also handles multiclass learning. Left figure shows another toy example (three rings): the unlabeled datapoints are depicted in green, the diamonds represent the labeled examples (one for each class). The output of the algorithm is depicted on the center figure. The blue classification function F 2 (x) is shown on the right. This figure is best viewed in color.

Algorithm 4 K-Tree ManifoldBoost Algorithm
1: Let 2: 3: 4: 5: 6: 7: 8: 9:

be the frequencies of each class c. C (c) (c) 1 = log p0 - C c =1 log p0 for m = 1 to C do (c) Compute pm (x) as in eq. 13 for all c. for c = 1 to C do (c) Compute GV as in (14) (c) Obtain regression tree {Rm,s } by minimizing i (c) s (c) (c) m,s I [xi  Rm,s ])2 (GV (xi ) - end for (c) V Find {m,s } using BFGS and  m,s , and fixing
(c) F0 (x) (c)

(c) p0

{Rm,s } for all c. (c) (c) 10: Fm (x) = Fm-1 (x) + for all c. 11: end for

s

 ( (c))m,s I [x  Rm,s ]

(c)

in performance for the Sonar dataset, an impovement in the Ionosphere dataset, and a very slight decrease in performance in the Pima Indians dataset with respect to RegBoost, well within a standard deviation. It should be noted that the variance in the performance of the algorithm is consistently smaller for our algorithm. (K´gl & Wang, 2005) also tests the algorithm e under semi-supervision, using 100 labeled and 251 unlabeled examples. We ran our algorithm under the same conditions, using the stumps to prevent overfitting. In this case our algorithm outperforms (K´gl & e Wang, 2005) and (M. Belkin & Niyogi, 2004), as our mean performance over 10 runs is more than a standard deviation above theirs. No variance of results is reported in (K´gl & Wang, 2005). e (Chen & Wang, 2008) proposes an interesting alternative approach to regularized boosting based on the more traditional framework of boosting "weak" learners outlined in (Mason et al., 2000). As a consequence, they need to assign pseudo-class labels to unlabeled data (labels assigned with the current Ft (x)) while learning the ensemble. In contrast, ManifoldBoost uses base regressors to measure the confidence of the prediction and does not commit to {-1, +1} classification at each step. Smoothing this seems more natural in a formulation that penalizes second derivatives (Laplacian cost). 6.2. Comparison to Other Semi-Sup ervised Learning Algorithms We measured the performance of our two-class RBFManifoldBoost algorithm on the SSL data sets, a standard benchmark for semi-supervised learning problems introduced in (Chapelle et al., 2006), and compared with the 14 other state-of-the-art semi-supervised learning algorithms discussed there. In table 2 we present results for five data sets, 2 of which are clusterlike and 3 manifold-like. On the manifold-like data sets, we are at the state of the art and no single algorithm does uniformly better than us. On the cluster-

the types of learners to be used (they use stumps only). Also, the ensemble classifier should be smooth on the manifold, but regularizing each of the base learners may result in over-smoothing of the overall solution. We compare our results with (K´gl & Wang, 2005) on e standard UCI benchmark datasets. Whenever possible we tried to use the same configuration as (K´gl & e Wang, 2005)1 . We set number of nearest neighbors K = 8 and used binary weights to compute the graph Laplacian. We used regression trees of fixed depth 3 as learners. The datasets were normalized to zero mean and unit variance. The learning rate was set to  = 0.1 after (Friedman, 1999). Only  was explored for different values. We used 5-fold cross validation for determining parameters and 10-fold cross validation for error estimation. Table 1 compares our performance with that of AdaBoost, RegBoost, and (M. Belkin & Niyogi, 2004) as reported in (K´gl & Wang, 2005). e In the fully supervised problems, there is a difference
1 The breast cancer dataset was not used because (K´gl e & Wang, 2005) does not explain what metric they use for categorical data in the Lalacian, making any comparison meaningless.


ManifoldBoost: Stagewise Function Approximation for Fully-, Semi- and Un-sup ervised Learning Table 1. Performance results of Tree-ManifoldBoost and different boosting approaches on standard UCI datasets, as reported in (K´gl & Wang, 2005) (variance in parenthesis) e

Algorithm AdaBoost RegBoost Tree-ManifoldBoost Belkin et al., 04 2

Ionosphere Train / Test 0% / 9.2% (7.1) 0% / 7.7% (6.0) 0% (0) / 6.5% (4.8) -/-

Pima Indians Train / Test 10.9% / 25.3% (5.3) 16.0% / 23.3% (6.8) 6.5% (0.5) / 24.0% (5.2) -/-

Sonar Train / Test 0% / 32.5% (19.8) 0% / 29.8% (18.8) 0% (0) / 18.7% (6.1) -/-

Ionosphere (semisup.) 12% 10.4% (0.8) 18%

like data sets, our performance is good compared to most other regularization-based and manifold learners but is not as good as the specialized clustering algorithms Cluster-Kernel and SGT (Spectral Graph Transducer). Parameter search was performed following section 21.2.5 of (Chapelle et al., 2006) when possible, using the same ranges for  , the RBF width  , distance metric, K , etc. For the base regressors, we used R  {15, 30} as the numbers of RBFs, and M = 500 rounds. The learning rate was again chosen as  = 0.1. These parameters were obtained in small-scale experiments and then fixed. Results reported are the means over the different splits. The running times for a MATLAB implementation on a 2 GHz machine was in the order of minutes. Unfortunately, running times for the other algorithms were not reported in (Chapelle et al., 2006). 6.3. SecStr Data Set We also ran experiments on the SecStr data set (Chapelle et al., 2006), which is a problem of predicting the secondary structure of protein sequences from their amino acid chains. This is a large-scale and challenging data set with 83,000 labeled and 1.2 million unlabeled examples. Semi-supervised algorithms have made little improvement to this benchmark so far (Table 3), and the best result is the manifold-regularized learning algorithm (Sindhwani et al., 2006), which yields a 29% error rate on a subset of the data with 10,000 labeled and 73,000 unlabeled examples. Tree-ManifoldBoost with   {0, 10-5 , 10-3 , 0.1, 1} achieved similar performance on the same subset in approximately 45 minutes of training time (after computing the Laplacian matrix). We used stumps, K = 6 and  = 0.05. No model selection was performed. We used as similarity measure the Hamming distance between the best alignment of sequences. The results reported are the mean over the 10 splits. When we used the whole dataset (1.3 million se-

quences) with   {0, 10-3 , 1}, there is virtually no performance improvement. This may be due to the smaller parameter search space, or to peculiarities of the dataset. When we analysed the structure of the manifold on the labeled subset, we observed that almost 20% of sequences at distance 1 (that is, shifted by one position to the left or right) had a different label. Therefore the manifold assumption is not strong on this set. As far as we know, this is the first time results are reported on the complete SecStr dataset. Our algorithm is efficient and therefore can handle datasets of this size. Learning time is in the order of three hours for 1.3 million samples (leaving aside the computation of the graph Laplacian, which took significantly longer)
Table 3. Error rates on SecStr dataset. l is the number of labeled examples.

l SVM Cluster Kernel QC randsub (CMN) QC smartonly (CMN) QC smartsub (CMN) Boosting (assemble) LapRLS LapSVM Tree-ManifoldBoost (83K) Tree-ManifoldBoost (1.3M)

100 44.59 42.95 42.32 42.14 42.26 42.59 43.42 42.70 43.28

1000 33.71 34.03 40.84 40.71 40.84 34.17 33.96 33.43 33.42

10000

32.21 28.55 28.53 28.96 29.07

7. Conclusion
We have presented a new boosting framework for regularized learning in a greedy, stage-wise procedure. It is flexible enough to handle the whole range of supervision, from fully supervised classification to unsupervised clustering. The framework is general, accepts many different function approximation techniques, is


ManifoldBoost: Stagewise Function Approximation for Fully-, Semi- and Un-sup ervised Learning Table 2. Error rates for data sets from (Chapelle et al., 2006). l is the number of labeled examples.

1-NN SVM 21.2.8 MVU + 1-NN 21.2.8 LEM + 1-NN 21.2.4 QC + CMN 21.2.6 Discrete Reg. 21.2.1 TSVM 21.2.1 SGT 21.2.10 Cluster-Kernel 21.2.3 Data-Dep. Reg. 21.2.11 LDS 21.2.5 Laplacian RLS 21.2.7 CHM (normed) RBF-ManifoldBoost

l = 10 BCI Digit1 49.00 13.65 49.85 30.60 47.95 14.42 48.74 23.47 50.36 9.80 49.51 12.64 49.15 17.77 49.59 8.92 48.31 18.73 50.21 12.49 49.27 15.63 48.97 5.44 46.90 14.86 47.12 19.42

Manifold-like l = 100 USPS BCI Digit1 16.66 48.67 3.89 20.03 34.31 5.53 23.34 47.89 2.83 19.82 44.83 6.12 13.61 46.22 3.15 16.07 47.67 2.77 25.20 33.25 6.15 25.36 45.03 2.61 19.41 35.17 3.79 17.96 47.47 2.44 17.57 43.97 3.46 18.99 31.36 2.92 20.53 36.03 3.79 19.97 32.17 4.29

USPS 5.81 9.75 6.50 7.64 6.36 4.68 9.77 6.80 9.68 5.10 4.96 4.68 7.65 6.65

Cluster-like l = 10 l = 100 g241c g241d g241c 47.88 46.72 43.93 47.32 46.66 23.11 47.15 45.56 43.01 44.05 43.22 40.28 39.96 46.55 22.05 49.59 49.05 43.65 24.71 50.08 18.46 22.76 18.64 17.41 48.28 42.05 13.49 41.25 45.89 20.31 28.85 50.63 18.04 43.95 45.68 24.36 39.03 43.01 24.82 42.17 42.80 22.87

g241d 42.45 24.64 38.20 37.49 28.20 41.65 22.42 9.11 4.95 32.82 23.74 26.46 25.67 25.00

efficient and fast at each round of boosting, handles multi-class and wholly unsupervised problems, and produces results at the state of the art. We are working on understanding important aspects of the algorithm, in particular, generalization, error bounds, convergence and local minima.

K´gl, B., & Wang, L. (2005). Boosting on manifolds: e Adaptive regularization of base classifiers. NIPS 17 (pp. 665­672). Lafferty, J., & Wasserman, L. (2007). Statistical analysis of semi-supervised regression. NIPS 20 (pp. 801­808). M. Belkin, I. M., & Niyogi, P. (2004). Regression and regularization on large graphs. COLT (pp. 824­831). Mason, L., Baxter, J., Bartlett, P., & Frean, M. (2000). Boosting algorithms as gradient descent. NIPS 12 (pp. 512­518). Niyogi, P. (2008). Manifold regularization and semisupervised learning: Some theoretical analyses (Technical Report). University of Chicago. Technical Report TR-2008-01, Computer Science Dept. Roweis, S., & Saul, L. (2000). Nonlinear dimensionality reduction by locally linear embedding. Science, 290, 2323­ 2326. Sindhwani, V., Belkin, M., & Niyogi, P. (2006). The geometric basis of semi-supervised learning. In Chapelle, Schoelkopf and Zien (Eds.), Semi-supervised learning. MIT Press. Tenenbaum, J. B., de Silva, V., & Langford, J. C. (2000). A global geometric framework for nonlinear dimensionality reduction. Science, 290, 2319­2323. Zhu, X. (2006). Semi-supervised learning literature review (Technical Report). University of Wisconsin. Zhu, X., Ghahramani, Z., & Lafferty, J. (2003). Semisupervised learning using gaussian fields and harmonic functions. ICML (pp. 912­919).

References
Bertsekas, D. P. (1996). Constrained optimization and lagrange multiplier methods. Academic Press. Bickel, P., & Li, B. (2007). Local polynomial regression on unknown manifolds. IMS Lecture Notes Monograph Series: Complex Datasets and Inverse Problems: Tomography, Networks and Beyond (pp. 177­186). Chapelle, O., Sch¨lkopf, B., & Zien, A. (2006). Semio supervised Learning. MIT Press. Chen, K., & Wang, S. (2008). Regularized boost for semisupervised learning. NIPS 20 (pp. 281­288). Donoho, D. L., & Grimes, C. (2003). Hessian eigenmaps: Locally linear embedding techniques for highdimensional data. Proc. Natl. Acad. Sci. USA, 100, 5591­5596. Dyn, N., Levin, D., & Rippa, S. (1986). Numerical procedures for surface fitting of scattered data by radial functions. SIAM Journal on Scientific and Statistical Computing, 7, 639­659. Friedman, J. (1999). Greedy function approximation: a gradient boosting machine (Technical Report). Dept. of Statistics, Stanford University. Joachims, T. (1999). Transductive inference for text classification using support vector machines. ICML (pp. 200­ 209).