A Decoupled Approach to Exemplar-based Unsup ervised Learning
Sebastian Nowozin sebastian.nowozin@tuebingen.mpg.de Max Planck Institute for Biological Cybernetics, Spemannstrasse 38, 72076 Tubingen, Germany Ј GЈkhan Bakir o Google GmbH, Brandschenkestrasse 110, 8002 Zurich, Switzerland ghb@google.com

Abstract
A recent trend in exemplar based unsupervised learning is to formulate the learning problem as a convex optimization problem. Convexity is achieved by restricting the set of possible prototypes to training exemplars. In particular, this has been done for clustering, vector quantization and mixture model density estimation. In this paper we propose a novel algorithm that is theoretically and practically superior to these convex formulations. This is possible by posing the unsupervised learning problem as a single convex "master problem" with non-convex subproblems. We show that for the above learning tasks the subproblems are extremely wellbehaved and can be solved efficiently.

convex learning problems. Recently, a number of convex approaches have been proposed. Our goal in this paper is to improve on these approaches. In section 2 we review convex formulations for unsupervised learning tasks and discuss two recent methods. We show how convexity is achieved and derive a small experiment whose result suggests a way to improve on the established models. We describe our model in section 3 together with an algorithm and a theoretical justification. The model is validated experimentally in section 4 and we conclude in section 5.

2. Review of Convex Approaches
We now discuss two convex approaches to unsupervised learning from the literature. We will denote the training set as X = {xi }i=1,...,N , with xi  X and usually X = Rd . Kernel Vector Quantization (Tipping & SchЈlkopf, o 2001) learns a small set of codebook vectors such that the minimum distance from any training sample to its nearest codebook vector is bounded above by a given maximum distortion h. In (Tipping & SchЈlkopf, o 2001), this is done by formulating a linear programming problem, of which the following problem is an equivalent reformulation.1 max
q ,

1. Intro duction
Methods for unsupervised learning aim at recovering underlying structure from data. In this paper, we are concerned with exemplar based models in which this structure is represented by a weighted set of points in input space. Depending on the used model, these points can be interpreted as clusters, codebook vectors or mixture components. Although the representation is done by a finite point set, the structure being represented ­ such as a density ­ is defined on the entire input space by expanding a smoothing kernel function around each representing point. In this setting learning simply becomes deciding on the number of points and their weights, as well as their location in input space by means of a suitable objective. In EM-learning of mixture models and in k -means clustering one fixes the number of points and adjusts their position by performing descent steps on the ob jective function starting from a random initialization. This leads to well-behaved but usually nonAppearing in Proceedings of the 25 th International Conference on Machine Learning, Helsinki, Finland, 2008. Copyright 2008 by the author(s)/owner(s).

 K q  1, q
1

(1)

sb.t.

= 1,

q  0. Here K is a (N , N ) matrix with Ki,j = I ( xi - xj  h), where I (З) evaluates to one if the predicate is true and to zero otherwise, therefore, Ki,j is one if a ball of radius h centered on xj contains xi . In the solution of (1) the balls selected by qj > 0 form a sparse covering of the training set and the distance of each sample to its closest covering ball is bounded by h. Convex Clustering (Lashkari & Golland, 2007) was re1

Sub ject to rescaling of q .


A Decoupled Approach to Exemplar-based Unsup ervised Learning

cently proposed for clustering. In Lashkari and Golland's model, a mixture model is fit to an observed training set, such that a candidate mixture component is centered around each training set exemplar. Using the framework of Bregman clustering (Banerjee et al., 2005), their ob jective maximizes the loglikelihood sub ject to the constraint that the resulting model is a proper mixture model. In the optimum solution of the model, a sparse set of exemplars is selected, allowing the interpretation as clusters. Formally, Lashkari and Golland maximize N o N 1 - d (xi ,xj ) ver the mixture i=1 log j =1 qj e N N parameters qj  0, j = 1, . . . , N with j =1 qj = 1. The model allows all exponential family distributions with a corresponding Bregman divergence d (Banerjee et al., 2005). For the maximization, a multiplicative update is used, which leads to slow convergence once elements of q approach zero. We reformulate the above ob jective function by introducing a new set of variables i , with i = 1, . . . , N as follows.
N 1i log i N =1

Training exemplars versus dense exemplars 6 5 4 3 2 1 0 -1 -2 -3 -2 0 2
Training exemplars Convex clustering selection Dense selection

4

6

8

Figure 1. Exemplar selection within the training set versus the finest dense set of 900 exemplars on a regular grid. In this toy example, there are 66 data points.

2.2. Motivating Exp eriment: More Exemplars Restricting the set of possible prototype candidates to the training set might result in a suboptimal solution if there is no exemplar close to the true mean of a cluster. If the data is low-dimensional, normal-distributed within each cluster, has low noise and there are enough training examples, this effect is small and can be ignored. But in high dimensions the true mean might be far away from any exemplar. To demonstrate the effect of restricting the prototype candidate set we perform an experiment. A simple two-dimensional data set is created by sampling from an isotropic Gaussian and a ring of uniform density, forming two well-separated clusters, see Figure 1. We compare convex clustering (Lashkari & Golland, 2007) with a modified mMdel where the ob jective is changed o , N 1 - d (xi ,z j ) to N i=1 log with N training j =1 qj e samples xi and M cluster center candidates z j . This convex ob jective still represents the log-likelihood of the training samples under a mixture model. We generate z j by densely discretizing the [-2; 7]2 box on a regular grid. Our hope is that a fine discretization will increase the chance that {z j }j =1,...,M contains exemplars close to the true center of each cluster. For both models we use an isotropic multivariate normal distribution with covariance matrix  =  2 I ,  = 2.5. The clustering result is shown in Figure 1. For the cluster around the origin there is indeed a training set exemplar close to the mean of the generating Gaussian and the difference between the convex clustering and dense selection is small. However, for the ring-like structure, the training set exemplars cannot represent the cluster center adequately. This causes convex clustering to select two exemplars, while in the dense set a single good candidate is selected. A slight perturba-

max
q ,

(2) (3)

sb.t. K q =  , qj 1 = 1, qj  0, j = 1, . . . , N ,

where K is a (N , N ) matrix and Ki,j = e- d (xi ,xj ) . Clearly, problem (2) is equivalent to the previous one because constraints (3) only serve to evaluate the likelihood i for each sample xi . 2.1. Where do es Convexity come from? Models as proposed in (Tipping & SchЈlkopf, 2001) o and (Lashkari & Golland, 2007) achieve convexity by changing the problem parametrization. Instead of learning the coordinates of a fixed number of exemplars z j , j = 1, . . . , M , there is now a larger set of possible candidate exemplars with fixed coordinates. Learning is performed by optimizing over indicator variables, selecting a sparse subset of the candidates. This reparametrization makes the problem convex but also changes the regularization: whereas usually the number of exemplars M is the main regularization parameter, it is now an implicit guarantee on the quality of the solution. In (Tipping & SchЈlkopf, 2001) this o is the maximum distortion h, whereas in (Lashkari & Golland, 2007) the regularization parameter  controls the smoothness of the density.


A Decoupled Approach to Exemplar-based Unsup ervised Learning
Objective for training exemplars versus dense exemplars -0.76 -0.77 Achieved objective -0.78 -0.79 -0.8 -0.81 -0.82 10 Dense exemplars Training set exemplars 20 30 40 50 60 Discretizations per dimension 70

response functions such that an ob jective is minimized. ( , ) (4) Z sb.t. qz kz (xi ) dz = i : i , i = 1, . . . , N (5)
q , ,

min

  i : i ,

i = 1, . . . , N ,

(6) (7) (8)

qz  0 : Еz , z  Z , Z qz dz = 1 : ,

Figure 2. Training set vs. dense set log-likelihood.

where ,  , Е and  are the Lagrange multipliers for the respective constraints. Before discussing the choice of ob jective function , let us discuss the purpose of the constraints. З Constraint (5) evaluates a convex combination of responses for each sample. i contains the combined response for sample xi . З Constraint (6) identifies ­ if (, ) < 0 ­ the lowest response among all samples. The value of the lowest combined response is . З Constraints (7) and (8) define the combination simplex of the response functions. For the special case where Z is a finite set of points in X , we can replace the integrals and infinite constraints with a finite sum and finite set of constraints, respectively. Constraints (5) can then be compactly written as K q =  , where K is a (N , |Z |) matrix storing the kernel responses. The dual problem of (4) can be derived from the conjugate function  (, ,  , Е) and its respective domain (Boyd & Vandenberghe, 2004, result (5.11)). The dual problem is
,, ,Е

tion in the training data would lead to a different selection by the convex clustering method, as all samples bordering to the interior of the ring are roughly equally bad. For this data set, the solution produced by convex clustering is not only qualitatively disappointing but also unstable. The achieved ob jectives are shown in Figure 2, where the convex clustering ob jective is drawn as horizontal line and the dense exemplar model forms a curve as the discretization becomes finer and finer. At around eight discretizations per dimension our modified model surpasses the log-likelihood of the convex clustering model. At around 30 discretizations per dimension the log-likelihood levels out and adding more cluster candidates does not improve the solution. This experiment suggests that a larger set of candidate clusters can lead to higher quality results which are also more robust. While dense discretization is only feasible in case the input space is low-dimensional, ideally we would like to use an infinitely fine discretization and thus use the set of al l possible input points as candidates. This idea will be the basis for our method.

max

- (, ,  , Е) -  (, ,  , Е)  dom( ), iN
=1

(9)

sb.t.

3. A Decoupled Mo del
We now introduce our model for unsupervised learning together with an efficient solution algorithm. Essential to the solution is the ability to solve a certain subproblem which we analyze in detail. 3.1. Mo del Our model for unsupervised learning generalizes convex clustering (Lashkari & Golland, 2007) and kernel vector quantization (Tipping & SchЈlkopf, 2001). Let o kz (З) be a non-negative smoothing kernel centered at z  Z , with Z  X . Let {xi }i=1,...,N , xi  X denote the training set. The following semi-infinite convex programming problem learns a convex combination of

i kz (xi )  Еz - ,

z  Z (10)

0 Еz  0, z  Z

We propose the following choices of convex ob jective functions ( , ). 1. ( , ) = - The ob jective states that the lowest response among all samples is to be maximized. All other samples have equal or higher responses but are ignored by this ob jective, hence a single exemplar can have a large influence on the overall ob jective. The KVQ problem (1) corresponds to this


A Decoupled Approach to Exemplar-based Unsup ervised Learning

ob jective with K chosen as discussed in section 2. The conjugate is  (, ,  , Е) = 0 and domain dom( ) = {(, ,  , Е) :  +   0,  1 = 1}. With (10) we have   0. N 1 2. ( , ) = - N i=1 log(i ) N This ob jective maximizes i=1 i . For the special case where the columns of K correspond to evaluations of probability density functions at the training samples this ob jective maximizes the log-likelihood of the samples under a mixture model, resulting in convex clustering (2). A single exemplar can have a significant effect on the overall ob jective, but all sample responses are considered, contrasting the previous ob jective function. The conjugate is N 1  (, ,  , Е) = - N i=1 log(-i ) + log(N ) with domain dom( ) = {(, ,  , Е) :  < 0,  = 0}. N 2 C 3. ( , ) = - + N i=1 (i - ) The ob jective maximizes the margin  while penalizing large deviations from the margin, where the penalty strength is determined by C  0. The ob jective may prefer a smaller margin if the corresponding choice of q leads to a more uniform i . This margin-minus-variance (MMV) ob jective was first proposed in (Ruckert & Kramer, Ј 2006) for supervised learning. N N N 1 C 1 4. ( , ) = - N i=1 i + N i=1 (i - N i=1 i )2 The ob jective maximizes the mean response while penalizing large deviations from it, where the penalty strength is determined by C  0. This maximizes the mean-minus-variance popular in applications such as portfolio optimization, see for example (Cornuejols & Tutuncu, 2007). ЈЈ Ј In order to be able to compare our method with established methods from the literature we only use the first two ob jectives in the experiments. 3.1.1. Relation to existing methods. Most relevant for our approach is Boosting Density Estimation (Rosset & Segal, 2002). We note the following differences, i) our model includes different ob jectives, ii) in our solution algorithm, we will use totallycorrective weight updates2 instead of a simple linesearch procedure, and iii) we identify each weak learner uniquely with a point in input space. Also related is the hard-margin case of 1-class Boosting (RЈtsch et al., a 2001). With exemplar-based weak learners it is a special case of our model with the first ob jective.
2 Totally-corrective steps update all weights individually in each iteration, leading to faster convergence.

Algorithm 1 Infinite Exemplar Column Generation (Z, q ) = Infex(X, , k , Z0 ) Input: Sample set X = {xi }i=1,...,N , xi  X Convergence tolerance > 0 Non-negative smoothing kernel kz : Z з X  R+ Initial exemplar set Z0 = {z j }j =1,...,|Z0 | , z j  Z Output: Column exemplar set Z = {z j }j =1,...,R , z j  Z Weightings qzj  R+ , j = 1, . . . , R Algorithm: 1   - N 1, Z  Z0 , R  |Z0 | + 1,   ,    0 lo op N z R  argmaxzZ - i=1 i kz (xi ) {(SP)} N     - i=1 i kzR (xi ) {Compute zR } if  < then break {convergence to tolerance} end if Z  Zk {z R } i K  zj (xi ) =1,...,N , j =1,...,R {response matrix} p , (q  ,   ,  ), ( ,   , Е ,   )  R R R ob jective value, primal- and dual-solution to problem (4) with finite (N , R) matrix K . RR+1 end lo op

3.2. Algorithm To solve problem (4), we propose Algorithm 1 (INFEX), a delayed column generation algorithm. The algorithm works with a finite and usually small set of candidate prototypes z j . This set is iteratively enlarged by adding good candidates. Selecting the candidates to add in each iteration becomes a subproblem, which we define now. Problem 1 (Subproblem (SP)) Given a set of samples xi  X , i = 1, . . . , N , a corresponding nonpositive sample weighting i  0, i = 1, . . . , N and a non-negative smoothing kernel kz (x) : Z з X  R+ , obtain z  as the solution of iN z  = argmaxzZ - i kz (xi ).
=1

The solution to this subproblem provides a candidate z  that, when added to the set of considered candidates, will reduce the global ob jective.3 We will now rigorously derive the subproblem from global optimality conditions of problem (4).
3 In the optimization literature such columns are referred to as having negative reduced cost. The overall decoupled solution approach is closely related to the generalized Benders decomposition (Geoffrion, 1972).


A Decoupled Approach to Exemplar-based Unsup ervised Learning

Theorem 1 Assume that the subproblem (SP) can be solved exactly in each iteration. Then Algorithm 1 solves problem (4) global ly to the desired accuracy . Proof. Consider a slightly modified version of problem (4), where a part of the constraints (7) is replaced by equality constraints. We replace (7) by the following constraint set, parametrized by a finite set of points ZR = {z 1 , . . . , z |ZR | }. qz  0 : Еz , qz = vz : Еz , z  ZR , z  Z \ ZR , (11) (12)

Specifically, for all z  Z \ ZR we must have N  qz L(q  ,   ,  ,  ,   , Е ,   ) = i=1 i kz (xi ) +    Еz +  = 0. This allows us to express Еz as Е =   - z iN
=1  i kz (xi ).

(13)

Therefore, if for all z  Z \ ZR we have dual feasible Е  0, then the current solution is optimal, despite z the restrictions imposed by constraints (12). If we satisfy the optimality condition, then replacing (12) with constraints (11), does not change the solution, which remains optimal in the original problem (4). What remains to be shown is that Algorithm 1 makes progress in each iteration and thus in the limit will satisfy the optimality condition. Consider the case where the above optimality condition is violated for one or more z  Z \ ZR . Thb n, let z  = e  N   argmaxzZ \ZR - i=1 i kz (xi ) e the sample corresponding to the most negative partial derivative vz p(v ) < 0. Because of the sensitivity theorem, adding z  to ZR ­ making qz a free variable ­ and re-solving (4) will reduce the ob jective value. Therefore, either no z  with vz p(v ) < - is found and convergence to the tolerance is established, or a strict decrease in the ob jective is obtained. N ote that in practice, we can add multiple exemplars in each iteration. Suppose during solving the subproblem (SP) we obtain a number of good local maximizers. Then, we can add all these local maximizers in order to obtain a faster convergence. Adding redundant exemplars with vz p(v ) > 0 does not have an effect as they will receive a zero weight qz = 0. 3.3. On the Nature of the Subproblem The subproblem (SP) is completely determined by the negative weighting of the training set and the shape of the smoothing kernel function. For further discussion let us define N = -i and rewrite the subproblem as i argmaxzZ i=1 i k (xi , z ). From the definition it follows that all i are non-negative. Clearly, this problem is non-concave whenever k is non-concave in z which is true for all smoothing functions we consider. However, for kernel functions of the form kz (x) = k ( x - z ), the optima of the subproblem, thus the new candidNtes, are located at the modes of the exa pansion i=1 i kz (xi ). It is this fact that can b e exploited to efficiently solve the subproblem by standard hill-climbing algorithms. Such algorithms start at a point z (0) in input space and Nenerate iteratively g better candidates such that i=1 i kz (t+1) (xi ) > N i kz(t) (xi ). In this paper, we use the weighted i=1

where vz = 0 is constant for all z  Z \ ZR . Together, constraints (11) and (12) restrict problem (4) such that only a finite subset of the variables q are used. For a given finite ZR , we can obtain an optimal primal (q  ,   ,  ), and dual ( ,   , Е ,   ) solution to the modified problem by solving a finite problem in the restricted set of variables {qz : z  ZR }. Let the optimal function value of this solution be denoted by p(v ). Because the optimal solution must be feasible,  we have qz = vz = 0 for all z  Z \ ZR . How would the ob jective function value p(v ) change if we force  a qz to become non-zero? That is, if we increase vz by a very small amount can we improve the solution? The sensitivity theorem (Bertsekas, 1999, Proposition 3.3.3) provides a definite answer, namely we have for all z  Z \ ZR the following. vz p(v ) = -Е . z If we have for all z  Z \ ZR that vz p(v )  0, then this implies that we can not decrease p(v ) by making qz > 0. Conversely, this observation provides us with a global optimality condition : if and only if ZR contains all relevant (positive qz ) exemplars, we have z  Z \ ZR : Е  0. Given ZR and a primal-dual optimal z solution we can find an alternative expression for Е . z Consider the Lagrangian of the modified problem. L(q ,  , , ,  , Е,  ) = ( , ) Z iN + i qz kz (xi ) dz - i
=1

+ 

(

1 -  )

-

z
ZR

Z Еz qz +
\ZR

Еz qz dz Z qz dz - 1)
\ZR

z + (
ZR

qz +

Because of optimality of the solution, it must satisfy the Karush-Kuhn-Tucker necessary conditions (Bertsekas, 1999), therefore we must have a zero gradient with respect to the primal variables.


A Decoupled Approach to Exemplar-based Unsup ervised Learning

mean shift procedure which was introduced by (Fukunaga & Hostetler, 1975; Cheng, 1995) and gained popularity due to (Comaniciu & Meer, 2002). Given an initial starting point z (0) the iterates are produced by z 2x N (t) -xi i i=1 i g h z (14) z (t+1) = 2, N (t) -x i i=1 i g h where g : R+  R+ is the negative derivative of the so called kernel profile. If for a continuous kernel the function g is convex and non-increasing, then the mean shift procedure is guaranteed to converge to a local maxima (Comaniciu & Meer, 2002). For each of the common continuous smoothing kernels, a unique function g exists and some popular kernels and their profile derivatives are discussed in section 4. For the Gaussian kernel, g is a scaled version of the original kernel profile and thus particularly easy to maximize.4 Mean shift is popular in computer vision, where specialized procedures have been developed to efficiently find globally good modes, for example the annealed mean shift procedure (Shen et al., 2007). If the smoothing kernel function is a reproducing Hilbert kernel (SchЈlkopf & Smola, 2002), then probo lem (SP) is known as the pre-image problem (SchЈlkopf o et al., 1999). An important difference which simplifies our subproblem considerably is that all our weights  are of the same sign. In the general pre-image problem the sign is not fixed and procedures such as the one of (SchЈlkopf et al., 1999) can be unstable and do not o have a convergence guarantee. 3.4. Optimality Bound The proof of global optimality of the solution obtained by Algorithm 1 was based on the assumption that the subproblem (SP) can be solved globally. We now show that even without this assumption, the method can be no worse than methods using a fixed exemplar set. Theorem 2 Given ( , ), a set X = {xi }i=1,...,N , xi  X and a finite set of exemplars ZF = {z j }j =1,...,M , the solution obtained by solving problem (4) with Z = ZF can not achieve a better objective than the solution obtained by Algorithm 1 with Z = X , Z0 = ZF .
4 The Gaussian kernel has received special attention in the literature. In (Carreira-Perpinan, 2000) it was conjec~Д tured that the number of modes in a Gaussian mixture is bounded above by the number of components. While this is true in the univariate case, this has been proven wrong in general in (Carreira-Perpinan & Williams, 2003). See ~Д also the counter-example at http://www.inference.phy. cam.ac.uk/mackay/gaussians/.

Proof. Let Algorithm 1 be called with Z0 = ZF . In the first iteration of Algorithm 1, the solved problem is identical to problem (4) with Z = ZF . Therefore, after the first iteration, the ob jective of Algorithm 1 is equal to the one obtained by solving problem (4). In 4 all later iterations, the ob jective can only improve.

. Exp eriments and Results
For the following experiments, we solve the restricted master problem (4) using IpOpt (WЈchter & Biegler, a 2006), a modern primal-dual interior point solver for non-linear programming available as open-source. For each master problem, we obtain accurate convergence in a few dozen solver iterations. We use tolerances 10-10 for the restricted master problem and 10-7 for the subproblems for all experiments.5 As smoothing kernels we use the unnormalized Gaussian, the unnormalized Epanechnikov, and a simple uniform disc kernel. All are parametrized by a bandwidth parameter h. The following are the kernel functions k and profiles g used in the mean shift procedure. 1. Gaussian, bandwidth h
-z 1 kz (x) = e- 2 x h 2 ,

g (y ) =

1 -1y e2 2

2. Epanechnikov, andwidth h b
kz (x) =

, ,2 , x-z , , ,1 1 - , x-z , h h 0 otherwise  1 0y1 g (y ) = 0 y>1 1 0 , x-z ,  1 h otherwise

3. Uniform disc, maximum d,stortion h i  ,
kz ( x) =

The first two kernels are common in non-parametric density estimation, whereas the last one is used by (Tipping & SchЈlkopf, 2001) for vector quantizao tion. We use the mean shift procedure (14) started from all training samples to solve the subproblem (SP) for the Gaussian and Epanechnikov kernels. We collect the result of each run and add the set of unique local maximizers to the restricted master problem. However, mean shift cannot be used to solve subproblem (SP) for the non-continuous uniform disc kernel. Instead, when using the uniform disc kernel, we find new codebook candidates by solving the subproblem with the Epanechnikov kernel instead. This is a reasonable approximation as the Epanechnikov kernel response lower bounds the uniform disc kernel response and its maximum lies in the center of the disc.
5 Our implementation is available at http://www.kyb. mpg.de/bs/people/nowozin/infex/.


A Decoupled Approach to Exemplar-based Unsup ervised Learning

4.1. Comparison with KVQ In the first experiment we compare the original Kernel Vector Quantization formulation (1) with all training exemplars as possible prototypes with our Algorithm 1, where the initial set is empty, ZR = . We use the first ob jective ( , ) = - and the uniform disc kernel. As dataset we use a subset of 1100 exemplars from the USPS digit machine learning dataset, with all labels removed and each class sampled equally such that there are 110 exemplars from each class. We evaluate by selecting the maximum allowed distortion h from {800, 1000, 1200, 1400, 1600, 1800, 2000}, where  2000 is the mean inter-class L2 -distance in the dataset. We compare the achieved margin  VQ (h) K with NFEX (h), and the number of codebook vectors I q  VQ 0 with q NFEX 0. Figures 3 and 4 show these K I as the maximum allowed distortion is varied. The proposed method outperforms KVQ, selecting a smaller number of codebook vectors and achieving a better ob jective value. Especially for larger allowed distortions, the benefit of selecting an arbitrary point in input space is substantial as due to the high dimensionality of the data set all input samples are relatively far away from each other. Because we use ZR =  to initialize our method, the results show that our subproblem approximation using the Epanechnikov kernel is an effective way to find good codebook candidates. 4.2. Comparison with Gaussian Mixture EM In the second experiment we consider mixture model density estimation and compare our method with Convex Clustering and a homoscedastic Gaussian mixture ( =  2 I ) learned with Expectation Maximization (EM).6 The log-likelihood ob jective and the same USPS dataset as before is used. The experimental protocol is as follows. For a range of bandwidths our model and convex clustering are run once per bandwidth. For each run, the number of components of our model is used to fix the number of components in the Gaussian mixture model, which is trained by EM starting 20 times from random initial sample points. The results are shown in Table 1. Clearly, a single run of our model is consistently the best. The best EM run is always close to our result and Convex Clustering is always the worst. (Lashkari & Golland, 2007) mention that their solution "can be improved in practice with a few extra steps of the EM algorithm". From Table 1, we conclude that the results of convex clustering are qualitatively inferior to plain EM and such refitting is actually essential for obtaining good results.
6

4.3. Subproblem Mo des In the last experiment we show the qualitative behavior of our model with the Epanechnikov kernel with h = 1500 and the log-likelihood ob jective. Because the Epanechnikov kernel has finite support, if we start with Z0 =  we could have some samples xi which have zero response because kzj (xi ) = 0 for all j . Then, the restricted variables q j are too few and problem (4) would be infeasible. Thus, in order to ensure feasibility of the initial master problems, we use Z0 = X . Some subproblem modes are shown in Figure 5. The modes approximate the "natural" clusters well except for classes such as 3, 8 and 9, which seem to be explained by one joint region with many local modes in it, for example in the first and second row.

5. Discussion and Conclusion
We presented a unifying perspective on existing exemplar based methods that aim at density estimation, clustering and vector quantization. Existing methods were either non-convex or achieved convexity by severe restrictions. In contrast, our approach ­ although still non-convex as a whole ­ is provable better than all existing methods. This is achieved by isolating a non-convex but still efficient solvable subproblem. The non-convex subproblem is embedded into a convex master problem steering towards an optimal solution. One limitation of our model is that one cannot fix q  0, the number of components. For problems where guarantees such as maximum distortion or smoothness are more natural constraints, this is not an issue. There are open questions that result from our work: 1. Does there exists a response function k that is useful for unsupervised learning and at the same time yields a globally solvable subproblem? 2. What is the relation between ob jective , kernel k and number of components q  0?

Table 1. Achieved log-likelihoods. CC is Convex Clustering; for EM the best and mean of 20 runs are shown.  440 460 480 500 520 540 CC -6.3356 -6.1269 -5.8705 -5.5813 -5.2780 -4.9779 Infex -5.1370 -4.7424 -4.3796 -4.0499 -3.7499 -3.4788 EM best -5.1442 -4.7486 -4.3823 -4.0507 -3.7502 -3.4789 EM mean -5.1485 -4.7503 -4.3834 -4.0520 -3.7512 -3.4795

A similar experiment is in (Lashkari & Golland, 2007).


A Decoupled Approach to Exemplar-based Unsup ervised Learning
10
0

INFEX versus KVQ margin
1200 1000

INFEX versus KVQ prototypes
INFEX #prototypes KVQ #prototypes

10

-1

Number of prototypes

Achieved margin

800 600 400 200 0 800

10

-2

10

-3

INFEX margin KVQ margin 1000 1200 1400 1600 1800 Maximum distortion rate 2000

800

1000

1200 1400 1600 1800 Maximum distortion rate

2000

Figure 3. Optimal margin  as a func- Figure 4. The number of selected proFigure 5. Subproblem modes found in tion of the maximum allowed distor- totypes as a function of the maximum different iterations. tion. Note the log-scale. allowed distortion.

3. Can a decomposition similar to ours yield a training scheme for supervised learning of RBF networks in the line of (Bengio et al., 2005)? Acknowledgments This work is funded in part by the EU CLASS pro ject, IST 027978. This work was supported in part by the IST Programme of the European Community, under the PASCAL Network of Excellence, IST-2002-506778. This publication only reflects the authors' views.

Fukunaga, K., & Hostetler, L. D. (1975). The estimation of the gradient of a density function, with applications in pattern recognition. IEEE Trans. Information Theory, 21, 32­40. Geoffrion, A. M. (1972). Generalized Benders decomposition. Journal of Optimization Theory and Applications, 10, 237­260. Lashkari, D., & Golland, P. (2007). Convex clustering with exemplar-based models. NIPS. RЈtsch, G., SchЈlkopf, B., & Mika, S. (2001). SVM a o and boosting: One class. Rosset, S., & Segal, E. (2002). Boosting density estimation. NIPS (pp. 641­648). MIT Press. Ruckert, U., & Kramer, S. (2006). A statistical apЈ proach to rule learning. ICML (pp. 785­792). SchЈlkopf, B., Mika, S., Burges, C. J. C., Knirsch, P., o Mueller, K.-R., Raetsch, G., & Smola, A. J. (1999). Input Space versus Feature Space in Kernel-Based Methods. IEEE-NN, 10, 1000. SchЈlkopf, B., & Smola, A. J. (2002). Learning with o kernels. MIT Press. Shen, C., Brooks, M. J., & van den Hengel, A. (2007). Fast global kernel density mode seeking: Applications to localization and tracking. IEEE Transactions on Image Processing, 16, 1457­1469. Tipping, M., & SchЈlkopf, B. (2001). A kernel apo proach for vector quantization with guaranteed distortion bounds. AISTATS. WЈchter, A., & Biegler, L. T. (2006). On the implea mentation of an interior-point filter line-search algorithm for large-scale nonlinear programming. Mathematical Programming, 106, 25­57.

References
Banerjee, A., Merugu, S., Dhillon, I. S., & Ghosh, J. (2005). Clustering with bregman divergences. Journal of Machine Learning Research, 6, 1705­1749. Bengio, Y., Roux, N. L., Vincent, P., Delalleau, O., & Marcotte, P. (2005). Convex neural networks. NIPS. Bertsekas, D. P. (1999). Nonlinear programming. Athena Scientific. 2nd edition. Boyd, S., & Vandenberghe, L. (2004). Convex optimization. Cambridge University Press. Д Carreira-PerpinДn, M. A. (2000). Mode-finding for ~a mixtures of gaussian distributions. IEEE Trans. Pattern Anal. Mach. Intel l, 22, 1318­1323. Д Carreira-PerpinДn, M. A., & Williams, C. K. I. (2003). ~a An isotropic gaussian mixture can have more modes than components. Cheng, Y. (1995). Mean shift, mode seeking, and clustering. IEEE Trans. Pattern Analysis and Machine Intel ligence, 17, 790­799. Comaniciu, D., & Meer, P. (2002). Mean shift: A robust approach toward feature space analysis. IEEE Trans. Pattern Anal. Mach. Intel l, 24, 603­619. Cornuejols, G., & Tutuncu, R. (2007). Optimization ЈЈ Ј methods in finance. Mathematics, Finance and Risk.