Clustering via LP-based Stabilities

Nikos Komodakis University of Crete komod@csd.uoc.gr

Nikos Paragios Ecole Centrale de Paris INRIA Saclay Ile-de-France nikos.paragios@ecp.fr

Georgios Tziritas University of Crete tziritas@csd.uoc.gr

Abstract
A novel center-based clustering algorithm is proposed in this paper. We first formulate clustering as an NP-hard linear integer program and we then use linear programming and the duality theory to derive the solution of this optimization problem. This leads to an efficient and very general algorithm, which works in the dual domain, and can cluster data based on an arbitrary set of distances. Despite its generality, it is independent of initialization (unlike EM-like methods such as K-means), has guaranteed convergence, can automatically determine the number of clusters, and can also provide online optimality bounds about the quality of the estimated clustering solutions. To deal with the most critical issue in a centerbased clustering algorithm (selection of cluster centers), we also introduce the notion of stability of a cluster center, which is a well defined LP-based quantity that plays a key role to our algorithm's success. Furthermore, we also introduce, what we call, the margins (another key ingredient in our algorithm), which can be roughly thought of as dual counterparts to stabilities and allow us to obtain computationally efficient approximations to the latter. Promising experimental results demonstrate the potentials of our method.

1

Introduction

Clustering is considered as one of the most fundamental unsupervised learning problems. It lies at the heart of many important tasks in machine learning, patter recognition, computer vision, data mining, biology, marketing, just to mention a few of its application areas. Most of the clustering methods are center-based, thus trying to extract a set of cluster centers that best `describe' the input data. Typically, this translates into an optimization problem where one seeks to assign each input data point to a unique cluster center such that the total sum of the corresponding distances is minimized. These techniques are extremely popular and they are thus essential even to other types of clustering algorithms such as Spectral Clustering methods [1],[2]. Currently, most center-based clustering methods rely on EM-like schemes for optimizing their clustering objective function [3]. K-means is the most characteristic (and perhaps the most widely used) technique from this class. It keeps greedily refining a current set of cluster centers based on a simple gradient descent scheme. As a result, it can very easily get trapped to bad local minima and is extremely sensitive to initialization. It is thus likely to fail in problems with, e.g., a large number of clusters. A second very important drawback of many center-based clustering methods, which severely limits their applicability, is that they either require the input data to be of vectorial form and/or impose strong restrictions on the type of distance functions they can handle. Ideally, one would like to be able to cluster data based on arbitrary distances. This is an important point because, by an appropriate choice of these distances, clustering results with completely different characteristics can be achieved [4]. In addition to that, one would prefer that the number of clusters is automatically estimated by the algorithm (e.g., as a byproduct of the optimization process) and not given as input. In contrast to that, however, many algorithms assume that this number is known a priori.


Note that this is a draft version of the paper for the NIPS preproceedings.

1


To circumvent all the issues mentioned above, a novel center-based clustering algorithm is proposed in this paper. Similarly to other methods, it reduces clustering to a well-defined (but NP-hard) minimization problem, where, of course, the challenge now is how to obtain solutions of minimum objective value. To this end, we rely on the fact that the above problem admits a linear integer programming formulation. By making heavy use of a dual LP relaxation to that program, we then manage to derive a dual based algorithm for clustering. As in all center-based clustering techniques, the most critical component in the resulting algorithm is deciding what cluster centers to choose. To this end, we introduce, what we call, the stability of a data point as a cluster center (this is an LP-based quantity), which we consider as another contribution of this work. Intuitively, the stability of a data point as a cluster center tries to measure how much we need to penalize that point (by appropriately modifying the objective function) such that it can no longer be chosen as a center in an optimal solution of the modified problem. Obviously, one would like to choose as centers those points having high stability. For applying this idea in practice, however, a crucial issue that one needs to deal with is how to efficiently approximate these stability measures. To this end, we introduce, what we call, the margins, another very important concept in our algorithm and a key contribution of our work. As we prove in this paper, margins can be considered as dual to stabilities. Furthermore, they allow us to approximate the latter on the fly, i.e., as our algorithm runs. The outcome is an efficient and very easily implementable optimization algorithm, which works in the dual domain by iteratively updating a dual solution via two very simple operations: D I S T R I B U T E and P RO J E C T. It can cluster data based on an arbitrary set of distances, which is the only input required by the algorithm (as a result, it can find use in a wide variety of applications, even in case where nonvectorial data need to be used). Furthermore, an important point is that, despite its generality, it does not get trapped to bad local minima. It is thus insensitive to initialization and can always compute clusterings of very low cost. Similarly to [5], the number of clusters does not need to be predefined, but is decided on the fly during the optimization process. However, unlike [5], convergence of the proposed method is always guaranteed and no parameters' adjustment needs to take place for this. Finally, an additional advantage of our method is that it can provide online optimality guarantees, which can be used for assessing the quality of the generated clusterings. These guarantees come in the form of lower bounds on the cost of the optimal clustering and are computed (for free) by simply using the cost of the dual solutions generated during the course of the algorithm.

2

Clustering via stabilities based on Linear Programming

Given a set of objects V with distances d = {dpq }, clustering amounts to choosing a set of cluster centers from V (say {qi }k=1 ) such that the sum of distances between each object and its closest i center is minimized. To this end, we are going to use the following objective function E (·) (which will be referred to as the primal cost hereafter): i p (1) dqi qi min dpqi + min E ({qi }k=1 ) = i
k,{qi }k=1 i V i

Note that, in this case, we require that each cluster is chosen from the set V . Also note that, besides {qi }, here we optimize over the number of cluster centers k as well. Of course, to avoid the trivial solution of choosing all objects as centers, we regularize the problem by assigning a penalty dqq to each chosen center q . Problem (1) has an equivalent formulation as a 0 - 1 linear integer program [6], whose relaxation leads to the following LP (denoted by P R I M A L hereafter): p P R I M A L  min dpq xpq (2) ,q  V q s .t. xpq = 1 (3)
V

xpq  xqq xpq  0

(4) (5)

To get an equivalent problem to (1), we simply have to replace xpq  0 with xpq  {0, 1}. In this case, each binary variable xpq with p = q indicates whether object p has been assigned to cluster center q or not, while binary variable xqq indicates whether object q has been chosen as a cluster center or not. Constraints (3) simply express the fact that each object must be assigned to exactly one center, while constraints (4) require that if p has been assigned to q then object q must obviously be chosen as a center. 2


Obviously at the core of any clustering problem of this type lies the issue of deciding which objects will be chosen as centers. To deal with that, a key idea of our approach is to rely on, what we call, the stability of an object. This will be a well defined measure which, intuitively, tries to quantitatively answer the following question: "How much do we need to penalize an object in order to ensure that it is never selected as an optimal cluster center?" For formalizing this concept, we will make use of the LP relaxation P R I M A L. We will thus define the stability S (q ) of an object q as follows: S (q ) = inf {perturbation s that has to be applied to penalty dqq (i.e., dqq  dqq + s) such that P R I M A L has no optimal solution x with xqq > 0} (6)

An object q can be stable or unstable depending on whether it holds S (q )  0 or S (q ) < 0. To select a set of centers Q, we will then rely on the following observation: a stable object with high stability is also expected to be, with high probability, an optimal center in (1). The reason is that the assumption of a high S (q )  0 is essentially a very strong requirement (much stronger than simply requiring q to be active in the relaxed problem P R I M A L): it further requires that q will be active for all problems P R I M A L(dqq + s)1 as well (where s  S (q )). Hence, our strategy for generating Q will be to sequentially select a set of stable objects, trying, at each step, to select an object of approximately maximum stability (as already explained, there is high chance that this object will be an optimal center in (1)). Furthermore, each time we insert a stable object q to Q, we reestimate stabilities for the remaining objects in order to take this fact into account (e.g., an object may become unstable if we know that it holds xqq = 1 for another object q ). To achieve that, we will need to impose extra constraints to P R I M A L (as we shall see, this will help us to obtain an accurate estimation for the stabilities of the remaining objects given that objects in Q are already chosen as centers). Of course, this process repeats until no more stable objects can be found. 2.1 Margins and dual-based clustering

Of course, for having a practical algorithm, the most critical issue is how to obtain a rough approximation to the stability of an object q in a computationally efficient manner. As we shall see, to achieve this we will need to to move to the dual domain and introduce a novel concept that lies at the core of our approach: the margin of dual solutions. But, first, we need to introduce the dual to problem P R I M A L, which is the linear program called D UA L in (7)2 : p hp (7) D UA L  max D(h) =
V

s.t. hp = minqV hpq , p p hpq =
V

p  V dpq , q  V p = q

(8) (9) (10)

V

hpq  dpq

Dual variables hpq can be thought of as representing pseudo-distances between objects, while each variable hp represents the minimum pseudo-distance from p (which is, in fact, `thought' by the dual as an estimation of the actual distance between p and its closest active center). Given a feasible dual solution h, we can now define its margin q (h) (with respect to object q ) as follows: , h p p ^ q (h) = (hp - hp ) - (11) (hpq - max(hp , dpq )) - qq - hq
:hpq =hp =q

^ where (for any h) hp hereafter denotes the next-to-minimum pseudo-distance from p. There is a very tight connection between margins of dual solutions and stabilities of objects. The following lemma provides a first indication for this fact and shows that we can actually use margins to decide whether an object is stable or not and also to lower bound or upper bound its stability accordingly (see [7] for proofs): Lemma 1 ([7]). Let h be an optimal dual solution to D UA L.
P R I M A L(z ) denotes a modified problem P R I M A L where the penalty for q has been set equal to z . Problem D UA L results from the standard dual to P R I M A L after applying a transformation to the dual variables.
2 1

3


1. If q (h) > 0 then S (q )  q (h). 2. If q (h) < 0 then S (q )  q (h). In fact, the following fundamental theorem goes even further by proving that stabilities can be fully characterized solely in terms of margins. Hence, margins and stabilities are two concepts that can be roughly considered as dual to each other: Theorem 2 ([7]). The following equalities hold true: S (q )  0  S (q ) = sup{q (h) | h optimal solution to D UA L} , S (q )  0  S (q ) = inf {q (h) | h optimal solution to D UA L} . Furthermore, it can be shown that: S (q ) = sign(S (q )) · sup{|q (h)| h optimal solution to D UA L} . (14) (12) (13)

What the above theorem essentially tells us is that one can compute S (q ) exactly, simply by considering the margins of optimal dual solutions. Based on this fact, it is therefore safe to assume that solutions h with high (but not necessarily maximum) dual objective D(h) will have margins that are good approximations to S (q ), i.e., it holds: S (q )  q (h) . (15)

This is exactly the idea that our clustering algorithm will rely on in order to efficiently discover objects that are stable. It thus maintains a dual solution h and a set Q containing all stable objects chosen as centers up to the current point (Q is empty initially). At each iteration, it increases the dual objective D(h) by updating solution h via an operation called D I S T R I B U T E. This operation is repeatedly applied until a high enough objective value D(h) is obtained such that at least one stable object is revealed based on the estimated margins of h. At that point, the set Q is expanded and h is updated (via an operation called P RO J E C T) to take account of this fact. The process is then repeated until no more stable objects can be found. A remarkable thing to note in this process is that, as we shall see, determining how to update h during the D I S T R I B U T E operation (i.e., for increasing the dual objective) also relies critically on the use of margins. Another technical point that we need to solve comes from the fact that Q gets populated with objects as the algorithm proceeds, which is something that we certainly need to take into account when estimating object stabilities. Fortunately, there is a very elegant solution to this problem: since all objects in Q are assumed to be cluster centers (i.e., it holds xqq = 1, q  Q), instead of working with problems P R I M A L and D UA L, it suffices that one works with the following primal-dual pair of LPs called P R I M A LQ and D UA LQ 3 : P R I M A LQ = min P R I M A L s.t. xqq = 1, q  Q D UA LQ = max D UA L s.t. hpq = dpq , {p, q }  Q = 

This means, e.g., that stability S (q ) is now defined by using P R I M A LQ (instead of P R I M A L) in (6). Likewise, lemma 1 and theorem 2 still continue to hold true provided that D UA L is replaced with D UA LQ in the statement of these theorems. In addition to that, the definition of margin q (h) needs to be modified as follows : . h p p ^ (16) (hpq - max(hp , dpq )) - qq - hq (hp - hp ) - q (h) =
/ Q:hpq =hp Q{q } /

The P RO J E C T operation: Given this modified definition of margins, we can now update Q at any iteration in the following manner:
E X PA N D :

Compute q = arg max q (h) and if q (h)  0 then set Q = Q  {q } . Ż Ż Ż
q Q /

(17)

Based on the fact that margins are used as approximations to the stabilities of objects, the above update simply says that the object q with maximum stability should be chosen as the new center at Ż the current iteration, provided of course that this object q is stable. Furthermore, in this case, we also Ż
3 Actually, to represent the dual of P R I M A LQ exactly, we need to add a constant in the objective function of D UA LQ . Since, however, this constant does not affect maximization, it is thus omitted for clarity.

4


need to update the current dual solution h in order to take account of the fact that extra constraints have been added to D UA LQ (these are a result of the extra constraint xqq = 1 that has been added to ŻŻ P R I M A LQ ). By definition of D UA LQ , the new constraints are hqp = dqp , hpq = dpq for all p  Q / Ż Ż Ż Ż and, so, one has to apply the following operation, which simply projects the current dual solution into the feasible set of the updated linear program D UA LQ :
P R O J E C T:

hpp += hqp - dqp , hqp = dqp , hpq = dpq , p  Q . / Ż Ż Ż Ż Ż Ż

(18)

Note that update hpp += hqp - dqp is needed for maintaining dual feasibility constraint (9). EssenŻ Ż tially, P RO J E C T is a warm-start operation, that allows us to reuse existing information for computing a solution h that has a high dual objective value D(h) and is also feasible to the updated D UA LQ . The D I S T R I B U T E operation: In case it holds q (h) < 0 for all q  Q, this means that we are / unable to find an object with good stability at the current iteration. To counter that, we will thus need to update solution h in order to increase its dual objective value (recall that, by lemma 1, stable objects will necessarily be revealed at an optimal dual solution, i.e., at a dual solution of maximum objective). Intuitively, what happens is that as we increase the dual objective D(h), objects not in Q actually try to compete with each other for achieving a large margin. Interestingly enough, in order to increase D(h), we will again have to rely on the margins of the current dual solution. In particular, it turns out that, if q (h) < 0 holds true for all q  Q, then the following very simple / update of h is guaranteed to increase the dual objective: p h  LQ O R hp < dpq max(hp , dpq ), if p = q A N D  q ( h ) else if hpq > hp D I S T R I B U T E : p, q  Q, hpq = / p - |V q | ,  h - q (h) , ^p else if hpq = hp
|V q |

In the above update, we denote by LQ the set of objects whose minimum pseudo-distance hp is attained at an object from Q, i.e., LQ = {p  Q | hp = minqQ hpq }, while |Vq | denotes the / cardinality of the set Vq = {p  Q  LQ | hp  dpq }  {q }. The following theorem then holds true: / Theorem 3. If maxqQ q (h) < 0, then the D I S T R I B U T E operation maintains feasibility and, / unless V = Q  LQ , it also strictly increases the dual objective. The pseudocode of the resulting algorithm is shown in Fig. 1. As already explained, it is an iterative algorithm, which keeps updating a dual solution h by using the D I S T R I B U T E and P RO J E C T operations (the latter applied only when needed) until the dual objective can no longer increase. Note also that, besides maintaining a dual solution h, the algorithm also maintains Q which provides a current clustering and also has a primal cost E (Q). With respect to this cost, the following theorem can be shown to hold true: Theorem 4. If maxqQ q (h) > 0, then the E X PA N D operation strictly decreases the primal cost / E (Q).

This implies that the sequence of primal costs E (Q) generated by the algorithm is decreasing (recall that we actually want to minimize E (·)). It is worth noting at this point that nowhere have we tried to enforce this property by explicitly considering the primal cost when updating Q. This is achieved simply thanks to the requirement of always selecting objects with high stability, thus showing how powerful this requirement actually is. We also note that the algorithm's convergence is always guaranteed: the algorithm terminates when neither the primal cost E (Q) decreases nor the dual objective D(h) increases during the current iteration. Finally, we note that exactly the same algorithm applies to the general case where the objects in V form a graph with edges E (distance dpq is then defined only for pq  E ). In this case, it is easy to verify that the cost of each iteration will be O(|E |). Furthermore, the algorithm converges extremely fast in practice (i.e. in very few iterations).

3

Related work

Before proceeding, let us briefly mention how our method relates to some state-of-the-art exemplarbased clustering techniques. Affinity propagation [5] is a recently proposed method for clustering, which relies on minimizing exactly the same objective function (1). This is an iterative algorithm, which repeatedly updates (through messages) the so-called responsibilities and availabilities. These can be considered as counterparts to our pseudo-distances hpq . Affinity propagation also estimates 5


1: 2: 3: 4: 5: 6: 7:

h  d; while maxqQ q (h) < 0 do / Dprev  D(h); h  D I S T R I B U T E(h); if Dprev = D(h) then exit; end q  arg maxqQ q (h); Q  Q  {q }; h  P RO J E C T(h); Ż Ż / goto 2; Fig. 1: Pseudocode of our clustering algorithm.

the so-called self-availabilities for measuring the likelihood of an object being a cluster center. On the contrary, we use for the same purpose the margins that approximate the stability of an object. Furthermore, compared to affinity propagation, our method offers the following significant advantages: its convergence is always guaranteed, it is parameter-free (no need for adjusting parameters such as damping factors in order to ensure convergence), it is a descent method (objective function (1) always decreases), and it can make use of the computed dual solutions for deriving online optimality bounds for free (these can be used for assessing that the derived solutions are almost optimal). At the same time, our method performs equally well or better in practice. Very recently, another exemplar-based algorithm has been proposed as well, which relies on solving a convex formulation of clustering [8]. We note, however, that this method is used for solving a different and much easier problem, which is that of soft clustering. Furthermore, it relies on a convex relaxation which is known to be much less tight than the LP relaxation P R I M A L we use hp re (essentially [8] e replaces all constraints xpq  xqq , p  V with the much looser constraint xpq  |V | · xqq ). As a result, generated solutions are expected to be of much lower quality. We also note that, unlike EM-like clustering algorithms such as K-means, our method is totally insensitive to initialization conditions and does not get stuck at bad local minima (thus yielding solutions of much better quality). Also, it is much more efficient than methods like [6], that require solving very large linear programs.

4

Experimental results

To illustrate the robustness of our algorithm to noise and its insensitivity to initialization, we start by showing clustering results on synthetic data. The synthetic datasets were generated using the following procedure: 2D points were sampled from a mixture of gaussian distributions, where the centers of the gaussians were arranged in an approximately grid-like fashion over the plane. In addition to that, random outliers were generated uniformly all over the grid, with their number being equal to half the number of the points drawn from the gaussian distributions. One such dataset (consisting of 24 gaussians) is displayed in Fig. 2, where colored crosses correspond to samples from gaussians, while the black dots correspond to outliers. The clustering result produced by our algorithm is shown in Fig. 2(a). As can be seen from that figure, despite the heavy percentage of noise, our method has been able to accurately detect all gaussian centers and successfully cluster this 2D dataset. Note that the number of gaussians was not given as input to our algorithm. Instead, it was inferred based on a common penalty term dqq for all objects q , which was set roughly equal to the median distance between points. On the contrary, K-means was unable to produce a good result for this dataset despite the fact that it was restarted multiple times (100 runs were used in this case). This is, of course, due to its well known sensitivity to initialization conditions. We repeated multiple experiments by varying the number of gaussians. Contrary to our algorithm, behavior of K-means gets even worse as this number increases. We have also plotted in Fig. 2(c) the primal and dual costs that were generated by our algorithm when it was applied to the example of Fig. 2(a). These correspond to the solid red and dashed blue curves respectively. Note that the dual costs represent lower bounds to the optimum value of the objective function E (·), while the primal costs represent obviously upper bounds. This fact allows us to obtain online optimality bounds with respect to how far our current primal solution Q is with respect to the unknown optimum of E (·). These bounds are, of course, refined continuously as the algorithm proceeds and can be useful for assessing its performance. For instance, in this particular example, we can be sure that the primal cost of our final solution is within 1% of the unknown optimum of function E (·), i.e., an approximately optimal solution has been obtained. 6


K-means clustering 1.4 1.4

1.2

1.2

1000 primal cost dual cost 500

1

1

0.8

0.8

0.6

0.6

0.4

0.4

0.2

0.2

0

0.2

0.4

0.6

0.8

1

1.2

1.4

0

0.2

0.4

0.6

0.8

1

1.2

1.4

0 0

20

40

60

(a) Our algorithm

(b) K-means

(c) Primal and dual costs

Fig. 2: Clustering results for synthetic data. The centers of the big circles represent the points chosen as cluster centers by the 2 algorithms. The primal and dual costs in (c) verify that the cost of our algorithm's solution is within 1% of the optimum cost.

Next we show some results from applying our algorithm to the challenging problem of multibody 3D segmentation, which has several applications in computer vision. As we shall see, a non-Euclidean distance for clustering will have to be used in this case. According to the 3D segmentation problem, we are given a set of N pixel correspondences between two images. These correspondences result from K objects undergoing K 3D rigid-body motions relative to a moving camera. The 3D-motion segmentation problem is the task of clustering these N pixel pairs according to the K moving objects. We consider the more general and difficult scenario of a fully projective camera model. In this case, each pixel pair, say, pi = (yi , zi ) that belongs to a moving object k should satisfy an epipolar constraint: T y i Fk z i = 0 , (19) where Fk represents the fundamental matrix associated with the k-th 3D motion. Of course, the matrices Fk corresponding to different motions are unknown to us. Hence, to solve the 3D segmentation problem, we need to estimate both the matrices Fk as well as the association of each pixel pair pi = (yi , zi ) to the correct fundamental matric Fk . To this end, we sample a large set of fundamental matrices by using a RANSAC-based scheme (we recall that a random set of, e.g., 8 pixel pairs pi is enough for generating a new fundamental matrix). The resulting matrices, say, {Fk } will then correspond to cluster centers, whereas all the input pixel pairs {pi } will correspond to objects that need to be assigned to an active cluster center. A clustering objective function of the form (1) thus results and by minimizing it we can also obtain a solution to the 3D segmentation problem. Of course, in this case, the distance function d(pi , Fk ) between an object pi = (yi , zi ) and a cluster center will not be Euclidean. Instead, based on (19), we can use a distance of the following form:
T d(pi , Fk ) = |yi Fk zi | .

(20)

Due to being more robust, a normalized version of the above distance is usually preferred in practice. Figure 3 displays 3D motion segmentation results that were obtained by applying our algorithm to two image pairs (points with different colors correspond to different motions). These examples were downloaded from a publicly available motion segmentation database [9] with ground-truth. The ground-truth motion segmentation is also shown for each example and, as can be seen, it is almost identical with the segmentation estimated by our algorithm. We next compare our method to Affinity Propagation (AP). Some really impressive results on 4 very challenging datasets have been reported for that algorithm in [5], indicating that it outperforms

(a)

(b)

Fig. 3: Two 3D motion segmentation results. For each one we show (left) ground truth segmentation of feature points and (right) estimated segmentation along with the input optical flow vectors.

7


1000
10 Our exemplars

900

Primal Cost E(Q) Ours AP Faces 13430 13454 Genes -210595 -210539 Cities 92154 92154 Sentences 10234 10241

#clusters Ours AP 60 62 1301 1290 7 7 4 4

9

800
8

700
7

600
6

500
5

4

400

3

300

2

200

Primal costs from Affinity Propagation 0 20 40 60 80 100 120 140 160 180 200

1

1

2

3

4

5

6

7

8

9

10

100

(a) (b) (c) Fig. 4: (a) Comparison of our algorithm with affinity propagation [5] on the 4 very challenging datasets `Faces', `Genes', `Cities' and `Sentences' from [5]. Since the goal of both algorithms is to minimize objective function E (Q), for each dataset we report the final value of this function and the number of estimated clusters. We have used exactly the same settings for both methods. (b) Our algorithm's clustering when applied to the `fourclouds' dataset from [1]. The primal costs generated by AP for this dataset (shown in (c)) demonstrate that AP fails to converge in this case (to prevent that, a properly chosen damping factor has to be used).

any other center-based clustering method. In particular, AP has been used for: clustering images of faces (using the squared error distance), detecting genes in microarray data (using a distance based on exons' transcriptions levels), identifying representative sentences in manuscripts (using the relative entropy as distance), and identifying cities that can be easily accessed by airline travel. In Fig. 4(a), we compare our method to AP on these publicly available problems. Since both methods rely on optimizing the same objective function, we list the values obtained by the two methods for the corresponding problems. Exactly the same settings have been used for both algorithms, with AP using the parameters proposed in [5]. Note that in all cases our algorithm manages to obtain a solution of equal or lower value than AP. This is true even, e.g., in the Genes dataset, where a higher number of clusters is selected by our algorithm (and thus a higher penalty for activating them is paid). The running times of our algorithm when applied to these datasets were: < 1 sec for the `Travel' and `Sentence' datasets, 7 secs for the `face' dataset and 5.7 mins for the `genes' dataset (the `genes' dataset consists of approximately 75000 datapoints). Furthermore, an additional advantage of our algorithm is that, unlike AP, it is always guaranteed to converge (e.g., see Figs 4(b), 4(c)).

5

Conclusions

In this paper we have introduced a very powerful and efficient center-based clustering algorithm, derived from LP duality theory. The resulting algorithm has guaranteed convergence and can handle data sets with arbitrary distance functions. Furthermore, despite its extreme generality, the proposed method is insensitive to initialization and computes clusterings of very low cost. As such, and considering the key role that clustering has in many problems, we believe that our method can find use in a wide variety of tasks. As another very important (both practical and theoretical) contribution of this work we also consider the fact of introducing the notions of LP-based stabilities and margins, two quantities that, as we have proved, are dual to each other and can be used for deciding what objects should be chosen as cluster centers. We strongly believe that these ideas can be of both practical and theoretical interest not just for designing center-based clustering algorithms, but also in many other contexts as well.

References
[1] A. Ng, M. Jordan, and Y. Weiss, "On spectral clustering: Analysis and an algorithm," in NIPS, 2001. [2] D. Verma and M. Meila, "A comparison of spectral clustering algorithms," Tech. Rep., 2001. [3] A. Banerjee, S. Merugu, I. S. Dhillon, and J. Ghosh, "Clustering with bregman divergences," J. Mach. Learn. Res., vol. 6, pp. 1705­1749, 2005. [4] B. Fischer, V. Roth, and J. Buhmann, "Clustering with the connectivity kernel," in NIPS, 2004. [5] B. J. Frey and D. Dueck, "Clustering by passing messages between data points," Science, vol. 315, 2007. ´ [6] M. Charikar, S. Guha, E. Tardos, and D. B. Shmoys, "A constant-factor approximation algorithm for the k-median problem," J. Comput. Syst. Sci., vol. 65, no. 1, pp. 129­149, 2002. [7] N. Komodakis, N. Paragios, and G. Tziritas, "Clustering via LP-based Stabilities," Technical Report, 2008. [8] D. Lashkari and P. Golland, "Convex clustering with exemplar-based models," in NIPS, 2008. [9] R. Tron and R. Vidal, "A benchmark for the comparison of 3-d motion segmentation algorithms," in CVPR, 2007.

8