Ensemble Clustering using Semidefinite Programming

Vikas Singh Biostatistics and Medical Informatics University of Wisconsin ­ Madison
vsingh @ biostat.wisc.edu

Lopamudra Mukherjee Computer Science and Engineering State University of New York at Buffalo
lm37 @ cse.buffalo.edu

Jiming Peng Industrial and Enterprise System Engineering University of Illinois at Urbana-Champaign
pengj @ uiuc.edu

Jinhui Xu Computer Science and Engineering State University of New York at Buffalo
jinhui @ cse.buffalo.edu

Abstract
We consider the ensemble clustering problem where the task is to `aggregate' multiple clustering solutions into a single consolidated clustering that maximizes the shared information among given clustering solutions. We obtain several new results for this problem. First, we note that the notion of agreement under such circumstances can be better captured using an agreement measure based on a 2D string encoding rather than voting strategy based methods proposed in literature. Using this generalization, we first derive a nonlinear optimization model to maximize the new agreement measure. We then show that our optimization problem can be transformed into a strict 0-1 Semidefinite Program (SDP) via novel convexification techniques which can subsequently be relaxed to a polynomial time solvable SDP. Our experiments indicate improvements not only in terms of the proposed agreement measure but also the existing agreement measures based on voting strategies. We discuss evaluations on clustering and image segmentation databases.

1

Introduction

In the so-called Ensemble Clustering problem, the target is to `combine' multiple clustering solutions or partitions of a set into a single consolidated clustering that maximizes the information shared (or `agreement') among all available clustering solutions. The need for this form of clustering arises in many applications, especially real world scenarios with a high degree of uncertainty such as image segmentation with poor signal to noise ratio and computer assisted disease diagnosis. It is quite common that a single clustering algorithm may not yield satisfactory results, while multiple algorithms may individually make imperfect choices, assigning some elements to wrong clusters. Usually, by considering the results of several different clustering algorithms together, one may be able to mitigate degeneracies in individual solutions and consequently obtain better solutions. The idea has been employed successfully for microarray data classification analysis [1], computer assisted diagnosis of diseases [2] and in a number of other applications [3]. Formally, given a data set D = {d1 , d2 , . . . , dn }, a set of clustering solutions C = {C1 , C2 , . . . , Cm } obtained from m different clustering algorithms is called a cluster ensemble. Each solution, Ci , is the partition of the data into at most k different clusters. The Ensemble Clustering problem requires one to use the individual solutions in C to partition D into k clusters such that information shared (`agreement') among the solutions of C is maximized.


1.1 Previous works The Ensemble Clustering problem was recently introduced by Strehl and Ghosh [3], in [4] a related notion of correlation clustering was independently proposed by Bansal, Blum, and Chawla. The problem has attracted a fair amount of attention and a number of interesting techniques have been proposed [3, 2, 5, 6], also see [7, 4]. Formulations primarily differ in how the objective of shared information maximization or agreement is chosen, we review some of the popular techniques next. The Instance Based Graph Formulation (IBGF) [2, 5] first constructs a fully connected graph G = (V , W ) for the ensemble C = (C1 , . . . , Cm ), each node represents an element of D = {d1 , . . . , dn }. The edge weight wij for the pair (di , dj ) is defined as the number of algorithms in C that assign the nodes di and dj to the same cluster (i.e., wij measures the togetherness frequency of di and dj ). Then, standard graph partitioning techniques are used to obtain a final clustering solution. In Cluster Based Graph Formulation (CBGF), a given cluster ensemble is represented as C = ¯ ¯ {C11 , . . . , Cmk } = {C1 , . . . , Cm·k } where Cij denotes the ith cluster of the j th algorithm in C . Like IBGF, this approach also constructs a graph, G = (V , W ), to model the correspondence (or `similarity') relationship among the mk clusters, where the similarity matrix W reflects the Jaccard's ¯ ¯ similarity measure between clusters Ci and Cj . The graph is then partitioned so that the clusters of the same group are similar to one another. Variants of the problem have also received considerable attention in the theoretical computer science and machine learning communities. A recent paper by Ailon, Charikar, and Newman [7] demonstrated connections to other well known problems such as Rank Aggregation, their algorithm is simple and obtains an expected constant approximation guarantee (via linear programming duality). In addition to [7], other results include [4, 8]. A commonality of existing algorithms for Ensemble Clustering [3, 2, 9] is that they employ a graph construction, as a first step. Element pairs (cluster pairs or item pairs) are then evaluated and their edges are assigned a weight that reflects their similarity. A natural question relates to whether we can find a better representation of the available information. This will be the focus of the next section.

2

Key Observations: Two is a company, is three a crowd?

Consider an example where one is `aggregating' recommendations made by a group of family and friends for dinner table seating assignments at a wedding. The hosts would like each `table' to be able to find a common topic of dinner conversation. Now, consider three persons, Tom, Dick, and Harry invited to this reception. Tom and Dick share a common interest in Shakespeare, Dick and Harry are both surfboard enthusiasts, and Harry and Tom attended college together. Because they had strong pairwise similarities, they were seated together but had a rather dull evening. A simple analysis shows that the three guests had strong common interests when considered two at a time, but there was weak communion as a group. The connection of this example to the ensemble clustering problem is clear. Existing algorithms represent the similarity between elements in D as a scalar value assigned to the edge joining their corresponding nodes in the graph. This weight is essentially a `vote' reflecting the number of algorithms that assigned those two elements to the same cluster. The mechanism seems perfect until we ask if strong pairwise coupling necessarily implies coupling for a larger group as well. The weight metric considering two elements does not retain information about which algorithms assigned them together. When expanding the group to include more elements, one is not sure if a common feature exists under which the larger group is similar. It seems natural to assign a higher priority to triples or larger groups of people that were recommended to be seated together (must be similar under at least one feature) compared to groups that were never assigned to the same table by any person in the recommendation group (clustering algorithm), notwithstanding pairwise evaluations, for an illustrative example see [10]. While this problem seems to be a distinctive disadvantage for only the IBGF approach; it also affects the CBGF approach. This can be seen by looking at clusters as items and the Jaccard's similarity measure as the vote (weight) on the edges.

3

Main Ideas

To model the intuition above, we generalize the similarity metric to maximize similarity or `agreement' by an appropriate encoding of the solutions obtained from individual clustering algorithms.


More precisely, in our generalization the similarity is no longer just a scalar value but a 2D string. The ensemble clustering problem thus reduces to a form of string clustering problem where our objective is to assign similar strings to the same cluster. The encoding into a string is done as follows. The data item set is given as D with |D| = n. Let m be the number of clustering algorithms with each solution having no more than k clusters. We represent all input information (ensemble) as a single 3D matrix, A  n×m×k . For every data element dl  D, Al  m×k is a matrix whose elements are defined by 1 if dl is assigned to cluster i by Cj ; (1) Al (i, j ) = 0 otherwise It is easy to see that the summation of every row of Al equals 1. We call each Al an A-string. Our goal is to cluster the elements D = {d1 , d2 , . . . , dn } based on the similarity of their A-strings. We now consider how to compute the clusters based on the similarity (or dissimilarity) of strings. We note that the paper [11] by Gasieniec et al., discussed the so-called Hamming radius p-clustering and Hamming diameter p-clustering problems on strings. Though their results shed considerable light on the hardness of string clustering with the selected distance measures, those techniques cannot be directly applied to the problem at hand because the objective here is fairly different from the one in [11]. Fortunately, our analysis reveals that a simpler objective is sufficient to capture the essence of similarity maximization in clusters using certain special properties of the A-strings. Our approach is partly inspired by the classical k -means clustering where all data points are assigned to the cluster based on the shortest distance to the cluster center. Imagine an ideal input instance for the ensemble clustering problem (all clustering algorithms behave similarly) ­ one with only k unique members among n A-strings. The partitioning simply assigns similar strings to the same partition. The representative for each cluster will then be exactly like its members, is a valid Astring, and can be viewed as a center in a geometric sense. General input instances will obviously be non-ideal and are likely to contain far more than k unique members. Naturally, the centers of the clusters will vary from its members. This variation can be thought of as noise or disagreement within the clusters, our objective is to find a set of clusters (and centers) such that the noise is minimized and we move very close to the ideal case. To model this, we consider the centers to be in the same high dimensional space as the A-strings in D (though it may not belong to D). Consider an example where a cluster i in this optimal solution contains items (d1 , d2 , . . . , d7 ). A certain algorithm Cj in the input ensemble clusters items (d1 , d2 , d3 , d4 ) in cluster s and (d5 , d6 , d7 ) in cluster p. How would Cj behave if evaluating the center of cluster i as a data item? The probability it assigns the center to cluster s is 4/7 and the probability it assigns the center to cluster p is 3/7. If we emulate this logic ­ we must pick the choice with the higher probability and assign the center to such a cluster. It can be verified that this choice minimizes the dissent of all items in cluster i to the center. The Astring for the center of cluster i will have a "1" at position (j, s). The assignment of A-string (items) to clusters is unknown; however, if it were somehow known, we could find the centers for all other clusters i  [1, k ] by computing the average value at every cell of the A matrices corresponding to the members of the cluster and rounding the largest value in every row to 1 (rest to 0) and assigning this as the cluster center. Hence, the dissent within a cluster can be quantified simply by averaging the matrices of elements that belong to the cluster and computing the difference to the center. Our goal is to find such an assignment and group the A-strings so that the sum of the absolute differences of the averages of clusters to their centers (dissent) is minimized. In the subsequent sections, we will introduce our optimization framework for ensemble clustering based on these ideas.

4

Integer Program for Model 1

We start with a discussion of an Integer Program (IP, for short) formulation for ensemble clustering.   For convenience, we denote the final clustering solution by C  = {C1 , . . . , Ck } and Cij denotes the cluster i by the algorithm j . The variables that constitute the IP are as follows. 1  if dl  Ci ; = Xl i (2) 0 otherwise 1   if Ci = arg max {|Ci Cij |} i=1,...,k = sij i (3) 0 otherwise


We mention that the above definition implies that for a fixed index i , its center, sij i also provides an  indicator to the cluster most similar to Ci in the set of clusters produced by the clustering algorithm Cj . We are now ready to introduce the following IP. s n i k i k jm l=n Alij Xli 1 - min 4) ij i ( l=1 Xli = =1 =1
1 k

s.t.

i
=1

Xl i

=

1 l  [1, n],

ln
=1

Xl i


1

i


[1, k ],

(5)

ik
=1

sij i

=

1 j  [1, m], i


[1, k ],

Xl i


{0, 1},

sij i


{0, 1}.

(6)

 (4) minimizes the sum of the difference between sij i (the center for cluster Ci ) and the average of  all Alij bits of the data elements dl assigned to cluster Ci . Recall that sij i will be 1 if Cij is the  mostPimilar cluster to Ci amongsall the Pusters produced by algorithm Cj . Hence, if sij i = 0 s cl n n P1 Alij Xli P1 Alij Xli l= l= - and 0, the value ij i epresents the percentage of data elements n n = r l=1 Xli l=1 Xli  in Ci that do not consent with the majority of the other elements in the group w.r.t. the clustering solution provided by Cj . In other words, we are trying to minimize the dissent and maximize the consent simultaneously. The remaining constraints are relatively simple ­ (5) enforces the condition that a data element should belong to precisely one cluster in the final solution and that every cluster must have size at least 1; (6) ensures that sij i is an appropriate A-string for every cluster center.

5

0-1 Semidefinite Program for Model 1

The formulation given by (4)-(6) is a mixed integer program (MIP, for short) with a nonlinear objective function in (4). Solving this model optimally, however, is extremely challenging ­ (a) the constraints in (5)-(6) are discrete; (b) the objective is nonlinear and nonconvex. One possible way of attacking the problem is to `relax' it to some polynomially solvable problems such as SDP (the problem of minimizing a linear function over the intersection of a polyhedron and the cone of symmetric and positive semidefinite matrices, see [12] for an introduction). Our effort would be to convert the nonlinear form in (4) into a 0-1 SDP form. By introducing artificial variables, we rewrite (4) as min i k jm i k
=1 =1
=1

tij i

(7)

sij i - cij i  tij i , cij i - sij i  tij i i, i , j, where the term cij i represents the second term in (4) defined by n n Alij Xli i, i , j. cij i = l=1  Xl i l=1 Since both Alij and Xli are binary, (9) can be rewritten as n 2 2 l=n Alij Xli 1 = cij i i, i , j. 2  l=1 Xli Let us introduce a matrix variable yi  n whose lth column is defined by n X li Xl i X (l ) yi = . 2 i2 l=1 Xli =

(8)

(9)

(10)

(11)

Let Aij  n be a vector whose lth element has value Al (i, j ). This allows us to represent (10) as
2 0, cij i = tr(Bij Zi ), Zi = Zi , Zi (12) where Bij = diag(Aij ) is a diagonal matrix with (Bij )ll = Al (i, j ), the second and third properties T follow from Zi = yi yi being a positive semidefinite matrix. Now, we rewrite the constraints for X in terms of Z . (5) is automatically satisfied by the following constraints on the elements of Zi . l n (l l ) l n (l l ) Zi = 1 i  [1, k ], Zi i  [1, k ], l  [1, n]. (13) 1 =1
=1


i where Zi refers to the (u, v ) entry of matrix Zi . Since Z is a symmetric projection matrix by construction, (7)-(13) constitute a precisely defined 0-1 SDP that can be expressed in trace form as

(uv )

min s.t.

ik
=1

tr(diag(Ti
-

ek

)) (Qi [1, k ],
-

(14) Si
-

(Si
k

Ti

-

Qi )  0, = en i
 

Ti )  0

i


[1, k ],
k

(15) Zi ) = k ,
=1

i (
=1

Zi
ek

)en

tr(Zi ) = 1 i
2 Zi
=

[1, k ],
=

i tr( Si

(16) (17)

Si where Qi
(i,

= em i
=

[1, k ],
),

Z  0;

Zi

;

Zi

T Zi ;


{0, 1}.

j ) = cij i

tr(Bij Zi

and en  n is a vector of all 1s.

The experimental results for this model indicate that it performs very well in practice (see [10]). However, because we must solve the model while maintaining the requirement that Si be binary (otherwise, the problem becomes ill-posed), a branch and bound type method is needed. Such approaches are widely used in many application areas, but its worst case complexity is exponential in the input data size. In the subsequent sections, we will make several changes to this framework based on additional observations in order to obtain a polynomial algorithm for the problem.

6

Integer Program and 0-1 Semidefinite Program for Model 2

Recall the definition of the variables cij i , which can be interpreted as the size of the overlap between   the cluster Ci in the final solution and Cij , and is proportional to the cardinality of Ci . Let us define ci j i
=

i=1,...,k

max cij i

.

Let us also define vector variables qj i whose ith element is sij i - cij i . In the IP model 1, we try to minimize the sum of all the L1 -norms of qj i . The main difficulty in the previous formulation stems from the fact that cij i is a fractional function w.r.t the assignment matrix X . Fortunately, we k note that since entries of cij i are fractional satisfying i=1 cij i = 1 for any fixed j, i , their sum of squares is maximized when its largest entry is as high as possible. Thus, minimizing the function k 1 - i=1 (cij i )2 is a reasonable substitute to minimizing the sum of the L1 -norms in the IP model 1. The primary advantage of this observation is that we do not need to know the `index' (i ) of the maximal element ci j i . As before, X denotes the assignment matrix. We no longer need the variable s, as it can be easily determined from the solution. This yields the following IP. ( 1 i k jm l n ik 2 ( Xl i ) - 18) (cij i ) min
=1

=1

=1

=1

k

s.t.

i
=1

Xl i

=

1 l  [1, n],

ln
=1

Xl i


1

i


[1, k ],

Xl i


{0, 1}.

(19)

We next discuss how to transform the above problem to a 0-1 SDP. For this, we first note that the objective function (18) can be expressed as follows. (n i k jm l i k ( n Alij Xli )2 l=1 n )- (20) min Xl i , l=1 Xli = =1 =1
1 =1

which can be equivalently stated as  min nm -

i k jm i k (
=1

n

l=1 Alij Xli n l=1

)

2

=1 =1

Xl i

,


(21)

The numerator of the second term above can be rewritten as ln ( Alij Xli )2 = (A1ij X1i + . . . + Anij Xni )2 = (ATj Xi i
=1

)

2

T = Xi Aij ATj Xi i

,

(22)


where Xi is the i th column vector of X . Therefore, the second term of (21) can be written as i k jm i k T = tr( Xi Aij ATj Xi i
=1 (X

T i Xi

)

-1

)

=1 =1

i k jm i k jm i k jm = tr( Aij ATj Zi ) = tr( Aij ATj Z ) = tr( Bj Z ) = tr(B Z ). (23) i i
=1

=1 =1

=1 =1

=1

m T T In (23), Zi = Xi (Xi Xi )-1 Xi (same as in IP model 1) and Z = i =1 Zi and B = j =1 Bj . Since each matrix Zi is a symmetric projection matrix and Xi1 and Xi2 are orthogonal to each other when i1 = i2 , Z is a projection matrix of the form X (X T X )-1 X . The last fact also used in [13] is originally attributed to an anonymous referee in [14]. Finally, we derive the 0-1 SDP formulation for the problem (18)-(19) as follows. k min s.t. (nm - tr(B Z )) Z en = en i  [1, k ], tr(Z ) = k , Z  0; Z2 = Z; Z = ZT . (24) (25) (26)

Relaxing and Solving the 0-1 SDP: The relaxation to (24)-(26) exploits the fact that Z is a projection matrix satisfying Z 2 = Z . This allows replacing the last three constraints in (26) as I Z 0. By establishing the result that any feasible solution to the second formulation of 0-1 SDP, Z feas is a rank k matrix, we first solve the relaxed SDP using SeDuMi [15], take the rank k projection of Z  and then adopt a rounding based on a variant of the winner-takes-all approach to obtain a solution in polynomial time. For the technical details and their proofs, please refer to [10].

7

Experimental Results

Our experiments included evaluations on several classification datasets, segmentation databases and simulations. Due to space limitations, we provide a brief summary here. Our first set of experiments illustrates an application to several datasets from the UCI Machine Learning Repository: (1) Iris dataset, (2) Soybean dataset and (3) Wine dataset; these include ground truth data, see http://www.ics.uci.edu/ mlearn/MLRepository.html. To create the ensemble, we used a set of [4, 10] clustering schemes (by varying the clustering criterion and/or algorithm) from the CLUTO clustering toolkit. The multiple solutions comprised the input ensemble, our model was then used to determine a agreement maximizing solution. The ground-truth data was used at this stage to evaluate accuracy of the ensemble (and individual schemes). The results are shown in Figure 1(a)-(c). For each case, we can see that the ensemble clustering solution is at least as good as the best clustering algorithm. Observe, however, that while such results are expected for this and many other datasets (see [3]), the consensus solution may not always be superior to the `best' clustering solution. For instance, in Fig. 1(c) (for m = 7) the best solution has a marginally lower error rate than the ensemble. An ensemble solution is useful because we do not know a priori that which algorithm will perform the best (especially if ground truth is unavailable).
0.4

Mislabelled cases in each algorithm

Mislabelled cases in each algorithm

Mislabelled cases in each algorithm

0.5

cluster ensemble individual algorithms

0.3

cluster ensemble individual algorithms

0.6 0.5 0.4 0.3 0.2 0.1

cluster ensemble individual algorithms

0.4

0.3

0.2

0.2

0.1

0.1

3

Number of clustering algorithms in ensemble (m)

4

5

6

7

8

9

10

11

3

Number of clustering algorithms in ensemble (m)

4

5

6

7

8

9

10

11

3

Number of clustering algorithms in ensemble (m)

4

5

6

7

8

9

10

11

(a) (b ) (c) Figure 1: Synergy. The fraction of mislabeled cases ([0, 1]) in a consensus solution () is compared to the number of mislabelled cases () in individual clustering algorithms. We illustrate the ensemble effect for the Iris dataset in (a), the Soybean dataset in (b), and the Wine dataset in (c).


Our second set of experiments focuses on a novel application of ensembles to the problem of image segmentation. Even sophisticated segmentation algorithms may yield `different' results on the same image, when multiple segmentations are available, it seems reasonable to `combine' segmentations to reduce degeneracies. Our experimental results indicate that in many cases, we can obtain a better overall segmentation that captures (more) details in the images more accurately with fewer outlying clusters. In Fig. 2, we illustrate the results on an image from the Berkeley dataset. The segmentations were generated using several powerful algorithms including (a) Normalized Cuts, (b) Energy Minimization by Graph Cuts and (c)­(d) Curve Evolution. Notice that each algorithm performs well but misses out on some details. For instance, (a) and (d) do not segment the eyes; (b) does well in segmenting the shirt collar region but can only recognize one of the eyes and (c) creates an additional cut across the forehead. The ensemble (extreme right) is able to segment these details (eyes, shirt collar and cap) nicely by combining (a)­(d). For implementation details of the algorithm including settings, preprocessing and additional evaluations, please refer to [10].

(a) (b) (c) (d) ensemble Figure 2: A segmentation ensemble on an image from the Berkeley Segmentation dataset. (a)­(d) show the individual segmentations overlaid on the input image, the right-most image shows the segmentation generated from ensemble clustering. The final set of our experiments were performed on 500 runs of artificially generated cluster ensembles. We first constructed an initial set segmentation, this was then repeatedly permuted (up to 15%) yielding a set of clustering solutions. The solutions from our model and [3] were compared w.r.t. our objective functions and Normalized Mutual Information used in [3]. In Figure 3(a), we see that our algorithm (Model 1) outperforms [3] on all instances. In the average case, the ratio is slightly more than 1.5. We must note the time-quality trade-off because solving Model 1 requires a branchand-bound approach. In Fig. 3(b), we compare the results of [3] with solutions from the relaxed SDP Model 2 on (24). We can see that our model performs better in  95% cases. Finally, Figure 1(b) shows a comparison of relaxed SDP Model 2 with [3] on the objective function optimized in [3] (using best among two techniques). We observed that our solutions achieve superior results in 80% of the cases. The results show that even empirically our objective functions model similarity rather well, and that Normalized Mutual Information may be implicitly optimized within this framework. Remarks. We note that the graph partitioning methods used in [3] are typically much faster than the time needed by SDP solvers (e.g., SeDuMi [15] and SDPT3) for comparable problem sizes. However, given the increasing interest in SDP in the last few years, we may expect the development of new algorithms, and faster/more efficient software tools.

8

Conclusions

We have proposed a new algorithm for ensemble clustering based on a SDP formulation. Among the important contributions of this paper is, we believe, the observation that the notion of agreement in an ensemble can be captured better using string encoding rather than a voting strategy. While a partition problem defined directly on such strings yields a non-linear optimization problem, we illustrate a transformation into a strict 0-1 SDP via novel convexification techniques. The last result of this paper is the design of a modified model of the SDP based on additional observations on the structure of the underlying matrices. We discuss extensive experimental evaluations on simulations and real datasets, in addition, we illustrate application of the algorithm for segmentation ensembles. We feel that the latter application is of independent research interest; to the best of our knowledge, this is the first algorithmic treatment of generating segmentation ensembles for improving accuracy.


200

500

200

Worse

Better

Number of instances

Number of instances

300

100

200

50

Number of instances

150

400

Worse

Better
150

Worse

Better

100

100

50

0

Ratios of solutions of objective functions of the two algorithms

0

1

2

3

4

5

Ratios of solutions of objective functions of the two algorithms

0 0.4

0.6

0.8

1

1.2

1.4

1.6

1.8

0 -0.1

-0.08

-0.06

-0.04

Normalized difference

-0.02

0

0.02

0.04

0.06

0.08

0.1

(a) (b ) (c) Figure 3: A comparison of [3] with SDP Model 1 in (a), and with SDP Model 2 on (24) in (b). The solution from [3] was used as the numerator. In (c), comparisons (difference in normalized values) between our solution and the best among IBGF and CBGF based on the Normalized Mutual Information (NMI) objective function used in [3]. Acknowledgments. This work was supported in part by NSF grants CCF-0546509 and IIS0713489. The first author was also supported by start-up funds from the Dept. of Biostatistics and Medical Informatics, UW ­ Madison. We thank D. Sivakumar for useful discussions, Johan ¨ Lofberg for a thorough explanation of the salient features of Yalmip [16], and the reviewers for suggestions regarding the presentation of the paper. One of the reviewers also pointed out a typo in the derivations in §6.

References
[1] V. Filkov and S. Skiena. Integrating microarray data by consensus clustering. In Proc. of International Conference on Tools with Artificial Intelligence, page 418, 2003. [2] X. Z. Fern and C. E. Brodley. Solving cluster ensemble problems by bipartite graph partitioning. In Proc. of International Conference on Machine Learning, page 36, 2004. [3] A. Strehl and J. Ghosh. Cluster Ensembles ­ A Knowledge Reuse Framework for Combining Partitionings. In Proc. of AAAI 2002, pages 93­98, 2002. [4] N. Bansal, A. Blum, and S. Chawla. Correlation clustering. In Proc. Symposium on Foundations of Computer Science, page 238, 2002. [5] S. Monti, P. Tamayo, J. Mesirov, and T. Golub. Consensus clustering: A resampling-based method for class discovery and visualization of gene expression microarray data. Mach. Learn., 52(1-2):91­118, 2003. [6] A. Gionis, H. Mannila, and P. Tsaparas. Clustering aggregation. In Proc. of International Conference on Data Engineering, pages 341­352, 2005. [7] N. Ailon, M. Charikar, and A. Newman. Aggregating inconsistent information: ranking and clustering. In Proc. of Symposium on Theory of Computing, pages 684­693, 2005. [8] M. Charikar, V. Guruswami, and A. Wirth. Clustering with qualitative information. J. Comput. Syst. Sci., 71(3):360­383, 2005. [9] X. Z. Fern and C. E. Brodley. Random projection for high dimensional data clustering: A cluster ensemble approach. In Proceedings of International Conference on Machine Learning, 2003. [10] V. Singh. On Several Geometric Optimization Problems in Biomedical Computation. PhD thesis, State University of New York at Buffalo, 2007. [11] L. Gasieniec, J. Jansson, and A. Lingas. Approximation algorithms for hamming clustering problems. In Proc. of Symposium on Combinatorial Pattern Matching, pages 108­118, 2000. [12] S. Boyd and L. Vandenberghe. Convex Optimization. Cambridge University Press, New York, 2004. [13] J. Peng and Y. Wei. Approximating k-means-type clustering via semidefinite programming. SIAM Journal on Optimization, 18(1):186­205, 2007. [14] A. D. Gordon and J. T. Henderson. An algorithm for euclidean sum of squares classification. Biometrics, 33:355­362, 1977. [15] J. F. Sturm. Using SeDuMi 1.02, A Matlab Toolbox for Optimization over Symmetric Cones. Optimization Methods and Software, 11-12:625­653, 1999. ¨ [16] J. Lofberg. YALMIP : A toolbox for modeling and optimization in MATLAB. In CCA/ISIC/CACSD, September 2004.