Finding Metric Structure in Information Theoretic Clustering

Kamalika Chaudhuri University of California, San Diego kamalika@soe.ucsd.edu

Andrew McGregor University of California, San Diego andrewm@ucsd.edu

Abstract
We study the problem of clustering discrete probability distributions with respect to the Kullback-Leibler (KL) divergence. This problem arises naturally in many applications. Our goal is to pick k distributions as "representatives" such that the average or maximum KLdivergence between an input distribution and the closest representative distribution is minimized. Unfortunately, no polynomial-time algorithms with worst-case performance guarantees are known for either of these problems.
2 The analogous problems for l1 , l2 and l2 (i.e., k -center, k -median and k -means) have been extensively studied and efficient algorithms with good approximation guarantees are known. However, these algorithms rely crucially on the (geo-)metric properties of these metrics and do not apply to KL-divergence. In this paper, our contribution is to find a "relaxed" metricstructure for KL-divergence. In doing so, we provide the first polynomial-time algorithm for clustering using KL-divergences with provable guarantees for general inputs.

is minimized. In M M CKL (minimum maximum cost), the goal is to find distributions c1 , . . . , ck such that the maximum KL-divergence from each pj to its closest ci , max min KL(pj , ci )
j [n] i[k]

is minimized. It turns out that polynomial time algorithms do not exist for either of these problems unless P = N P . Therefore, we are interested in -approximation algorithms, i.e., algorithms that find c1 , . . . , ck satisfying ~ ~ the guarantee that j ji ~ [n] mini[k] KL(p , c ) j  ji minc1 ,...,cn [n] mini[k] KL(p , c ) for some   1. The smaller the value of , the better the approximation. Both problems have been studied extensively when the input is a set of arbitrary points (not necessarily distributions), and instead of KL, the measure of distance between two points is either a metric ( 1 or 2 or an arbitrary metric), with symmetry and the triangle inequality, or a measure such as 2 . The problems are usually 2 referred to as k -median if the measure is a metric or k -means if the measure is 2 . However, previous algo2 rithms for these problems typically rely crucially on the (geo-)metric properties of these distances, which do not hold for the KL-divergence. For example, KL is not symmetric and does not satisfy the triangle inequality. In the remainder of the introduction, we motivate the need to cluster distributions and the reason why KL is a natural measure in this context. We then review the related work and summarize our contributions. Why cluster distributions? A natural application of distributional clustering is in clustering words for document classification by topic [4]. In document classification, we are given a training set of documents (or collections of words) whose labels indicate the topic they represent, and the goal is to classify other similar documents according to topic. A natural approach is to look at the words in a document as features that are somehow correlated with the document labels; each word is viewed as a frequency distribution over labels, and given a new document containing a set of words, the distributions corresponding to the words in it are used to find

1 Introduction
In this paper, we consider the problem of clustering discrete probability distributions with respect to the KullbackLiebler (KL) divergence where, the KL-divergence from p = (p1 , . . . , pd ) to distribution q = (q1 , . . . , qd ) is defined as i pi KL(p, q ) = pi ln . qi
[d]

Specifically, we consider two problems that take n distributions p1 , . . . , pn on [d] as input. In M T CKL (minimum total cost), the goal is to find distributions c1 , . . . , ck such that the total KL-divergence from each pj to its closest ci , i.e., j min KL(pj , ci )
[n] i[k]


the most-likely label for the new document. However, such data is typically very sparse, because each specific word occurs a few times in the document corpora. So a common approach is to cluster together similar word distributions, for more robust inference algorithms. Other applications of distributional clustering include clustering words according to context for language modeling [24], information bottleneck techniques [27, 24, 25], and clustering users according to their preference for movies in collaborative filtering. Why KL-divergence? KL-divergence arises as a natural measure of the dissimilarity between two distributions in numerous ways. We direct the interested reader to Pereira et al. [24] for a wider discussion on the motivations. In what follows, we describe the motivation in terms of compressibility. Given an alphabet  of size d where the i-th symbol has relative frequency pi , an important question is to find the binary encoding of the alphabet such the average number of bits required for an encoded symbol is minimized. This classic problem in information theory was essentially solved by Huffman who presented a simple encoding scheme that achieved the optimum value of i H (p) = - pi lg pi
[d]

the aggregate statistics of the distributions that use Ej . Hence we may rewrite the quantity to be minimized as j i j i f (j ) - pj lg qi + pj lg pj i i i
[n] [d] [n] [d]

=

j

i
[n] [d]

pj lg i

pj i f (j ) qi

= (lg e)

j
[n]

KL(pj , q f (j ) )

which is exactly the objective function to be minimized in M T CKL . 1.1 Prior Work on Clustering There has been a rich body of research on approximation algorithms for various forms of clustering. We restrict ourselves to those on hard-clustering, i.e., each input distributions is "assigned" to only the closest picked center. Even so, there is a considerable number of incomparable results in a variety of settings. The common optimization measures when clustering points in general metrics are (a) k -median, in which the goal is to partition the input points into k sets, while minimizing the sum of the distances between each point and the center of the cluster it is assigned to, and (b) k -center, where the goal is to again partition the input points to k sets, while minimizing the maximum diameter of a cluster. When clustering in Euclidean spaces, an additional optimization measure which is commonly used is k -means, in which the goal is to partition the input points into k clusters, while minimizing the sum of the squares of the Euclidean distances between each point and the center of the cluster it is assigned to. General "Metrics": For metric k -center, the best approximation algorithm is due to [16], which achieves an approximation factor of 2 and this is the best possible in polynomial time unless P = N P . For asymmetric k -center, when the directed triangle inequality holds, the best known approximation algorithm is due to [23], which achieves a factor of O(log n), and this is also optimal in terms of hardness [7]. For metric k -median, the best known approximation algorithm is due to [2], which achieves an approximation factor of 3, when the distances between points are symmetric, and there is a triangle inequality. Euclidean Space: When the input points lie in Euclidean space, two versions of the clustering problems have been studied. In the restricted version, we require the cluster centers to be input points, while in the unrestricted version,we allow the cluster centers to be any point in the Euclidean space. For more details about restricted and unrestricted versions of the problems, see Section 5. Most results for clustering in Euclidean space deal with the unrestricted version of the problem. When the input points lie in d-dimensional Euclidean spaces, Kolliopoulos and Rao [21] showed an algorithm for k -median which provides a (1 + ) approximation, and runs in time O(2(O(1+
-1 log -1 ))d-1

if all pi were negative powers of two. We consider an issue that arises when we have two or more distributions over . Consider the problem of trying to encode multiple texts with different statistics such as texts written in different languages or magazine articles covering different topics. For example, the word "perforation" may be common in articles from Gibbons Stamp Monthly Magazine1 whereas "peroxide" may be more frequent in issues of Hairdressers Journal International2 . Hence, if the origin of the text is known it will make sense to tailor the encoding to the statistics of the the source. However, it is likely to be unfeasible to have a different scheme for every possible periodical. Rather, we consider the problem of designing k encoding schemes and assigning each of n periodicals to one of the encoding schemes. How should this be done such that extra cost of using k encoding schemes rather than n is minimized? More formally, let pj be distribution over symbols in the j -th periodical. We wish to design k encoding schemes E1 , . . . , Ek :   {0, 1} along with an assignment of distributions to encoding schemes f : [n]  [k ] such that the increase in average encoding length, j i j i pj |Ef (j ) (i)| + pj lg pj i i i
[n] [d] [n] [d]

is minimized. Each encoding scheme Ej can be characterized by a distribution q j over [d] that will capture
1 2

http://www.gibbonsstampmonthly.com/ http://www.hji.co.uk/

n log k log n) .


Har-Peled and Mazumdar [19] gave a (1 + ) approximation algorithm which runs in time O(n + 2O(1+
-1 log -1 )d-1 O (1)

k

logO(1) n) .

A third algorithm was proposed by Badoiu et al. [3] with a running time of O(dO(1) n logO(k) n2O(k/ ) ) . For Euclidean k -means, Har-Peled and Mazumdar [19] provided an (1 + ) approximation algorithm with running time O(n + ( -1 )2d+1 k k+2 logk+1 n logk -1 ) . A second (1 + ) approximation algorithm is due to Feldman et al. [15], which achieves a running time of O(ndk + d(k -1 )O(1) + 2O(k
-1 )

There has been previous work on methods for KLclustering [24, 4, 26, 5, 11]. However, none of these algorithms achieve guaranteed approximations in the worst case. The most directly relevant paper is a recent paper by Ackermann et al. [1]. They present a very nice algorithm that returns a good approximation for M T C KL on the assumption that all distributions to be clustered have constant mass on each coordinate, i.e., for some constant  , pj   for all j  [t], i  [d]. i This implies that d  1/ is also constant and even for distributions with constant dimension, rules out any sparse data where some coordinates will have zero mass. Sparse data is common in many applications. In contrast, the algorithms we present are fully general and require no assumptions on the sparsity or the dimensionality of the input distributions. 1.2 Our Contributions

).

Kumar et al. [22] provided a simple algorithm based on sampling for Euclidean k -means which gave a (1 + )approximation in O(dn2poly(k
-1 )

Our main contribution in this paper is to provide algorithms for clustering in the KL-divergence measure which achieve guaranteed approximations in the worst case. Our specific contributions are the following: 1. Minimizing Average Distortion: We provide the first guaranteed approximation algorithm for the problem of minimizing average distortion in the KL-divergence measure, when the input is a set of n arbitrary distributions. To show our result, we first provide constant factor approximation algorithms for the related divergences, Hellinger and JensenShannon. These results exploit the fact that these divergences satisfy a relaxation of the triangle inequality and are closely related to the k -means problem on the sphere. We then show that although the KL-divergence between two distributions can be infinitely larger than the Jensen-Shannon or Hellinger divergence, one can relate the average clustering distortion in terms of the Hellinger cost to the average clustering distortion in terms of the KL-divergence. This yields an O(log n)-approximation algorithm for M T CKL . We note that while a guarantee of O(log n)-factor from optimality is weaker than we would like, this does not preclude the possibility that the algorithm achieves better results in practice. Furthermore, the clustering found could be used as a preprocessing step for a improvement heuristic for which there exist no guarantees. The most important contribution of a O(log n)-factor approximation is to understanding the structure of the problem. 2. Minimizing Maximum Distortion: We provide the first guaranteed approximation algorithm for minimizing the maximum distortion, when the input is a set of n arbitrary distributions. To show our result, we relate the maximum clustering distortion in terms of the KL-divergence to the maximum diameter of a cluster measured in terms of the JS-divergence. We then show a constant factor

)

time. This was improved by Chen [6] to provide an algorithm which ran in O(ndk + d2 n 2poly(k
-1 )

)

time, for any  > 0. Kanungo et al. [20] gives a (9 + )-approximation for k -means in time O(n3 / d). For Euclidean k -center, Feder and Greene [13] show that it is NP-Hard to find an approximation-factor better than 1.822 for this problem. KL-clustering: In this paper we are interested in KLclustering on the probability simplex. We first note that algorithms that cluster distributions with respect to either 1 or 2 may give arbitrarily bad solutions for the 2 KL-divergence. The following example shows this for M T C KL . Example 1 Consider the following three distributions: 1 1- 2 2 11 1 1 p=( , , ), q = ( , , 0), r = ( + , - , 0) . 2 22 2 2 We consider the costs of all possible partitions of {p, q , r} into two groups. Clustering {p, q }, {r} {p}, {q , r} {p, r}, {q } 2 -cost 2 2/4 2 3 2/4 1-cost 2 2 KL-cost /2 + O( 2) O( 2 ) /2 + O( 2)

Note that the clustering {{p, q }, {r}} minimizes the 2 or 2 1 cost but that this clustering is a factor (1/ ) from optimal in terms of M T CKL . Since may be made arbitrarily small, we conclude that clustering the distributions according to either 2 or 1 can lead to arbitrarily bad 2 solutions.


approximation to the problem of minimizing the JS diameter. This yields an O(min(log n, log d))approximation algorithm for M M CKL . 3. Hardness Results: Finally, we provide hardness results for the above problems. First, we show that when we restrict the cluster centers to be in the set of input distributions, no polynomial-time approximation is possible, unless P = N P . In addition, when the centers are unrestricted, we show a hardness of approximation result for k -center by demonstrating that KL behaves like 2 near the middle of the 2 probability simplex. Notation: We denote the probability simplex over Rd as . We write a = b ą c as short hand for a  [b - c, b + c].

Both JS(p, q ) and He(p, q ) are symmetric and bounded: it can easily be shown that JS(p, q )  2 and He(p, q )  2 for all p, q  . Note that since KL(p, q ) may be infinite this rules out any multiplicative relationship in general. Relationships between JS(p, q ) and He(p, q ) are given in the next lemma [18, 28]. Lemma 2 For all distributions p and q , He(p, q )/2  JS(p, q )  2 ln(2) He(p, q ) . (1)

2 Information Geometry
In this section we review some known results about the geometry of KL and prove some new results. As we mentioned, KL(p, q ) is asymmetric, does not satisfy a directed triangle inequality, and can be infinite even if p and q are on the probability simplex. (It is, however, at least always positive by Gibb's inequality.) Furthermore, KL does not even satisfy a relaxed directed triangle inequality, that is KL(p, r) + KL(r, q ) KL(p, q ) can be made arbitrarily small with p, q , r  .3 The following example demonstrates this. Example 2 KL is not a relaxed metric. Consider p = (1/2, 1/2), q = (e-c , 1 - e-c ), r = ( , 1 - ) where 1/2  > e-c . Then KL(p, q )  c/2 - ln 2 KL(p, r)  (ln -1 - ln 2)/2 KL(r, q )  c - 1 Hence, by increasing c and decreasing , the ratio (KL(p, r) + KL(r, q ))/KL(p, q ) can be made arbitrarily small. Two other information divergences that will play an important role in our results are the Hellinger and JensenShannon divergences. These are both divergences from the family of f -divergences [10]. Definition 1 The Hellinger and Jensen-Shannon divergence between p, q   are defined as i   He(p, q ) = ( pi - qi )2
[d]

Unfortunately, neither JS or He are metrics but we can show that they are "almost metrics" in that they satisfy non-negativity, identity of indiscernibles, symmetry, and a relaxation of the triangle inequality. We say that a measure D satisfies an -relaxed triangle inequality if for all p, q , r  , D(p, r) + D(r, q )  D(p, q )/. (When  = 1, this is the usual triangle inequality.) Lemma 3 He and JS obey the 2-relaxed triangle equality. Proof: We note that He and JS are both the square of metrics: this is obvious for He and the result for JS was proved in [12]. Therefore, for all p, q , r  , H H H e(p, q ) + e(q , r)  e(p, r) and hence H He(p, q ) + He(q , r) + 2 e(p, q )He(q , r)  He(p, r) . By an application of the AM-GM inequality we deduce: 2(He(p, q ) + He(q , r))  He(p, r) as required. The result for JS follows similarly. The next lemma is a well-known identity (see, e.g., [8]) that relates the KL and JS divergence. Lemma 4 For all p, q , c  : KL(p, c) + KL(q , c) = JS(p, q ) + 2KL((p + q )/2, c) . This is referred to as the parallelogram property. Another useful property that we will exploit is that the He-balls are convex. Lemma 5 B (p) = {p : He(p, p )  } is convex for all  0 and p  . Furthermore, for all p, q , r   and   (0, 1), He(p, q + (1 - )r)  He(p, q ) + (1 - )He(p, r) . Proof: Consider any ball B (p) = {p : He(p, p )  } and let q , s  B (p) and   (0, 1). Then it suffices to show that q + (1 - )r  B (p). Let  = 1 - . Note that     ( pi - qi )2 +  ( pi - ri )2  1  ( pi - qi +  ri )2      q i +  ri  q i +  ri  2 2   qi +  ri + 2 qi ri  qi +  ri   2 qi ri   qi +  ri   2 q i ri  q i + ri and this is true by the AM-GM inequality.

JS(p, q )

= KL(p,

p+q p+q ) + KL(q , ). 2 2

3 We note that this ratio can be bounded below for some families of distributions in terms of the ratio of eigenvalues of a related Hessian matrix [9].


Properties of Cluster Centers: For the remaining result of this section we need to introduce some further notation. For any measure D :  ×   R+ : SumCostD (p1 , . . . , pt ; c) =
1 t

3 Minimizing Average Distortion
In this section, we address the problem of computing a clustering of the input distributions which approximately minimizes the average Kullback-Liebler divergence between an input distribution and the center of the cluster it belongs to. We provide an algorithm that computes a clustering in which the average KL-divergence between an input distribution, and the center of the cluster it belongs to is at most O(log n) times the optimal cost. The main theorem in this section is the following. Theorem 8 There exists a polynomial time O(log n)approximation algorithm for M T CKL . The main idea behind our algorithm is the observation that even though, in general, the KL-divergence between two distributions can be infinitely larger than the He-divergence between them, a clustering of the input distributions with low average distortion according to the He-divergence also has low average distortion by the KL-divergence. Therefore, our analysis proceeds in two steps. First, we show in Section 3.1 how to compute a clustering that approximately (within a factor of 2 + ) minimizes the average Hellinger divergence between input distributions and the closest cluster center. Then, we show in Section 3.2 how this leads to a clustering with low average distortion in the KL-divergence. 3.1 Hellinger Clustering In this section we present an algorithm for minimizing average He distortion. The main idea behind our algorithm is the simple observation that the Hellinger distance between two distributions p and q is  e square of the th  Euclidean distance between the points p and q where  p is a shorthand for the vector in the positive quadrant of the unit sphere:    p = ( p1 , . . . , pd ) . p i and then comTherefore, mapping each point pi to puting a clustering that minimizes the average 2 mea2 sure between each transformed point and the center of the cluster it belongs to, should give us a good clustering for minimizing average Hellinger distortion. However, there will be a slight issue that arises because we insist that the cluster centers lie on the probability simplex. Before we address the issue, we present the algorithm: p i 1. For each input distribution i  [n], compute 2. Compute a (1 + )-approximation to p  1 , . . . , pn ) , MTC 2 ( 2 using any (1 + )-approximation algorithm for k means. Let the cluster centers be c1 , . . . , ck . Note ~ ~ that in general cj is not on the unit sphere. ~ 3. Let {pj1 , . . . , pjt } be the set of input distribution whose closest cluster center is cj . Let the final ~ center for this cluster be cent(pj1 , . . . , pjt ).

j
[t]

D(pj , c)

SumCostD (p , . . . , p ) = min SumCostD (p1 , . . . , pt ; c)
c

MaxCostD (p1 , . . . , pt ; c) = max D(pj , c)
j [t]

MaxCostD (p , . . . , p ) = min MaxCostD (p1 , . . . , pt ; c)
c

1

t

We denote the centroid of a set of distributions as p i cent(p1 , . . . , pt ) = t-1 . The next lemma (a special case of more general result for all Bregman divergences [5]) shows that the center that minimizes the average 2 or KL distortion is the 2 centroid of the distributions being clustered. Lemma 6 For any distributions p1 , . . . , pt , cent(p1 , . . . , pt ) = argminq SumCost 2 (p1 , . . . , pt ; q ) 2 = argminq SumCostKL (p1 , . . . , pt ; q ), i.e., the cluster centers for 2 and KL are at centroids. 2 The next lemma shows that when we are clustering distributions near the middle of the probability simplex, the centers that minimize either the maximum or average KL distortion also lie near the middle of the probability simplex. Define, 1 A(r) = {p   : pj = ą r for all j  [d]} . (2) d Lemma 7 Let p1 , . . . , pt  A( /d) and 0 < Then, If MaxCostKL (p1 , . . . , pt ; c)  10 MaxCostKL (p1 , . . . , pt )  then c  A(10 ). Proof: The first claim follows from Lemma 6 and the fact that cent(p1 , . . . , pt ) is a convex combination of p1 , . . . , pt . For the second claim note that for i  [t], 1+ KL(pi ; p1 )  ln 3, - and h ce MaxCostKL (p1 , . . . , pt )  3 . Consider q  en / A(10 ). Then  KL(pi ; q )  1 (pi ; q )  (10 - /d)2 > 30 , 2 where the first inequality follows by Pinsker's inequality. Hence q does not satisfy, MaxCostKL (p1 , . . . , pt ; q )  10ˇMaxCostKL (p1 , . . . , pt ) . < 1/10.

argminc SumCostKL (p1 , . . . , pt ; c)  A( /d) .


The issue we need to address is that the cluster center c that minimizes SumCostHe (p1 , . . . , pt ; c) need not lie on : this can be seen as a consequence of the fact that cj is not on the unit sphere in general. Thus the actual ~ average Hellinger divergence for the same clustering may be much higher than the k -means cost of the transformed points. However, the following lemma establishes that setting c = cent(p1 , . . . , pt ) (which necessary lies on ) produces a clustering whose average Hellinger distortion is at most a factor 2 away from the k -means cost of the transformed points. Before we state the lemma, we define some notation. For a vector p = (p1 , . . . , pd ) over d dimensions, we use p2 to denote the vector p2 = ( p 2 , . . . , p 2 ) 1 d Lemma 9 For p1 , . . . , pt  , for i  [d], define p j j j ai = pj and bi = i i.
[t] [t]

mass of the probability distributions once mapped to the sphere is a factor 2 as shown in Lemma 9. We conclude this section by noting that our algorithm also leads to a good clustering for minimizing average distortion according to the Jensen-Shannon measure using Eq. 1. Lemma 11 There exists a polynomial-time (8 ln 2 + )approximation algorithm for M T CJS . 3.2 Kullback-Leibler Clustering

The following lemma relates SumCostKL (p1 , . . . , pt ) and SumCostHe (p1 , . . . , pt ). We note that that a later result in Section 4 could be used (in conjunction with Lemma 2) to achieve a result with that shows the ratio scales with lg t in the worst case. However, the following proof establishes better constants and has the benefit that the proof is more geometric. Lemma 12 For any distributions p1 , . . . , pt , 1/2  SumCostKL (p1 , . . . , pt )  lg t (ln 16). SumCostHe (p1 , . . . , pt )

and let a = (a1 , . . . , ad ) and b = (b1 , . . . , bd ). SumCostHe (p1 , . . . , pt ; a/t)  2SumCostHe (p , . . . , p ; (b/t) ) Proof: p j j ( i-
1 t 2

a

2 i /t)

= ai +

j

ai /t - 2

a

i /tbi

 = 2ai - 2t-1/2 ai bi and 2 j ( p
j i

- bi /t)2  2ai - 2t-1 b2 . i

 Therefore it suffices to show that bi  t1/2 ai but this follows because p j jk b2 = ai + i i pi  ai + (t - 1)ai .
=k

Proof: The first inequality follows because for p, q  , JS(p, q ) = minc (KL(p, c) + KL(q , c))  KL(p, q ) (this follows from e.g., Lemma 6) and Eq. 1. We now prove the second inequality. Without loss of generality assume that t is a power of 2 (otherwise consider adding (2 log t - t) new points at He center of p1 , . . . , pt ­ this can only increases the middle term of the equation.) Consider a balanced binary tree on the nodes of the cluster. For an internal node at height j , associate a multiset of distributions S (v ) consisting of 2j copies p(u), the center of mass of the 2j distributions at the leaves of the subtree rooted at v . Let Sj be the set of distributions at height j . Note that S0 = {p1 , . . . , pt }. The lemma follows from the next three claims. Claim 13 For all j , SumCostHe (Sj )  SumCostHe (S0 ). Proof: Let c be an arbitrary distribution. By Lemma 5,

where the inequality follows by AM-GM inequality. Theorem 10 There exists a polynomial-time (2 + )approximation algorithm for M T CHe . Proof: The result for M T CHe is achieved as described above: first we map each distribution from the probability simplex to the positive quadrant of the unit sphere: f :   {x  Rd : 2(x) = 1, xi  1}   (p1 , . . . , pd )  ( p1 , . . . , pd ) . We then run an algorithm for M T C 2 . For each cluster 2 formed, return the centroid of the original probability distributions. This is clearly a probability distribution. The cost of using this center rather than the center of

2j He(p, c) + 2j He(q , c)  2j +1 He((p + q )/2, c) and therefore SumCostHe (Sj ; c) decreases as j increases and the result follows. Claim 14 For all j , v
:height(v )=j +1

SumCostKL (u:uch(v) S (u))  (ln 16)SumCostHe (Sj ) .

where ch(v ) denotes the children of v .


Proof: Let u and w be the children of a node v at height j + 1. Let c = (p(u) + p(w))/2.Then, SumCostKL (S (u), S (w)) = 2j JS(p(u), p(w))  2j +1 (ln 2)He(p(u), p(w))  2j +2 (ln 2)(He(p(u), c) + He(p(w), c))  (ln 16)SumCostHe (S (u), S (w))

4 Minimizing Maximum Distortion
In this section, we provide an algorithm for clustering the input distributions such that the maximum KullbackLiebler divergence between an input distribution and the center of the cluster it belongs to is approximately minimized. In particular, our algorithm produces a clustering in which the maximum KL-divergence between an input distribution, and the closest center is at most a min(O(log d), O(log n)) factor greater than optimal. Our algorithm is pleasantly simple: we use a variant of Gonzalez's algorithm [16] to cluster the input distributions such that the Jensen-Shannon divergence between any two points in the same cluster is minimized. We then show that although the KL-divergence between two distributions can be infinitely larger than their JS-divergence, this procedure still produces a good clustering according to the KL-divergence. The main theorem in this section can be stated as follows. Theorem 17 There exists a polynomial-time O(min(log d, log n)) approximation for M M CKL . Before proving the theorem, we show a lemma which establishes a general relationship between the KL-divergence and JS-divergence between two distributions, when the ratio of probability masses that the two distributions assign to any coordinate is bounded. This lemma may be of independent interest. Lemma 18 Let p, q   such that, for all i, pi /qi  t, where t  e2 . Then, KL(p, q )  2 ln t JS(p, q ) ln(6/5)

Claim 15 jv
:height(v )=j +1

SumCostKL (u:uch(v) S (u)) = SumCostKL (p1 , . . . , pt ) .

Proof: Let v be at height j + 1. Let v have children u and w and grandchildren u1 , u2 , w1 , w2 . Then the result follows because SumCostKL (S (u1 ), S (u2 )) +SumCostKL (S (w1 ), S (w2 )) +SumCostKL (S (u), S (w)) = 2j -1 (KL(p(u1 ), p(u)) + KL(p(u2 ), p(u)) +KL(p(w1 ), p(w)) + KL(p(w2 ), p(w)) +2KL(p(u), p(v )) + 2KL(p(w), p(v ))) = 2j -1 (KL(p(u1 ), p(v )) + KL(p(u2 ), p(v )) +KL(p(w1 ), p(v )) + KL(p(w2 ), p(v ))) = SumCostKL (S (u1 ), S (u2 ), S (w1 ), S (w2 )) where the second inequality follows from the parallelogram property and the fact that p(u) = (p(u1 )+p(u2 ))/2 and p(w) = (p(w1 ) + p(w2 ))/2. We next show that the above lemma is nearly tight. Lemma 16 There exists (pi )i[t] on d  t coordinates such that, SumCostKL (p1 , . . . , pt ) = (log t) . SumCostHe (p1 , . . . , pt ) Proof: Let (p )i[t] be t distributions where p takes value i with probability 1. Then SumCostKL (p1 , . . . , pt ) = t ln t whereas = ( 1 t-1 SumCostHe (p1 , . . . , pt ; c) = t 1 -  )2 + t t  2t - 2 t , ii where c = t-1 p . Then appeal to Lemma 9. Then the proof of Theorem 8 follows immediately from Lemma 12 and Theorem 10.
i i

Proof: For each i, liet i = (pi - qi )/qi so that pi = i (1 + i )qi . Then, i qi = pi - qi = 0, i KL(p, q ) = qi (1 + i ) ln(1 + i ) , and JS(p, q ) = i qi ((1+i ) ln(1+i )-(2+i ) ln(1+ i )) 2

Since pi /qi  t, and i  t - 1, from Lemma 19, i KL(p, q )   ˇ JS(p, q ) + i qi
n where  = l2 l6/t . The lemma follows from the fact that n5 i i qi = 0 and t  4.

Lemma 19 For any x  [-1, 2], (1 + x) ln(1 + x)  4((1 + x) ln(1 + x) -2(1 + x/2) ln(1 + x/2)) + x .


For any x  (2, x ], (1 + x) ln(1 + x)  2 ln x ((1 + x) ln(1 + x) ln(6/5) -2(1 + x/2) ln(1 + x/2)) + x .

Moreover, there exists some c which is a convex combination of p1 , . . . , pt such that: MaxCostKL (p1 , . . . , pt ; c )  O(log d) . maxi,j JS(pi , pj ) Proof: To show the first inequality, we observe that for any i  [d], j  [t]: pj /ci  t. Using this fact along i with Lemma 18, we conclude that: MaxCostKL (p1 , . . . , pt ; c)  O(log t) ˇ max JS(pi , c)
i

Proof: Let  be a parameter and let Y (x) = (1 + x) ln(1 + x) - ((1 + x) ln(1 + x) . -2(1 + x/2) ln(1 + x/2)) - x Our goal is to show that Y (x)  0 for suitable values of the parameter . The first and second order derivatives of Y can be written as follows: Y (x) =  ln(1 + x/2) - ( - 1) ln(1 + x) 2-+x . (1 + x)(2 + x) We first consider x  [-1, 2) and  = 4. If x < 2, then Y (x) < 0. Therefore, Y (x) is strictly decreasing in the range [-1, 2). We note that Y (-1) =  and Y (0) = 0; therefore Y is a strictly increasing function from [-1, 0) and strictly decreasing from (0, 2]. As Y (0) = 0, Y (x) < 0 for x < 0 and Y (x) < 0 for x > 0, and the first part of the lemma follows. To prove the second part, we write the derivative Y (x) as follows: Y
(

The rest of the inequality follows from the fact that JS is constant factor related to He (Lemma 2), followed by an application of Lemma 20. To show the second inequality, let q 1 , . . . , q d  { p1 , . . . , pt } be distributions such that for any i  [d] and any j  [n], i qi  pj . We define i c = cent(q 1 + . . . + q d ) . Observe that for any i  [d], j  [t]: pj /c  d. Therei i fore, from Lemma 18, for any i, MaxCostKL (p1 , . . . , pt ; c )  O(log d)JS(pi , c ) From Lemma 2, JS(pi , c )  O(He(pi , c )). As c is a convex combination of p1 , . . . , pt , the rest of the lemma follows from an application of Lemma 20. Lemma 22 Let p1 , . . . , pt be t distributions over [d]. 1 MaxCostKL (p1 , . . . , pt )  2 maxi,j JS(pi , pj ) Proof: Let (i, j ) = argmax JS(pi , pj ). Note that since we allow unrestricted centers, MaxCostKL (pi , pj )  MaxCostKL (p1 , . . . , pt ; c) , and let q minimize max{KL(pi , q ), KL(pj , q )}. But 2 max{KL(pi , q ), KL(pj , q )}  KL(pi , q ) + KL(pj , q )  min KL(pi , q ) + KL(pj , q )
q

and

x) =

Y (x) =  ˇ ln If x > 2, then ln l x  = 2nn /5 , l6

1 + x/2 + ln(1 + x) 1+x < ln(5/6). By plugging in

1+x/2 1+x

Y (x) < -2 ln x + ln(1 + x) < 0 for x in (2, x ], which means that Y is strictly decreasing in this interval. As t  e2 , here   4. The previous part of the lemma implies that Y (2) < 0, for any  > 4, and hence the lemma follows. Lemma 20 Consider t distributions p1 , . . . , pt such that He(pi , pj )  r for all i, j  [t]. Then He(pi , c)  r for all i  [t] where c is any convex combination of p1 , . . . , p t . Proof: The result follows by Lemma 5: Consider distribution pi and the set of distributions in Br (pi ) = {q : He(pi , q )  r}. By Lemma 5, Br (pi ) is convex. Since pj  Br (pi ) for all j  [t] and c is a convex combination of {pj }j [t] we deduce that c  Br (pi ). Hence He(pi , c)  r as required. Since i was arbitrary the result follows. Lemma 21 Let p1 , . . . , pt be t distributions over [d] and let c = cent(p1 , . . . , pt ). Then, MaxCostKL (p1 , . . . , pt ; c)  O(log t) . maxi,j JS(pi , pj )

= JS(pi , pj ) . from which the Lemma follows. We now are in a position to complete the proof of Theorem 17. Proof: From Lemmas 21 and 22, and the fact that for any c, MaxCostKL (p1 , . . . , pt ; c)  MaxCostKL (p1 , . . . , pt ) (by definition), we know that if we -approximate the problem of minimizing the maximum JS diameter of a cluster, we get a min(O( log d), O( log n)) approximation for k -center KL clustering. In the rest of the proof we show that we may assume  = 4.


We use a variant of an algorithm by Gonzalez [16] that is applicable to divergences that satisfy a relaxed triangle inequality. Recall that JS satisfies, JS(p, q ) + JS(q , r)  JS(p, r)/2 . for all p, q , r. The algorithm assumes knowledge of the optimum JS diameter (note that there are at most n2 possible values and thus we can check them all); let this value be D. Initially, let all pj be "unmarked." The algorithm proceeds by picking an arbitrary unmarked distribution pi , marking all pj such that JS(pj , pi )  D and repeating until all distributions are marked. Define the each cluster as the set of distributions marked in the same iteration and call pi the "center" of this cluster. This results in a clustering such that the maximum diameter is at most 2(D + D) = 4D. We need to show that the process does not determine more than k centers. Suppose we pick k + 1 centers. Note that each of these centers are strictly greater than (D + D)/2 = D apart and hence no two may be in the same cluster for the optimum clustering. This is a contradiction.

Proof: We apply Taylor's Theorem to the terms of the KL divergence: i pi KL(p, q ) = pi log - pi + qi qi [d] - 1 i pi - q i pi + q i = -pi log - pi
[d]

=

i
[d]

(pi - qi )2 3 + i pi pi

for some i with |i |  |pi - qi |/pi . Note that |i |3 pi  and therefore KL(p, q ) = (1ą3 ) i
[d]

pi - qi )2 (pi - qi )2 |pi - qi | ˇ 3 ( pi pi pi (pi - qi )2  (1ą5 )d 2 (p-q ) . 2 pi

5 Hardness Results
In this final section of our paper, we prove hardness of approximation results for M M CKL and M T CKL , i.e., we show lower bounds of the approximation factors possible in polynomial time on the assumption that P = N P . We consider two variants of these problems. For all the algorithms we presented in the previous sections, we insisted that the centers c1 , . . . , ck lay in  but other than this the centers were unrestricted. In some of the previous work on approximation algorithms for clustering a variant is considered in which it is required that c1 , . . . , ck are chosen from among the input distributions {p1 , . . . , pn }. We call this the restricted center version. When a metric is used rather than KL, the restricted and unrestricted versions of the problems are closely related: it can be shown that the restriction can at most double the clustering cost. However, for KL we show that, while we have presented approximation algorithms for the unrestricted case, no approximation to any multiplicative factor is possible in the restricted case. 5.1 Unrestricted Centers In this section, we prove an approximation hardness result for M M CKL . Our result is based on demonstrating that near the center of the simplex KL behaves similarly to 2 . We then use a result by Feder and Greene [13] that 2 showed an approximation hardness of 1.822 for k -center in the plane where distance are measured as 2. (Hence, this gives a 1.8222 < 3.320 approximation hardness result for 2 .) 2 Recall the definition, A(r) = {p   : pj = 1/d ą r for all j  [d]} . (3) Lemma 23 For p, q  A( /d), KL(p, q ) = (1 ą 5 )d 2 (p, q ) . 2

Using the above lemma, the next theorem shows that if the distributions to be clustered are near the center of the simplex, then we can use an approximation algorithm for M T CKL or M M CKL to get a good approximation for M T C 2 or M M C 2 respectively. 2 2 Theorem 24 Let   (1, 10) and let p1 , . . . , pn  A( 2/(502 d3 )) . Then, a  -approximation for M T CKL yields a ( + 5 )approximation for M T C 2 . Similarly, a  -approximation 2 for M M CKL yields a ( + 5 )-approximation for M M C 2 . 2 Proof: We first consider M T CKL . Suppose we want to solve M T C 2 on the input p1 , . . . , pn   and let 2 {c1 , . . . , ck } ~ ~ be a  -approximation for M T CKL . Without loss of generality, we may assume that c1 , . . . , ck are in the con~ ~ vex hull of p1 , . . . , pn since if cj is the closest center to ~ {pi }iI then the objective function only decreases if we let cj = cent(pi : i  I ). ~ Denote the convex hull of p1 , . . . , pn by H and note that q  H implies that q  A( 2/(502 d3 ))  A( /(5d)). Hence, by appealing to Lemmas 7 and 23, we deduce, j min 2 (pj , ci ) Ż 2
[n] i[k]

=   =

j 1 2 d(1 ą )  d(1 ą )2  (1 ą )4  (1 ą )4

[n]

i[k]

min KL(pj , ci ) Ż j
[n]

c1 ,...,ck H

min

i[k]

min KL(pj , ci )

c1 ,...,ck H

min

j
[n]

i[k]

min 2 (pj , ci ) 2

c1 ,...,ck 

min

j
[n]

i[k]

min 2 (pj , ci ) . 2


where the last line follows because the optimum centers for M T C 2 lie in convex(P ). 2 We now consider M M CKL and suppose {c1 , . . . , ck } ~ ~ is a  -approximation for M M CKL . By appealing to Lemma 7, we may assume 2 c1 , . . . , ck  A(10 ~ ~ /(502 d2 )) = A( /(5d)) . Hence, max min 2 (p , c ) Ż 2
j [n] i[k] j i

where U is the matrix: 1  U =
2 1 - 2 

0
1  2 1 - 2

0

1  3 1  3 1  3

  

Lemma 26 If x  A, (x) lies on the simplex. Proof: We first show that if x lies on the x1 - x2 plane, then U x lies on the plane x1 + x2 + x3 = 0. Let     y1 = (1/ 2, -1/ 2, 0) and y2 = (0, 1/ 2, -1/ 2) . As y1 = U x1 and y2 = U x2 , if x = 1 x1 + 2 x2 , then, U x = 1 y1 + 2 y2 . Since y1 ˇ (1, 1, 1) = y2 ˇ (1, 1, 1) = 0 , U x ˇ (1, 1, 1) = 0 as well, which means that U x lies on the plane x1 + x2 + x3 = 0. Since U x lies on the plane x1 + x2 + x3 = 0, we deduce that (x) lies on the plane x1 + x2 + x3 = 1. Since x  [0, 1/4] × [0, 1/4], for any i  {1, 2, 3}, 1 1 1 (U x)i  - ×   - . 4 3 2 Therefore, for any i, ((x))i  0. Again, as x  [0, 1/4] × [0, 1/4], for any i  {1, 2, 3}, (U x)i  1 1 2 ×  . 4 3 2

=   =

1 max min KL(pj , ci ) Ż d(1 ą )2 j [n] i[k] m  min 5 ax min KL(pj , ci ) d(1 ą )2 c1 ,...,ck A( d ) j [n] i[k] m  min 5 ax min 2 (pj , ci ) (1 ą )4 c1 ,...,ck A( d ) j [n] i[k] 2 m .  ji min ax min 2 (p , c ) (1 ą )4 c1 ,...,ck  j [n] i[k] 2

where the last line follows because the optimum centers for M M C 2 lie in A( /(5d)). This can be shown using 2 ideas contained in Lemma 7: 2 (p 2
i

, p1 )  d max(pi - p1 )2  4 4/(504 d5 ) j j
j [d]

while for q  A( /(5d)), / 2 (p 2
i

, q )  ( 2/(5d)2 - 2/(502 d3 ))2  4 4/(504 d5 ) .

Therefore, ((x))i  1 for each i, from which the lemma follows. To map an instance I of M M C 2 on the x1 - x2 plane to the probability simplex, we simply apply the map  o on each point of I . This produces another instance I f the problem on the simplex, which has the following property. Lemma 27 There is an approximation-preserving bijection between the solutions of I and the solutions of I . Proof: We observe that as U is a unitary matrix, the map  is a bijection. Moreover, since  consists of a translation and a rotation, it preserves the distance between every pair of points. Therefore, applying  on a solution of I produces a solution of I of the same cost. The mapping  is thus approximation preserving. Finally, by mapping each point x  I to x+(1- )u where u = [1/3, 1/3, 1/3]T we generate a set of points that lie arbitrarily close to the center of  (setting as small as necessary.) Again, this transformation can be seen to be approximation preserving. 5.2 Restricted Centers In this section, we consider the restricted version of M M CKL and M T C KL where we insist that the cluster centers c1 , . . . , cn  {p1 , . . . , pn }. Our result is based on relating the problem to the S E T- C OV E R problem and appealing to a result of Feige [14].

To show a hardness result for unrestricted centers it is therefore sufficient to show a hardness result for M M C 2 2 when the points to be clustered lie near the middle of the probability simplex. We do this by taking Feder and Greene [13] result the showed the hardness of M M C 2 in the plane and demonstrating that the plane can be mapped into the middle of the probability simplex in a manner that preserves approximation factors. This will give the following theorem. Theorem 25 For any  < 3.320, unless P = N P , no polynomial-time, -approximation algorithm exists for M M CKL . k -means on the middle of : Given an instance I of M M C 2 on a bounded domain A of the x1 - x2 plane, we show how to produce an instance I of M M C 2 on the three-dimensional simplex, such that, there is an approximation preserving bijection between the solutions to I and the solutions to I . To show this, we first assume without loss of generality that A  [0, 1/4] × [0, 1/4]. We can assume this because translating and scaling scales down the distance between every pair of points by the same number. For any x  A, we define a map (x) as follows: (x) = U x + [1/3, 1/3, 1/3]T


Theorem 28 For any   1, unless P = N P , no polynomial-time, -approximation algorithm exists for either M T CKL or M M CKL . Proof: Consider a reduction from the problem S E TC OV E R : Consider S1 , . . . , Sn-d-1  [d - 1] and k  d. It was shown by Feige [14] that it is NP-hard to deterS ine if there exists S = {Si1 , . . . , Sik-1 } such that m ij = [d - 1]. We first consider M M CKL . Let c1 , c2 > 1 such that (d - 1)e-c1 < 1 and (d - 1)e-c2 < e-c1 . Let q i be the probability distribution with mass e-c1 on each element in Si , and the remaining mass on {d}. Let pi = ei (i.e., the i-th vector of the standard basis) for i  [d - 1]. Let r be the probability distribution with e-c2 mass on each element in [d - 1] and the remaining mass on {d}. Note that KL(pi , q j ) = c1 , KL(pi , r) = c2 , and KL(q j , r) = (1 - |Sj |e-c1 ) ln 1 - |Sj |e-c1 1 - (d - 1)e-c2 -c1 -c1 +|Sj |e ln(e /e-c2 )

References
[1] M. R. Ackerman, J. Blomer, and C. Sohler. Clustering for metric and non-metric distance measures. In ACM-SIAM Symposium on Discrete Algorithms, 2008. [2] V. Arya, N. Garg, R. Khandekar, K. Munagala, and V. Pandit. Local search heuristic for k-median and facility location problems. In ACM Symposium on Theory of Computing, pages 21­29, 2001. [3] M. Badoiu, S. Har-Peled, and P. Indyk. Approximate clustering via core-sets. In ACM Symposium on Theory of Computing, pages 250­257, 2002. [4] L. D. Baker and A. K. McCallum. Distributional clustering of words for text classification. In W. B. Croft, A. Moffat, C. J. van Rijsbergen, R. Wilkinson, and J. Zobel, editors, Proceedings of SIGIR-98, 21st ACM International Conference on Research and Development in Information Retrieval, pages 96­103, Melbourne, AU, 1998. ACM Press, New York, US. [5] A. Banerjee, S. Merugu, I. S. Dhillon, and J. Ghosh. Clustering with bregman divergences. Journal of Machine Learning Research, 6:1705­1749, 2005. [6] K. Chen. On k-median clustering in high dimensions. In Symposium on Discrete Algorithms, 2006. [7] J. Chuzhoy, S. Guha, E. Halperin, S. Khanna, G. Kortsarz, R. Krauthgamer, and J. Naor. Asymmetric k -center is log n-hard to approximate. J. ACM, 52(4):538­551, 2005. [8] T. M. Cover and J. A. Thomas. Elements of Information Theory. Wiley Series in Telecommunications. John Wiley & Sons, New York, NY, USA, 1991. [9] K. Crammer, M. Kearns, and J. Wortman. Learning from multiple sources. In NIPS, pages 321­328, 2006. ´ [10] I. Csiszar. Why least squares and maximum entropy? an axiomatic approach to inference for linear inverse problems. Ann. Statist., 19:2032­2056, 1991. [11] I. S. Dhillon, S. Mallela, and R. Kumar. A divisive information-theoretic feature clustering algorithm for text classification. JMLR, 3:1265­1287, 2003. [12] D. M. Endres and J. E. Schindelin. A new metric for probability distributions. IEEE Transactions on Information Theory, 49(7):1858­1860, 2003. [13] T. Feder and D. Greene. Optimal algorithms for approximate clustering. In ACM Symposium on Theory of Computing, 1988. [14] U. Feige. A threshold of ln for approximating set cover. J. ACM, 45(4):634­652, 1998. [15] D. Feldman, M. Monemizadeh, and C. Sohler. A ptas for k-means clustering based on weak coresets. In Symposium on Computational Geometry, pages 11­18, 2007. [16] T. F. Gonzalez. Clustering to minimize the maximum intercluster distance. Theor. Comput. Sci., 38:293­306, 1985. [17] S. Guha, P. Indyk, and A. McGregor. Sketching

 |Sj |e-c1 (c2 - c1 )  de-c1 c2 Hence, if there exists S , the clustering with centers pi1 , . . . , pik-1 , r costs at most max{c1 , de-c1 c2 } whereas otherwise the cost is max{c1 , c2 , de-c1 c2 }  c2 . Hence the ratio difference is at least c2 max{c1 , de-c1 c2 } which we can make arbitrarily large. This also implies that no approximation is possible for M T CKL because any  -approximate solution for M T C KL is also a n-approximation solution for M M CKL for k -median when centers must be original points.

Bi-criteria Approximation: We briefly mention an approximation algorithm for the related approximation problem of finding the minimum number k of centers  c1 , c2 , . . . , ck {p1 , . . . , pn } such that for all i  [n],
j [k

min] KL(pi , cj )  r

for some given r. This can be approximated up to a factor of O(log n) using a well-known approximation algorithm for S E T- C OV E R. Specifically, for each pi we define a set Si = {j  [n] : KL(pj , pi )  r}. Then, our problem becomes picking the smallest number of sets Si1 , Si2 , . . . such that j 1 Sij = [n]. An O(log n)approximation algorithm exists for this problem. Acknowledgements: We would like to thank Sanjoy Dasgupta for helpful discussions.


[18]

[19]

[20]

[21]

[22]

[23]

[24] [25]

[26] [27]

[28]

information divergences. Submitted to Journal of Machine Learning, 2007. S. Guha, A. McGregor, and S. Venkatasubramanian. Streaming and sublinear approximation of entropy and information distances. In ACM-SIAM Symposium on Discrete Algorithms, pages 733­742, 2006. S. Har-Peled and S. Mazumdar. Coresets for kmeans and k-median clustering and their applications. In ACM Symposium on Theory of Computing, 2004. T. Kanungo, D. M. Mount, N. S. Netanyahu, C. D. Piatko, R. Silverman, and A. Y. Wu. A local search approximation algorithm for k-means clustering. Comput. Geom., 28(2-3):89­112, 2004. S. G. Kolliopoulos and S. Rao. A nearly linear-time approximation scheme for the euclidean kappamedian problem. In European Symposium on Algorithms, pages 378­389, 1999. A. Kumar, Y. Sabharwal, and S. Sen. A simple linear time (1 + )-approximation algorithm for k-means clustering in any dimensions. In IEEE Symposium on Foundations of Computer Science, pages 454­462. IEEE Computer Society, 2004. R. Panigrahy and S. Vishwanathan. An o(log n) approximation algorithm for the asymmetric pcenter problem. J. Algorithms, 27(2):259­268, 1998. F. C. N. Pereira, N. Tishby, and L. Lee. Distributional clustering of english words. In ACL, pages 183­190, 1993. N. Slonim, R. Somerville, N. Tishby, and O. Lahav. Objective classification of galaxy spectra using the information bottleneck method. Monthly Notes of the Royal Astronomical Society, 323:270­284, 2001. N. Slonim and N. Tishby. Agglomerative information bottleneck. In NIPS, pages 617­623, 1999. N. Tishby, F. Pereira, and W. Bialek. The information bottleneck method. In Proceedings of the 37-th Annual Allerton Conference on Communication, Control and Computing, pages 368­377, 1999. F. Topsře. Some inequalities for information divergence and related measures of discrimination. IEEE Transactions on Information Theory, 46(4):1602­ 1609, 2000.