Clustering processes

Daniil Ryabko INRIA Lille, 40 avenue Halley, 59650 Villeneuve d'Ascq, France

daniil@ryabko.net

Abstract
The problem of clustering is considered, for the case when each data point is a sample generated by a stationary ergodic process. We propose a very natural asymptotic notion of consistency, and show that simple consistent algorithms exist, under most general non-parametric assumptions. The notion of consistency is as follows: two samples should be put into the same cluster if and only if they were generated by the same distribution. With this notion of consistency, clustering generalizes such classical statistical problems as homogeneity testing and process classification. We show that, for the case of a known number of clusters, consistency can be achieved under the only assumption that the joint distribution of the data is stationary ergodic (no parametric or Markovian assumptions, no assumptions of independence, neither between nor within the samples). If the number of clusters is unknown, consistency can be achieved under appropriate assumptions on the mixing rates of the processes. In both cases we give examples of simple (at most quadratic in each argument) algorithms which are consistent.

Zadeh & Ben-David, 2009). What is more, even if one assumes the similarity measure to be simply the Euclidean distance (on the plane), and the number of clusters k known, then clustering may still appear intractable for computational reasons. Indeed, in this case finding k centres (points which minimize the cumulative distance from each point in the sample to one of the centres) seems to be a natural goal, but this problem is NP-hard (Mahajan et al., 2009). In this work we concentrate on a subset of the clustering problem: clustering processes. That is, each data point is itself a sample generated by a certain discrete-time stochastic process. This version of the problem has numerous applications, such as clustering biological data, financial observations, or behavioural patterns, and as such it has gained a tremendous attention in the literature. The main observation that we make in this work is that, in the case of clustering processes, one can benefit from the notion of ergodicity to define what appears to be a very natural notion of consistency. This notion of consistency is shown to be satisfied by simple algorithms that we present, which are polynomial in all arguments. This can be achieved without any modeling assumptions on the data (e.g. Hidden Markov, Gaussian, etc.), without assuming independence of any kind within or between the samples. The only assumption that we make is that the joint distribution of the data is stationary ergodic. The assumption of stationarity means, intuitively, that the time index itself bares no information: it does not matter whether we have started recording observations at time 0 or at time 100. By virtue of the ergodic theorem, any stationary process can be represented as a mixture of stationary ergodic processes. In other words, a stationary process can be thought of as first selecting a stationary ergodic process (according to some prior distribution) and then observing its outcomes. Thus, the assumption that the data is stationary ergodic is both very natural and rather weak. At the same time, ergodicity means that, in asymptotic, the properties of the process can be learned from observation. This allows us to define the clustering problem as follows. N samples are given: x1 =

1. Introduction
Given a finite set of objects, the problem is to "cluster" similar objects together. This intuitively simple goal is notoriously hard to formalize. Most of the work on clustering is concerned with particular parametric data generating models, or particular algorithms, a given similarity measure, and (very often) a given number of clusters. It is clear that, as in almost learning problems, in clustering finding the right similarity measure is an integral part of the problem. However, even if one assumes the similarity measure known, it is hard to define what a good clustering is (Kleinberg, 2002;
Appearing in Proceedings of the 27 th International Conference on Machine Learning, Haifa, Israel, 2010. Copyright 2010 by the author(s)/owner(s).

Clustering processes

(x1 , . . . , x1 1 ), . . . , xN = (xN , . . . , xNN ). Each samn n 1 1 ple is drawn by one out of k different stationary ergodic distributions. The samples are not assumed to be drawn independently; rather, it is assumed that the joint distribution of the samples is stationary ergodic. The target clustering is as follows: those and only those samples are put into the same cluster that were generated by the same distribution. The number k of target clusters can be either known or unknown (different consistency results can be obtained in these cases). A clustering algorithm is called asymptotically consistent if the probability that it outputs the target clustering converges to 1, as the lengths (n1 , . . . , nN ) of the samples tend to infinity (a variant of this definition is to require the algorithm to stabilize on the correct answer with probability 1). Note the particular regime of asymptotic: not with respect to the number of samples N , but with respect to the length of the samples n1 , . . . , nN . Similar formulations have appeared in the literature before. Perhaps the most close approach is mixture models (Smyth, 1997; Zhong & Ghosh, 2003): it is assumed that there are k different distributions that have a particular known form (such as Gaussian, Hidden Markov models, or graphical models) and each one out of N samples is generated independently according to one of these k distributions (with some fixed probability). Since the model of the data is specified quite well, one can use likelihood-based distances (and then, for example, the k-means algorithm), or Bayesian inference, to cluster the data. Clearly, the main difference from our setting is in that we do not assume any known model of the data; not even between-sample independence is assumed. The problem of clustering in our formulation generalizes two classical problems of mathematical statistics. The first one is homogeneity testing, or the twosample problem. Two samples x1 = (x1 , . . . , x1 1 ) and n 1 x2 = (x2 , . . . , x2 2 ) are given, and it is required to test n 1 whether they were generated by the same distribution, or by different distributions. This corresponds to clustering just two data points (N = 2) with the number k of clusters unknown: either k = 1 or k = 2. The second problem is process classification, or the three-sample problem. Three samples x1 , x2 , x3 are given, it is known that two of them were generated by the same distribution, while the third one was generated by a different distribution. It is required to find out which two were generated by the same distribution. This corresponds to clustering three data points, with the number of clusters known: k = 2. The classical approach is of course to consider Gaussian i.i.d. data, but general non-parametric solutions

exist not only for i.i.d. data (Lehmann, 1986), but also for Markov chains (Gutman, 1989), and under certain mixing rates conditions. What is important for us here, is that the three-sample problem is easier than the two-sample problem; the reason is that k is known in the latter case but not in the former. Indeed, in (Ryabko, 2010b) it is shown that in general, for stationary ergodic (binary-valued) processes, there is no solution to the two-sample problem, even in the weakest asymptotic sense. However, a solution to the three-sample problem, for (real-valued) stationary ergodic processes was given in (Ryabko & Ryabko, 2010). In this work we demonstrate that, if the number k of clusters is known, then there is an asymptotically consistent clustering algorithm, under the only assumption that the joint distribution of data is stationary ergodic. If k is unknown, then in this general case there is no consistent clustering algorithm (as follows from the mentioned result for the two-sample problem). However, if an upper-bound n on the -mixing rates of the joint distribution of the processes is known, and n  0, then there is a consistent clustering algorithm. Both algorithms are rather simple, and are based on the empirical estimates of the so-called distributional distance. For two processes 1 , 2 a distributional dis tance d is defined as k=1 wk |1 (Bk ) - 2 (Bk )|, where wk are positive summable real weights, e.g. wk = 2-k , and Bk range over a countable field that generates the sigma-algebra of the underlying probability space. For example, if we are talking about finite-alphabet processes with the binary alphabet A = {0, 1}, Bk would range over the set A = kN Ak ; that is, over all tuples 0, 1, 00, 01, 10, 11, 000, 001, . . . (of course, we could just as well omit, say, 1 and 11); therefore, the distributional distance in this case is the weighted sum of differences of probabilities of all possible tuples. In this work we consider real-valued processes, so Bk have to range through a suitable sequence of intervals, all pairs of such intervals, triples, etc. (see the formal definitions below). This distance has proved a useful tool for solving various statistical problems concerning ergodic processes (Ryabko & Ryabko, 2010; Ryabko, 2010a). Although this distance involves infinite summation, we show that its empirical approximations can be easily calculated. For the case of a known number of clusters, the proposed algorithm (which is shown to be consistent) is as follows. (The distance in the algorithms is a suitable empirical estimate of d.) The first sample is assigned to the first cluster. For each j = 2..k, find a point that maximizes the minimal distance to those points already assigned to clusters, and assign it to the cluster j. Thus we have one point in each of the k clus-

Clustering processes

ters. Next, assign each of the remaining points to the cluster that contains the closest points from those k already assigned. For the case of an unknown number of clusters k, the algorithm simply puts those samples together that are not farther away from each other than a certain threshold level, where the threshold is calculated based on the known bound on the mixing rates. In this case, besides the asymptotic result, finite-time bounds on the probability of outputting an incorrect clustering can be obtained. Each of the algorithms is shown to be at most quadratic in each argument. Therefore, we show that for the proposed notion of consistency, there are simple algorithms that are consistent under most general assumptions. While these algorithms can be easily implemented, we have left the problem of trying them out on particular applications, as well as optimizing the parameters, for future research. It may also be suggested that the empirical distributional distance can be replaced by other distances, for which similar theoretical results can be obtained. An interesting direction, that could preserve the theoretical generality, would be to use data compressors. These were used in (Ryabko & Astola, 2006) for the related problems of hypotheses testing, leading both to theoretical and practical results. As far as clustering is concerned, compression-based methods were used (without asymptotic consistency analysis) in (Cilibrasi & Vitanyi, 2005), and (in a different way) in (Bagnall et al., 2006). Combining our consistency framework with these compression-based methods is a promising direction for further research.

frequency with which the sequence x falls in the set B.
(x, B) :=
1 n-|B|+1 n-|B|+1 i=1

I{(Xi ,...,Xi+|B|-1 )B} 0

if n  |B|, otherwise.

A process  is stationary if (X1..|B| = B) = (Xt..t+|B|-1 = B) for any B  A and t  N. We further abbreviate (B) := (X1..|B| = B). A stationary process  is called (stationary) ergodic if the frequency of occurrence of each word B in a sequence X1 , X2 , . . . generated by  tends to its a priori (or limiting) probability a.s.: (limn (X1..n , B) = (B)) = 1. Denote E the set of all stationary ergodic processes. Definition 1 (distributional distance). The distributional distance is defined for a pair of processes 1 , 2 as follows (e.g. (Gray, 1988))


d(1 , 2 ) =
m,l=1

wm wl
BB m,l

|1 (B) - 2 (B)|,

where wj = 2-j . (The weights in the definition are fixed for the sake of concreteness only; we could take any other summable sequence of positive weights instead.) In words, we are taking a sum over a series of partitions into cubes of decreasing volume (indexed by l) of all sets Ak , k  N, and count the differences in probabilities of all cubes in all these partitions. These differences in probabilities are weighted: smaller weights are given to larger k and finer partitions. It is easy to see that d is a metric. We refer to (Gray, 1988) for more information on this metric and its properties. The clustering algorithms presented below are based on empirical estimates of the distance d:
2 ^ 1 d(X1..n1 , X1..n2 ) = 

2. Preliminaries
Let A be an alphabet, and denote A the set of tuples  Ai . In this work we consider the case A = R; i=1 extensions to the multidimensional case, as well as to more general spaces, are straightforward. Distributions, or (stochastic) processes, are measures on the space (A , FA ), where FA is the Borel sigmaalgebra of A . When talking about joint distributions of N samples, we mean distributions on the space ((AN ) , F(AN ) ). For each k, l  N, let B k,l be the partition of the set Ak into k-dimensional cubes with volume hk = (1/l)k (the l cubes start at 0). Moreover, define B k = lN B k,l and B =  B k . The set {B × A : B  B k,l , k, l  N} k=1 generates the Borel -algebra on R = A . For a set B  B let |B| be the index k of the set B k that B comes from: |B| = k : B  B k . We use the abbreviation X1..k for X1 , . . . , Xk . For a sequence x  An and a set B  B denote (x, B) the

wm wl
m,l=1 BB m,l

1 2 |(X1..n1 , B) - (X1..n2 , B)|, (1)

i where n1 , n2  N,   S, X1..ni  Ani .

Although the expression (1) involves taking three infinite sums, it will be shown below that it can be easily calculated. ^ Lemma 1 (d is consistent). Let 1 , 2  E and let two 1 2 samples x1 = X1..n1 and x2 = X1..n2 be generated by a distribution  such that the marginal distribution of i X1..n1 is i , i = 1, 2, and the joint distribution  is stationary ergodic. Then
n1 ,n2 

lim

2 ^ 1 d(X1..n1 , X1..n2 ) = d(1 , 2 ) ­a.s.

Clustering processes

Proof. The idea of the proof is simple: for each set B  B, the frequency with which the sample x1 falls into B converges to the probability 1 (B), and analogously for the second sample. When the sample sizes grow, there will be more and more sets B  B whose frequencies have already converged to the probabilities, so that the cumulative weight of those sets whose frequencies have not converged yet, will tend to 0. For any  > 0 we can find an index J such that  i,j=J wi wj < /3. Moreover, for each m, l we can
m,l m,l find such elements B1 , . . . , Btm,l , for some tm,l  m,l m,l N, of the partition B m,l that i (i=1 Bi )  1 - m,l /6Jwm wl . For each Bj , where m, l  J and m,l m,l 1 1 j  tm,l , we have ((X1 , . . . , Xn1 ), Bj )  1 (Bj ) a.s., so that m,l m,l 1 1 |((X1 , . . . , Xn1 ), Bj ) - 1 (Bj )| t

such that xj , 1  j  N is generated by j if and only if j  Ij . The partitioning I is called the target clustering and the sets Ii , 1  i  k, are called the target clusters. Given samples x1 , . . . , xN and a target clustering I, let I(x) denote the cluster that contains x. A clustering function F takes a finite number of samples x1 , . . . , xN and an optional parameter k (the target number of clusters) and outputs a partition F (x1 , . . . , xN , (k)) = {T1 , . . . , Tk } of the set {1..N }. Definition 2 (asymptotic consistency). Let a finite number N of samples be given, and let the target clustering partition be I. Define n = min{n1 , . . . , nN }. A clustering function F is strongly asymptotically consistent if F (x1 , . . . , xN , (k)) = I from some n on with probability 1. A clustering function is weakly asymptotically consistent if P (F (x1 , . . . , xN , (k)) = I)  1. Note that the consistency is asymptotic with respect to the minimal length of the sample, and not with respect to the number of samples. 3.1. Known number of clusters Algorithm 1 is a simple clustering algorithm, which, given the number k of clusters, will be shown to be consistent under most general assumptions. It works as follows. The point x1 is assigned to the first cluster. Next, find the point that is farthest away from x1 ^ in the empirical distributional distance d, and assign this point to the second cluster. For each j = 3..k, find a point that maximizes the minimal distance to those points already assigned to clusters, and assign it to the cluster j. Thus we have one point in each of the k clusters. Next simply assign each of the remaining points to the cluster that contains the closest points from those k already assigned. One can notice that Algorithm 1 is just one iteration of the k-means algorithm, with so-called farthest-point initialization (Katsavounidis et al., 1994), and a specially designed distance. Algorithm 1 The case of known number of clusters k INPUT: The number of clusters k, samples x1 , . . . , x N . Initialize: j := 1, c1 := 1, T1 := {xc1 }. for j := 2 to k do ^ cj := argmax{i = 1, . . . , N : minj-1 d(xi , xct )} t=1 Tj := {xcj } end for for i = 1 to N do Put xi into the set Targmink d(xi ,xc ) ^ j=1 j end for OUTPUT: the sets Tj , j = 1..k.

<

m,l 1 (Bj )/(6Jwj )

m,l for all n1  u, for some u  N; define Uj := u. m,l Let U := maxm,lJ,jtm,l Uj (U depends on the re1 1 alization X1 , X2 , . . . ). Define analogously V for the 2 2 sequence (X1 , X2 , . . . ). Thus for n1 > U and n2 > V we have ^ |d(x1 , x2 ) - d(1 , 2 )| =


wm wl
m,l=1  BB k,l

|(x1 , B) - (x2 , B)| - |1 (B) - 2 (B)|


m,l=1

wm wl
BB k,l J

wi |(x1 , B) - 1 (B)| + |(x2 , B) - 2 (B)|
tk,l


m,l=1

wm wl
i=1 m,l

|(x1 , Bi ) - 2 (Bi (1 (Bi
i=1 m,l

m,l

) - 1 (Bi

m,l

)|

+ |(x2 , Bi
J

m,l

)| + 2/3

tk,l


m,l=1

wm wl

m,l

)/(6Jwm wl )

+ 2 (Bi

)/(6Jwm wl )) + 2/3  ,

which proves the statement.

3. Main results
The clustering problem can be defined as follows. We are given N samples x1 , . . . , xN , where each sample xi i is a string of length ni of symbols from A: xi = X1..ni . Each sample is generated by one out of k different unknown stationary ergodic distributions 1 , . . . , k  E. Thus, there is a partitioning I = {I1 , . . . , Ik } of the set {1..N } into k disjoint subsets Ij , j = 1..k {1..N } = k Ij , j=1

Clustering processes

^ Proposition 1 (calculating d(x1 , x2 )). For two sam2 1 ples x1 = X1..n1 and x2 = X1..n2 the computational complexity (time and space) of calculating ^ the empirical distributional distance d(x1 , x2 ) (1) is -1 2 O(n log smin ), where n = max(n1 , n2 ) and smin =
2 1 i=1..n1 ,j=1..n2 ,Xi =Xj

min

1 2 |Xi - Xj |.

^ target cluster (I(xi ) = I(xj )), and d(xi , xj ) >  otherwise (I(xi ) = I(xj )). Therefore, from some n on, for every j  k we will have max{i = 1, . . . , N : ^ minj-1 d(xi , xct )} >  and the sample xcj , where t=1 j-1 ^ cj = argmax{i = 1, . . . , N : mint=1 d(xi , xct )}, will be selected from a target cluster that does not contain any xci , i < j. The consistency statement follows. Next, let us find how many pairwise distance estimates ^ d(xi , xj ) the algorithm has to make. On the first it^ eration of the loop, it has to calculate d(xi , xc1 ) for all i = 1..N . On the second iteration, it needs again ^ d(xi , xc1 ) for all i = 1..N , which are already calcu^ lated, and also d(xi , xc2 ) for all i = 1..N , and so on: on jth iteration of the loop we need to calculate d(xi , xcj ), i = 1..N , which gives at most kN pairwise distance calculations in total. The statement about computational complexity follows from this and Proposition 1: ^ indeed, apart from the calculation of d, the rest of the computations is of order O(kN ). Complexity­precision trade­off. The bound on the computational complexity of Algorithm 1, given in Theorem 1, is given for the case of precisely calculated ^ distance estimates d(·, ·). However, precise estimates are not needed if we only want to have an asymptotically consistent algorithm. Indeed, following the proof of Lemma 1, it is easy to check that if we replace in (1) the infinite sums with sums over any number of terms mn , ln that grows to infinity with n = min(n1 , n2 ), and if we replace partitions B m,l by their (finite) subsets B m,l,n which increase to B m,l , then we still have a consistent estimate of d(·, ·).  Definition 3 (d). Let mn , ln be some sequences of m,l,n numbers, B  B m,l for all m, l, n  N, and denote n := min{n1 , n2 }. Define
mn 2  1 d(X1..n1 , X1..n2 ) := m=1 l=1 1 2 |(X1..n1 , B) - (X1..n2 , B)|. (3) BB m,l,n ln

Proof. First, observe that for fixed m and l, the sum T m,l :=
BB m,l 1 2 |(X1..n1 , B) - (X1..n2 , B)|

(2)

has not more than n1 +n2 -2m+2 non-zero terms (assuming m  n1 , n2 ; the other case is obvious). Indeed, for each i = 0, 1, in the sample xi there are ni - m + i i i 1 tuples of size k: X1..m , X2..m+1 , . . . , Xn1 -m+1..n1 . m,l Therefore, the complexity of calculating T is O(n1 + n2 - 2m + 2) = O(n). Furthermore, observe that for each m, for all l > log s-1 the term T m,l min is constant. Therefore, it is enough to calculate -1 T m,1 , . . . , T m,log smin , since for fixed m


wm wl T
l=1

m,l

= wm wlog s-1 T
min

m,log s

-1 min

log s

-1 min

+
l=1

wm wl T

m,l

(that is, we double the weight of the last nonzero term). Thus, the complexity of calculating  m,l is O(n log s-1 ). Finally, for all min l=1 wm wl T ^ m > n we have T m,l = 0. Since d(x1 , x2 ) =  wm , wl T m,l , the statement is proven. m,l=1 Theorem 1. Let N  N and suppose that the samples x1 , . . . , xN are generated in such a way that the joint distribution is stationary ergodic. If the correct number of clusters k is known, then Algorithm 1 is strongly asymptotically consistent. Algorithm 1 ^ makes O(kN ) calculations of d(·, ·), so that its computational complexity is O(kN n2 log s-1 ), where max min nmax = maxk ni and i=1 smin =
u v u,v=1..N,u=v,i=1..nu ,j=1..nv ,Xi =Xj

wm wl

min

u |Xi

-

v Xj |.

Observe that the samples are not required to be generated independently. The only requirement on the distribution of samples is that the joint distribution is stationary ergodic. This is perhaps one of the mildest possible probabilistic assumptions. ^ Proof. By Lemma 1, d(xi , xj ), i, j  {1..N } converges to 0 if and only if xi and xj are in the same cluster. Since there are only finitely many samples xi , there exists some  > 0 such that, from some n on, ^ we will have d(xi , xj ) <  if xi , xj belong to the same

 Lemma 2 (d is consistent). Assume the conditions of Lemma 1. Let ln and mn be any sequences of integers that go to infinity with n, and let, for each m, l  N, the sets B m,l,n , n  N be an increasing sequence of subsets of B m,l , such that nN B m,l,n = B m,l . Then
n1 ,n2 

lim

2  1 d(X1..n1 , X1..n2 ) = d(1 , 2 ) ­a.s..

The proof is analogous to that of Lemma 1.

Clustering processes

 If we use the estimate d(·, ·) in Algorithm 1 (instead ^ ·)), then we still get an asymptotically consisof d(·, tent clustering function. Thus the following statement holds true. Proposition 2. Assume the conditions of Theorem 1. For all sequences mn , ln of numbers that increase to infinity with n, there is a strongly asymptotically consistent clustering algorithm, whose computational complexity is O(kN nmax mnmax lnmax ). On the one hand, Proposition 2 can be thought of as an artifact of the asymptotic definition of consistency; on ^ the other hand, in practice precise calculation of d(·, ·) is hardly necessary. What we get from Proposition 2 is the possibility to select the appropriate trade­off between the computational burden, and the precision of clustering before asymptotic. Note that the bound in Proposition 2 does not involve the sizes of the sets B m,l,n ; in particular, one can take B m,l,n = B m,l for all n. This is because, for every 1 2 two samples X1..n and X1..n , this sum has no more than 2n non-zero terms, whatever are m, l. However, in the following section, where we are after clustering with an unknown number of clusters k, and thus after controlled rates of convergence, the sizes of the sets B m,l,n will appear in the bounds. 3.2. Unknown number of clusters So far we have shown that when the number of clusters is known in advance, consistent clustering is possible under the only assumption that the joint distribution of the samples is stationary ergodic. However, under this assumption, in general, consistent clustering with unknown number of clusters is impossible. Indeed, as was shown in (Ryabko, 2010b), when we have only two binary-valued samples, generated independently by two stationary ergodic distributions, it is impossible to decide whether they have been generated by the same or by different distributions, even in the sense of weak asymptotic consistency (this holds even if the distributions come from a smaller class: the set of all Bprocesses). Therefore, if the number of clusters is unknown, we have to settle for less, which means that we have to make stronger assumptions on the data. What we need is known rates of convergence of frequencies to their expectations. Such rates are provided by assumptions on the mixing rates of the distribution generating the data. Here we will show that under rather mild assumptions on the mixing rates (and, again, without any modeling assumptions or assumptions of independence), consistent clustering is possible when the number of clusters is unknown.

In this section we assume that all the samples are [0, 1]j valued (that is, Xi  [0, 1]); extension to arbitrary bounded (multidimensional) ranges is straightforward. Next we introduce mixing coefficients, mainly following (Bosq, 1996) in formulations. Informally, mixing coefficients of a stochastic process measure how fast the process forgets about its past. Any oneway infinite stationary process X1 , X2 , . . . can be extended backwards to make a two-way infinite process . . . , X-1 , X0 , X1 , . . . with the same distribution. In the definition below we assume such an extension. Define the  mixing coefficients as (n) = sup
A(...,X-1 ,X0 ),B(Xn ,Xn+1 ,... ))

|P (A  B) - P (A)P (B)|, (4) where (..) stays for the sigma-algebra generated by random variables in brackets. These coefficients are non-increasing. A process is called strongly -mixing if (n)  0. Many important classes of processes satisfy the mixing conditions. For example, if a process is a stationary irreducible aperiodic Hidden Markov process, then it is -mixing. If the underlying Markov chain is finite-state, then the coefficients decrease exponentially fast. Other probabilistic assumptions can be used to obtain bounds on the mixing coefficients, see e.g. (Bradley, 2005) and references therein. Algorithm 2 is very simple. Its inputs are: samples x1 , . . . , xN ; the threshold level   (0, 1), the parameters m, l  N, B m,l,n . The algorithm assigns to the same cluster all samples which are at most -far from   each other, as measured by d(·, ·). The estimate d(·, ·) ^ ·) (see Propocan be calculated in the same way as d(·, sition 1 and its proof). We do not give a pseudo code implementation of this algorithm, since it's rather obvious. The idea is that the threshold level  is selected according to the minimal length of a sample and the (known bounds on) mixing rates of the process  generating the samples (see Theorem 2). The next theorem shows that, if the joint distribution of the samples satisfies (n)  n  0, where n are known, then one can select (based on n only) the parameters of Algorithm 2 in such a way that it is weakly asymptotically consistent. Moreover, a bound on the probability of error before asymptotic is provided. Theorem 2 (Algorithm 2 is consistent, unknown k). Fix sequences n  (0, 1), mn , ln , bn  N, and let B m,l,n  B m,l be an increasing sequence of finite sets, for each m, l  N. Set bn := maxlln ,mmn |B m,l,n |. Let also n  (0, 1). Let N  N and suppose that the samples x1 , . . . , xN are generated in such a way

Clustering processes

that the (unknown) joint distribution  is stationary ergodic, and satisfies n ()  n , for all n  N. Then for every sequence qn  [0..n/2], Algorithm 2, with the above parameters, satisfies (T = I)  2N (N + 1)(mn ln bn n (n ) + n ( )) (5) where () = (2e-qn 
2

Next, let i, j be such that I(xi ) = I(xj ). Then, for some mi,j , li,j  N there is Bi,j  B mi,j ,li,j such that j i |(X1..|Bi,j |  Bi,j ) - (X1..|Bi,j |  Bi,j )| > 2i,j for some i,j > 0. Then for every  < i,j /2 we have
j i (|(X1..ni , Bi,j ) - (X1..nj , Bi,j )| < )  i i (|(X1..ni , Bi,j ) - (X1..|B|  Bi,j )| > i,j )

/32

+ 11(1 + 4/)1/2 qn (n-2mn )/2qn ),

j j + (|(X1..nj , Bi,j ) - (X1..|Bi,j |  Bi,j )| > i,j )

T is the partition output by the algorithm, I is the target clustering,  is a constant that depends only on , and n = mini=1..N ni . In particular, if n = o(1), then, selecting the parameters in such a way that n = o(1), qn , mn , ln , bn = o(n), qn , mn , ln  , kN B m,l,k = B m,l , bm,l  , n for all m, l  N, and, finally,
-1/2 mn ln bn (e-qn n + n qn (n-2mn )/2qn ) = o(1),
2

 2n (i,j ). (8) Moreover, for  < wmi,j wli,j i,j /2  (d(xi , xj ) > )  2n (wmi,j wli,j i,j ). (9)

Define  := mini,j=1..N :I(xi )=I(xj ) wmi,j wli,j i,j /2. Clearly, from this and (8), for every  < 2 we obtain  (d(xi , xj ) > )  2n ( ). (10)  If, for every pair i, j of samples, d(xi , xj ) < n if and only if I(xi ) = I(xj ), then Algorithm 2 gives a correct answer. Therefore, taking the bounds (7) and (10) together for each of the N (N + 1)/2 pairs of samples, we obtain (5). The complexity statement can be established analogously to that in Theorem 1. While Theorem 2 shows that -mixing with a known bound on the coefficients is sufficient to achieve asymptotic consistency, the bound (5) on the probability of error includes as multiplicative terms all the parameters mn , ln and bn of the algorithm, which can make it large for practically useful choices of the parameters. The multiplicative factors are due to the fact that we take a bound on the divergence of each individual frequency of each cell of each partition from its expectation, and then take a union bound over all of these. To obtain a more realistic performance guarantee, we would like to have a bound on the divergence of all the frequencies of all cells of a given partition from their expectations. Such uniform divergence estimates are possible under stronger assumptions; namely, they can be established under some assumptions on -mixing coefficients, which are defined as follows (n) = E sup
B(Xn ,... ))

as is always possible, Algorithm 2 is weakly asymptotically consistent (with the number of clusters k unknown). The computational complexity of Algorithm 2 is O(N 2 mnmax lnmax bnmax ), and is bounded by O(N 2 n2 log s-1 ), where nmax and log s-1 are demax min min fined as in Theorem 1. Proof. We use the following bound from (Bosq, 1996): for any zero-mean random process Y1 , Y2 , . . . , every n  N and every q  [1..n/2] we have
n

P

|
i=1

Yi | > n  4 exp(-q2 /8) + 22(1 + 4/)1/2 q(n/2q).

For every j = 1..N , every m < n, l  N, and B  B m,l , define the processes Y1j , Y2j , . . . , where Ytj := I(X j ,...,X j
t t+m-1 )B

j - (X1..m  B).

It is easy to see that -mixing coefficients for this process satisfy (n)  n-2m . Thus,
j j (|(X1..nj , B)-(X1..m  B)| > /2)  n ()

(6) |P (B) - P (B|(. . . , X0 ))|.

Then for every i, j  [1..N ] such that I(xi ) = I(xj ) (that is, xi and xj are in the same cluster) we have
i (|(X1..ni , B) - j (X1..nj , B)|

> )  2n ().

Using the union bound, summing over m, l, and B, we obtain  (d(xi , xj ) > )  2mn ln bn n (). (7)

These coefficients satisfy 2(n)  (n) (see e.g. (Bosq, 1996)), so assumptions on the speed of decrease of coefficients are stronger. Using the uniform bounds given in (Karandikara & Vidyasagar, 2002), one can obtain a statement similarto that in Theorem 2, with -mixing replaced by -mixing, and without the multiplicative factor bn .

Clustering processes

4. Conclusion
We have proposed a framework for defining consistency of clustering algorithms, when the data comes as a set of samples drawn from stationary processes. The main advantage of this framework is its generality: no assumptions have to be made on the distribution of the data, beyond stationarity and ergodicity. The proposed notion of consistency is so simple and natural, that it may be suggested to be used as a basic sanitycheck for all clustering algorithms that are used on sequence-like data. For example, it is easy to see that the k-means algorithm will be consistent with some initializations (e.g. with the one used in Algorithm 1) but not with others (e.g. not with the random one). While the algorithms that we presented to demonstrate the existence of consistent clustering methods are computationally efficient and easy to implement, the main value of the established results is theoretical. As it was mentioned in the introduction, it can be suggested that for practical applications empirical estimates of the distributional distance can be replaced with distances based on data compression, in the spirit of (Ryabko & Astola, 2006; Cilibrasi & Vitanyi, 2005; Ryabko, 2009). Another direction for future research concerns optimal bounds on the speed of convergence: while we show that such bounds can be obtained (of course, only in the case of known mixing rates), finding practical and tight bounds, for different notions of mixing rates, remains open. Finally, here we have only considered the setting in which the number N of samples is fixed, while the asymptotic is with respect to the lengths of the samples. For on-line clustering problems, it would be interesting to consider the formulation where both N and the lengths of the samples grow.

Cilibrasi, R. and Vitanyi, P.M.B. Clustering by compression. IEEE Trans. Inf. Th., 51:1523­1545, 2005. Gray, R. Probability, Random Processes, and Ergodic Properties. Springer Verlag, 1988. Gutman, M. Asymptotically optimal classification for multiple tests with empirically observed statistics. IEEE Trans. Inf. Th., 35(2):402­408, 1989. Karandikara, R.L. and Vidyasagar, M. Rates of uniform convergence of empirical means with mixing processes. Stat.&Prob. Lett., 58:297­307, 2002. Katsavounidis, I., Kuo, C.-C. Jay, and Zhang, Zhen. A new initialization technique for generalized Lloyd iteration. IEEE Signal Proc. Let., 1:144­146, 1994. Kleinberg, J. An impossibility theorem for clustering. In NIPS :446­453, 2002. Lehmann, E. Testing Statistical Hypotheses, 2nd ed.. Wiley, New York, 1986. Mahajan, M., Nimbhorkar, P., and Varadarajan, K. The planar k-means problem is NP-hard. In WALCOM : 274­285, 2009. Ryabko, B. Compression-based methods for nonparametric prediction and estimation of some characteristics of time series. IEEE Trans. Inf. Th., 55:4309­ 4315, 2009. Ryabko, B. and Astola, J. Universal codes as a basis for time series testing. Stat. Methodology, 3:375­397, 2006. Ryabko, D. Testing composite hypotheses about discrete-valued stationary processes. In ITW : 291­ 295, 2010a. Ryabko, D. Discrimination between B-processes is impossible. J. Theor. Prob., 23(2):565­575, 2010b. Ryabko, D. and Ryabko, B. Nonparametric statistical inference for ergodic processes. IEEE Trans. Inf. Th., 56(3):1430­1435, 2010. Smyth, P. Clustering sequences with hidden Markov models. In NIPS : 648­654. 1997. Zadeh, R. and Ben-David, S. A uniqueness theorem for clustering. In UAI, 2009. Zhong, S. and Ghosh, J. A unified framework for model-based clustering. JMLR, 4:1001­1037, 2003.

References
A. Bagnall, C. Ratanamahatana, E. Keogh, S. Lonardi, G. Janacek. A bit level representation for time series data mining with shape based similarity. Data Mining and Knowledge Discovery, 13(1): 11­40, 2006. Bosq, D. Nonparametric Statistics for Stochastic Processes. Estimation and Prediction. Springer, 1996. Bradley, R.C. Basic properties of strong mixing conditions. A survey and some open questions. Probability Surveys, 2:107­144, 2005.