Budgeted Nonparametric Learning from Data Streams

Ryan Gomes gomes@vision.caltech.edu Andreas Krause krausea@caltech.edu California Institute of Technology, 1200 East California Boulevard, Pasadena, CA 91125 USA

Abstract
We consider the problem of extracting informative exemplars from a data stream. Examples of this problem include exemplarbased clustering and nonparametric inference such as Gaussian process regression on massive data sets. We show that these problems require maximization of a submodular function that captures the informativeness of a set of exemplars, over a data stream. We develop an efficient algorithm, StreamGreedy, which is guaranteed to obtain a constant fraction of the value achieved by the optimal solution to this NP-hard optimization problem. We extensively evaluate our algorithm on large real-world data sets.

ods are promising because they can construct complex decision rules by allowing the data to "speak for itself". They may use complex similarity measures that capture domain knowledge while still providing more flexibility than parametric methods. However, nonparametric techniques are difficult to apply to large datasets because they typically associate a parameter with every data point, and thus depend on all the data. Therefore, most algorithms for nonparametric learning operate in batch mode. To overcome this difficulty, nonparametric learning methods may be approximated by specifying a budget: a fixed limit on the number of examples that are used to make predictions. In this work, we develop a framework for budgeted nonparametric learning that can operate in a streaming data environment. In particular, we study sparse Gaussian process regression and exemplar based clustering under complex, non-metric distance functions, which both meet the requirements of our framework. The unifying concept of our approach is submodularity, an intuitive diminishing returns property. When a nonparametric problem's objective function satisfies this property, we show that a simple algorithm, StreamGreedy, may be used to choose examples from a data stream. We use submodularity to prove strong theoretical guarantees for our algorithm. We demonstrate our approach with experiments involving sparse Gaussian Process regression and large scale exemplar-based clustering of 1.5 million images.

1. Introduction
Modern machine learning is increasingly confronted with the challenge of very large data sets. The unprecedented growth in text, video, and image data demands techniques that can effectively learn from large amounts of data, while still remaining computationally tractable. Streaming algorithms (Gaber et al., 2005; Domingos & Hulten, 2000; Guha et al., 2003; Charikar et al., 2003) represent an attractive approach to handling the data deluge. In this model the learning system has access to a small fraction of the data set at any point in time, and cannot necessarily control the order in which the examples are visited. This is particularly useful when the data set is too large to fit in primary memory, or if it is generated in real time and predictions are needed in a timely fashion. While computational tractability is critical, powerful methods are required in order to learn useful models of complex data. Nonparametric learning methAppearing in Proceedings of the 27 th International Conference on Machine Learning, Haifa, Israel, 2010. Copyright 2010 by the author(s)/owner(s).

2. Problem statement
We consider the problem of extracting a subset A  V of k representative items from a large data set V (which can, e.g., consist of vectors in Rd or other objects such as graphs, lists, etc.). Our goal is to maximize a set function F that quantifies the utility F (A) of any possible subset A  V. We give examples of such utility functions in Sec. 3. Intuitively, in the clustering example, F (A) measures, e.g., the reduction in quantization error when selecting exemplars A as cluster centers. In Gaussian process (GP) regression, F (A) measures the

Budgeted Nonparametric Learning from Data Streams

prediction performance when selecting the active set A. As we show below, many utility functions, such as those arising in clustering and GP regression, satisfy submodularity, an intuitive diminishing returns property: Adding a cluster center helps more if we have selected few exemplars so far, and less if we have already selected many exemplars. Formally, a set function F is said to be submodular, if for all A  B  V and s  V \ B it holds that F (A  {s}) - F (A)  F (B  {s}) - F (B). An additional natural assumption is that F is monotonic, i.e., F (A)  F (B) whenever A  B  V. Since the data set V is large, it is not possible to store it in memory, and we hence can only access a small number of items at any given time t. Let B1 , . . . , BT , . . . be a sequence of subsets of V, where Bt is the set of elements in V that are available to the algorithm at time t. Typically |Bt | = m n = |V|. For example, hardware limitations may require us to read data from disk, one block Bt of data points at a time. We only assume that there is a bound , such that for each element b  V, if b  Bt  · · ·  Bt+ , then < , / i.e., we have to wait at most  steps until b reappears. This assumption is satisfied, for example, if Bt is a sliding window over the data set (in which case  = n), or V is partitioned into blocks, and the Bt cycle through these blocks (in which case  is n/(mini |Bi |)). Our goal is to select at each time t a subset At  At-1 Bt , |At |  k, in order to maximize F (AT ) after some number of iterations T . Thus, at each time t we are allowed to pick any combination of k items from both the previous selection At-1 and the available items Bt , and we would like to maximize the final value F (AT ). Our streaming assumptions mirror those of Charikar et al. (2003), in that we assume a finite data set in which data items may be revisited although the order is not under our control. For certain submodular objectives (FV and FC but not FH , see Section 3) we require the additional assumption that we may access data items uniformly at random (see Section 4). Note that even if B1 = · · · = BT = V, i.e., access to the entire data set is always available, the problem of choosing a set A = argmax F (A)
|A|k

cient approximation algorithm with strong theoretical guarantees for this problem.

3. Examples of online budgeted learning
In this section, we discuss concrete problem instances of the streaming budgeted learning problem, and the corresponding submodular objective functions F . Active set selection in GPs. Gaussian processes have been widely used as a powerful tool for nonparametric regression (Rasmussen & Williams, 2006; Cressie, 1991). Formally, a Gaussian process (GP) is a joint probability distribution P (XV ) over a (possibly infinite) set of random variables XV indexed by a set V, with the property that every finite subset XA for A = {s1 , . . . , sk }, A  V is distributed according to a multivariate normal distribution, P (XA = xA ) = N (xA ; µA , AA ), where µA = (M(s1 ), . . . , M(sk )) is the prior mean and   K(s1 , s1 ) . . . K(s1 , sk )   . . . . AA =   . . K(sk , s1 ) . . . K(sk , sk ) is the prior covariance, parameterized through the positive definite kernel function K. In GP regression, each data point s  V is interpreted as a random variable in a GP. Based on observations XA = xA of a subset A of variables, the predictive distribution of a new data point s  V is a normal distribution 2 P (Xs | XA = xA ) = N (µs|A ; s|A ), where µs|A = µs + sA -1 (xA - µA ) AA
2 s|A

(3.1) (3.2)

=

2 s

-

sA -1 As , AA

and sA = (K(s, s1 ), . . . , K(s, sk )) and As = T . sA Computing the predictive distributions according to (3.1) is expensive, as it requires "inverting" (finding the Cholesky decomposition) of the kernel matrix AA , which, in general requires (|A|3 ) floating point operations. Reducing this computational complexity (and thereby enabling GP methods for large data sets) has been subject of much research (see Rasmussen & Williams 2006). Most approaches for efficient inference in GPs rely on choosing a small active set A of data points for making predictions. For example, the informative vector machine (IVM) uses the set A that maximizes the information gain FH (A) = H(XV ) - H(XV | XA ), (3.3)

maximizing a submodular function F is an NP-hard optimization problem (Feige, 1998). Hence, we cannot expect to efficiently find the optimal solution in general. The setting where Bt V is strictly more general and thus harder. In this paper, we will develop an effi-

or, equivalently, the entropy H(XA ) of the random variables associated with the selected data points A. It

Budgeted Nonparametric Learning from Data Streams

can be shown, that this criterion is monotonic and submodular (Seeger, 2004). While efficiently computable, the IVM criterion FH only depends on the selected data points, and does not explicitly optimize the prediction error of the non-selected examples V \ A. An alternative is to choose data points which minimize the prediction accuracy on the non-selected data: L(A) = sV\A (xs - µs|A )2 . If the data points V are drawn from some distribution P (s), then this criterion can be seen as a sample approximation to the expected variance reduction, L(A)  = P (s) P (xs | xA )(xs - µs|A )2 dsdxs

Algorithm 1 StreamGreedy Initialize active set A0 = ; Bound  on wait time for t = 1 : k do Set st = argmaxsBt F (At-1  {s}) Set At  At-1  {st } end for Set N I = 0 while N I   do Set (s , s) = argmax F (At-1 \{s }{s})
s At-1 ,sAt-1 Bt

2 P (s)s|A dxs = L(A).

It can be shown, that under certain assumptions on the kernel function, the expected variance reduction FV (A) = L() - L(A) is a monotonic submodular function. Exemplar based clustering with complex distance functions on data streams. In exemplar clustering problems, the goal is to select a set of examples from the data set that are representative of the data set as a whole. Exemplar clustering is particularly relevant in cases where choosing cluster centers that are averages of training examples (as in the k-means algorithm) is inappropriate or impossible (see Dueck & Frey 2007 for examples). The kmedoid (Kaufman & Rousseeuw, 1990) approach seeks to choose exemplars that minimize the average dissimilarity of the data items to their nearest exemplar: 1 L(A) = min d(xs , xc ). (3.5) cA |V|
sV

Set t  t + 1; At = At-1 \ {s }  {s} if F (At ) > F (At-1 ) +  then Set N I = 0 else Set N I = N I + 1 end if end while AT would be to start with the empty set, A0 = , and, at iteration t, greedily select the element st = argmax F (At-1  {s})
sV

(3.4)

(4.1)

for t  k, and At = At-1 for t > k. Perhaps surprisingly, this simple greedy algorithm is guaranteed to obtain a near-optimal solution: Nemhauser et al. (1978) prove that for the solution AT , for any T  k, obtained by the greedy algorithm it holds that F (AT )  (1-1/e) max|A|k F (A), i.e., it achieves at least a constant fraction of (1-1/e) of the optimal value. In fact, no efficient algorithms can provide better approximation guarantees unless P=NP (Feige, 1998). Unfortunately, the greedy selection rule (4.1) requires access to all elements of V, and hence cannot be applied in the streaming setting. A natural extension to the streaming setting is the following algorithm: Initialize A0 = . For t  k, set At  At-1  {st }, where st = argmax F (At-1  {s}).
sBt

This loss function can be transformed to a monotonic submodular utility function by introducing a phantom exemplar x0 which may not be removed from the active set, and defining the utility function FC (A) = L({x0 }) - L(A  {x0 }). (3.6)

(4.2)

For t > k, let (s , s) = argmax
s At-1 ,sAt-1 Bt

This measures the decrease in the loss associated with the active set versus the loss associated with just the phantom exemplar, and maximizing this function is equivalent to minimizing (3.5). The dissimilarity function d(x, x ) need only be a positive function of x and x , making this approach potentially very powerful.

F (At-1 \ {s }  {s}), (4.3)

4. StreamGreedy for budgeted learning from data streams
If, at every time, full access to the entire data set V is available, a simple approach to selecting the subset

and set At = At-1 \ {s }  {s}, i.e., replace item s by item s in order to greedily maximize the utility. Stop after no significant improvement (at least  for some small value  > 0) is observed after a specified number  of iterations. StreamGreedy is summarized in Algorithm 1. Dealing with limited access to the stream. So far, we have assumed that StreamGreedy can evaluate the objective function F for any candidate set A.

Budgeted Nonparametric Learning from Data Streams

While the IVM objective FH (A) for active set selection in GPs (see Section 3) only requires access to the selected data points A, evaluating the objectives FC and FV requires access to the entire data set V. However, these objective functions share a key property: They additively decompose over the data set. Hence, they can be written in the form 1 f (A, xs ) F (A) = |V|
sV

function F clustering-consistent for a particular clustering C1 , . . . , CL , if the following conditions hold: 1. F (A) = =1 F (A  C ), i.e., F decomposes additively across clusters. 2. Whenever for two sets A, B  V such that B = A  {s} \ {s }, s  Ci , s  Cj , i = j it holds that if |A  Cj | > 1 and A  Ci = , then F (A)  F (B). Intuitively, a submodular function F is clusteringconsistent, if it is always preferable to select a representative from a new cluster than having two representatives of the same cluster. Proposition 2. Suppose F is clustering-consistent for V and k  L. Then, for T = 2 it holds for all sets At , t  T returned by StreamGreedy (for  = 0) that F (At ) = max F (A).
|A|k L

for suitable function f such that f (·, xs ) is submodular for each input xs . If we assume that data points xs are generated i.i.d. from a distribution and f is a measurable function of xs , then f (A, xs ) are themselves a series of i.i.d. outcomes of a random variable. Moreover, the range of random variables f (A, xs ) is bounded by some constant B (for clustering, B is the diameter of the data set; for GP regression, B is the maximum prior marginal variance). We can construct 1 a sample approximation F (A) = |W| sW f (A, xs ) by choosing a validation set W uniformly at random from the stream V. The following corollary of Hoeffding's inequality adapted from Smola et al. (1999) bounds the deviation between F (A) and F (A): Corollary 1 (Smola et al. 1999). Let c = 2|V|2 and c  > 0. Then, with probability 1 -  for |W| = 1+c |V|: 1 1 F (A) - F (A) <  |W| |V| The result relates the level of approximation to the fraction of the data set that is needed for validation. As the number of elements in the stream |V| increases, smaller fractions are needed to reach a given accuracy. Because this result holds for any (bounded) data distribution, it is usually pessimistic; in practice, smaller validation sets often suffice. Furthermore, this sample based approximation only requires a constant amount of memory: When xs arrives from the stream, f (A, xs ) may be added to a sufficient statistic and xs itself may be discarded.
B 2 log( 2 )

The proofs can be found in the longer version of this paper (Gomes & Krause, 2010). Thus, for clusteringconsistent objectives F , if the data set really consists of L clusters, and we use StreamGreedy to select a set of k  L exemplars, then StreamGreedy converges to the optimal solution after at most two passes through the data set V. A key question is which classes of objective functions are clustering-consistent. In the following, suppose that the elements in V are endowed with a metric d. The following proposition gives interesting examples: Proposition 3. Suppose V = C1 · · ·CL , |Ci | < |Cj | for all i, j. Further suppose that max diam(Ci ) <  min d(Ci , Cj )
i i,j

for suitable constants  and , where d(Ci , Cj ) = minrCi ,sCj d(r, s) and diam(Ci ) = maxr,sCi d(r, s). Then the following objectives from Sec. 3 are clustering-consistent with V = C1  · · ·  CL : · The clustering objective FC , whenever maxxCi d(x, x0 )  minj d(Ci , Cj ) for all i, j, where x0 is the phantom exemplar. · The entropy FH and variance reduction1 FV for Gaussian process regression with squared exponential kernel functions with appropriate bandwidth  2 , and where d is the Euclidean metric in Rd . Intuitively, Propositions 2 and 3 suggests that in situations where the data actually exhibits a well-separated, balanced clustering structure, and we are interested in selecting a number of exemplars k consistent with the number of clusters L in the data, we expect StreamGreedy to perform near-optimally.
1 under the condition of conditional suppressor-freeness (Das & Kempe, 2008)

5. Theoretical analysis
Clustering-consistent objectives. For clarity of notation, we will consider the setting where Bt = {bt } contains only a single element bt  V. The results generalize to sets Bt containing more elements. We first show that for an interesting class of submodular functions, the algorithm actually converges to the optimal solution. Suppose, the data set V can be partitioned into a set of clusters, i.e., V = C1  · · ·  CL , where Ci  Cj = . We call a monotonic submodular

Budgeted Nonparametric Learning from Data Streams
1.8 x 10
4

1.5 1.4

x 10

4

Test Utility relative to full data set

1 K=20 0.98 K=50 K=100 0.96 K=200

1.6

Batch k-means

Online k-means 1.3 StreamGreedy Online K-means (nearest medoid) Utility

Utility

1.4 StreamGreedy 1.2

1.2 1.1 1

Batch K-means (nearest medoid)

1 0 0.5 1 Points processed 1.5 x 10 2
5

0.94 0 0.2 0.4 0.6 0.8 1 Validation Set Size (percentage of full data set) 1.2

0.9 0

0.5

1 Points processed

1.5 x 10

2
5

Figure 1. Left and Center: Convergence rates on MNIST data set. The y-axis represents the clustering utility evaluated on the training set. The x-axis shows the number of data items processed by StreamGreedy and online k-means. K-means' unconstrained centers yield better quantization performance. When k-means' centers are replaced with the nearest training set example, the advantage disappears (center). Right: Test performance versus validation set size. It is possible to obtain good generalization performance even using relatively small validation sets. The validation set size is varied along the x-axis. The y-axis shows test utility divided by the test utility achieved with the entire data set used for validation. As K increases, more validation data is needed to achieve full performance.

General submodular objectives. However, the assumptions made by Propositions 2 and 3 are fairly strong, and likely violated by the existence of outliers, overlapping and imbalanced clusters, etc. Furthermore, when using criteria such as FC and FV (Sec. 3), it is not possible to evaluate F (A) exactly, but only up to additive error . Perhaps surprisingly, even in such more challenging settings, the algorithm is still guaranteed to converge to a near-optimal solution: Theorem 4. Let  > 0. Suppose F is monotonic submodular on V, and we have access to a function F such that for all A  V, |A|  2k it holds that |F (A)-F (A)|  . Furthermore suppose F is bounded by B. Then, for T = B/ it holds for all sets At , t  T selected by StreamGreedy applied to F that F (At )  1 max F (A) - k( + ). 2 |A|k

We have also implemented an adaptive stopping rule that is useful when determining an appropriate size of the validation set. Please see the long version (Gomes & Krause, 2010) for details. Our first set of experiments uses MNIST handwritten digits with 60,000 training images and 10,000 test images.2 The MNIST digits were preprocessed as follows: The 28 by 28 pixel images are initially represented as 784 dimensional vectors, and the mean of the training image vectors was subtracted from each image; then the resulting vectors are normalized to unit norm. PCA was performed on the normalized training vectors and the first 50 principal components coefficients were used to form feature vectors. The same normalization procedure was performed on the test images and their dimensionality was also reduced using the training PCA basis. Fig. 1 compares the performance of our approach against batch k-means and online k-means (Dasgupta, 2009) with the number of exemplars set to K = 100. We chose the origin as the phantom exemplar in this experiment, since this yielded better overall quantization performance than choosing a random exemplar. To unambiguously assess convergence speed we use the entire training set of 60,000 points as the validation set. We assess convergence by plotting (3.6) against the number of swap candidates T ( t=1 |Bt |) considered. We find that our algorithm converges to a solution after examining nearly the same number of data points as online k-means, and is near its final value after a single pass through the training data. Similar convergence was observed for smaller validation sizes. The left plot in Fig. 1 shows
MNIST was downloaded from http://yann.lecun.com/exdb/mnist/.
2

Thus, e.g., in the case where bt = st mod n , i.e., if StreamGreedy sequentially cycles through the data set V, at most B/ passes (typically it will stop far earlier) through the data set will suffice to produce a solution that obtains almost half the optimal value. The proof relies on properties of the pairwise exchange heuristic for submodular functions (Nemhauser et al., 1978). See the long version of this paper (Gomes & Krause, 2010) for details.

6. Experimental results
Exemplar based streaming clustering. Our exemplar based clustering experiments involve StreamGreedy applied to the clustering utility FC (Eq. (3.6)) with d(x, x ) = ||x - x ||2 . The implementation can be made efficient by exploiting the fact that only a subset of the validation points (c.f., Sec. 4) change cluster membership for each candidate swap.

Budgeted Nonparametric Learning from Data Streams

Cluster size (#members)

Cluster size (#members)

2.5 2 1.5 1 0.5 0

x 10

4

x 10 4 2 0

4

50

100 150 Cluster rank

200

50

100 150 Cluster rank

200

Figure 2. Tiny Image data set. Top Left: Cluster exemplars discovered by StreamGreedy, sorted according to descending size. Top Right: Cluster centers from online kmeans (singleton clusters omitted). Bottom Left: Cluster sizes (number of members) for our algorithm. Bottom Right: Cluster sizes for online k-means. Online k-means finds a poor local minima with many of the 200 clusters containing only a single member.

Figure 3. Examples from Tiny Image cluster 26. Left: 100 examples nearest to exemplar 26. Right: 100 randomly sampled images from cluster 26.

nents), then normalize it to unit norm. No dimensionality reduction is performed. We generate a random center to serve as the phantom exemplar for this experiment, since we find that this leads to qualitatively more interesting clusters than using the origin4 . Fig. 2 (left) shows K = 200 exemplars discovered by our algorithm. Clusters are organized primarily according to non-semantic visual characterstics such as color and basic shape owing to the simple sum of squared differences similarity measure employed (Fig. 3). We set the validation size to one-fifth of the data set. This was determined by examining the stability of argmaxs At-1 ,sAt-1 Bt FC (At-1 \{s }{s}) as validation data was progressively added to the sums in FC , which tends to stabilize well before this amount of data is considered. The algorithm was halted after 600 iterations (each considering |Bt | = 1, 000 candidate centers). This was determined based on inspection of the utility function, which converged before a single pass through the data. We compare against the online k-means algorithm with 200 centers initialized to randomly chosen images, and run through a single pass over the data. We find that online k-means converges to a suboptimal solution in which many of the clusters are empty or contain only a single member (see Fig. 2.) In Fig. 4 (left) we assess the tradeoff between run time and performance by varying the parameter |Bt | = {500, 1000, 2000} and the validation set size as {10%, 20%, 40%} of the data set. The number of centers and iterations are fixed at 200 and 600, respectively. Our Matlab StreamGreedy implementation was run on a quad-core Intel Xeon server. Performance for each parameter setting is visualized as a point in the test utility versus run time plane, and only the Pareto optimal points are displayed for clarity. OnWe find that a random phantom exemplar is unlikely to be chosen as a prototype, while one near the origin is the prototype for a significant fraction of the data.
4

that k-means performs better in terms of quantization loss. This is probably because StreamGreedy must choose exemplar centers from the training data, while k-means center locations are unconstrained. When the k-means' centers are replaced with the nearest training example (center plot), the advantage disappears. The right plot in Fig. 1 examines the impact of validation set size on quantization performance on the held out test set, measured as test set utility ((3.6) where V is the test set). It is possible to obtain good generalization performance even when using a small validation set. The y-axis indicates test performance relative to the performance attained with the full data set at the specified value of K (1.0 indicates equal performance, values less than one indicate worse performance than the full set), and the x-axis is plotted as the relative size of the validation set versus the full set. We find that as the number of centers K increases, a larger fraction of the data set is needed to approach the performance with the full set. This appears to be because as K increases, the numerical differences between FC (At-1 \ {s }  {s}) for alternative candidate swaps (s, s ) decrease, and more samples are needed in order to stably rank the swap alternatives. Our second set of experiments involves approximately 1.5 million Tiny Images3 (Torralba et al., 2008), and is designed to test our algorithm on a large scale data set. Each image in the data set was downloaded by Torralba et al. from an Internet search engine and is associated with an English noun query term. The 32 by 32 RGB pixel images are represented as 3,072 dimensional vectors. Following Torralba et al. (2008), we subtract from each vector its mean value (average of all compo3

http://people.csail.mit.edu/torralba/tinyimages/

Budgeted Nonparametric Learning from Data Streams
1.84 1.82 Utility StreamGreedy 1.8 1.78 1.76 5 x 10
6

0.1 |W|=2000 |W|=6000 |W|=10000 0.05

7. Related Work
Specialization of StreamGreedy to the clustering objective FC (3.6) yields an algorithm which is similar to the Partitioning Around Medoids (PAM, Kaufman & Rousseeuw 1990) algorithm for k-medoids, and related algorithms CLARA (Kaufman & Rousseeuw, 1990) and CLARANS (Ng & Han, 2002). Like our approach, these algorithms are based on repeatedly exchanging centers for non-center data points if the swap improves the objective function. Unlike our approach, however, no performance guarantees are known for these approaches. PAM requires access to the entire data set, and every data point is exhaustively examined at each iteration, leading to an approach unsuitable for large databases. CLARA runs PAM repeatedly on subsamples of the data set, but then makes use of the entire dataset when comparing the results of each PAM run. Like our algorithm, CLARANS evaluates a random subset of candidate centers at each iteration, but then makes use of the entire data set to evaluate candidate swaps. Our approach takes advantage of the i.i.d. concentration behavior of the clustering objective in order to eliminate the need for accessing the entire data set, while still yielding a performance guarantee. Domingos & Hulten (2001) exploit the concentration behavior of the (non-exemplar) k-means objective in a similar way. While there exist online algorithms for kmedoids with strong theoretical guarantees (Charikar et al., 2003), these algorithms require the distance function d to be a metric, and the memory to grow (logarithmically) in |V|. In contrast, our approach uses arbitrary dissimilarity functions and the memory requirements are independent of the data set size. Specialization of StreamGreedy to sparse GP inference is an example of the subset of datapoints class of sparse Gaussian Process approximations (Rasmussen & Williams, 2006), in which the GP predictive distribution is conditioned on only the datapoints in the active set. Seeger et al. (2003) also use a subset of datapoints approach that makes use of a submodular (Seeger, 2004) utility function (the entropy of the Gaussian distribution of each site in the active set). This approach is computationally cheaper than ours in that the evaluation criterion does not require a validation set, but depends only on the current active set. Seeger et al.'s approach also fits the framework proposed by this paper, and our approach could be used to optimize this objective over data streams. Smola & Bartlett (2000) use a subset of regressors approach. Their criterion for greedy selection of regressors has the same complexity as our approach if we use the entire data set for validation. Our approach is cheaper

Test MSE

0.03 0.02

Online K-means

10

15 20 25 30 Run time (hours)

35

40

200

400 600 800 K (active set size)

1000

Figure 4. Left: Utility score versus run time on the Tiny Images data set. Right: Gaussian Process regression. yaxis is test set mean squared prediction error. x-axis is the size of the active set.

line k-means is also shown for comparison. We find that StreamGreedy achieves higher utility at less running time, and a clear saturation in performance occurs as run time increases. Online active set selection for GP regression. Our Gaussian Process regression experiments involve specialization of StreamGreedy for the objective function FV in Sec. 3. The implementation can be made more efficient by using Cholesky factorization on the covariance matrix combined with rank one updates and downdates. (Please see the longer version, Gomes & Krause (2010), for details.) We used the KIN40K dataset5 which consists of 9 attributes generated by a robotic arm simulator. We divide the dataset into 10,000 training and 30,000 test instances. We follow the preprocessing steps outlined by Seeger et al. (2003) in order to compare our approach to the results in that study. We used the squared exponential kernel with automatic relevance determination (ARD) weights and learn the hyperparameters using marginal likelihood maximization (Rasmussen & Williams, 2006) on a subset of 2,000 training points, again following Seeger et al. (2003). Fig. 4 (right) shows the mean squared error predictive performance 1 s (ys - µs ) on the test set as a func2 tion of the size of the active set. Comparing our results to the experiments of Seeger et al. (2003), we find that our approach outperforms the info-gain criterion for active set size K = {200, 400, 600} at all values of the validation set size |W| = {2000, 6000, 10000}. At values K = {800, 1000} our approach outperforms info-gain for |W| = {6000, 10000}. Our performance matches Smola & Bartlett (2000) at K = {200, 400} but slightly underperforms their approach at larger values of K. We find that even for |W| = 2, 000, the algorithm is able to gain predictive ability by choosing more active examples from the data stream. The performance gap between |W| = 6, 000 and |W| = 10, 000 is quite small.
5 Downloaded from http://ida.first.fraunhofer.de/ anton/data.html.

Budgeted Nonparametric Learning from Data Streams

when we make use of a limited validation set. Csat´ o & Opper (2002) develop an approach for online sparse GP inference based on projected process approximation that also involves swapping candidate examples into an active set, but without performance guarantees. See Rasmussen & Williams (2006) for a survey of other methods for sparse Gaussian Process approximation. StreamGreedy's structure is similar to the algorithm by Weston et al. (2005) for online learning of kernel perceptron classifiers, in that both approaches make use of a fixed budget of training examples (the active set) that are selected by evaluating a loss function defined over a limited validation set. Nemhauser et al. (1978) analyzed the greedy algorithm and a pairwise exchange algorithm for maximizing submodular functions. As argued in Sec. 4, these algorithms do not apply to the streaming setting. Streeter & Golovin (2008) develop an online algorithm for maximizing a sequence of submodular functions over a fixed set (that needs to be accessed every iteration). Our approach, in contrast, maximizes a single submodular function on a sequence of sets, using bounded memory.

References
Charikar, M., O'Callaghan, L., and Panigrahy, R. Better streaming algorithms for clustering problems. In STOC, pp. 30­39, 2003. Cressie, N. A. C. Statistics for Spatial Data. Wiley, 1991. Csat´, Lehel and Opper, Manfred. Sparse on-line gaussian o processes. Neural Computation, 14(3):641­668, 2002. Das, A. and Kempe, D. Algorithms for subset selection in linear regression. In STOC, 2008. Dasgupta, S. Lecture notes on online clustering. Technical report, http://wwwcse.ucsd.edu/dasgupta/291/lec6.pdf, 2009. Domingos, P. and Hulten, G. Mining high-speed data streams. In KDD, 2000. Domingos, P. and Hulten, G. A general method for scaling up machine learning algorithms and its application to clustering. In ICML, 2001. Dueck, D. and Frey, B. J. Non-metric affinity propagation for unsupervised image categorization. In ICCV, pp. 1­ 8, 2007. Feige, U. A threshold of ln n for approximating set cover. Journal of the ACM, 45(4):634 ­ 652, July 1998. Gaber, Mohamed Medhat, Zaslavsky, Arkady, and Krishnaswamy, Shonali. Mining data streams: a review. SIGMOD Record, 34(2):18­26, June 2005. Gomes, Ryan and Krause, Andreas. Budgeted nonparametric learning from data streams (tech report). Technical report, California Institute of Technology, 2010. http://www.cs.caltech.edu/krausea/files/icmlbudget-long.pdf. Guha, Meyerson, Mishra, Motwani, and O'Callaghan. Clustering data streams: Theory and practice. IEEE TKDE, 15, 2003. Kaufman, L. and Rousseeuw, P. J. Finding Groups in Data: an Introduction to Cluster Analysis. Wiley, 1990. Nemhauser, G. L., Wolsey, L. A., and Fisher, M. L. An analysis of approximations for maximizing submodular set functions. Math. Programming, 14(1):265­294, December 1978. Ng, R. T. and Han, J. CLARANS: A method for clustering objects for spatial data mining. IEEE Trans. Knowl. Data Eng, 14(5):1003­1016, 2002. Rasmussen, C. E. and Williams, C. K.I. Gaussian Processes for Machine Learning. Adaptive Computation and Machine Learning. The MIT Press, 2006. Seeger, M. Greedy forward selection in the informative vector machine. Technical report, UC Berkeley, 2004. Seeger, M., Williams, C. K. I., and Lawrence, N. D. Fast forward selection to speed up sparse gaussian process regression. In AISTATS, 2003. Smola, Alex J. and Bartlett, Peter L. Sparse greedy gaussian process regression. In NIPS, pp. 619­625, 2000. Smola, Alex J., Mangasarian, Olvi L., and Scholkopf, Bernhard. Sparse kernel feature analysis, 1999. Streeter, Matthew and Golovin, Daniel. An online algorithm for maximizing submodular functions. In NIPS, pp. 1577­1584, 2008. Torralba, A., Fergus, R., and Freeman, W. T. 80 million tiny images: A large data set for nonparametric object and scene recognition. IEEE PAMI, 30(11):1958­1970, November 2008. Weston, Jason, Bordes, Antoine, and Bottou, L´on. Online e (and offline) on an even tighter budget. In AISTATS, pp. 413­420, 2005.

8. Conclusions
We have developed a theoretical framework for extracting informative exemplars from data streams that led to StreamGreedy, an effective algorithm with strong theoretical guarantees. We have shown that this framework can be successfully specialized to exemplar based problems and nonparametric regression with Gaussian Processes. In the case of clustering, our experiments demonstrate that our approach is capable of discovering meaningful clusters in large highdimensional data sets, while remaining computationally tractable. Our sparse Gaussian Process regression algorithm is competitive with respect to other approaches and is capable of operating in a streaming data environment. Future work involves discovering other machine learning problems that fit the framework (including classification) and exploring alternative ways to approximately evaluate submodular functions without full access to a large data set. Acknowledgements We thank Pietro Perona, Piotr Dollar, Kristin Branson, and the anonymous reviewers for their helpful comments. This research was partially supported by ONR grants N00014-09-1-1044 and N00014-06-1-0734, a gift from Microsoft Corporation, and an Okawa Foundation Research Grant.