Ultrafast Monte Carlo for Statistical Summations Michael P. Holmes, Alexander G. Gray, and Charles Lee Isbell, Jr. College Of Computing Georgia Institute of Technology Atlanta, GA 30327 {mph, agray, isbell}@cc.gatech.edu Abstract Machine learning contains many computational bottlenecks in the form of nested summations over datasets. Computation of these summations is typically O(n2 ) or higher, which severely limits application to large datasets. We present a multistage stratified Monte Carlo method for approximating such summations with probabilistic relative error control. The essential idea is fast approximation by sampling in trees. This method differs from many previous scalability techniques (such as multi-tree methods) in that its error is stochastic, but we derive conditions for error control and demonstrate that they work. Further, we give a theoretical sample complexity for the method that is independent of dataset size, and show that this appears to hold in experiments, where speedups reach as high as 1014 , many orders of magnitude beyond the previous state of the art. 1 Introduction Many machine learning methods have computational bottlenecks in the form of nested summations that become intractable for large datasets. A variety of objective functions and gradients involve computations of this sort, including the cross-validated objective functions of kernel methods and leave-one-out cross-validation in general. We formalize the general class of nested summations and present a new multi-stage Monte Carlo method for approximating any problem in the class with probabilistic relative error control. Key to the efficiency of this method is the use of tree-based data stratification, i.e. sampling in trees. We give a theoretical analysis of the error guarantees and sample complexity of the method, with the intriguing result that its runtime depends not on the size of the datasets on which it operates, but on statistical features such as variance, which can be controlled through techniques such as stratification. We present experiments validating these theoretical results. Previous approaches to scalable approximations of this kind fall into roughly two groups: 1) methods that run non-accelerated algorithms on subsets of the data, typically without error bounds, and 2) multi-tree methods with deterministic error bounds. The former are of less interest due to the lack of error control, while the latter are good when exact error control is required, but have built-in overconservatism that limits speedup, and are difficult to extend to new problems. Multi-stage Monte Carlo offers much larger speedup, generality that makes it simple to adapt to new problems, and rigorous probabilistic error control. While there are non-summative problems to which the standard multi-tree methodology is applicable and our Monte Carlo method is not, our method appears to give orders of magnitude greater speedup on problems where both methods can be used. The main contributions of this work are: formulation of the general class of nested summations; multi-stage application of Monte Carlo principles to this problem class; derivation of error guarantees for iterative multi-stage Monte Carlo; sample complexity bounds independent of dataset size; a stratification approach based on spatial partitioning trees; application to kernel regression and kernel conditional density estimation; and the demonstration of speedups as high as 1014 on datasets with points numbering in the millions (many orders of magnitude faster than the previous state of the art). 1 2 Problem definition and previous work Good examples of the type of nested summations we address are the least-squares cross-validation scores used for bandwidth optimization in nonparametric kernel methods such as kernel regression (KR), kernel density estimation (KDE), and kernel conditional density estimation (KCDE): S SK R S K DE 1i = n 1i = n = 1i n -2 y i j - 1 =i j Kh (||xi - xj ||)yj Kh (||xi - xj ||) K k h (||x - xj ||)Kh (||x - xk ||)dx - =i =i j 1)2 k K C DE K Kh2 (||xi - xj ||)Kh2 (||xi - xk ||) h1 (y - yj )Kh1 (y - yk )dy j k =i =i Kh2 (||xi - xj ||)Kh2 (||xi - xk ||) j . =i Kh2 (||xi - xj ||)Kh1 (yi - yj ) j =i Kh2 (||xi - xj ||) =i =i (n - j =i j 2 (n - 1) Kh (||xi - xj ||) =i These nested sums have quadratic and cubic computation times that are intractable for large datasets. We would like a method for quickly approximating these and similar computations in a simple and general way. We begin by formulating an inductive generalization of the problem class: i B (Xc ) f (Xc , Xi ) (1) I ( Xc ) G (Xc ) B (Xc ) | i I (Xc ) f (Xc , G1 (Xc , Xi ) , G2 (Xc , Xi ) , . . . ) . (2) B represents the base case, in which a tuple of constant arguments Xc may be specified and the tuple of variable arguments Xi is indexed by a set I , which may be a function of Xc . For instance, in each of the innermost leave-one-out summations of SK R , Xc is the single point xi while I (Xc ) indexes all single points other than xi . Note that |I | is the number of terms in a summation of type B , and therefore represents the base time complexity. Whenever I consists of all k -tuples or leave-one-out k -tuples, the base complexity is O(nk ), where n is the size of the dataset. The inductive case G is either: 1) the base case B , or 2) a sum where the arguments to the summand function are Xc and a series of nested instances of type G. In SK R the outermost summation is an example of this. The base complexity here is |I | multiplied by the maximum base complexity among the nested instances, e.g. if, as in SK R , I is all single points and the most expensive inner G is O(n), then the overall base complexity is O(n2 ). Previous work. Past efforts at scaling this class of computation have fallen into roughly two groups. First are methods where data is simply subsampled before running a non-accelerated algorithm. Stochastic gradient descent and its variants (e.g. [1]) are prototypical here. While these approaches can have asymptotic convergence, there are no error guarantees for finite sample sizes. This is not show-stopping in practice, but the lack of quality assurance is a critical shortcoming. Our approach also exploits the speedup that comes from sampling, but provides rigorous relative error guarantees and is able to automatically determine the necessary sample sizes to provide those guarantees. The other main class of acceleration methods consists of those employing "higher order divide and conquer" or multi-tree techniques that give either exact answers or hard error bounds (e.g. [2, 3, 4]). These approaches are applicable to a broad class of "generalized n-body problems" (GNPs), and feature the use of multiple spatial partitioning structures such as k d-trees or ball trees to decompose and reuse portions of computational work. While the class of GNPs has yet to be formally defined, the generalized summations we address are clearly related and have at least partial overlap. The standard multi-tree methodology has three significant drawbacks. First, although it gives deterministic error bounds, the bounds are usually quite loose, resulting in overconservatism that prevents aggressive approximation that could give greater speed. Second, creating a new multi-tree method to accelerate a given algorithm requires complex custom derivation of error bounds and pruning rules. Third, the standard multi-tree approach is conjectured to reduce O(np ) computations at best to O(nlog p ). This still leaves an intractable computation for p as small as 4. 2 In [5] the first of these concerns began to be addressed by employing sample-based bounds within a multi-tree error propagation framework. The present work builds on that idea by moving to a fully Monte Carlo scheme where multiple trees are used for variance-reducing stratification. Error is rigorously controlled and driven by sample variance, allowing the Monte Carlo approach to make aggressive approximations and avoid the overconservatism of deterministic multi-tree methods, yielding greater speedups by many orders of magnitude. Our Monte Carlo approach handles the class of recursive summations in full generality, making it easy to specialize to new problems. Lastly, the computational complexity of our method is not directly dependent on dataset size, which means it can address high degrees of nesting that would make the standard multi-tree approach intractable. The main tradeoff is that Monte Carlo error bounds are probabilistic, though the bound probability is a parameter to the algorithm. Thus, we believe the Monte Carlo approach is superior for all situations that can tolerate minor stochasticity in the approximated output. In summary, this work makes the following contributions: formulation of the class of generalized nested data summations; derivation of recursive Monte Carlo algorithms with rigorous error guarantees for this class of computation; derivation of sample complexity bounds showing no explicit dependence on dataset size; variance-driven tree-based stratified sampling of datasets, which allows Monte Carlo approximation to be effective with small sample sizes; empirical demonstration of speedups as high as 1014 on datasets numbering in the millions. It is the combination of all these elements that enables our method to perform so far beyond the previous state of the art. 3 Single-stage Monte Carlo We first derive a Monte Carlo approximation for the base case of a single-stage, flat summation, i.e. Equation 1. The basic results for this simple case (up to and including Algorithm 1 and Theorem 1) mirror the standard development of Monte Carlo as in [6] or [7], with some modification to accommodate our particular problem setup. We then move beyond to present novel sample complexity bounds and extensions of the single-stage results to the multi-stage and multi-stage stratified cases. These extensions allow us to efficiently bring Monte Carlo principles to bear on the entire class of generalized summations, while yielding insights into the dependence of computational complexity on sample statistics and how tree-based methods can improve those statistics. To begin, note that the summation B (Xc ) can be written as nE [fi ] = nµf , where n = |I | and the 1 expectation is taken over a discrete distribution Pf that puts mass n on each term fi = f (Xc , Xi ). ^ Our goal is to produce an estimate B that has low relative error with high probability. More precisely, ^ - B | |B| with probability at least 1 - . This is equivalent for a specified and , we want |B to estimating µf by µf such that |µf - µf | |µf |. Let µf be the sample mean of m samples ^ ^ ^ taken from Pf . From the Central Limit Theorem, we have asymptotically µf ^ N (µf , f /m), ^2 where f is the sample variance, from which we can construct the standard confidence interval: ^2 |µf - µf | z/2 f / m with probability 1 - . In the 1 - fraction of cases where µf lies in ^ ^ ^ this bound, our relative error condition is implied by z/2 f / m |µf |. We also have |µf | ^ |µf | + z/2 f / m. Combining these, we can ensure our target relative error by requiring that ^ ^ z/2 f / m (|^f | + z/2 f / m), which rearranges to: ^ µ ^ 2 m z/2 ^2 (1 - )2 f . 2 µ2 ^f (3) Equation 3 gives an empirically testable condition that guarantees the target relative error level with probability 1 - , given that µf has reached its asymptotic distribution N (µf , f /m). This ^ ^2 suggests an iterative sampling procedure in which m starts at a value mmin chosen to make the normal approximation valid, and then is increased until the condition of Equation 3 is met. This procedure is summarized in Algorithm 1, and we state its error guarantee as a theorem. ^ Theorem 1. Given mmin large enough to put µf in its asymptotic normal regime, with probability at least 1 - Algorithm 1 approximates the summation S with relative error no greater than . Proof. We have already established that Equation 3 is a sufficient condition for relative error with probability 1 - . Algorithm 1 simply increases sample size until this condition is met. 3 Algorithm 1 Iterative Monte Carlo approximation for flat summations. MC-Approx(S, Xc , , , mmin ) samples , mneeded mmin repeat addS amples(samples, mneeded , S, Xc ) m, µf , f calcS tats(samples) ^ ^2 2 ^2 ^f mthresh z/2 1- f /µ2 mneeded mthresh - m until m mthresh return |S.I |µf ^ addSamples(samples, mneeded , S, Xc ) for i = 1 to mneeded Xi rand(S.I ) samples samples S.f (Xc , Xi ) end for calcStats(samples) m count(samples) µf av g (samples) ^ f v ar (samples) ^2 return m, µf , f ^ ^2 Sample Complexity. Because we are interested in fast approximations, Algorithm 1 is only useful if it terminates with m significantly smaller than the number of terms in the full summation. Equation 3 gives an empirical test indicating when m is large enough for sampling to terminate; we now provide an upper bound, in terms of the distributional properties of the full set of fi , for the value of m at which Equation 3 will be satisfied. Theorem 2. Given mmin large enough to put µf and f in their asymptotic normal regimes, with ^ ^ µ . 2 probability at least 1 - 2 Algorithm 1 terminates with m O µf + |µf | 44f - 1 2 f Proof. The termination condition is driven by f /µ2 , so we proceed by bounding this ratio. ^ 2 ^f First, with probability 1 we have a lower bound on the absolute value of the sample mean: - |µf | |µf | - z/2 f / m. Next, because the sample variance is asymptotically distributed as ^ ^ 2 4 N (f , (µ4f - f )/m), where µ4f is the fourth central moment, we can apply the delta method 4 2 to infer that f converges in distribution to N (f , (µ4f - f )/4f m). Using the normal-based ^ confidence interval,µ his gives the following 1 - upper bound for the sample standard deviation: t 4 f f + z/2 ^ 4f - f /(2f m). We now combine these bounds, but since we only know that each bound individually covers at least a 1 - fraction of outcomes, we can only guarantee they will jointly hold with probability at least 1 - 2, giving the following 1 - 2 bound: µ4f - 4 f + z/2 2f m f f ^ . f |µf | ^ |µf | - z/2 m Combining this with Equation 3 and solving for m shows that, with probability at least 1 - 2, the algorithm will terminate with m no larger than: 1 µ 2 z/2 f f f 2(1 - ) 4f f 4f + (1 - ) + - 1 . (4) 4 -1+ 2 |µf | |µf | f |µf | 2 |µf | µf 4 Three aspects of this bound are salient. First, computation time is liberated from dataset size. This 2 is because the sample complexity depends only on the distributional features (f , µf , and µ4f ) of the summation terms, and not on the number of terms. For i.i.d. datasets in particular, these distributional features are convergent, which means the sample or computational complexity converges to a constant while speedup becomes unbounded as the dataset size goes to infinity. 4 Second, the bound has sensible dependence on f /|µf | and µ4f /f . The former is a standard dispersion measure known as the coefficient of variation, and the latter is the kurtosis. Algorithm 1 therefore gives greatest speedup for summations whose terms have low dispersion and low kurtosis. The intuition is that sampling is most efficient when values are concentrated tightly in a few clusters, making it easy to get a representative sample set. This motivates the additional speedup we later gain by stratifying the dataset into low-variance regions. f f Finally, the sample complexity bound indicates whether Algorithm 1 will actually give speedup for any particular problem. For a given summation, let the speedup be defined as the total number of terms n divided by the number of terms evaluated by the approximation. For a desired speedup , we need n mbound , where mbound is the expression in Equation 4. This is the fundamental characterization of whether speedup will be attained. 4 Algorithm 2 Iterative Monte Carlo approximation for nested summations. MC-Approx: as in Algorithm 1 calcStats: as in Algorithm 1 addSamples(samples, mneeded , S, Xc ) for i = 1 to mneeded Xi rand(S.I ) approxArg s map(MC-Approx(, Xc Xi , ), S.Gj ) samples samples S.f (Xc , approxArg s) end for 4 Multi-stage Monte Carlo We now turn to the inductive case of nested summations, i.e. Equation 2. The approach we take is to apply the single-stage Monte Carlo algorithm over the terms fi as before, but with recursive invocation to obtain approximations for the arguments Gj . Algorithm 2 specifies this procedure. Theorem 3. Given mmin large enough to put µf in its asymptotic normal regime, with probability ^ at least 1 - Algorithm 2 approximates the summation S with relative error no greater than . Proof. We begin by noting that the proof of correctness for Aligorithm 1 rests on 1) the ability to 1 sample from a distribution Pf whose expectation is µf = n fi , and 2) the ability to invoke the 2 CLT on the sample mean µf in terms its variance f /m. Given these properties, Equation 3 follows ^ ^ as a sufficient condition for relative error no greater than with probability at least 1 - . We therefore need only establish that Algorithm 2 samples from a distribution having these properties. ^ ^ For each sampled fi , let Gj be the approximation for each argument Gj . We assume each Gj has been approximated from sufficiently many samples as to be in an asymptotic CLT regime. Because ^ the Gj are recursively approximated by Algorithm 2, this is an inductive hypothesis, with the remainder of the proof showing that if the hypothesis holds for recursive invocations, it also holds for the outer invocation. The base case, where all invocations must bottom out, is the type-B summation ^ ^ ^ ^ ^ already handled in Theorem 1. Let Gm = (Gm1 , Gm2 , . . .) be the vector of Gj values after each Gj has been estimated from mj samples, and let G be the vector of true Gj values. Since each compo2 ^ ^ ^ nent Gj converges in distribution to N (Gj , j /mj ), Gm satisfies Gm N (G , m ). We leave the 2 detailed entries of the covariance m unspecified, except to note that its j j th element is j /mj , and ^ j are generated in a correlated way (this can that its off-diagonal elements may be non-zero if the G be used as a variance reduction technique). ^ Given the asymptotic normality of Gm , the same arguments used to derive the multivariate delta ^ method can be used, with some modification, to show that fi (Gm ) N (fi (G ), f (G )m T (G )). f ^ Thus, asymptotically, fi (Gm ) is normally distributed around its true value with a variance that de^ pends on both the gradient of f and the covariance matrix of the argument vector Gm . This being the case, uniform sampling of the fi is equivalent to sampling from a probability distri1 ~ ~ bution Pf giving weight n to the normal distribution of each fi . The expectation of fi over Pf is µf , and since the algorithm uses a simple sample mean the CLT does apply. Given enough samples to put the sample mean in its CLT regime, Equation 3 holds as an error criterion, and Algorithm 2 simply increases the sample size until the criterion is met. i 1 2 2 2 ~ Note that the variance over Pf works out to f = f + n ~2 f I i , where i = f (G )m T (G ). 2 In other words, the variance with recursive approximation is the exact variance f plus the average 2 of the variances i of the approximated fi . Likewise one could write an expression for the kurtosis µ4f . Because we are still dealing with a sample mean, Theorem 2 still holds in the nested case. ~ Corollary 2.1. Given mmin large enough to put µf and f in their asymptotic normal regimes, ^ ^ ~ ~ . 2 µ ~ with probability at least 1 - 2 Algorithm 2 terminates with m O µf + |µf | 44f - 1 2 ~ f ^ Note that the 1 - confidences and relative error bounds of the interior approximations Gj do not pass through or compound in the overall estimator µf : their influence appears in the variance ^ 2 i of each sampled fi , which in turn contributes to the overall variance f , and the error from f is ~2 ~2 independently controlled by the outermost sampling procedure. 5 f f Algorithm 3 Iterative Monte Carlo approximation for nested summations with stratification. MC-Approx: as in Algorithm 1 calcStats(strata, samples) m count(samples) µf stratAv g (strata, samples) ^ f stratV ar (strata, samples) ^2 return m, µf , f ^ ^2 addSamples(strata, samples, mneeded , S, Xc ) needP erS trat = optAlloc(samples, strata, mneeded ) for s = 1 to strata.count ms = needP erS trat[s] for i = 1 to ms Xi rand(strata[s]) approxArg s map(MC-Approx(, Xc Xi , ), S.Gj ) samples[s] samples[s] S.f (Xc , approxArg s) end for end for 5 Variance Reduction With Algorithm 2 we have coverage of the entire generalized summation problem class, and our focus turns to maximizing efficiency. As noted above, Theorem 2 implies we need fewer samples when the summation terms are tightly concentrated in a few clusters. We formalize this by spatially partitioning the data to enable a stratified sampling scheme. Additionally, by use of correlated sampling we induce covariance between recursively estimated summations whenever the overall variance can be reduced by doing so. Combining these techniques with recursive Monte Carlo makes for an extremely fast, accurate, and general approximation scheme. Stratification. Stratification is a standard Monte Carlo principle whereby the values being sampled are partitioned into subsets (strata) whose contributions are separately estimated and then combined. The idea is that strata with higher variance can be sampled more heavily than those with lower variance, thereby making more efficient use of samples than in uniform sampling. Application of this principle requires the development of an effective partitioning scheme for each new domain of interest. In the case of generalized summations, the values being sampled are the fi , which are not known a priori and cannot be directly stratified. However, since f is generally a function with some degree of continuity, its output is similar for similar values of its arguments. We therefore stratify the argument space, i.e. the input datasets, by use of spatial partitioning structures. Though any spatial partitioning could be used, in this work we use k d-trees modified to split along the dimension of highest variance, since we seek variance reduction. The approximation procedure runs as it did before, except that the sampling and sample statistics are modified to make use of the trees. Trees are expanded in greedy fashion up to a user-specified number of nodes, guided by a heuristic of expanding next the node with largest size times average per-dimensional variance. This heuristic will later be justified by the variance expression for the stratified sample mean. The approximation procedure is summarized in Algorithm 3, and we now establish its error guarantee. Theorem 4. Given mmin large enough to put µf in its asymptotic normal regime, with probability ^ at least 1 - Algorithm 3 approximates the summation S with relative error no greater than . Proof. Identical to Theorem 3, but we need to establish that 1) the sample mean remains unbiased under stratification, and 2) the CLT still holds under stratification. These turn out to be standard properties of the stratified sample mean and variance (see [7]): j pj µj ^ (5) µf s = ^ f s = m ^2 j p2 j j p2 j ^2 j2 = , ^ mj qj j (6) where j indexes the strata, µj and j are the sample mean and variance of stratum j , pj is the ^ ^2 fraction of summation terms in stratum j , and qj is the fraction of samples drawn from stratum j . Algorithm 3 modifies the addSamples subroutine to sample in stratified fashion, and computes the stratified sample mean and variance in calcStats. Since these estimators satisfy the two conditions necessary for the error guarantee, this establishes the theorem. 2 2 In [7] it is shown that f s f , i.e. stratification never increases variance, and that any refine2 ment of a stratification can only reduce f s . Although the sample allocation fractions qj can be 6 2 2 chosen arbitrarily, f s is minimized when qj pj j . With this optimal allocation, f s reduces j 1 2 to m ( pj j ) . This motivates our k d-tree expansion heuristic, as described above, which tries to first split the nodes with highest pj j , i.e. the nodes contributing the most to the variance under optimal allocation. While we never know the j exactly, Algorithm 3 uses the sample estimates j ^ at each stage to approximate the optimal allocation (this is the optAlloc routine). Finally, the Theorem 2 sample complexity still holds for the CLT-governed stratified sample mean. Corollary 2.2. Given mmin large enough to put µf s and f s in their asymptotic normal regimes, ^ ^ µ 2 . with probability at least 1 - 2 Algorithm 3 terminates with m O µf2s + |µffs| 44f s - 1 f fs Correlated Sampling. The variance of recursively estimated fi , as expressed by f (G )m T (G ), f depends on the full covariance matrix of the estimated arguments. If the gradient of f is such that the variance of fi depends negatively (positively) on a covariance j k , we can reduce the variance by inducing positive (negative) covariance between Gj and Gk . Covariance can be induced by sharing sampled points across the estimates of Gj and Gk , assuming they both use the same datasets. In some cases the expression for fi 's variance is such that the effect of correlated sampling is datadependent; when this happens, it is easy to try and check whether it helps. All experiments presented here were benefited by correlated sampling on top of stratification. 6 Experiments We present experimental results in two phases. First, we compare stratified multi-stage Monte Carlo approximations to exact evaluations on tractable datasets. We show that the error distributions conform closely to our asymptotic theory. Second, having verified accuracy to the extent possible, we run our method on datasets containing millions of points in order to show 1) validation of the theoretical prediction that runtime is roughly independent of dataset size, and 2) many orders of magnitude speedup (as high as 1014 ) relative to exact computation. These results are presented for two dataset-method pairs: kernel regression on a dataset containing 2 million 4-dimensional redshift measurements used for quasar identification, and kernel conditional density estimation on an n-body galaxy simulation dataset containing 3.5 million 3-dimensional locations. In the KR case, the fourth dimension is regressed against the other three, while in KCDE the distribution of the third dimension is predicted as a function of the first two. In both cases we are evaluating the cross-validated score functions used for bandwidth optimization, i.e. SK R and SK C DE as described in Section 2. Error Control. The objective of this first set of experiments is to validate the guarantee that relative error will be less than or equal to with probability 1 - . We measured the distribution of error on a series of random data subsets up to the highest size for which the exact computation was tractable. For the O(n2 ) SK R , the limit was n = 10K, while for the O(n3 ) SK C DE it was n = 250. For each dataset we randomly chose and evaluated 100 bandwidths with 1 - = 0.95 and = 0.1. Figure 1 shows the full quantile spreads of the relative errors. The most salient feature is the relationship of the 95% quantile line (dashed) to the threshold line (solid) at = 0.1. Full compliance with asymptotic theory would require the dashed line never to be above the solid. This is basically the case for KCDE,1 while the KR line never goes above 0.134. The approximation is therefore quite good, and could be improved if desired by increasing mmin or the number of strata, but in this case we chose to trade a slight increase in error for an increase in speed. Speedup. Given the validation of the error guarantees, we now turn to computational performance. As before, we ran on a series of random subsets of the data, this time with n ranging into the millions. At each value of n, we randomly chose and evaluated 100 bandwidths, measuring the time for each evaluation. Figure 2 presents the average evaluation time versus dataset size for both methods. The most striking feature of these graphs is their flatness as n increases by orders of magnitude. This is in accord with Theorem 2 and its corollaries, which predict sample and computational complexity independent of dataset size. Speedups2 for KR range from 1.8 thousand at n = 50K to 2.8 million at n = 2M. KCDE speedups range from 70 million at n = 50K to 1014 at n = 3.5M. These results are many orders of magnitude better than any previous state of the art. 1 2 The spike in the max quantile is due to a single outlier point. All speedups are relative to extrapolated runtimes based on the O() order of the exact computation. 7 0.25 0.2 relative error 0.15 relative error 99%-max 90%-99% 75%-90% 50%-75% 25%-50% 10%-25% 1%-10% min-1% 95% error = 0.1 0.7 0.6 0.5 0.4 99%-max 90%-99% 75%-90% 50%-75% 25%-50% 10%-25% 1%-10% min-1% 95% error = 0.1 0.1 0.3 0.2 0.05 0.1 0 1000 2000 3000 4000 5000 6000 dataset size 7000 8000 9000 10000 0 50 100 150 dataset size 200 250 Figure 1: Error distribution vs. dataset size for KR (left), and KCDE (right). 4000 6000 5000 avg. computation time (ms) avg. computation time (ms) 3000 4000 3000 2000 2000 1000 1000 0 0 0 500,000 1,000,000 dataset size 1,500,000 2,000,000 -1000 0 1,000,000 2,000,000 dataset size 3,000,000 Figure 2: Runtime vs. dataset size for KR (left), and KCDE (right). Error bars are one standard deviation. 7 Conclusion We have presented a multi-stage stratified Monte Carlo method for approximating a broad class of generalized summations. These summations are related to generalized n-body problems, and are common in machine learning. We developed theory for this Monte Carlo method that predicts: 1) relative error no greater than with probability at least 1 - , for user-specified and , and 2) sample and computational complexity independent of dataset size. We present experiments validating these theoretical guarantees on real datasets, where we accelerate intractable kernel cross-validation scores by factors up to 1014 . This performance goes many orders of magnitude beyond the previous state of the art. In addition to applications, future work will likely include automatic selection of stratification granularity, additional variance reduction techniques, and further generalization to other computational bottlenecks such as linear algebraic operations. References [1] Nicol N. Schraudolph and Thore Graepel. Combining conjugate direction methods with stochastic approximation of gradients. In Workshop on Artificial Intelligence and Statistics (AISTATS), 2003. [2] Alexander G. Gray and Andrew W. Moore. N-body problems in statistical learning. In Advances in Neural Information Processing Systems (NIPS) 13, 2000. [3] Mike Klaas, Mark Briers, Nando de Freitas, and Arnaud Doucet. Fast particle smoothing: If i had a million particles. In International Conference on Machine Learning (ICML), 2006. [4] Ping Wang, Dongryeol Lee, Alexander Gray, and James M. Rehg. Fast mean shift with accurate and stable convergence. In Workshop on Artificial Intelligence and Statistics (AISTATS), 2007. [5] Michael P. Holmes, Alexander G. Gray, and Charles Lee Isbell Jr. Fast nonparametric conditional density estimation. In Uncertainty in Artificial Intelligence (UAI), 2007. [6] Reuven Y. Rubinstein. Simulation and the Monte Carlo Method. John Wiley & Sons, 1981. [7] Paul Glasserman. Monte Carlo methods in financial engineering. Springer-Verlag, 2004. 8