Fast High-dimensional Kernel Summations Using the Monte Carlo Multipole Method

Dongryeol Lee Computational Science and Engineering Georgia Institute of Technology Atlanta, GA 30332 dongryel@cc.gatech.edu

Alexander Gray Computational Science and Engineering Georgia Institute of Technology Atlanta, GA 30332 agray@cc.gatech.edu

Abstract
We propose a new fast Gaussian summation algorithm for high-dimensional datasets with high accuracy. First, we extend the original fast multipole-type methods to use approximation schemes with both hard and probabilistic error. Second, we utilize a new data structure called subspace tree which maps each data point in the node to its lower dimensional mapping as determined by any linear dimension reduction method such as PCA. This new data structure is suitable for reducing the cost of each pairwise distance computation, the most dominant cost in many kernel methods. Our algorithm guarantees probabilistic relative error on each kernel sum, and can be applied to high-dimensional Gaussian summations which are ubiquitous inside many kernel methods as the key computational bottleneck. We provide empirical speedup results on low to high-dimensional datasets up to 89 dimensions.

1

Fast Gaussian Kernel Summation

In this paper, we propose new computational techniques for efficiently approximating the following sum for each query point qi  Q: r 2 2 e-||qi -rj || /(2h ) (1) (qi , R) =
j R

where R is the reference set; each reference point is associated with a Gaussian function with a smoothing parameter h (the 'bandwidth'). This form of summation is ubiquitous in many statistical learning methods, including kernel density estimation, kernel regression, Gaussian process regression, radial basis function networks, spectral clustering, support vector machines, and kernel PCA [1, 4]. Cross-validation in all of these methods require evaluating Equation 1 for multiple values of h. Kernel density estimation, for example, requires |R| density estimate based on |R| - 1 points, yielding a brute-force computational cost scaling quadratically (that is O(|R| 2 )). Error bounds. Due to its expensive computational cost, many algorithms approximate the Gaussian kernel sums at the expense of reduced precision. Therefore, it is natural to discuss error bound criteria which measure the quality of the approximations with respect to their corresponding true values. The following error bound criteria are common in literature:

Definition 1.2. An algorithm guarantees relative error bound, if for each exact value (q i , R)   for qi  Q, it computes (qi , R)  R such that (qi , R) - (qi , R) |(qi , R)|. 1

Definition 1.1. An algorithm guarantees absolute error bound,f for each exact value (q i , R) i  (qi , R) such that (qi , R) - (qi , R) for qi  Q, it computes .


Previous work. The most successful class of acceleration methods employ "higher-order divide and conquer" or generalized N -body algorithms (GNA) [4]. This approach can use any spatial partioning tree such as kd-trees or ball-trees for both the query set Q and reference data R and performs a simulataneous recursive descent on both trees. GNA with relative error bounds (Definition 1.2) [5, 6, 11, 10] utilized bounding boxes and additional cached-sufficient statistics such as higher-order moments needed for series-expansion. [5, 6] utilized bounding-box based error bounds which tend to be very loose, which resulted in slow empirical performance around suboptimally small and large bandwidths. [11, 10] extended GNA-based Gaussian summations with series-expansion which provided tighter bounds; it showed enormous performance improvements, but only up to low dimensional settings (up to D = 5) since the number of required terms in series expansion increases exponentially with respect to D. [9] introduces an iterative sampling based GNA for accelerating the computation of nested sums (a related easier problem). Its speedup is achieved by replacing pessimistic error bounds provided by bounding boxes with normal-based confidence interval from Monte Carlo sampling. [9] demonstrates the speedup many orders of magnitude faster than the previous state of the art in the context of computing aggregates over the queries (such as the LSCV score for selecting the optimal bandwidth). However, the authors did not discuss the sampling-based approach for computations that require per-query estimates, such as those required for kernel density estimation.

Bounding the relative error (e.g., the percentage deviation) is much harder because the error bound criterion is in terms of the initially unknown exact quantity. As a result, many previous methods [7] have focused on bounding the absolute error. The relative error bound criterion is preferred to the absolute error bound criterion in statistical applications in which high accuracy is desired. Our new algorithm will enforce the following "relaxed" form of the relative error bound criterion, whose motivation will be discussed shortly. Definition 1.3. An algorithm guarantees (1 - ) probabilistic relative error bound, if for each exact value (qi , R) for qi  Q, it computes (qi , R)  R, such that with at least probability   |(qi , R)|. 0 < 1 -  < 1, (qi , R) - (qi , R)

None of the previous approaches for kernel summations addresses the issue of reducing the computational cost of each distance computation which incurs O(D) cost. However, the intrinsic dimensionality d of most high-dimensional datasets is much smaller than the explicit dimension D (that is, d << D). [12] proposed tree structures using a global dimension reduction method, such as random projection, as a preprocessing step for efficient (1 + ) approximate nearest neighbor search. Similarly, we develop a new data structure for kernel summations; our new data structure is constructed in a top-down fashion to perform the initial spatial partitioning in the original input space R D and performs a local dimension reduction to a localized subset of the data in a bottom-up fashion. This paper. We propose a new fast Gaussian summation algorithm that enables speedup in higher dimensions. Our approach utilizes: 1) probabilistic relative error bounds (Definition 1.3) on kernel sums provided by Monte Carlo estimates 2) a new tree structure called subspace tree for reducing the computational cost of each distance computation. The former can be seen as relaxing the strict requirement of guaranteeing hard relative bound on very small quantities, as done in [5, 6, 11, 10]. The latter was mentioned as a possible way of ameliorating the effects of the curse of dimensionality in [14], a pioneering paper in this area. Notations. Each query point and reference point (a D-dimensional vector) is indexed by natural numbers i, j  N, and denoted qi and rj respectively. For any set S , |S | denotes the number of elements in S . The entities related to the left and the right child are denoted with superscripts L and R; an internal node N has the child nodes N L and N R .

2

Gaussian Summation by Monte Carlo Sampling

Here we describe the extension needed for probabilistic computation of kernel summation satisfying Definition 1.3. The main routine for the probabilistic kernel summation is shown in Algorithm 1. The function M C M M takes the query node Q and the reference node R (each initially called with the roots of the query tree and the reference tree, Qroot and Rroot ) and  (initially called with  value which controls the probability guarantee that each kernel sum is within relative error). 2


Algorithm 1 The core dual-tree routine for probabilistic Gaussian kernel summation. M C M M(Q, R,  ) if C A N S U M M A R I Z E E X AC T(Q, R, ) then S U M M A R I Z E E X AC T(Q, R) else if C A N S U M M A R I Z E M C(Q, R, ,  ) then 5: S U M M A R I Z E M C(Q, R, ,  ) else if Q is a leaf node then if R is a leaf node then M C M M BA S E(Q, R) 10: Q e else , Q MCMM , RL ,  M C M M , RR ,  2 2 lse if R is a leaf node then M C M M(QL , R,  ), M C M M(QR , R,  ) 15: elsM e , Q Q L L , RR ,  MCMM , RL ,  M C M M 2 2 T Q , Q R L R R CMM , R , 2 MCMM ,R , 2 he idea of Monte Carlo sampling used in the new algorithm is similar to the one in [9], except the sampling is done per query and we use approximations that provide hard error bounds as well (i.e. finite difference, exhaustive base case: M C M M BA S E). This means that the approximation has less variance than a pure Monte Carlo approach used in [9]. Algorithm 1 first attempts approximations with hard error bounds, which are computationally cheaper than sampling-based approximations. For example, finite-difference scheme [5, 6] can be used for the C A N S U M M A R I Z E E X AC T and S U M M A R I Z E E X AC T functions in any general dimension. The C A N S U M M A R I Z E M C function takes two parameters that specify the accuracy: the relative error and its probability guarantee and decides whether to use Monte Carlo sampling for the given pair of nodes. If the reference node R contains too few points, it may be more efficient to process it using exact methods that use error bounds based on bounding primitives on the node pair or exhaustive pair-wise evaluations, which is determined by the condition:  · minitial  |R| where  > 1 controls the minimum number of reference points needed for Monte Carlo sampling to proceed. If the reference node does contain enough points, then for each query pointrq  Q, the S A M P L E routine samples minitial terms over the terms in the summation (q , R) = Kh (||q - rjn ||) where (q , R) denotes the exact contribution of R to q 's kernel sum. Basically, we are interested in estimating (q , R) by (q , R) = |R|µS , where µS is the sample mean of S . From the Central 2 Limit Theorem, given enough m samples, µS N (µ, S /m) where (q , R) = |R|µ (i.e. µ is the average of the kernel value between q and any reference point r  R); this implies that  |µS - µ|  z /2 S / m with probability 1 -  . The pruning rule we have to enforce for each query point for the contribution of R is: S (q, R) z /2   |R| m
jn R

where S the sample standard deviation of S . Since (q , R) is one of the unknown quanities we want to compute, we instead enforce the following: |µ z/ l (q , R) + |R| S - 2 S S m z /2   (2) R| m where lµq , R) is the currently running lower bound on the sum computed using exact methods ( i z/ and |R| S - 2 S s the probabilistic component contributed by R. Denoting l,new (q , R) = m µ , z /2 S l  (q , R) + |R| the minimum number of samples for q needed to achieve the S- 
|S |

3


target error the right side of the inequality in Equation 2 with at least probability of 1 -  is:
2 2 m  z /2 S

If the given query node and reference node pair cannot be pruned using either nonprobabilistic/probabilistic approximations, then we recurse on a smaller subsets of two sets. In particular, when dividing over the reference node R, we recurse with half of the  value 1 . We now state the probablistic error guarantee of our algorithm as a theorem. Theorem 2.1. After calling M C M M with Q = Qroot , R = Rroot , and  = , Algorithm 1 approximates each (q , R) with (q , R) such that Definition 1.3 holds.

(|R| + |R|)2 2(l (q , R) + |R|µS )2

Proof. For a query/reference (Q, R) pair an 0 <  < 1, M C M< BA S E and S U M M A R I Z E E X AC T d M R (q|,R|)|R| with probability at compute estimates for q  Q such that (q , R) - (q , R) least 1 > 1 -  . By Equation 2, S U M M A R I Z E M C computes estimates for q  Q such that  < R (q , R) - (q , R) (q|,R|)|R| with probability 1 -  .

Line 14 divides over the query and each subcall computes estimates that hold with at least probability 1 -  for q  QL and q  QR . Line 16 and 17 divides both over the query and the reference, and the correctness can be proven similarly. Therefore, M C M M (Qroot , Rroot , ) computes estimates satisfying Definition 1.3. "Reclaiming" probability. We note that the assigned probability  for the query/reference pair computed with exact bounds (S U M M A R I Z E E X AC T and M C M M BA S E) is not used. This portion of the probability can be "reclaimed" in a similar fashion as done in [10] and re-used to prune more aggressively in the later stages of the algorithm. All experiments presented in this paper were benefited by this simple modification.

(q,|R)||R | each with at least 1 -  probability by induction hypothesis. For q  Q, (q , R) = R 2 (q , RL ) + (q , RR ) which means |(q , R) - (q , R)|  (q,R)|R| with probability at least 1 -  .
|R|

We now induct on |Q  R|. ine 11 of Algorithm 1 divides over the refer ce whose subcalls comL en   (q ,R)|RL | L L  |R| pute estimates that satisfy (q , R ) - (q , R ) and (q , RR ) - (q , RR )
R

3

Subspace Tree

Monte Carlo sampling using a subspace tree. Consider C A N S U M M A R I Z E M C function in Algorithm 2. The "outer-loop" over this algorithm is over the query set Q, and it would make sense to project each query point q  Q to the subspace owned by the reference node R. Let U and µ be the orthogonal basis system for R consisting of d basis. For each q  Q, consider the squared distance
1

A subspace tree is basically a space-partitioning tree with a set of orthogonal bases associated with each node N : N . = (µ, U, , d) where µ is the mean, U is a D ×d matrix whose columns consist of d eigenvectors, and  the corresponding eigenvalues. The orthogonal basis set is constructed using a linear dimension reduction method such as PCA. It is constructed in the top-down manner using the PA RT I T I O N S E T function dividing the given set of points into two (where the PA RT I T I O N S E T function divides along the dimension with the highest variance in case of a kd-tree for example), with the subspace in each node formed in the bottom-up manner. Algorithm 3 shows a PCA tree (a subspace tree using PCA as a dimension reduction) for a 3-D dataset. The subspace of each leaf node is computed using P C A BA S E which can use the exact PCA [3] or a stochastic one [2]. For an internal node, the subspaces of the child nodes, N L . = (µL , U L , L , dL ) and N R . = (µR , U R , R , dR ), are approximately merged using the M E R G E S U B S PAC E S function which involves solving an (d L + dR + 1) × (dL + dR + 1) eigenvalue problem [8], which runs in O((dL + dR + 1)3 ) << O(D 3 ) given that the dataset is sparse. In addition, each data point x in each node N is mapped to its new lower-dimensional coordinate using the orthogonal basis set of N : xproj = U T (x - µ). The L2 norm reconstruction error is given by: ||xrecon - x||2 = ||(U xproj + µ) - x||2 . 2 2

We could also divide  such that the node that may be harder to approximate gets a lower value.

4


Algorithm 2 Monte Carlo sampling based approximation routines. C A N S U M M A R I Z E M C(Q, R, , ) S A M P L E(q , R, , , S, m) return  · minitial  |R| for k = 1 to m do r  random point in R S U M M A R I Z E M C(Q, R, , ) S  S  {Kh (||q - r||)} 2 for qi  Q do N µS  M E A N(S ), S  VA R I Aµ C E(S ) m S  , m  minitial z/2 S l,new (q , R)  l (q , R) + |R| S-  repeat |S | 2 S A M P L E(qi , R, , , S, m) (|R|+ |R|) 2 2 thresh  z/2 S 2(l (q ,R)+|R|µS )2 until m  0 m  mthresh - |S | (qi , R)  (qi , R) + |R| · M E A N(S ) ||(q - µ) - rproj ||2 (where (q - µ) is q 's coordinates expressed in terms of the coordinate system of R) as shown in Figure 1. For the Gaussian kernel, each pairwise kernel value is approximated as: e-||q-r||
2

/(2h2 )

Increased variance comes at the cost of inexact distance computations, however. Each distance computation incurs at most squared L2 norm of ||rrecon - r||2 error. That is, 2 |  |q - rrecon ||2 - ||q - r||2 ||rrecon - r||2 . Neverhteless, the sample variance for each query 2 2 2 point plus the inexactness due to dimension reduction S can be shown to be bounded for the Gaus2 2 sian kernel as: (where each s = e-||q-rrecon || /(2h ) ): + s 1 2 2 s - m · µS S m-1 S s m E 1 - µ 2 1 2 ||rrecon -r ||2 /h2 -||rrecon -r ||2 /(2h2 ) 2 2 s in , max e m  S min e r R r R m-1
S

where qrecon = U qproj + µ and qproj = U (q - µ). For a fixed query point q , e can be precomputed (which takes d dot products between two D-dimensional vectors) and re-used for every distance computation between q and any reference point r  R whose cost is now O(d) << O(D). Therefore, we can take more samples efficiently. For a total of sufficiently large m samples, the computational cost is O(d(D + m)) << O(D · m) for each query point.
T -||q -qrecon ||2 /(2h2 )

 e-||q-qrecon ||

2

/(2h2 ) -||qproj -rproj ||2 /(2h2 )

e

(3)

xhaustive computations using a subspace tree. Now suppose we have built subspace trees for the query and the reference sets. We can project either each query point onto the reference subspace, or each reference point onto the query subspace, depending on which subspace has a smaller dimension and the number of points in each node. The subspaces formed in the leaf nodes usually are highly numerically accurate since it contains very few points compared to the extrinsic dimensionality D.

4

Experimental Results

We empirically evaluated the runtime performance of our algorithm on seven real-world datasets, scaled to fit in [0, 1]D hypercube, for approximating the Gaussian sum at every query point with a range of bandwidths. This experiment is motivated by many kernel methods that require computing the Gaussian sum at different bandwidth values (according to the standard least-sqares crossvalidation scores [15]). Nevertheless, we emphasize that the acceleration results are applicable to other kernel methods that require efficient Gaussian summation. In this paper, the reference set equals the query set. All datasets have 50K points so that the exact exhaustive method can be tractably computed. All times are in seconds and include the time needed to build the trees. Codes are in C/C++ and run on a dual Intel Xeon 3GHz with 8 Gb of main memory. The measurements in second to eigth columns are obtained by running the algorithms at the bandwidth k h where 10-3  k  103 is the constant in the corresponding column header. The last columns denote the total time needed to run on all seven bandwidth values. Each table has results for five algorithms: the naive algorithm and four algorithms. The algorithms with p = 1 denote the previous state-of-the-art (finite-difference with error redistribution) [10], 5


Algorithm 3 PCA tree building routine. B U I L D P C AT R E E(P ) if C A N PA RT I T I O N(P ) then {P L , P R }  PA RT I T I O N S E T(P ) N  empty node N L  B U I L D P C AT R E E(P L ) N R  B U I L D P C AT R E E(P R ) N .S  M E R G E S U B S PAC E S(N L .S, N R .S ) else N  B U I L D P C AT R E E BA S E(P ) N .S  P C A BA S E(P ) N .Pproj  P RO J E C T(P , N .S ) return N while those with p < 1 denote our probabilistic version. Each entry has the running time and the percentage of the query points that did not satisfy the relative error . Analysis. Readers should focus on the last columns containing the total time needed for evaluating Gaussian sum at all points for seven different bandwidth values. This is indicated by boldfaced numbers for our probabilistic algorithm. As expected, On low-dimensional datasets (below 6 dimensions), the algorithm using series-expansion based bounds gives two to three times speedup compared to our approach that uses Monte Carlo sampling. Multipole moments are an effective form of compression in low dimensions with analytical error bounds that can be evaluated; our Monte Carlo-based method has an asymptotic error bound which must be "learned" through sampling. As we go from 7 dimensions and beyond, series-expansion cannot be done efficiently because of its slow convergence. Our probabilistic algorithm (p = 0.9) using Monte Carlo consistently performs better than the algorithm using exact bounds (p = 1) by at least a factor of two. Compared to naive, it achieves the maximum speedup of about nine times on an 16-dimensional dataset; on an 89-dimensional dataset, it is at least three times as fast as the naive. Note that all the datasets contain only 50K points, and the speedup will be more dramatic as we increase the number of points.

5

Conclusion

We presented an extension to fast multipole methods to use approximation methods with both hard and probabilistic bounds. Our experimental results show speedup over the previous state-of-the-art on high-dimensional datasets. Our future work will include possible improvements inspired by a recent work done in the FMM community using a matrix-factorization formulation [13].

Figure 1: Left: A PCA-tree for a 3-D dataset. Right: The squared Euclidean distance between a given query point and a reference point projected onto a subspace can be decomposed into two components: the orthogonal component and the component in the subspace. 6


Algorithm \ scale 0.001 0.01 0.1 1 10 100 1000  mockgalaxy-D-1M-rnd (cosmology: positions), D = 3, N = 50000, h  = 0.000768201 Naive 182 182 182 182 182 182 182 1274 MCMM 3 3 5 10 26 48 2 97 ( = 0.1, p = 0.9) 1% 1% 1% 1% 1% 1% 5% DFGT 2 2 2 2 6 19 3 36 ( = 0.1, p = 1) 0% 0% 0% 0% 0% 0% 0% MCMM 3 3 4 11 27 58 21 127 ( = 0.01, p = 0.9) 0 % 0% 1% 1% 1% 1% 7% DFGT 2 2 2 2 7 30 5 50 ( = 0.01, p = 1) 0% 0% 0% 0% 0% 0% 0% Algorithm \ scale 0.001 0.01 0.1 1 10 100 1000  bio5-rnd (biology: drug activity), D = 5, N = 50000, h = 0.000567161 Naive 214 214 214 214 214 214 214 1498 MCMM 4 4 6 144 149 65 1 373 ( = 0.1, p = 0.9) 0% 0% 0% 0% 1% 0% 1% DFGT 4 4 5 24 96 65 2 200 ( = 0.1, p = 1) 0% 0% 0% 0% 0% 0% 0% MCMM 4 4 6 148 165 126 1 454 ( = 0.01, p = 0.9) 0 % 0% 0% 0% 1% 0% 1% DFGT 4 4 5 25 139 126 4 307 ( = 0.01, p = 1) 0% 0% 0% 0% 0% 0% 0% Algorithm \ scale 0.001 0.01 0.1 1 10 100 1000  pal l7 - rnd , D = 7, N = 50000, h = 0.00131865 Naive 327 327 327 327 327 327 327 2289 MCMM 3 3 3 3 63 224 <1 300 ( = 0.1, p = 0.9) 0% 0% 0% 1% 1% 12 % 0% DFGT 10 10 11 14 84 263 223 615 ( = 0.1, p = 1) 0% 0% 0% 0% 0% 0% 0% MCMM 3 3 3 3 70 265 5 352 ( = 0.01, p = 0.9) 0 % 0% 0% 1% 2% 1% 8% DFGT 10 10 11 14 85 299 374 803 ( = 0.01, p = 1) 0% 0% 0% 0% 0% 0% 0% Algorithm \ scale 0.001 0.01 0.1 1 10 100 1000  covtype - rnd , D = 10, N = 50000, h = 0.0154758 Naive 380 380 380 380 380 380 380 2660 MCMM 11 11 13 39 318 <1 <1 381 ( = 0.1, p = 0.9) 0% 0% 0% 1% 0% 0% 0% DFGT 26 27 38 177 390 244 <1 903 ( = 0.1, p = 1) 0% 0% 0% 0% 0% 0% 0% MCMM 11 11 13 77 362 2 <1 477 ( = 0.01, p = 0.9) 0 % 0% 0% 1% 1% 10 % 0% DFGT 26 27 38 180 427 416 <1 1115 ( = 0.01, p = 1) 0% 0% 0% 0% 0% 0% 0% Algorithm \ scale 0.001 0.01 0.1 1 10 100 1000  CoocTexture - rnd , D = 16, N = 50000, h = 0.0263958 Naive 472 472 472 472 472 472 472 3304 MCMM 10 11 22 189 109 <1 <1 343 ( = 0.1, p = 0.9) 0% 0% 0% 1% 8% 0% 0% DFGT 22 26 82 240 452 66 <1 889 ( = 0.1, p = 1) 0% 0% 0% 0% 0% 0% 0% MCMM 10 11 22 204 285 <1 <1 534 ( = 0.01, p = 0.9) 0 % 0% 1% 1% 10 % 4% 0% DFGT 22 26 83 254 543 230 <1 1159 ( = 0.01, p = 1) 0% 0% 0% 0% 0% 0% 0%

7


Algorithm \ scale 0.001 0.01 0.1 1 10 100 1000 LayoutHistogram - rnd , D = 32, N = 50000, h = 0.0609892 Naive 757 757 757 757 757 757 757 MCMM 32 32 54 168 583 8 8 ( = 0.1, p = 0.9) 0% 0% 1% 1% 1% 0% 0% DFGT 153 159 221 492 849 212 <1 ( = 0.1, p = 1) 0% 0% 0% 0% 0% 0% 0% MCMM 32 45 60 183 858 8 8 ( = 0.01, p = 0.9) 0 % 0% 1% 6% 1% 0% 0% DFGT 153 159 222 503 888 659 <1 ( = 0.01, p = 1) 0% 0% 0% 0% 0% 0% 0% Algorithm \ scale 0.001 0.01 0.1 1 10 100 1000 CorelCombined - rnd , D = 89, N = 50000, h = 0.0512583 Naive 1716 1716 1716 1716 1716 1716 1716 MCMM 384 418 575 428 1679 17 17 ( = 0.1, p = 0.9) 0% 0% 0% 1% 10 % 0% 0% DFGT 659 677 864 1397 1772 836 17 ( = 0.1, p = 1) 0% 0% 0% 0% 0% 0% 0% MCMM 401 419 575 437 1905 17 17 ( = 0.01, p = 0.9) 0 % 0% 0% 1% 2% 0% 0% DFGT 659 677 865 1425 1794 1649 17 ( = 0.01, p = 1) 0% 0% 0% 0% 0% 0% 0%

 5299 885 2087 1246 2585  12012 3518 6205 3771 7086

References
[1] Nando de Freitas, Yang Wang, Maryam Mahdaviani, and Dustin Lang. Fast krylov methods for n-body ¨ learning. In Y. Weiss, B. Scholkopf, and J. Platt, editors, Advances in Neural Information Processing Systems 18, pages 251­258. MIT Press, Cambridge, MA, 2006. [2] P. Drineas, R. Kannan, and M. Mahoney. Fast monte carlo algorithms for matrices iii: Computing a compressed approximate matrix decomposition, 2004. [3] G. Golub. Matrix Computations, Third Edition. The Johns Hopkins University Press, 1996. [4] A. Gray and A. W. Moore. N-Body Problems in Statistical Learning. In Todd K. Leen, Thomas G. Dietterich, and Volker Tresp, editors, Advances in Neural Information Processing Systems 13 (December 2000). MIT Press, 2001. [5] Alexander G. Gray and Andrew W. Moore. Nonparametric Density Estimation: Toward Computational Tractability. In SIAM International Conference on Data Mining 2003, 2003. [6] Alexander G. Gray and Andrew W. Moore. Very Fast Multivariate Kernel Density Estimation via Computational Geometry. In Joint Statistical Meeting 2003, 2003. to be submitted to JASA. [7] L. Greengard and J. Strain. The Fast Gauss Transform. SIAM Journal of Scientific and Statistical Computing, 12(1):79­94, 1991. [8] Peter Hall, David Marshall, and Ralph Martin. Merging and splitting eigenspace models. IEEE Transactions on Pattern Analysis and Machine Intelligence, 22(9):1042­1049, 2000. [9] Michael Holmes, Alexander Gray, and Charles Isbell. Ultrafast monte carlo for statistical summations. In J.C. Platt, D. Koller, Y. Singer, and S. Roweis, editors, Advances in Neural Information Processing Systems 20, pages 673­680. MIT Press, Cambridge, MA, 2008. [10] Dongryeol Lee and Alexander Gray. Faster gaussian summation: Theory and experiment. In Proceedings of the Twenty-second Conference on Uncertainty in Artificial Intelligence. 2006. [11] Dongryeol Lee, Alexander Gray, and Andrew Moore. Dual-tree fast gauss transforms. In Y. Weiss, ¨ B. Scholkopf, and J. Platt, editors, Advances in Neural Information Processing Systems 18, pages 747­ 754. MIT Press, Cambridge, MA, 2006. [12] Ting Liu, Andrew W. Moore, and Alexander Gray. Efficient exact k-nn and nonparametric classification ¨ in high dimensions. In Sebastian Thrun, Lawrence Saul, and Bernhard Sch olkopf, editors, Advances in Neural Information Processing Systems 16. MIT Press, Cambridge, MA, 2004. [13] P. G. Martinsson and Vladimir Rokhlin. An accelerated kernel-independent fast multipole method in one dimension. SIAM J. Scientific Computing, 29(3):1160­1178, 2007. [14] A. W. Moore, J. Schneider, and K. Deng. Efficient locally weighted polynomial regression predictions. In D. Fisher, editor, Proceedings of the Fourteenth International Conference on Machine Learning, pages 196­204, San Francisco, 1997. Morgan Kaufmann. [15] B. W. Silverman. Density Estimation for Statistics and Data Analysis. Chapman and Hall/CRC, 1986.

8