Fast Gaussian Process Methods for Point Process Intensity Estimation John P. Cunningham Department of Electrical Engineering, Stanford University, Stanford, CA, USA 94305 J C U N N I N @ S TA N F O R D . E D U Krishna V. Shenoy S H E N OY @ S TA N F O R D . E D U Department of Electrical Engineering and Neurosciences Program, Stanford University, Stanford, CA, USA 94305 Maneesh Sahani Gatsby Computational Neuroscience Unit, UCL London, WC1N 3AR, UK M A N E E S H @ G AT S B Y. U C L . AC . U K Abstract Point processes are difficult to analyze because they provide only a sparse and noisy observation of the intensity function driving the process. Gaussian Processes offer an attractive framework within which to infer underlying intensity functions. The result of this inference is a continuous function defined across time that is typically more amenable to analytical efforts. However, a naive implementation will become computationally infeasible in any problem of reasonable size, both in memory and run time requirements. We demonstrate problem specific methods for a class of renewal processes that eliminate the memory burden and reduce the solve time by orders of magnitude. tional complexity inherent in GP methods (e.g. Rasmussen & Williams, 2006). The data size n will grow with the length (e.g. total time) of the point process. Naive methods will be O(n2 ) in memory requirements (storing Hessian matrices) and O(n3 ) in run time (matrix inversions and determinants). At one thousand data points (such as one second of millisecond-resolution data), a naive solution to this problem is already quite burdensome on a common workstation. At ten thousand or more, this problem is for all practical purposes intractable. While applications of doubly-stochastic point processes are numerous, there is little work proposing solutions to the serious computational issues inherent in these methods. Thus, the development of efficient methods for intensity estimation would be of broad appeal. In this paper, we do not address the appropriateness of doubly-stochastic point process models for particular applications, but rather we focus on the significant steps required to make such modelling computationally tractable. We build on previous work from both GP regression and large-scale optimization to create a considerably faster and less memory intensive algorithm for doubly-stochastic point-process intensity estimation. As part of the GP intensity estimation problem we optimize model hyperparameters using a Laplace approximation to the marginal likelihood or evidence. This requires an iterative approach which divides into two major parts. First, at each iteration we must find a modal (MAP) estimate of the intensity function. Second, we must calculate the approximate model evidence and its gradients with respect to GP hyperparameters. Both aspects of this problem present computational and memory problems. We develop methods to reduce the costs of both drastically. We show that for certain classes of renewal process observation models, MAP estimation may be framed as a tractable convex program. To ensure nonnegativity in the intensity function we use a log barrier Newton method 1. Introduction Point processes with temporally or spatially varying intensity functions arise naturally in many fields of study. When the intensity function is itself a random process (often a Gaussian Process), the process is called a doubly-stochastic or Cox point process. Application domains including economics and finance (e.g. Basu & Dassios, 2002), neuroscience (e.g. Cunningham et al., 2008), ecology (e.g. Moller et al., 1998), and others. Given observed point process data, one can use a Gaussian Process (GP) framework to infer an optimal estimate of the underlying intensity. In this paper we consider GP prior intensity functions coupled with point process observation models. The problem of intensity estimation then becomes a modification of GP regression and inherits the computaAppearing in Proceedings of the 25 th International Conference on Machine Learning, Helsinki, Finland, 2008. Copyright 2008 by the author(s)/owner(s). Fast Gaussian Process Methods for Point Process Intensity Estimation (Boyd & Vandenberghe, 2004), which we solve efficiently by deriving decompositions of matrices with known structure. By exploiting a recursion embedded in the algorithm, we avoid many costly matrix inversions. We combine these advances with large scale optimization techniques, such as conjugate gradients (CG, as used by Gibbs & MacKay, 1997) and fast fourier transform (FFT) matrix multiplication methods. To evaluate the model evidence, as well as its gradients with respect to hyperparameters, we again exploit the structure imposed by the renewal process framework to find an exact but considerably less burdensome representation. We then show that a further approximation loses little in accuracy, but makes the cost of this computation insignificant. Combining these advances, we are able to reduce a problem that is effectively computationally infeasible to a problem with minimal memory load and very fast solution time. O(n2 ) memory requirements are eliminated, and O(n3 ) computation is reduced to modestly superlinear. truncating the GP prior in the continuous, infinite dimensional function space; see Horrace, 2005). Thus, if the observation model is also log concave in x, the MAP estimate x is unique and can be readily found using a log barrier Newton method (Boyd & Vandenberghe, 2004; Paninski, 2004). Renewal processes are simply defined by their interarrival distribution fz (z ). A common construction for a renewal process with an inhomogeneous underlying intensity is to use the intensity rescaling m(ti | ti-1 ) = ti x(u)du (in practice, a discretized sum of x) (Barbieri ti-1 et al., 2001; Daley & Vere-Jones, 2002). Accordingly, the density for an observation of event times y can be defined p(y) = iN p(yi | yi-1 ) |m (yi | yi-1 )| fz (m(yi | yi-1 )) (1) =1 = 2. Problem Overview Define x IRn to be the intensity function (the high dimensional signal of interest); x is indexed by input1 time points t IRn . Let the observed data y = {y0 , . . ., yN } IRN +1 be a set of N + 1 time indices into the vector x; that is, the ith point event occurs at time yi , and the intensity at that time is xyi . Denote all hyperparameters by . In general, the prior and observation models are both functions of . The GP framework implies a normal prior on the intensity p(x | ) = N (µ1, ), where the nonzero mean is a sensible choice because the intensity function is constrained to be nonnegative. Thus we treat µ as a hyperparameter (µ ). The positive definite covariance matrix (also a function of ) is defined by an appropriate kernel such as a squared exponential or Ornstein-Uhlenbeck kernel (see Rasmussen & Williams, 2006, for a discussion of GP kernels). The point-process observation model gives the likelihood p(y | x, ). In this work, we consider renewal processes (i.e. one-dimensional point processes with independent event interarrival times), a family of point processes that has both been well-studied theoretically and applied in many domains (Daley & Vere-Jones, 2002). The GP prior is log concave in x, and the nonnegativity constraint on intensity (x 0) is convex (constraining x to be nonnegative is equivalent to solving an unconstrained problem where the prior on the vector x is a truncated multivariate normal distribution, but this is not the same as In this work we restrict ourselves to a single input dimension (which we call time), as it aligns with the family of renewal processes in one-dimension. Some ideas here can be extended to multiple dimensions (e.g. if using a spatial Poisson process). 1 by a change of variables for the interarrival distribution (Papoulis & Pillai, 2002). Since m(t) is a linear transformation of the intensity function (our variables of interest), the observation model obeys log concavity as long as the distribution primitive fz (z ) is log concave. Examples of suitable renewal processes include the inhomogeneous Poisson, gamma interval, Weibull interval, inverse Gaussian (Wald) interval, Rayleigh interval, and other processes (Papoulis & Pillai, 2002). For this paper, we choose one of these distributions and focus on its details. However, for processes of the form above, the implementation details are identical up to the forms of the actual distributions. To solve the GP intensity estimation, we first find a MAP estimate x given fixed hyperparameters , and then we approximate the model evidence p(y | ) (for which we need x ) and its gradients in . Iterating these two steps, we can ^ find the optimal model (we do not integrate over hyperparameters). Finally, MAP estimation under these optimal ^ hyperparameters gives an optimal estimate of the under^ lying intensity. This iterative solution for can be written: ^ = argmax p()p(y | ) iN =1 (2) argmax p()p(y | x , )p(x | ) (2 ) 2 n 1 | + -1 | 2 , where the last term is a Laplace approximation to the intractable form of p(y | ), x is the mode of p(y | x)p(x) (MAP estimate), and = - 2 log p(y | x, ) |x=x . x The log concavity of our problem in x supports the choice of a Laplace approximation. Each of the two major steps in this algorithm (MAP estimation and model selection) involves computational and memory challenges. We address these challenges in Sections 4 and 5. Fast Gaussian Process Methods for Point Process Intensity Estimation The computational problems inherent in GP methods have been well studied, and much progress has been made in sparsification (e.g. Quinonero-Candela & Rasmussen, 2005). Unfortunately, these methods do not apply directly to point process estimation, as there are no distinct training and test sets. The reader might wonder if a coarser grid would be adequate, thereby obviating the detailed methods developed here. We have found in experiments (not shown) that the sacrifice in accuracy required to allow reasonable computational tractability is large, and thus we do not consider the coarse grid a viable option. One could also consider re-expressing the problem in terms of the integrals m(yi | yi-1 ) appearing in Eq. 1. While this is possible in certain cases, it requires additional approximation. Finally, we note that the Laplace approximation is often inferior to Expectation Propagation (EP) (Kuss & Rasmussen, 2005) for GP methods. While many of the same techniques used here could also be used with EP, EP requires additional approximations and computational overhead. We find in experiments (not shown) that EP yields similar accuracy to the Laplace approximation in this domain, but EP incurs increased complexity and computational load. 1 ( )diag(x-2 , . . ., x-2 ) being positive definite and diagon 1 ^ nal. B is block diagonal with N blocks Bi : ^ Bi = bi bT where bi = i -1 k yi -1 ( 1. (6) - 1) xk =yi-1 B is thus block rank 1 (with the positive eigenvalue in each block corresponding to the eigenvector bi ). This matrix is key, as we exploit its structure to achieve improvements in computational performance. 4. MAP Estimation Problem As outlined in Section 2, we first find the MAP estimate x for any model defined by hyperparameters . The log barrier method has the intensive requirements of calculating the objective Eq. 4, its gradient g (in x), and the Newton step xnt = -H -1 g. Each of these calculations is O(n3 ) in run time and O(n2 ) in memory. We show an approach that alleviates these burdens. 4.1. Finding the Newton Step xnt First we consider the Hessian, H = -1 + , which itself contains the costly inverse -1 . We would like to avoid this inversion of entirely with the matrix inversion lemma (Sherman-Woodbury-Morrison formula): -H -1 = = -(-1 + )-1 - + R(I + RT R)-1 RT (7) 3. Model Construction To demonstrate our fast method, we choose the specific observation model of an inhomogeneous gamma interval process (Barbieri et al., 2001) (with hyperparameter , 1). If time has been discretized with precision , this can be written p(y | x, ) = iN x yi ( ) k yi -1 xk -1 , (3) =1 =yi-1 - k yi -1 · exp xk =yi-1 (where we have ignored terms that scale with ). Let f (x) = - log p(y | x, )p(x | ). Our MAP estimation problem is to minimize f (x) subject to the constraint x 0 (nonnegativity). In the log barrier method, we consider the above problem as a sequence of convex problems where we seek to minimize, at increasing values of , the (unconstrained) objective function f (x) = f (x) - kn 1l og (xk ) (4) where R is any valid factorization such that RRT = . This decomposition preserves symmetry in the remaining matrix inverse (required for CG) and has advantageous numerical properties. With this form, instead of calculating xnt = -H -1 g directly, we need only multiply the rightmost expression in Eq. 7 with the gradient g. Doing so requires the inversion (I + RT R)-1 v where v = RT g. CG allows us to avoid directly calculating matrix inverses and instead achieve the desired inversion by iteratively multiplying (I + RT R)z for different vectors z (Gibbs & MacKay, 1997). It is common to precondition the CG method to reduce the number of iterations required for convergence. However, our experience with preconditioning (using both classic preconditioners and some of our own design) was that it actually degraded run-time performance. Preconditioners typically aim to improve the condition number of the Hessian, which indeed they do in this problem. However, the rapidity of CG convergence here is facilitated more by spectral concentration ­ many eigenvalues being equal or close to 1 ­ than by overall conditioning. Thus, we found it more effective to use CG inversion directly on (I +RT R). =1 which has Hessian (positive definite by our log concave construction): H= 2 f (x) = -1 + , where = B + D, (5) x + with D = diag(x-2 , . . ., 0, . . ., x-2 , . . ., 0. . ., x-2 ) yN yi y0 Fast Gaussian Process Methods for Point Process Intensity Estimation In general, however, finding the decomposition = RR T is an O(n3 ) operation, which would remove any computational benefit from this approach. For log concave renewal processes, we can derive a valid decomposition in closed form and linear computation time. Since is block diagonal, we consider only one block without loss of generality. ^ ^ ^ ^ Calling this block , we know = bbT + D, where D is a diagonal block of the larger diagonal matrix D, and b is de^ ^1 fined in Eq. 6. D is positive definite, so T = D- 2 satisfies ^ T DT = I (a similarity transform). Then, calling ~ = T b, b ^ we have T T = ~~T + I . With this form, we see that the bb ^ general structure of T T is preserved under the desired matrix decomposition, up to scaling of the components: (~~T + I )(~~T + I )T = (2 ~ 2 + 2)~~T + I bb bb b bb (8) of which can be stored and calculated in O(n) time. Thus, we eliminate the need for O(n2 ) storage, and we perform the relevant matrix multiplications in O(n) time. Since R can be multiplied in linear time, the complexity of multiplying vectors by (I + RT R) depends on multiplying vectors by the covariance matrix . Since we have evenly spaced resolution of our data x in time indices ti , is Toeplitz. This matrix can be embedded in a larger circulant matrix, multiplication by which is simply a convolution operation of the argument vector with a row of this circulant matrix. Thus, the operation can be quickly done in O(nlog n) using frequency domain multiplications(Silverman, 1982). Further, we need never represent the matrix ; we only store the first row of the circulant matrix. Again we have eliminated O(n2 ) memory needs. Other methods for fast kernel matrix multiplications include Fast Gauss Transforms (FGT) (Raykar et al., 2005) and kd-Trees (Shen et al., 2006; Gray & Moore, 2003). We note that the single input dimension (time) enables this Toeplitz structure, and thus an extension to multiple dimensions should use FGT or similar. The regular structure of the data points in any dimension make multiplications very fast with such a method. Further, these methods avoid explicit representation of . Here, the simple FFT approach for this one-dimensional problem significantly outperforms other (more general) methods in both speed and accuracy. Finally, we note that the matrix (I + RT R) is particularly well suited to CG. Although RT R is full rank by definition, in practice its spectrum has very few large eigenvalues (typically fewer than N , the number of events). Loosely, the matrix looks like identity plus low rank. In practice, the CG method converges with high accuracy almost always in fewer than 50 steps (very often under 30). This is drastically fewer than the worst case of n steps (n of 103 to 104 ). Instead of decomposing = RRT , one might have considered using the matrix inversion lemma to write (-1 + )-1 = - ( + )-1 . Indeed this valid form enables all of the CG and fast multiplication methods previously discussed. While it may seem that this form's ease of derivation (compared to the matrix decomposition in Eq. 11) warrants its use in general, the matrix to be inverted is poorly conditioned compared to (I + RT R), and thus the inversion requires more CG steps. We have found in testing that the number of CG steps can roughly double. Thus, the decomposition of Eq. 11 is computationally worthwhile. In this section, we have constructed a fast method for calculating the Newton step that costs O(nlog n) per CG step and incurs a very small number of CG steps. Also, we have avoided explicit representation of any matrix, so that memory requirements are only linear in the data size n, al- and we want to choose such that (Eq. 8) equals ~~T + I . bb Using the quadratic formula to find this , we see then that ~ R= 1 + ~ 2-1 b ~2 b ~ b~T + I b (9) ^ ~~ satisfies T T = RRT . Since T is diagonal, it easily in-1 ^ 1 . Then: 2 verts to T = D ^ ~~ ~ ~ ^^ = T -1 RRT T -1 = (T -1 R)(T -1 R)T = RRT . (10) To be explicit, we have found that ^ R= 1 + D- 2 b 2 - 1 ^ 1 D- 2 b ^ 1 2 b ^1 ^1 bT D - 2 + D 2 (11) ^^ ^ is a valid decomposition RRT = . This decomposition can be seen as a partial rank-one (blockwise) update to a ^ Cholesky factorization (Gill et al., 1974), in that D can triv^1 ially be factorized to D 2 . The final form is not, however, a ^ Cholesky factorization, since R is not triangular (making a triangular factor would require additional computation and the explicit representation of the Cholesky matrix). ^ Since all of the products needed to construct R can be formed in O(m) time (where m is the size of the block), and since the larger matrix R can be formed by tiling the ^ blocks R, we have a total complexity for this decomposition of O(n). We can then use CG to find the solution to (I + RT R)-1 (RT g). With this inversion calculated, we can perform the remaining forward multiplications in Eq. 7; this completes calculation of a Newton step. In fact, we need not form the matrix R in memory. Instead, we retain each of its component elements (in Eq. 11), and reduce multiplication of a vector by R to a sequence of inner products and multiplications by diagonal matrices, all Fast Gaussian Process Methods for Point Process Intensity Estimation lowing problem sizes of potentially millions of time steps. These two factors stand in contrast to the cubic run time and quadratic storage needs of a naive method. 4.2. Evaluating the Gradient and Objective Calculating the objective f (x) (Eq. 4) and its gradient (both required for the log barrier method) require finding -1 (x - µ1). Note that the k th iterate x(k) (of the log barrier method) has the form (x(k) - µ1) = = (j ) x(k-1) + t(k-1) xnt k-1 j =1 (j ) (k-1) - µ1 t(j ) xnt + (x(0) - µ1) (12) on the hyperparameters (such implicit gradients are typical for the use of Laplace approximation in GP learning; see Rasmussen & Williams, 2006, section 5.5.1). The implicit gradients in this problem are extremely computationally burdensome to calculate (requiring the trace of matrix inversions and matrix-matrix products for each element of x). In empirical tests, we find implicit gradients to be quite small relative to the explicit gradients (often by several orders of magnitude). Ignoring these gradients is undesirable but essential to make this problem computationally feasible. Thus we consider only explicit gradients. This is a common approach for GP methods; see Rasmussen and Williams (2006). Efficient computation of the first two terms of Eq. 14, as well as their gradients with respect to , can be achieved by the fast multiplication method and the recursion derived in Sec. 4. Specifically, the values of the first and second terms of Eq. 14 are calculated during the MAP estimation, so no additional memory or computation is necessary for them. The gradient of the first term is nonzero only with respect to and is linear in x (no matrix multiplications are required). Thus it can be quickly calculated with no additional memory demands. Computation of the gradient of the second term (the prior) can exploit the fact that we calculated -1 (x - µ1) in the final step of the MAP estimation. The gradient of this term with respect to µ is a simple inner product 1T (-1 (x - µ1)) (since we have already calculated the right side of this inner product, this computation is O(n) in run time and requires no additional memory). The gradient of this term with respect to a kernel hyperparameter i (e.g. a lengthscale or variance) is: = d1 (x - µ1)T -1 (x - µ1) di 2 T d -1 . 1 -1 (x - µ1) (x - µ1) 2 di where t(j ) and xnt represent the j th iterates of the Newton step size t and the step xnt , and x(0) is the algorithm initial point. The most logical starting point x(0) is µ1, in which case the rightmost term in Eq. 12 drops out. Thus, letting x(0) = µ1 and using the form of xnt = -H -1 g with -H -1 defined as in Eq. 7, we write: -1 (x(k) - µ1) = k-1 - . j t(j ) g(j ) + R(j ) (I + R(j )T R(j ) )-1 R(j )T g(j ) =1 (13) In the earlier calculation of xnt (Section 4.1), both of the right hand side arguments in Eq. 13 have already been found. As such, we have a recurrence that obviates the invertion of , with no additional memory demands (Rasmussen & Williams, 2006). The above steps reduce a naive MAP estimation (of any log concave renewal process) that requires cubic effort and quadratic storage to an algorithm that is modestly superlinear in run time and linear in memory requirements. (15) 5. Model Selection Problem Having now found x for any hyperparameters , the second major part of the problem is to find the negative logarithm of our approximation to the evidence p(y | ) in Eq. 2, and its gradients with respect to . The approximated log evidence can be written as: - log p(y | ) -log p(y | x ) 1 1 + (x - µ1)T -1 (x - µ1) + log |I + | 2 2 (14) (ignoring constants). Each of these terms has an explicit and an implicit gradient with respect to , where the latter result from the dependence of the MAP estimate x Since we have -1 (x - µ1), this gradient only requires d one matrix-vector multiplication. di has the same Toeplitz structure as and can thus be quickly multiplied. Thus, calculating the first two terms of Eq. 14 and their gradients adds no complexity to the method developed so far. Only the term 1 log |I + | presents difficulty. De2 terminants in general require O(n2 ) memory and O(n3 ) solve time using a Cholesky or PLU factorization, so we must consider the problem more carefully. We examine the eigenstructure of (I + ). Since we are not trying to find a MAP estimate, there is no log barrier term (i.e. let ); thus D (from Eq. 5) is rank N only. This means that = B + D (Eq. 6) is block outer product plus sub rank diagonal, so it is also rank deficient with block rank 2. Thus, it has 2N nonzero eigenvalues (two corresponding to each of the N events, one in each block from B and one in each block from D). Using the eigenvalue decomposition Fast Gaussian Process Methods for Point Process Intensity Estimation = U S U T , we see log |I + | = log |I + U S U | = log |U T ||I + U S U T ||U | = log |I + U T U S |, (16) T since the orthogonal matrix U has determinant 1 and U T U = I by definition. Since has rank 2N , we know that S is diagonal with zeros on the last n - 2N entries. By construction, the number of events N is much smaller than the total data size n. Since the determinant of a matrix is the product of its eigenvalues, the unit eigenvalue dimensions of I + U T U S can be ignored. We define S as the 2N × 2N submatrix of S that is made up of the diagonal block with nonzero diagonal entries. Further define U as the corresponding 2N eigenvectors. Then, since the other dimensions of U T U S contribute nothing to the determinant, we have log |I + | = log |I + U T U S | = log |I + U U S | = log |I + S |, (17) where I is now the 2N × 2N identity, and we have furT ther defined = U U . Computationally, is formed by multiplying with the columns of U . Since is block rank 2, both matrices S and U can be found in closed form (N rank 2 eigendecompositions, one decomposition per block). This calculation of Eq. 17 requires 2N matrix multiplications which each have a run time cost of O(nlog n). We can make a small approximation that simplifies this problem even further. Typically, N of these 2N eigenvalues are substantially larger than the other N . Each block of contributes two nonzero eigenvalues. The larger is due to the diagonal entry x-2 (from the matrix D) and is yi nearly axis aligned. The smaller eigenvalue is due to the ^ outer product vector from block Bi . Examination of the ^ denominators in the definition of Bi and D in Eqs. 5 and 6 explains the difference in magnitude, since x2i is much y ^ smaller than the square of sums denominator in Bi . We approximate the eigenvector as the yi axis and approximate its eigenvalue as the corresponding value in . Then S is size N × N . This savings is small, but importantly we can T form = U U simply by picking out the N rows and columns of corresponding to the event times yi . In this formulation, we are left with matrices of size N × N only, so we have some modest number of O(N 3 ) operations; this approach is considerably faster and scales better than the exact method above. We have also reduced O(n2 ) storage to O(N 2 ). The following section elucidates the quality of this approximation. To calculate the gradients with respect to this log determinant term, we also use the approximation of Eq.17. We call T our approximate gradient of this term the gradient of the approximation in Eq. 17. This approximation can readily be differentiated with respect to the hyperparameters (again, typical for GP; see Rasmussen & Williams, 2006). Since these approximations are matrices in the event space N (not time space n), these gradients are quickly calculated with a handful of O(N 3 ) operations and with storage of O(N 2 ). 6. Results and Discussion The methods developed here maintain computational accuracy while achieving massive speed-up and the elimination of memory burden. First, we have shown a fast method that achieves an accurate approximation of the MAP estimate x in much less time than a naive method. We have made all matrix multiplications implicit, thereby eliminating the memory burden of representing full matrices. We call this piece the "MAP Estimation." Second, we found the approximate model evidence, as well as its gradients, so as to perform model selection on the hyperparameters . These calculations, which involved the calculation of a log determinant and its gradients (Eq. 17), were achieved with matrices of significantly reduced dimension, again removing the storage demands of teh naive method. We call this piece the "log determinant approximation." These two pieces must be iterated (as described before Eq. 2) to find ^ both the optimal model and the optimal intensity x . We call this iterative method (combining the two pieces above) the "full GP intensity estimation." We show here that each piece is fast and accurate, and finally that they combine to make an overall method that is considerably faster than a standard implementation, with minimal sacrifice to accuracy. To demonstrate results, we pick six representative intensity functions, consisting of sinusoids of various amplitudes (5-100 events/second), means (15-150 events/second), frequencies (1-2 Hz), and lengths (0.5-10 seconds of millisecond resolution data, implying data sizes n of 500 to 10000). This set is by no means exhaustive, but it does indicate how this method outperforms a naive implementation in a range of scenarios. Our testing over many different intensity functions (including those in Cunningham et al., 2008) agrees with the results shown here. We simulate point process data y from these intensities, and we implement both the naive and the fast method on these process realizations. All results are given for 2006era Linux (FC4) 64 bit workstations with 2-4GB of RAM running MATLAB (R14sp3, BLAS ATLAS 3.2.1 on AMD processors). The naive method was implemented in MATLAB. The fast method was similarly implemented in MATLAB with some use of the C-MEX interface for linear operations such as multiplication of a vector by the (implicitly represented) matrix R. Fast Gaussian Process Methods for Point Process Intensity Estimation Table 1. Performance for fast and naive methods. Results averaged over 10 independent trials. 1 Data Size(n) Num. Events (N )1 MAP Estimation Fast Solve Time(s) Naive Solve Time(s) Speed Up MS Error (Fast vs. Naive)2 Avg. CG Iters. Fast Solve Time(s) Naive Solve Time(s) Speed Up Avg. Acc. of Fast Approx. Avg. Model Selection Iters. Fast Solve Time(s) Naive Solve Time(s) Speed Up MS Error (Fast vs. Naive)2 1 2 2 1000 30-40 0.17 40.5 232× 4.2e-4 5.5 1.8e-3 1.02 566× 98.8% 54.6 7.1 3094 451× 0.03 3 Data Set 4 5 4000 55-70 7.6 3704 493× 6.1e-6 29.9 2.8e-3 34.7 1.3e4× 99.7% 39.4 128 1.5e5 1166× 0.01 6 10000 140-160 37.9 1day3 2000×3 49.7 2.5e-2 5403 2.2e4×3 40.7 423 1month3 1e4×3 - 500 20-30 0.12 7.04 58× 4.3e-4 6.4 6.5e-4 0.24 375× 99.1% 54.3 4.4 443 105× 0.10 1000 140-160 0.46 39.5 86× 2.1e-4 16.2 1.9e-2 0.97 52× 99.8% 89.1 30.3 4548 150× 10.8 2000 55-70 0.32 333 1043× 5.2e-6 8.1 2.8e-3 5.7 2058× 98.9% 68.1 18.7 2.4e4 1512× 0.01 Log Determinant Approximation Full GP Intensity Estimation (Iterative Model Selection and MAP Estimation) Entries show a range of data used. Squared norm of x(t) is roughly 103 to 105 , so these errors are insignificant. 3 Unable to complete naive method; numbers estimated from cubic scaling. First we demonstrate the utility of our fast MAP estimation method on problems of several different sizes and with different x. We compare the fast MAP estimation to a naive implementation, demonstrating the average mean squared (MS) error (between the fast and naive estimates) and the average solve time. These results are found in the first part of Table 1. The squared norm of x is roughly 103 to 105 , so the errors shown (the difference between the naive and fast methods) are vanishingly small. Thus, the fast MAP estimation gives an extremely accurate approximation of the naive MAP estimate. For all practical purposes, the fast MAP estimation method is exact. The naive method scales in run time as the cube of data size n, as expected. The fast method and the speed-up factor do not appear to scale linearly in the data size. Indeed, run time depends heavily on the number of CG iterations required to solve the MAP estimation. This number of CG steps depends on problem size n, number of events N , and hyperparameters such as the lengthscale of the covariance matrix. Even so, major gains are achieved. Second, we demonstrate our model selection accuracy and speed-up (the log determinant approximation). We run the full iterative fast method with both MAP estimation and evidence model selection. At each iterate of , we calculate evidence and its gradients using both the fast and naive methods. In the second section of Table 1, we show average solution times for calculating the log determinant in both naive and fast methods, and we compare their accuracy. For the sake of brevity, we demonstrate only the calculation of log |I + |, not its gradients with respect to the hyperparameters. Those calculations show very similar speed-ups and are as well approximated. Thus, the log determinant is calculated to 99-100% accuracy with the naive method, and we have a highly accurate approximation. Finally, the full intensity estimation problem requires iterative evidence calculations and MAP estimations, so we must also demonstrate the accuracy of the full fast method versus the full naive method. The last part of Table 1 shows this result (Full GP Intensity Estimation). We see that all data sets converge to quite similar results in both the fast and the naive methods, and the fast method enjoys significant speed-up. The MS errors shown compare the result Fast Gaussian Process Methods for Point Process Intensity Estimation of the fast method to the result of the naive method and are very small compared to the squared norm of x (103 to 105 ). We have demonstrated a method for inferring optimal intensity estimates from an observation of renewal process data, and we have exploited problem structure to make this method computationally attractive. As an extension, we also developed this fast GP technique for multiple observations y(i) of the same underlying x. It uses the same approach with comparable performance improvements. As such, we do not report it here. Since we avoid all explicit representations of n × n matrices, our memory requirements are very minor for a problem of this size. The major run time improvements in Table 1 require effectively no loss of accuracy from an exact naive approach, and thus the additional technical complexity of this approach is well justified. Having fast, scalable methods for point process intensity estimation problems may mean the difference between theoretically interesting approaches and methods that become well used in practice. Gray, A., & Moore, A. (2003). Nonparametric density estimation: Toward computationsl tractability. SIAM Int'l Conference on Data Mining.. Horrace, W. (2005). Some results on the multivariate truncated normal distribution. J Multivariate Analysis, 94, 209­221. Kuss, M., & Rasmussen, C. (2005). Assessing approximate inference for binary gaussian process classification. Journal of Machine Learning Res., 6, 1679­1704. Moller, J., Syversveen, A., & Waagepetersen, R. (1998). Log Gaussian Cox processes. Scandanavian J. of Stats., 25, 451­482. Paninski, L. (2004). Log-concavity results on Gaussian process methods for supervised and unsupervised learning. Advances in NIPS, 16. Papoulis, A., & Pillai, S. (2002). Probability, random variables, and stochastic processes. Boston: McGraw Hill. Quinonero-Candela, J., & Rasmussen, C. (2005). A Unifying View of sparse approximate Gaussian process regression. J. Machine Learning, 6, 1939­1959. Rasmussen, C., & Williams, C. (2006). Gaussian processes for machine learning. Cambridge: MIT Press. Raykar, V., Yang, C., Duraiswami, R., & Gumerov, N. (2005). Fast computation of sums of Gaussians in high dimensions. University of Maryland Tech. Report CSTR-4767/UMIACS-TR-2005-69. Shen, Y., Ng, A., & Seeger, M. (2006). Fast Gaussian Process Regression using KD-trees. Advances in NIPS, 18. Silverman, B. (1982). Kernel density estimation using the fast fourier transform. Journal of Royal Stat. Soc. Series C: Applied Stat., 33. Acknowledgments This work was supported by NIH-NINDS-CRCNS-R01, the Michael Flynn SGF, NSF, Gatsby, CDRF, BWF, ONR, Sloan, and Whitaker. We thank Stephen Boyd for helpful discussions, Drew Haven for technical support, and Sandy Eisensee for administrative assistance. References Barbieri, R., Quirk, M., Frank, L., Wilson, M., & Brown, E. (2001). Construction and analysis of non-poisson stimulus-response models of neural spiking activity. J Neurosci Methods, 105, 25­37. Basu, S., & Dassios, A. (2002). A Cox process with lognormal intensity. Insurance: Mathematics and Economics, 31, 297­302. Boyd, S., & Vandenberghe, L. (2004). Convex optimization. Cambridge: Cambridge University Press. Cunningham, J. P., Yu, B. M., Shenoy, K. V., & Sahani, M. (2008). Inferring neural firing rates from spike trains using Gaussian processes. In Advances in NIPS, 20. Daley, D., & Vere-Jones, D. (2002). An introduction to the theory of point processes. New York: Springer. Gibbs, M., & MacKay, D. (1997). Efficient Implementation of Gaussian Processes. Preprint. Gill, P., Golub, G., Murray, W., & Saunders, M. (1974). Methods for Modifying Matrix Factorizations. Mathematics of Computation, 28, 505­535.