Designing neurophysiology experiments to optimally constrain receptive field models along parametric submanifolds. Jeremy Lewi School of Bioengineering Georgia Institute of Technology jeremy@lewi.us Robert Butera School of Electrical and Computer Engineering Georgia Institute of Technology rbutera@ece.gatech.edu Sarah M. N. Woolley Department of Psychology Columbia University sw2277@columbia.edu David M. Schneider Departments of Neurobiology and Psychology Columbia University dms2159@columbia.edu Liam Paninski Department of Statistics and Center for Theoretical Neuroscience Columbia University liam@stat.columbia.edu Abstract Sequential optimal design methods hold great promise for improving the efficiency of neurophysiology experiments. However, previous methods for optimal experimental design have incorporated only weak prior information about the underlying neural system (e.g., the sparseness or smoothness of the receptive field). Here we describe how to use stronger prior information, in the form of parametric models of the receptive field, in order to construct optimal stimuli and further improve the efficiency of our experiments. For example, if we believe that the receptive field is well-approximated by a Gabor function, then our method constructs stimuli that optimally constrain the Gabor parameters (orientation, spatial frequency, etc.) using as few experimental trials as possible. More generally, we may believe a priori that the receptive field lies near a known sub-manifold of the full parameter space; in this case, our method chooses stimuli in order to reduce the uncertainty along the tangent space of this sub-manifold as rapidly as possible. Applications to simulated and real data indicate that these methods may in many cases improve the experimental efficiency. 1 Introduction A long standing problem in neuroscience has been collecting enough data to robustly estimate the response function of a neuron. One approach to this problem is to sequentially optimize a series of experiments as data is collected [1, 2, 3, 4, 5, 6]. To make optimizing the design tractable, we typically need to assume our knowledge has some nice mathematical representation; e.g. a log-concave distribution [7]. This restriction often makes it difficult to include the types of prior beliefs held by neurophysiologists; for example that the receptive field has some parametric form such as a Gabor http://www.lewilab.org http://www.stat.columbia.edu/liam/ 1 p ( | µ t , C t ) TµM M 2 µM µt 2 1 M p ( | µ b , C b ) 1 1 2 Figure 1: A schematic illustrating how we use the manifold to improve stimulus design. Our method begins with a Gaussian approximation of the posterior on the full model space after t trials. The left most plot shows an example of this Gaussian distribution when dim() = 2. The next step is constructing the tangent space approximation of the manifold, which is illustrated in the middle plot. Here the manifold, shown in blue, is a cubic polynomial. The MAP, the blue dot, is projected onto the manifold. The projection µM,t , the green dot, is close to µt but also on M. We then compute the tangent space, shown in red, by taking the derivative of the manifold at µM,t . The tangent space is the space spanned by a vector in the direction of the derivative. Clearly, in the neighborhood of µM,t , moving along the manifold is roughly equivalent to moving along the tangent space. Thus, the tangent space provides a good local approximation of M. In the final column we compute p(|µb,t , Cb,t ) by evaluating p(|µt , C t ) on the tangent space. The resulting distribution concentrates its mass on models which are probable under p(|µt , C t ) and close to the manifold. function. While this prior information can simply be ignored, doing so is counter-productive. Here we consider the problem of incorporating prior knowledge into an existing algorithm for optimizing neurophysiology experiments [7]. We start by assuming that a neuron can be modeled as a Generalized Linear Model (GLM). Our prior knowledge defines a subset of all GLMs in which we expect to find the best model of the neuron. We represent this class as a sub-manifold in the parameter space of the GLM. We use the manifold to design an experiment which will provide the largest reduction in our uncertainty about the unknown parameters. To make the computations tractable we approximate the manifold using the tangent space evaluated at the Maximum a posteriori(MAP) estimate of the parameters projected onto the manifold. While the tangent space only provides a local approximation of the manifold, our simulations show that our methods can still produce more informative experiments compared to optimizing the design without taking into account the manifold. Furthermore, our methods will work robustly even if the best model does not lie on the sub-manifold. 2 Methods We begin by summarizing the three key elements of an existing algorithm for optimizing neurophysiology experiments. A more thorough discussion is available in [7]. We model the neuron's response function as a mapping between the neuron's input at time t, st , and its response, rt . We define the input rather generally as a vector which may consist of terms corresponding to a stimulus, e.g. an image or a sound, or the past activity of the neuron itself, {rt-1 , rt-2 , . . .}. The response, rt , is typically a non-negative integer corresponding to the number of spikes observed in a small time window. Since neural responses are typically noisy we represent the response function as a conditional distribution, p(rt |st , ). In this context, optimizing the experimental design means picking the input for which observing the response will provide the most information about the parameters defining the conditional response function. 2 The first important component of this algorithm is the assumption that p(rt |st , ) can be adequately approximated by a generalized linear model(GLM) [8, 9]. The likelihood of the response depends on the firing rate, t , which is a function of the input, , t = E (rt ) = f T st (1) where f () is some nonlinear function which is assumed known1 . To identify the response function, we need to estimate the filter coefficients, . One important property of the GLM is that we can easily derive sufficient conditions to ensure the log-likelihood is concave [10]. The second key component of the algorithm is that we may reasonably approximate the posterior on as Gaussian. This approximation is justified by the log-concavity of the likelihood function and asymptotic normality of the posterior distribution [11]. As a result, we can recursively compute a Gaussian approximation of the full posterior, p(|r1:t , s1:t ) p(|µt , C t ) [7]. Here (µt , C t ) denote the mean and covariance matrix of our Gaussian approximation. µt is set to the maximuma-posteriori (MAP) estimate of . The final component is an efficient method for picking the optimal input on the next trial, st+1 . Since the purpose of an experiment is to identify the best model, we optimize the design by maximizing the conditional mutual information between rt+1 and given st+1 , I (; rt+1 |st+1 ). The mutual information measures how much we expect observing the response to st+1 will reduce our uncertainty about . We pick the optimal input by maximizing the mutual information with respect to st+1 ; as discussed in [7], this step, along with the updating of the posterior mean and covariance (µt , C t ), may be computed efficiently enough for real-time implementation in many cases. 2.1 Optimizing experiments to reduce uncertainty along parameter sub-manifolds. For the computation of the mutual information to be tractable, the space of candidate models, , must have some convenient form so that we can derive a suitable expression for the mutual information. Intuitively, to select the optimal design, we need to consider how much information an experiment provides about each possible model. Evaluating the mutual information entails an integral over model space, . The problem with incorporating prior knowledge is that if we restrict the model to some complicated subset of model space we will no longer be able to efficiently integrate over the set of candidate models. We address this problem by showing how local geometric approximations to the parameter sub-manifold can be used to guide optimal sampling while still maintaining a flexible, tractable representation of the posterior distribution on the full model space. In many experiments, neurophysiologists expect a-priori that the receptive field of a neuron will have some low-dimensional parametric structure; e.g the receptive field might be well-approximated by a Gabor function [12], or by a difference of Gaussians [13], or by a low rank spatiotemporal matrix [14, 12]. We can think of this structure as defining a sub-manifold, M, of the full model space, , M = { : = ( ), }. (2) The vector, , essentially enumerates the points on the manifold and () is a function which maps these points into space2 . A natural example is the case where we wish to enforce the constraint that has some parametric form, e.g. a Gabor function. The basic idea is that, rather than conduct an experiment which tests whether the model is on the manifold, we want to run experiments which can identify exactly where on the manifold the optimal model lies. Since M can have some arbitrary nonlinear shape, computing the informativeness of a stimulus using just the models on the manifold is not easy. Furthermore, if we completely restrict our attention to models in M then we ignore the possibility that our prior knowledge is incorrect. Hence, we do not force the posterior distribution of to only have support on the manifold. Rather, we maintain a Gaussian approximation of the posterior on the full space, . However, when optimizing our stimuli we combine our posterior with our knowledge of M in order to do a better job of maximizing the informativeness of each experiment. 1 It is worth noting that this simple GLM can be generalized in a number of directions; we may include spike-history effects, nonlinear input terms, and so on [9]. 2 We can relax the condition that this parameter subset M is a manifold. 3 Computing the mutual information I (rt+1 ; |st+1 , s1:t , r1:t ) entails an integral over model space weighted by the posterior probability on each model. We integrate over model space because the informativeness of an experiment clearly depends on what we already know; i.e. the likelihood we assign to each model given the data and our prior knowledge. Furthermore, the informativeness of an experiment will depend on the outcome. Hence, we use what we know about the neuron to make predictions about the experimental outcome. Unfortunately since the manifold in general has some arbitrary nonlinear shape we cannot easily compute integrals over the manifold. Furthermore, we do not want to continue to restrict ourselves to models on the manifold if the data indicates our prior knowledge is wrong. We can solve both problems by making use of the tangent space of the manifold, as illustrated in Figure 1. The tangent space is a linear space which provides a local approximation of the manifold. Since the tangent space is a linear subspace of , integrating over in the tangent space is much easier than integrating over all on the manifold; in fact, the methods introduced in [7] may be applied directly to this case. The tangent space is a local linear approximation evaluated at a particular point, µM,t , on the manifold. For µM,t we use the projection of µt onto the manifold (i.e., µM,t is the closest point in M to µt ). Depending on the manifold, computing µM,t can be nontrivial; the examples considered in this paper, however, all have tractable numerical solutions to this problem. The challenge is representing the set of models close to µM,t in a way that makes integrating over the models tractable. To find models on the manifold close to µM,t we want to perturb the parameters about the values corresponding to µM,t . Since is in general nonlinear, there is no simple expression for the combination of all such perturbations. However, we can easily approximate the set of resulting from these perturbations by taking linear combinations of the partial derivatives of with respect to . The partial derivative is the direction in in which moves if we perturb one of the manifold's parameters. If we a take a linear combination of the partial derivatives with respect to , the result is a which approximates the corresponding to perturbing the manifold's parameters. Thus, the subspace formed by linear combinations of the partial derivatives approximates the set of models on the manifold close to µM,t . This subspace is the tangent space, ... 1 d , (3) TµM,t M = { : = µM,t + B b, b Rdim(M) } B = orth where orth is an orthonormal basis for the column space of its argument. Here Tx M denotes the tangent space at the point x. The columns of B denote the direction in which changes if we perturb one of the manifold's parameters. (In general, the directions corresponding to changes in different parameters are not independent; to avoid this redundancy we compute a set of basis vectors for the space spanned by the partial derivatives.) We now use our Gaussian posterior on the full parameter space to compute the posterior likelihood of the models in the tangent space. Since the tangent space is a subspace of , restricting our Gaussian approximation, p(|µt , C t ), to the tangent space means we are taking a slice through our Gaussian approximation of the posterior. Mathematically, we are conditioning on TµM,t M. The result is a Gaussian distribution on the tangent space whose parameters may be obtained using the standard Gaussian conditioning formula: N (b; µb, , Cb, ) if b s.t = µM,t + B b ptan (|µb,t , Cb,t ) = (4) Tµ 0 if / M,t µb,t = -Cb, B T C -1 (µM,t - µt ) t Cb,t = (B T C -1 B )-1 t (5) Now, rather than optimizing the stimulus by trying to squeeze the uncertainty p(|r1:t , s1:t , M) down as much as possible (a very difficult task in general), we pick the stimulus which reduces ptan (|µb,t , Cb,t ) as much as possible. We can solve this latter problem directly using the methods presented in [7]. Finally, to handle the possibility that M, every so often we optimize the / stimulus using the full posterior p(|µt , C t ). This simple modification ensures that asymptotically we do not ignore directions orthogonal to the manifold; i.e., that we do not get stuck obsessively 4 t=250 info. max. tan. space t=500 t=750 t=1000 i.i.d. 4 2 0 -2 Figure 2: The MAPs on several trials estimated using an info. max. design using the tangent space, an i.i.d. design, and an info. max. design which did not use the assumption that corresponds to a low rank STRF. The results clearly show that using the tangent space to optimize the design leads to much faster convergence to the true parameters. Since the true STRF was not in fact a rank-2 matrix, the results also show that our knowledge of M does not need to be exact in order to improve the experimental design. Optimizing the design without taking M into account does better than a random design but not nearly as well as our method. The true STRF (fit to real zebrafinch auditory responses) used to generate the synthetic data is shown in the last column. sampling along the incorrect manifold. As a result, µt will always converge asymptotically to the true parameters, even when M . To summarize, our method proceeds as follows: 0. Initial conditions: start with a log-concave (approximately Gaussian) posterior given t previous trials, summarized by the posterior mean, µt and covariance, C t . 1. Compute µM,t , the projection of µt on the manifold. (The procedure for computing µM,t depends on the manifold.) 2. Compute the tangent space of M at µM,t using Eqn. 3. 3. Compute the posterior restricted to the tangent space, p(|µb,t , Cb,t ), using the standard Gaussian conditioning formula. 4. Apply the methods in [7] to find the optimal t + 1 stimulus. 5. Observe the response rt+1 . 6. Update the posterior by recursively updating the posterior mean and covariance: µt µt+1 and C t C t+1 . 7. Goto step 1. 3 Results 3.1 Low rank models To test our methods in a realistic, high-dimensional setting, we simulated a typical auditory neurophysiology [15, 14, 16] experiment. Here, the objective is to to identify the spectro-temporal receptive field (STRF) of the neuron. The input and receptive field of the neuron are usually represented in the frequency domain because the cochlea is known to perform a frequency decomposition of sound. The STRF, ( , ), is a 2-d filter which relates the firing rate at time t to the amount of 5 info. max. full Elog p(r|st,t) -0.2 -0.4 -0.6 -0.8 -1 10 3 Shuffled: Info. Max.: Info. Max.: rank=2 trial 10 4 Figure 3: Plots comparing the performance of info. max. designs, info. max. designs which use the tangent space, and a shuffled design. The manifold was the set of rank 2 matrices. The plot shows the expected log-likelihood of the spike trains averaged across inputs in the test set. Using a rank 2 manifold to constrain the model leads to the best fits of the data. energy at frequency and time t - in the stimulus. To incorporate this spectrotemporal model in the standard GLM setting, we simply vectorize the matrix ( , ). Estimating the STRF can be quite difficult because it is often very high dimensional, e.g 104 parameters. Several researchers, however, have shown that low-rank assumptions can be used to produce accurate approximations of the receptive field while significantly reducing the number of unknown parameters [17, 12, 14, 18]. A low rank assumption is a more general version of the space-time separable assumption that is often used when studying visual receptive fields. Mathematically, a low-rank assumption means that the matrix corresponding to the STRF can be written as a sum of rank one matrices, = M at = U V T (6) where M at indicates the matrix formed by reshaping the vector to form the STRF, and U, V are orthogonal matrices. We simulated an auditory experiment using an STRF fitted to the actual response of a neuron in the Mesencephalicus lateralis pars dorsalis (MLd) of Zebra Finch [16]. To reduce the dimensionality we sub-sampled the STRF in the frequency domain and shortened it in the time domain to yield a 20 × 21 STRF. We generated synthetic data by sampling a Poisson process whose instantaneous firing rate was set to the output of a GLM with exponential nonlinearity and proportional to the actual STRF from Schneider and Woolley. For the manifold we used the set of corresponding to rank-2 matrices. For the STRF we used, the rank-2 assumption turns out to be rather accurate. We also considered manifolds of rank-1 and rank-5 matrices (data not shown), but rank-2 did slightly better. The manifold of rank r matrices is convenient because we can easily project any onto M by reshaping as a matrix and then computing its singular-value-decomposition (SVD) which returns the principal components of the matrix. µM,t is the matrix formed by the first r principal components of µt . To compute the tangent space, we compute the derivative of with respect to each component of the matrices U and V . In Figure 3.1 we compare the effectiveness of different experimental designs by plotting the MAP, µt , on several trials. The results clearly show that using the tangent space to design the experiments leads to much faster convergence to the true parameters. Furthermore, using the assumption that the STRF is rank-2 is beneficial even though the true STRF is not in fact rank-2. 3.2 Real birdsong data We also tested our method by using it to reshuffle the data collected during an actual experiment to find an ordering which provided a faster decrease in the error of the fitted model. During the experiments, we recorded the responses of an auditory neuron in the Mesencephalicus lateralis pars 6 dorsalis of Zebra Finch when the songs of other birds and ripple noise were presented to the bird [16]. We compared a design which randomly shuffled the trials to a design which used our info. max. algorithm to select the order in which the trials are processed. We then evaluated the fitted model by computing the expected log-likelihood of the spike trains in response to inputs in a test set which was distinct from our training set. The expectation is taken with respect to our posterior on . To further constrain the models we imposed a smoothing constraint by transforming the STRF into the Fourier domain, i = i,j G :,i D Tj (7) :, ,j where G and D are matrices whose columns correspond to sine and cosine functions evaluated at frequencies {0, . . . , fo,f mf } and {0, . . . , fo,t mt } respectively. fo,f and fo,t are the fundamental frequencies and are set so that 1 period corresponds to the dimensions of the STRF. mf and mt are the largest integers such that fo,f mf and fo,t mt are less than the Nyquist frequency. The unknown parameters are the amplitudes = {. . . , i,j , . . .}. To impose a smoothing constraint we forced i,j = 0 for i > 10 or j > 2. In addition to the smoothing constraint, we want to impose the low-rank constraint introduced in the previous section. Combining the low-rank constraint with our representation of in the Fourier domain means that the columns of U and V should themselves be linear combinations of 1-d sinusoidal functions. To apply our methods, we need to compute the projection of the parameters onto the manifold. If is the parameters of our manifold then we can rewrite the STRF as = G T DT (8) where and are orthonormal matrices. Each column of the matrices G and D is one of the principal components of our smoothed representation of the STRF. These components are formed by taking linear combinations of sine and cosine functions, i.e. the columns of G and D . is the projection of the STRF onto each principal component. Thus, to compute , , and , we simply compute the SVD of G T D . The results, Figure 3, show that both info. max. designs did better than a shuffled design. Furthermore, incorporating the low-rank assumption using the tangent space further improved the info. max. design. A plot of the STRF's, Figure 4, clearly shows that the info. max. designs converge more rapidly to the final estimate of the MAP. These results show that by optimizing our experiment, we can increase the amount of information gained about the neuron. In an actual experiment, we would expect to see a much larger improvement when we use an info. max. design because during the experiment we would be free to pick any input. Thus, the different designs could end up picking vastly different inputs to train with. In contrast, when re-analyzing the data offline, all we can do is reshuffle the trials. 4 Conclusion We have provided a method for incorporating detailed prior information in existing algorithms for the information-theoretic optimal design of neurophysiology experiments. These methods use realistic assumptions about the neuron's response function and choose significantly more informative stimuli, leading to faster convergence to the true response function using fewer experimental trials. We expect that the inclusion of this strong prior information will help experimentalists contend with the high dimensionality of neural response functions. 5 Acknowledgments We thank Vincent Vu and Bin Yu for helpful conversations. JL is supported by the Computational Science Graduate Fellowship Program administered by the DOE under contract DE-FG0297ER25308 and by the NSF IGERT Program in Hybrid Neural Microsystems at Georgia Tech via grant number DGE-0333411. LP is supported by an NSF CAREER award and a Gatsby Initiative in Brain Circuitry Pilot Grant. 7 Trial 500 Frequency 6000 4000 2000 6000 4000 2000 6000 4000 2000 Shuffled Trial 1000 Trial 1250 Trial 1500 Trial 1750 Trial 10000 x 10-3 10 5 0 -5 Info. Max. rank=2 Frequency Frequency Info. Max. -0.04 0.02 0 -0.04 0.02 0 -0.04 0.02 0 -0.04 0.02 0 -0.04 0.02 0 -0.04 0.02 0 - - - - - - Time Time Time Time Time Time Figure 4: A comparison of the STRF estimated from the bird song data using different designs. We chose the trials by picking the interval over which the expected log-likelihood of the different designs differed the most in Fig. 3. References [1] [2] [3] [4] [5] [6] [7] [8] [9] P. Foldiak, Neurocomputing 38­40, 1217 (2001). R. C. deCharms, et al., Science 280, 1439 (1998). T. Gollisch, et al., Journal of Neuroscience 22, 10434 (2002). F. Edin, et al., Journal of Computational Neuroscience 17, 47 (2004). C. Machens, et al., Neuron 47, 447 (2005). K. N. O'Connor, et al., Journal of Neurophysiology 94, 4051 (2005). J. Lewi, et al., accepted to Neural Computation (2008). E. Simoncelli, et al., The Cognitive Neurosciences, M. Gazzaniga, ed. (MIT Press, 2004), third edn. L. Paninski, et al., Computational Neuroscience: Theoretical Insights into Brain Function (Elsevier, 2007), chap. Statistical models for neural encoding, decoding, and optimal stimulus design. L. Paninski, Network: Computation in Neural Systems 15, 243 (2004). L. Paninski, Neural Computation 17, 1480 (2005). A. Qiu, et al., J Neurophysiol 90, 456 (2003). C. Enroth-Cugell, et al., Journal of Physiology 187, 517 (1966). J. F. Linden, et al., Journal of Neurophysiology 90, 2660 (2003). F. E. Theunissen, et al., Journal of Neuroscience 20, 2315 (2000). S. M. Woolley, et al., The Journal of Neuroscience 26, 2499 (2006). D. A. Depireux, et al., Journal of Neurophysiology 85, 1220 (2001). M. B. Ahrens, et al., Network 19, 35 (2008). [10] [11] [12] [13] [14] [15] [16] [17] [18] 8