An improved estimator of Variance Explained in the presence of noise Ralf. M. Haefner Laboratory for Sensorimotor Research National Eye Institute, NIH Bethesda, MD 20892 ralf.haefner@gmail.com Bruce. G. Cumming Laboratory for Sensorimotor Research National Eye Institute, NIH Bethesda, MD 20892 bgc@lsr.nei.nih.gov Abstract A crucial part of developing mathematical models of information processing in the brain is the quantification of their success. One of the most widely-used metrics yields the percentage of the variance in the data that is explained by the model. Unfortunately, this metric is biased due to the intrinsic variability in the data. We derive a simple analytical modification of the traditional formula that significantly improves its accuracy (as measured by bias) with similar or better precision (as measured by mean-square error) in estimating the true underlying Variance Explained by the model class. Our estimator advances on previous work by a) accounting for overfitting due to free model parameters mitigating the need for a separate validation data set, b) adjusting for the uncertainty in the noise estimate and c) adding a conditioning term. We apply our new estimator to binocular disparity tuning curves of a set of macaque V1 neurons and find that on a population level almost all of the variance unexplained by Gabor functions is attributable to noise. 1 Introduction Constructing models of biological systems, e.g. in systems neuroscience, mostly aims at providing functional descriptions, not fundamental physical laws. It seems likely that any parametric model of signal processing in single neurons can be ruled out given a sufficient amount of data. Rather than only testing the statistical validity of a particular mathematical formulation against data, e.g. by using a 2 -test, it is equally important to know how much of the signal, or variance, in the data is explained by the model. This is commonly measured by Variance Explained (VE), the coefficient of determination or r2 statistic. A fundamental problem of the traditional estimator for VE is its bias in the presence of noise in the data. This noise may be due to measurement error or sampling noise owing to the high intrinsic variability in the underlying data. This is especially important when trying to model cortical neurons where variability is ubiquitous. Either kind of noise is in principle unexplainable by the model and hence needs to be accounted for when evaluating the quality of the model. Since the total variance in the data consists of the true underlying variance plus that due to noise, the traditional estimator yields a systematic underestimation of the true VE of the model in the absence of noise [1][2][3]. This has been noted by several authors before us; David & Gallant compute the traditional measure at several noise levels and extrapolate it to the noise-free condition [1]. This method relies on many repeats of the same stimulus and is therefore often impractical. Sahani & Linden add an analytical correction to the traditional formula in order to reduce its bias [2]. A number of subsequent studies have used their corrections to evaluate their models (e.g. [4][5][6]). We further improve on Sahani Corresponding author (ralf.haefner@gmail.com) 1 & Linden's formula in three ways: 1) most importantly by accounting for the number of parameters in the model, 2) adding a correction term for the uncertainty in the noise estimation, and 3) including a conditioning term to improve the performance in the presence of excessive noise. We propose a principled method to choose the conditioning term in order to electively minimize either the bias or the mean-square-error (MSE) of the estimator. In numerical simulations we find that the analytical correction alone is capable of drastically reducing the bias at moderate and high noise levels while maintaining a mean-square-error about as good as the traditional formula. Only for very high levels of noise is it advantageous to make use of the conditioning term. We test the effect of our improved formula on a data set of disparity selective macaque V1 neurons and find that for many cells noise accounts for most of the unexplained variance. On a population level we find that after adjusting for the noise, Gabor functions can explain about 98% of the underlying response variance. 2 Derivation of an improved estimator 2.1 Traditional Variance Explained Given a set of N measurements di of process D and given the model predictions mi , the traditional Variance Explained is computed as the difference of total variance var(di ) and the variance of the residuals of the model var(di - mi ). It is usually reported as a fraction of total variance: (di - mi )2 var(di ) - var(di - mi ) var(di - mi ) =1 = =1- =1- N . i var(di ) var(di ) ¯ (di - d)2 =1 iN (1) In most cases, the di themselves are averages of individual measurements and subject to a sampling error. Since the variances of independent random variables add, this measurement noise leads to additive terms in both numerator and denominator of equation (1) ­ essentially the two terms that we subtract in equation (8) (see derivation below). From (8) one immediately sees that as the noise level increases, (n - 1)/(N - 1) with n being the number of model parameters. The consequence is a systematic misestimation of the true Variance Explained (typically underestimation since (n - 1)/(N - 1) is usually smaller than the true VE). The effect of this can be seen in Figure 1 for two example simulations. In each simulation we fit a model to simulated noisy data sampled from a different but known underlying function. This allows us to compare the estimated VE to the true one, in the absence of noise. The average bias (estimated VE minus true VE) of the traditional variance explained is shown for 2000 instantiations of each simulation (blue lines). As we simulate an increase in sampling noise, the variance explained decreases significantly, underestimating the true VE by up to 30% in our examples. 2.2 Noise bias Ri ¯ Let di = 1/Ri j =1 dij where the Ri are the number of observations for each variable i. We further assume that the measured dij are drawn from a Gaussian distribution around the true means Di with ¯ a variance of R2 . Then the di are drawn from N [Di ; 2 ]. To simplify the presentation we assume i i that the variables have been transformed to equalize all i and that R Ri . It follows that R N ¯ 2 = 1/(RN (R - 1)) i=1 j =1 (dij - di )2 is the estimate of 2 based on measurements with N = N (R - 1) degrees of freedom. Let Mi be the best fitting model to Di of a given model class with parameters. Then the variance explained in the absence of noise becomes: (Di - Mi )2 var(Mi - Di ) =1 0 = 1 - =1- N i var(Di ) ¯ (Di - D)2 =1 iN (2) 2 N ¯ where D = 1/N i=1 Di . Then 0 is the true value for the Variance Explained that one would like to know: based on the best fit of the model class to the underlying data in the absence of any measurement or sampling noise. 0 is of course unknown and the values obtained by (1) are drawn from a probability distribution around the true Variance Explained. Normalizing both denominator and numerator of formula (1) by 2 leaves unchanged. However it becomes clear that the resulting denominator is drawn from a noncentral F -distribution: 1 N -1 iN =1 ¯ ¯ (di - d)2 = 2 1 N -1 1 N iN ¯ ¯ (di - d)2 /2 =1 iN j R =1 =1 ¯ (dij - di )2 /(R2 ) 2 -1 (DD )/(N - 1) N 2 /N N N (3) with N -1 and N = N (R-1) degrees of freedom, the noncentrality parameter DD = N¯ ¯ ¯ ¯ D)2 /2 and d = 1/N i=1 di . For N > 2 the mean of this distribution is given by 1 = ¯ iN (di - d)2 ¯ N (N - 1 + DD ) E 2 N - 1 =1 (N - 1)(N - 2) Hence, an unbiased estimator of N i=1 (Di i=1 (Di - (4) ¯ - D)2 /2 = DD is given by (5) DD = N ¯ ¯ N - 2 i (di - d)2 - (N - 1) 2 N =1 With the same reasoning we find that the numerator of equation (1) N 2 (DD )/(N - n) 1 i (di - mi )2 N -n 2 N - n =1 2 N /N (6) follows a noncentral F -distribution with N - n and N degrees of freedom and the noncentrality N N parameter DM = i=1 (Di - Mi )2 /2 . Hence, an unbiased estimator of i=1 (Di - Mi )2 /2 = DM is given by N N - 2 i (di - mi )2 - (N - n) (7) DM = N 2 =1 Combining (5) and (7) yields an estimator for 0 whose numerator and denominator are individually unbiased: iN di - mi 2 N (N - n) - N - 2 =1 . (8) [0 ] = 1 - N d 2 i ¯ N (N - 1) i-d - N - 2 =1 Note that the estimator proposed by Sahani & Linden is contained in ours as a special case, becoming identical when there is no uncertainty in the noise estimate (N ) and testing a model with no free parameters (n = 0). N is an excellent approximation in their case of fitting receptive fields to long series of data, but less so in the case of fitting tuning curves with a limited number of data points. However, the fact that their noise-term does not account for overfitting due to free parameters in the model means that their formula overestimates the true Variance Explained. Hence, it requires a separate validation data set which might be costly to obtain. At this point we wish to note that (5), (7) and (8) readily generalize to cases where the noise level ¯ i and the number of observations Ri on which the means di are based (and therefore Ni ) differ between those data points. 2.3 Conditioning term First it is important to note that while both numerator and denominator in formula (8) are now unbiased, the ratio is generally not. In fact, the ratio is not even well-defined for arbitrary measurements 3 since the denominator can become zero and negative. In practice this is avoided by implicit or explicit selection criteria imposed by the experimenter requiring a minimum degree of signal-to-noise in the data before further analysis and fitting of a model. For example one could define a criterion based on the significance level pANOVA of the modulation in the data as assessed by a 1-way ANOVA test. (Any criterion can be used in the context of the framework described here, as long as it is used consistently.) The effect of such a criterion is to cut off the lower tail of the distribution from which the denominator is drawn to exclude zero. This introduces a bias to the denominator the size of which depends on the amount of noise and the strictness of the criterion used. We recognize that both biases are strongest when the data is such that the ratio is close to singular and therefore propose an additive conditioning term C in the denominator of (8): Nd /N d . 2 i i ¯ 2 N (N - 1) N (N - n) i - mi i-d (C ) = 1 - - - +C (9) N - 2 N - 2 =1 =1 Depending on the application, the optimal C can be chosen to either minimize the mean-squareerror (MSE) E [(C ) - 0 ] or the bias |E [(C )] - 0 | of the estimator. Generally, the optimal levels of conditioning for the two scenarios are different, i.e. unbiasedness comes at the expense of an increased MSE and vice versa. For individual estimates it makes sense to accept a small bias in order to improve accuracy (and hence minimize MSE). However, when averaging over a large number of estimates, e.g. from a population of neurons, it becomes important that the estimator is unbiased. C = C (N , n, N , DM , DD ; pANOVA ) is itself a function of a number of variables, only two of which, DM and DD , are unknown a priori. We approximate them by our estimates from equations (5) and (7). The optimal C can then be determined in each case by a simple minimization across a large number of random samples drawn from the appropriate distributions (compare equations (3) and (6)): Cbias Cbias CMSE : min |E [(C )] - (1 - DM /DD )| and therefore : C E 2 - 2 DM N -n (DM )/N - (N - n)/(N - 2) : min 2 2 - (N - 1)/(N - 2) + C /N C N -1 (DD )/N DD 2 2 2 DM N -n (DM )/N - (N - n)/(N - 2) : min E - C 2 -1 (DD )/2 - (N - 1)/(N - 2) + C /N DD N N (10) ( 11) ( 12) Note that the 2 distributions in numerator and denominator, sampling over varying estimates N of the underlying noise 2 , are shared in both formulas since the 2 is shared. Those two minimization problems can easily be solved by Monte-Carlo sampling the probability distributions and subsequently find the minimum of MSE or bias, respectively, across all samples. 2.4 Application to simulated data Figure 1 demonstrates the performance of various estimators of VE for three synthetic examples. In the left column we show the results when testing a model that consists of a 3rd degree polynomial that has been fit to noisy data sampled from a Gaussian distribution around an underlying sinefunction. Over the domain studied here, the true VE of the model as fit to the data in the noiseless condition would be 77%. The center & right column shows the case of a Gabor function that is fit to noisy data sampled from a difference-of-Gaussians "reality". Here the true VE is 90%. The center column simulates Gaussian and the right column Gamma noise (Fano factor of 2). We confirm that the traditional VE measure (blue) has an increasingly negative bias with increasing noise level . Applying the Sahani-Linden correction (green) this negative bias is turned into a positive one since the overfitting of noise due to the free parameters in the model is not taken into consideration. This leads to an overestimation of the true VE when applied to the fitting data instead of a separate set of validation data. Accounting for the number of parameters greatly reduces the bias to close to zero across a large range of noise levels (red curves). Only at the highest noise levels (at which a large number of data samples does not pass the ANOVA-test for significant modulation) increases the bias significantly, while still remaining smaller than that of the traditional estimator. The reason for the decreasing bias of the Sahani-Linden estimator at very high noise levels is the 4 0.1 0.1 bias 0.05 0 !0.05 0 !0.1 !0.2 !0.3 0.1 0 !0.1 !0.2 0.3 0.15 RMSE 0.2 0.1 0.05 0 0.1 0 0.25 0.2 0.15 0.1 0.05 0 0.1 bias 0.05 0 !0.05 0 !0.1 !0.2 !0.3 0 !0.1 !0.2 0.3 0.15 RMSE 0.2 0.1 0.05 0 ! 0.1 0 ! 0.25 0.2 0.15 0.1 0.05 0 ! Figure 1: Simulation results: Left column: a 3rd degree polynomial is fit to noise data drawn from an underlying sine-function. Center & Right column: a Gabor function is fit to noisy data around a linear combination of three Gaussians ­ two 'excitatory' and one 'inhibitory'. Left & Center: Gaussian noise, Right: Gamma distributed noise (Fano factor of 2). First row: data (blue) and model (red) are shown in the noise-free condition. Their true VE is 77% and 90%, respectively. Rows 2-5: bias (defined as estimated minus true VE) and RMSE are shown as a function of noise . The traditional estimator is shown in blue, the Sahani-Linden correction in green, our estimator from eq.(8) in red. Rows 4 & 5: VE measures were restricted to the range 0 to 1. Estimators with conditioning (eq.9) optimized for bias in cyan and MSE in magenta. Restricting VE to 0 1 is the reason for the plateau in the bias of the Sahani-Linden estimator. In all panels data samples with insignificant variation in the data (pANOVA > 0.05) were excluded from the analysis. Note the different scales in each panel. 5 0 !0.1 RMSE 20 30 N 40 50 60 bias !0.2 !0.3 !0.4 10 0.45 0.4 0.35 0.3 0.25 0.2 0.15 0.1 10 20 30 N 40 50 60 Figure 2: Tradeoff between number of conditions N and number of repetitions R at each condition. Traditional measure in blue and unbiased estimate in red. The total number of measurements was fixed at N · R = 120, while the number of different conditions N is varied along the abscissa. coincidental cancellation of two bias terms: the negative bias at high noise levels also seen in our estimator for Gabor-fits to differences of Gaussians, and their general positive bias due to not taking the over-fitting of parameters into account. Comparing the MSE (shown as root-mean-square-error or RMSE) of the different estimators shows that they are similar in the case of fitting a polynomial (left column) and significantly improved in the case of fitting a Gabor function (center & right column ­ note the different y -axis scales among all column). 1 The bottom two rows simulate the situation where prior knowledge about the possible range for VE is explicitly enforced. Since the numerator in our unbiased estimator (eq.8) yields values around its noiseless value that can be positive and negative, VE can become negative or greater than one. Setting VE to a range between zero and one, respectively, violates its unbiasedness, of course. We test whether a conditioning term can improve the performance of our estimator and find that this is the case for the Gabor fit, but not the polynomial fit. In the case of the Gabor fit, the improvement due to the conditioning term is greatest at the highest noise levels as expected. The bias is decreased at the highest three noise levels tested and the MSE is slightly decreased (at the highest noise level) or the same as with conditioning. Where the purely analytical formula outperforms the one with conditioning that is because the approximations we have to make in determining the optimal C are greater than the inaccuracy in the analytical formula at those noise levels. This is especially true in the 3rd column where the strongly non-Gaussian noise is incompatible with the Gaussian assumption in our computation of C . The conclusion is that unless one has to estimate VE in the presence of extremely high noise, and has confirmed that conditioning provides an improvement for the particular situation under consideration, our analytical estimator is preferable. (Note the different y -axis scales when comparing with the 2nd and 4th row.) Using an estimator that accounts for the amount of noise has another major benefit. The fact that the total number of measurements N · R one can make is usually limited, poses a tradeoff between number of conditions N and number of repeats R. Everything else being equal, unfortunately the result from the traditional estimator for VE will depend strongly on that choice: the more conditions and the fewer repeats, the higher the standard error of the means (noise) and hence the lower the estimated VE will be ­ regardless of the model. Figure 2 demonstrates this behavior in the case of fitting a Gabor to a difference-of-Gaussians exactly as in Figure 1. Keeping the total number of measurements constant, the traditional VE (blue line) decreases drastically as the number of conditions N is increased. The new unbiased estimator in comparison has a much reduced bias and depends only weakly on R. This means that relatively few repeats (but at least 2) are necessary, allowing many more conditions to be tested than previously, hence increasing resolution. 1 It is not surprising that the precise behavior of the respective estimators varies between examples. Two approximations were made in the analytical derivation: (1) the model parameters are independent and (2) unbiasing the denominator is not the same as unbiasing the ratio. Both approximations are accurate in the small noise regime. However, as noise levels increase they introduce biases that interact depending on the situation. 6 A spikerate0.5 3.5 3 2.5 2 1.5 -1 -0.5 0 0.5 disparity 1 B spikerate0.5 7 6 5 4 3 2 -0.4 -0.2 0 disparity 0.2 C 1.5 VE (unbiased) D VE (cond min MSE) -1 0 1 0.8 0.6 0.4 0.2 1 0.5 0 -2 10 10 log(sigma2/var(d)) 10 0.2 0.4 0.6 VE (old) 0.8 1 Figure 3: Disparity tuning curves of V1 neurons fit with a Gabor function: A: Data from an example neuron shown by their standard error of the mean (SEM) errorbars. Estimate of VE by Gabor fit (solid line) changes from 85% to 93% when noise is adjusted for. B: Data from 2nd example neuron. VE of Gabor fit changes from 94% to 95%. 2 -test on compatibility of data with model: p2 = 4 · 10-4 . C: Unbiased VE as a function of signal-to-noise power. One outlier at (0.93;4.0) not shown. D: Traditional VE estimate vs unbiased VE with conditioning to minimize MSE. VE values are limited to 0..1 range. C & D: Filled symbols denote cells whose responses are incompatible with the Gabor model, as evaluated by a 2 -test (p2 < 0.05). 3 Application to experimental data 3.1 Methods We recorded extracellularly from isolated V1 neurons in two awake, fixating rhesus macaque monkeys. Details of the experimental protocol are described elsewhere [7]. Stimulus presentations lasted 420 ms and were separated by 100 ms blank periods. Responses were measured as the mean spikerate over the entire stimulus presentation, beginning 50 ms after stimulus onset. The stimulus consisted of dynamic random dots (RDS). Binocular disparity was applied perpendicular to preferred orientation. We only included neurons in the analysis which were significantly modulated by binocular disparity as evaluated by a one-way ANOVA test. 109 neurons passed the test with pANOVA < 0.05. We perform all subsequent analysis using the square root of the spikerates to approx. equalizes variances across different spikerates. We fit a Gabor function with six parameters to the spikerates of each cell and perform a 2 - test on the residuals. The minimum number of different conditions Nmin = 13 and the median number of repeats median(R) = 15. 3.2 Results Most disparity tuning curves in V1 are reasonably well-described by Gabor functions which explain more than 90% in two thirds of the neurons [7]. Whether the remaining third reflect a failure of the model or are merely a consequence of noise in the data has been an open question. Panels A & B in Figure 3 show the responses of two example cells together with their best-fitting Gabor functions. The traditional VE in panel A is only 82% even though data is not significantly different from the model (p2 = 0.64). After adjusting for noise, the unbiased VE becomes 92%, i.e. more than half of the unexplained variance can be attributed to the response variability for each measurement. Panel B shows the opposite situation: 94% of the variance is explained according to the traditional measure and only an additional 1% can be attributed to noise. However, despite 7 this high VE, since the measurement error is relatively small, the model is rejected with a high significance (p2 = 4 · 10-4 ). Panel C shows the unbiased estimate of the VE for the entire population of neurons depending on their noise power relative to signal power. At high relative noise levels there is a wide spread of values and for decreasing noise, the VE values asymptote near 1. In fact, the overall population mean for the unbiased VE is 98%, compared with the traditional estimate of 82%. This means that for the entire population, most of the variance previously deemed unexplained by the model can in fact be accounted for by our uncertainty about the data. 22 out of 109 cells or 20% rejected the model (p2 < 0.05) and are denoted by filled circles. Panel D demonstrates the effect of the new measure on each individual cell. For the estimation of the true VE for each neuron individually, we incorporate our knowledge about the bounds 0 0 1 and optimize the conditioning term for minimum MSE. With the exception of two neurons, the new estimate of the true VE is greater than the traditional one. On average 40% of the unexplained variance in each individual neuron can be accounted for by noise. 4 Conclusions We have derived an new estimator of the variance explained by models describing noisy data. This estimator improves on previous work in three ways: 1) by accounting for overfitting due free model parameters, 2) by adjusting for the uncertainty in our estimate of the noise and 3) by describing a way to add an appropriate level of conditioning in cases of very low signal-to-noise in the data or other imposed constraints. Furthermore, our estimator does not rely on a large number of repetitions of the same stimulus in order to perform an extrapolation to zero noise. In numerical simulations with Gaussian and strongly skewed noise we have confirmed that our correction is capable of accounting for most noise levels and provides an estimate with greatly improved bias compared to previous estimators. We note that where the results from the two simulations differ, it is the more realistic simulation where the new estimator performs best. Another important benefit of our new estimator is that it addresses the classical experimenter's dilemma of a tradeoff between number of conditions N and number of repeats R at each condition. While the results from the traditional estimator quickly deteriorate with increasing N and decreasing R, the new estimator is much closer to invariant with respect to both ­ allowing the experimenter to choose a greater N for higher resolution. When applying the new VE estimator to a data set of macaque V1 disparity tuning curves we find that almost all of the variance previously unaccounted for by Gabor fits can be attributed to sampling noise. For our population of 109 neurons we find that 98% of the variance can be explained by a Gabor model. This is much higher than previous estimates precisely because they did not account for the variability in their data, illustrating the importance of this correction especially in cases where the model is good. The improvement we present is not limited to neuronal tuning curves but will be valuable to any model testing where noise is an important factor. Acknowledgments We thank Christian Quaia and Stephen David for helpful discussions. References [1] S.V. David, and J.L. Gallant, Network 16, 239 (2005). [2] M. Sahani, and J.F. Linden, NIPS, (2003). [3] A. Hsu, A. Borst, and F.E. Theunissen, Network 15, 91 (2004). [4] C.K. Machens, M.S. Wehr, and A.M. Zador, J Neurosci 24, 1089 (2004). [5] I. Nauhaus, A. Benucci, M. Carandini, and D.L. Ringach, Neuron 57, 673 (2008). [6] V. Mante, V. Bonin, and M. Carandini, Neuron 58, 625 (2008). [7] S.J. Prince, A.D. Pointon, B.G. Cumming, and A.J. Parker, J Neurophysiol 87, 191 (2002). 8