Reducing statistical dependencies in natural signals u s in g r a d ia l G a u s s ia n iz a t io n Siwei Lyu Computer Science Department University at Albany, SUNY Albany, NY 12222 lsw@cs.albany.edu Eero P. Simoncelli Center for Neural Science New York University New York, NY 10003 eero@cns.nyu.edu Abstract We consider the problem of transforming a signal to a representation in which the components are statistically independent. When the signal is generated as a linear transformation of independent Gaussian or non- Gaussian sources, the solution may be computed using a linear transformation (PCA or ICA, respectively). Here, we examine a complementary case, in which the source is non-Gaussian but elliptically symmetric. In this situation, the source cannot be decomposed into independent components using a linear transform, but we show that a simple nonlinear transformation, which we call radial Gaussianization (RG), is able to remove all dependencies. We apply this methodology to natural signals, demonstrating that the joint distributions of bandpass filter responses, for both sound and images, are better described as elliptical than linearly transformed independent sources. Consistent with this, we demonstrate that the reduction in dependency achieved by applying RG to either pairs or blocks of bandpass filter responses is signicantly greater than that achieved by PCA or ICA. 1 Introduction Signals may be manipulated, transmitted or stored more efficiently if they are transformed to a representation in which there is no statistical redundancy between the individual components. The principle of reducing redundancies in natural signals (known as the efficient coding hypothesis [1, 2]) has been used to explain various properties of biological perceptual systems. Given a source model, the problem of deriving an appropriate transformation to remove statistical dependencies, based on the statistics of observed samples, has been studied for more than a century. The most wellknown example is principal components analysis (PCA), a linear transformation derived from the second-order signal statistics (i.e., the covariance structure), that can fully eliminate dependencies for Gaussian sources. Over the past two decades, a more general method, known as independent component analysis (ICA), has been developed to handle the case when the signal is formed as a linear transformation of independent non-Gaussian sources. ICA has shown success in many applications, especially in deriving bases for natural signals [3, 4, 5, 6]. Although PCA and ICA bases may be computed for nearly any source, they are only guaranteed to eliminate dependencies when the assumed source model is correct. And even in cases where these methodologies seems to produce an interesting solution, the components of the resulting representation may be far from independent. A case in point is that of natural images, for which derived ICA transformations consist of localized oriented basis functions that appear similar to the receptive field descriptions of neurons in mammalian visual cortex [3, 5, 4]. Although dependency between the responses of such linear basis functions is reduced compared to that of the original pixels, this reduction is only slightly more than would be achieved with PCA [7]. Furthermore, ICA filter responses still exhibit striking higher-order dependencies [8, 9]. 1 Linearly transformed factorial Elliptical Factorial Gaussian Spherical Fig. 1. Venn diagram of the relationship between density models. The two circles represent the linearly transformed factorial densities as assumed by the ICA methods, and elliptically symmetric densities (ESDs). The intersection of these two classes is the set of all Gaussian densities. The factorial densities form a subset of the linearly transformed factorial densities and the spherically symmetric densities form a subset of the ESDs. Here, we consider the dependency elimination problem for the class of source models known as elliptically symmetric densities (ESDs) [10]. For ESDs, linear transforms have no effect on the dependencies beyond second-order, and thus ICA decompositions offer no advantage over PCA. We introduce an alternative nonlinear procedure, which we call radial Gaussianization (RG). In RG, the norms of whitened signal vectors are nonlinearly adjusted to ensure that the resulting output density is a spherical Gaussian, whose components are statistically independent. We first show that the joint statistics of spatially proximal bandpass filter responses, for both natural sound and natural images, are better described as an ESD than linearly transformed independent sources. Consistent with this, we demonstrate that the reduction in dependency achieved by applying RG to either pairs or blocks of bandpass filter responses is significantly greater than that achieved by PCA or ICA. 2 Elliptically Symmetric Densities The density of a random vector x Rd with zero mean is elliptically symmetric if it is of the form: , - 1 1 T -1 p (x ) = x x (1 ) f 1 2 || 2 where is a positive definite symmetric matrix, and f (·) is a positive-valued generating function satisfying 0 f (-r2 /2) rd-1 dr < [10]. The normalizing constant is chosen so that the density integrates to one. The density p(x) is completely determined by the generating function f (·) and the matrix . Assuming x has finite second-order statistics, is a multiple of the covariance matrix. In the special case when is a multiple of the identity matrix, p(x) is known as a spherically symmetric density (SSD). Note that an ESD can be transformed into an SSD by a whitening operation. The definitive characteristic of an ESD is that the level sets (curves of constant probability) are ellipsoids determined by (for SSD, such level sets are hyper-spheres). When the generating function is an exponential, the resulting ESD is a zero-mean multivariate Gaussian with covariance matrix . In this case, x can also be regarded as a linear transformation of a vector s containing independent unit-variance Gaussian components, x = -1/2 s. In fact, the Gaussian densities are the only density that is both elliptically symmetric and linearly decomposable into independent components [11]. In other words, they correspond to the intersection of the class of ESDs and the class assumed by the ICA methods, i.e., linearly transformed factorial densities. As a special case, a spherical Gaussian is the only density that is both spherically symmetric and factorial (i.e., with independent components). These relationships are illustrated in a Venn diagram in Fig. 1. Apart from the special cases of Gaussians, linear transformations such as PCA or ICA cannot completely eliminate dependencies in the ESDs. First, PCA is only effective in removing second-order dependencies, and thus has no effect on the higher-order dependencies in a non-Gaussian ESD. For ICA, it can be viewed as a cascade of whitening operation and an orthogonal linear transform V . The second-order dependencies in an ESD are removed by the whitening operation, with the resulted variable xwht having an SSD. After that, however, no orthogonal matrix V can further remove 2 (a) (b) rout g(r) (c) pout(r) r in (d) (e) p i n (r) (f ) Fig. 2. Radial Gaussianization procedure for 2D data. (a,e): 2D joint densities of a spherical Gaussian and a non-Gaussian SSD. The plot is arranged such that a spherical Gaussian has equal-spaced contours. (b,f): radial marginal densities of the Gaussian in (a) and the SSD in (e). Shaded regions correspond to shaded annuli in (a) and (e), respectively. (c): the radial map obtained through RG transform. (d): marginal densities in the log domain of the Gaussian in (a) and the SSD in (e), as red dashed line and blue solid line, respectively. See text. dependencies in xwht . The rason is simple: p(xwht ) is spherically symmetric (it is a function only e of the vector length x = xT x), which is invariant under orthogonal linear transformation, thus applying any V will leave p(xwht ) unchanged. 3 Radial Gaussianization Given that linear transforms are ineffective in removing dependencies from a spherically symmetric xwht (and hence the original ESD variable x), we need to consider non-linear mappings. As described previously, a spherical Gaussian is the only SSD with independent components. Thus, given a non-Gaussian SSD for xwht , a natural solution for eliminating dependencies is to transform it to a spherical Gaussian. Selecting such a non-linear mapping without any further constraint is a highly ill-posed problem. However, in this case, it is natural to restrict to nonlinear mappings that act radially, preserving the spherical symmetry. Specifically, one can show that the generating function d-1 of p(xwht ) is completely determined by its radial marginal distribution, as: pr (r) = r f (-r2 /2), where r = xwht , (·) is the standard Gamma function, and is the normalizing constant that ensures that the density integrates to one. Thus, a radial transform that rescaling the norm of each vector xwht to match the radial marginal distribution of a spherical Gaussian of unit variance, which is a chid-1 density with d degrees of freedom: p (r) = 2d/2-r1 (d/2) exp(-r2 /2), suffices to transform xwht into a spherical Gaussian variable and thus completely eliminates its statistical dependency. Therefore, we can define the radial Gaussianization (RG) transformation as xrg = g( xwht ) xwhtt , where nonlinear xwh function g(·) matches the radial marginal density of xwht to that of a spherical Gaussian. Selecting g(·) is a one-dimensional density-mapping problem, and the classical solution (commonly known as "histogram equalization") is the continuous monotonic mapping given by composition of the inverse - cumulative density function (CDF) of p with the CDF of pr as g(r) = F 1 Fr (r). A graphical illustration of the procedure is provided in Fig. 2. In practice, we estimate Fr (r) and F (r) using ^ ^ histograms collected from training data, and from F (r) and Fr (r), a look-up table can be constructed -^ ^ 1 Fr (r), to approximate the continuous function g(r). It is also with proper interpolation, as g(r) = F ^ possible, though not necessary, to fit g with piece-wise smooth functions (e.g., splines). Unlike ^ PCA or ICA, in which data requirements grow with the dimensionality of the space (since errors in eigenvector estimates scale with the ratio of the dimensionality to the number of data samples), estimation of the RG transformation depends only on the number of data samples (since it effectively operates on a one-dimensional space). 4 Application to Natural Signals An understanding of the statistical behaviors of source signals is beneficial for many problems in signal processing, and can also provide insights into the design and functionality of biological sen3 (a) 0.1 msec (4 samples) (b ) ( c) (d ) 2.0 msec (84 samples) 3.5 msec (154 samples) Fig. 3. (a,c): Contour plots of joint histograms of pairs of band-pass filter responses of a natural sound clip. Each row corresponds to pairs with different temporal separation, and levels are chosen so that a spherical Gaussian density will have equally spaced contours. (a) Original responses. (c) Responses after whitening and RG transformation. (b,d): Conditional histograms of the same data shown in (a,c), computed by independently normalizing each column of the joint histogram. Image intensities are proportional to probability, except that each column of pixels is independently rescaled so that the largest probability value is displayed as white. sory systems. Gaussian signal models are widely used, because they are easily characterized and often lead to clean and efficient solutions. But many naturally occurring signals exhibit striking nonGaussian statistics, and much recent literature focuses on the problem of characterizing and exploiting these behaviors. Specifically, ICA methodologies have been used to derive linear representations for sound and images signals whose coefficients are maximally sparse or independent [3, 5, 6]. These analyses generally produced basis sets containing bandpass filters resembling those used to model the early auditory or visual systems. Despite the success of ICA methods in providing a fundamental motivation for sensory receptive fields, there are a number of simple observations that indicate inconsistencies in the interpretation. First, the responses of ICA or other bandpass filters exhibit striking dependencies, in which the variance of one filter response can be predicted from the amplitude of another nearby filter response [12, 13]. This suggests that although the histograms of responses of the filters are heavytailed, the joint histograms of pairs of responses are not consistent with the factorial source model assumed by ICA. Furthermore, the marginal distributions of a wide variety of bandpass filters (even a "filter" with randomly selected zero-mean weights) are all highly kurtotic [14]. This would not be expected for the ICA source model: projecting the local data onto a random direction should result in a density that becomes more Gaussian as the neighborhood size increases, in accordance with a generalized version of the Central Limit Theorem [15]. A recent quantitative study [7] further showed that the oriented bandpass filters obtained through ICA optimization on images lead to a surprisingly small improvement in terms of reduction in multi-information (MI) relative to decorrelation methods such as PCA. Taken together, all of these observations suggest that the filters obtained through ICA optimization represent a "shallow" optimum, and are perhaps not as uniquely suited for image or sound representation as initially believed. Consistent with this, recently developed models for local image statistics suggest that the statistics of local groups of image bandpass filter responses may be modeled with non-Gaussian elliptically symmetric densities [e.g., 14, 16, 17, 18, 19, 20]. This suggests that RG might provide an appropriate means of eliminating dependencies in these responses. Below, we test this empirically. 4.1 MI Reduction in Natural Sounds We first apply RG to bandpass coefficients of natural sound signals. We used natural sound clips from commercial CDs, which have a sampling frequency of 44100 Hz and typical length of 15 - 20 4 0.5 0.4 0.3 raw pca ica rg 2 raw pca ica rg 1.5 bits % 0.2 0.1 0 0.1 0.5 msec 1 1.5 2 2.5 3.5 1 0.5 0 0.1 0.5 msec 1 1.5 2 2.5 3.5 Fig. 4. Left: Multi-information (in bits) for pairs of bandpass filter responses, as a function of temporal separation. Shown are the MI of the raw filter response pairs (solid blue curve), as well as the MI of the pairs transformed with PCA, ICA, and RG. Results are averaged over 10 natural sound signals. Right: same data, re-plotted as a proportion of the MI of the raw bandpass-filtered signals. seconds. We used the so-called "gammatone" band-pass filters that are commonly used to model the peripheral auditory system [21]. Shown in the top row of panel (a) in Fig.3 are contour plots of the joint histograms obtained from pairs of coefficients of a bandpass-filtered natural sound, separated with different time intervals. Similar to the empirical observations for natural images [14, 17], the joint densities are non-Gaussian, and have roughly elliptical contours for temporally proximal pairs. Shown in the top row of panel (c) in Fig.3 are the conditional histograms corresponding to the same pair of signals. The "bow-tie" shaped conditional distribution, which has been previously observed in natural images [9] and sounds [13] indicates that the conditional variance of one signal depends on the value of the other. This is a highly non-Gaussian behavior, since the conditional variances of a jointly Gaussian density are always constant, independent of the value of the conditioning variable. For pairs that are distant, both the second-order correlation and the higher-order dependency become weaker. As a result, the corresponding joint histograms show more resemblance to the factorial product of two one-dimensional super-Gaussian densities (bottom row of panel (a) in Fig.3), and the shape of the corresponding conditional histograms (panel (c)) is more constant, all as would be expected for two independent random variables . As described in previous sections, the statistical dependencies in an elliptically symmetric random variable can be effectively removed by a linear whitening operation followed by a nonlinear radial Gaussianization, the latter being implemented as histogram transform of the radial marginal density of the whitened signal. Shown in panels (b) and (d) in Fig.3 are the joint and conditional histograms of the transformed data. First, note that when the two signals are adjacent, RG is highly effective, as suggested by the roughly Gaussian joint density (equally spaced circular contours), and by the removal of dependencies in the conditional histogram. However, as the temporal separation between the two signals increases, the effects of RG become weaker (middle row, Fig. 3). When the two signals are close to be independent (bottom row, Fig.3), applying RG can actually increase dependency, as suggested by the irregular shape of the conditional densities (bottom row, (d)). To quantify more precisely the dependency reduction achieved by RG, we measure the statistical dependency of our multivariate sources using the multi-information (MI) [22], which is defined as the Kulback-Leibler divergence [23] between the joint distribution and phe product of its marginals: t p =d k p ( xk ) I ( x ) = D KL ( x ) H ( x k ) - H ( x ) , wh e r e H ( x ) = (x) log ( p(x)) dx is the difk =1 ferential entropy of x, and H ( xk ) denotes the differential entropy of the kth component of x. As a measure of statistical dependency among the elements of x, MI is non-negative, and is zero if and only if the components of x are mutually independent. Furthermore, MI is invariant to any operation that operates on individual components of x (e.g., element-wise rescaling). To compare the effect of dependency reduction of different transforms, we estimated the MIs of pairs of temporally shifted band-pass filter responses of some natural sounds, using a non-parametric "bin-less" method based on the order statistics [24], which alleviates the strong bias and variance intrinsic to the more traditional binning (i.e., "plug-in") estimators. It is especially effective in this case, where the data dimensionality is two. We computed the MI for each pair of raw signals, as well as pairs of the PCA, ICA and RG (after whitening) transformed signals. The ICA transformation was obtained using RADICAL [25], an algorithm that directly optimizes the MI using a smoothed 5 blk size = 3x3 0.7 1 0.9 blk size = 7x7 1.3 1.2 1.1 blk size = 15x15 0.6 0.8 0.5 0.4 0.3 0.2 0.2 0.3 0.4 0.5 0.6 1 0.7 0.9 0.6 0.8 0.5 0.4 0.4 0.5 0.6 0.7 0.8 0.9 0.7 0.6 0.6 0.7 0.8 0.9 1 1.1 3×3 7×7 15 × 15 Fig. 5. Reduction of MI (bits/pixel) achieved with ICA and RG transforms, compared to that achieved with PCA, for pixel blocks of various sizes. The x-axis corresponds to I pca . Pluses denotes Irg , and circles denotes Iica . Each plotted symbol corresponds to the result from one image in our test set. grid search over a non-parametric estimate of entropy. Analysis was performed for 10 different 44100 Hz natural sound clips with length ranging from 15 to 20 seconds, and contents including animal vocalization, human speech and recordings in natural environments. These sound clips were filtered with a gammatone filter with center frequency of 3078 Hz. The results, averaged over all 10 sounds, are plotted in Fig. 4. First, we note that PCA produces a relatively modest reduction in MI: roughly 20% for small separations, decreasing gradually for larger separations. More surprisingly, ICA offers no additional reduction for small separations, and only a modest improvement for separations of between 0.2 and 1.0 msecs. This is consistent with the histograms and kurtosis analysis shown in Fig. 3, which suggest that the joint density of adjacent pairs have roughly elliptical contours. As such, we should not expect ICA to provide much improvement beyond what is obtained with PCA. The behavior for distant pairs is also consistent with the results shown in Fig. 3. These densities are roughly factorial, and thus require no further transformation to reduce MI. So ICA again provides very small reduction, as is seen in the plots of Fig. 4 for separations beyond 2 msecs. The behavior for intermediate separations is likely due to the fact that during the transition from spherically symmetric density to more factorial density, there is a range where ICA can result in a reduction in MI (e.g., second row in Fig. 3). In comparison to PCA and ICA, the nonlinear RG transformation achieves an impressive reduction (nearly 100%) in MI for pairs separated by less than 0.5 msecs. However, as the temporal separation increase, especially over 2.5 msecs, the joint densities are closer to factorial, and RG can actually make the pairs more dependent, as indicated by an increase in MI above that of the original pairs. Similar results were also obtained for pairs of bandpass filter responses of natural images and are provided in a recently submitted longer version of this paper [26]. 4.2 MI Reduction in Natural Images We have also examined the effects of RG in reducing dependencies within patches of natural images. The image sets are the central 1024 × 1024 crops from eight images in the van Hateren database [27], which have linearized intensity values with contents of natural scenes such as woods and greens. Each image was first subject to a log nonlinearity, log I ( x) - C0 , as in [7], where C0 is a constant so that the adjusted log intensities of an image have a zero mean intensity. In practice, experimental results obtained with the log intensity were quite similar to those of the linear intensity. After the log transform, we removed the local mean by convolving with an isotropic bandpass filter that captures an annulus of frequencies in the Fourier domain ranging from /4 to radians/pixel. We collected patches from the band-pass filtered image of various sizes as our basic data set, denoted as xraw . For comparison, we also transform these data with PCA (xpca ), ICA (xica ) and RG (xrg ) (preceded by a whitening operation), and then compare their MI. The ICA algorithm was implemented using FastICA [28] with contrast function g(u) = 1 - exp(-u2 ) and symmetric approach for optimization. However, direct estimation of MI for pixel blocks becomes increasingly difficult (and less accurate) as the block size grows. This problem may be partially alleviated by evaluating and comparing differences in MI between different transforms. Specifically, we use I pca = I (xraw ) - I (x pca ) as a reference value, and compare this with Iica = I (xraw ) - I (xica ) and Irg = I (xraw ) - I (xrg ). Details of this computation are provided in the longer version of this paper [26]. 6 Shown in Fig.5 are scatter plots of I pca versus Iica (red circle) and Irg (blue plus) for various block sizes. Each point corresponds to MI computation over blocks from one of eight bandpassfiltered test images. As the figure shows, RG achieves significant reduction in MI for many images, which seems to hold over a range of block sizes, whereas ICA shows only a very small improvement relative to PCA. This again suggests that ICA algorithm does not offer much advantage over secondorder decorrelation algorithms such as PCA, which may be attributed to the fact that the joint density for local pixel blocks tend to be close to be elliptically symmetric [8, 9]. Similar results for the comparison of ICA to PCA were obtained with a slightly different method of removing the mean values of each block [7]. 5 Conclusion We have introduced a new form of statistically-adaptive signal transformation known as radial Gaussianization (RG), which can remove dependencies of sources with elliptically symmetric densities (ESDs). The RG methodology is complementary to that of ICA, which is effective in removing statistical dependencies for linear transformation of independent sources, but ineffective for ESDs. An important aspect of our development of this methodology is the emphasis on source models. The RG transformation may be applied to data from any source, but it is only guaranteed to produce independent responses when the source is elliptically symmetric. We have shown that RG transform is highly effective at removing dependencies between pairs of samples in bandpass filtered sounds, and within local blocks of bandpass filtered images. The reductions of multi-information are much more substantial than those obtained with ICA, which are, in turn, only slightly better than PCA. Several recent works in the literature are particularly relevant to RG. An iterative scheme that alternates between linear ICA transformations and nonlinear histogram matching that transforms the marginal density of each component to a Gaussian (known as iterative Gaussianization) was proposed in [29]. Although guaranteed to transform any source model into a spherical Gaussian, the overall transformation in iterative Gaussianization is difficult to interpret and will generally result in a substantial distortion of the original source space. This is especially true for ESDs, where RG is a one-step procedure while the iterative Gaussianization usually takes many. Another nonlinear transform that has also been shown to be able to reduce higher-order dependencies in natural signals is divisive normalization [30, 13]. In the extended version of this paper [26], we showed that though not optimal, divisive normalization provides a close approximation to RG for elliptically symmetric sources. Thus, RG provides a more principled justification of the previous empirical observations on divisive normalization in terms of a specific source model. There are a number of extensions of RG that are worth considering in the context of signal representation. First, we find that RG substantially reduces dependency for nearby coefficients, but that performance worsens as the coefficients become more distant. This is true for both sound and image signals. A natural means of extending a local statistical model is through Markov random fields, and recent models have begun to explore this direction [31, 32]. Second, the RG methodology provides a solution to the efficient coding problem for ESDs signals in the noise-free case, and it is worthwhile to consider how the solution would be affected by the incorporation of sensor and/or channel noise into the problem. Third, we are interested in specific sub-families of ESD for which the nonlinear mapping of signal amplitudes in RG may be expressed in closed form. And finally, we are currently examining the statistics of signals after local RG transformations, with the expectation that remaining statistical regularities (e.g., orientation and phase dependencies in images) can be studied, modeled and removed with additional transformations. References [1] F Attneave. Some informational aspects of visual perception. Psych. Rev., 61:183­193, 1954. [2] H B Barlow. Possible principles underlying the transformation of sensory messages. In W A Rosenblith, editor, Sensory Communication, pages 217­234. MIT Press, Cambridge, MA, 1961. [3] B A Olshausen and D J Field. Emergence of simple-cell receptive field properties by learning a sparse code for natural images. Nature, 381:607­609, 1996. [4] A van der Schaaf and J H van Hateren. Modelling the power spectra of natural images: Statistics and information. Vision Research, 28(17):2759­2770, 1996. 7 [5] A J Bell and T J Sejnowski. The 'independent components' of natural scenes are edge filters. Vision Research, 37(23):3327­3338, 1997. [6] M S Lewicki. Efficient coding of natural sounds. Nature Neuroscience, 5(4):356­363, 2002. [7] Matthias Bethge. Factorial coding of natural images: how effective are linear models in removing higherorder dependencies? J. Opt. Soc. Am. A, 23(6):1253­1268, 2006. [8] B Wegmann and C Zetzsche. Statistical dependence between orientation filter outputs used in an human vision based image code. In Proc Visual Comm. and Image Processing, volume 1360, pages 909­922, Lausanne, Switzerland, 1990. [9] E P Simoncelli. Statistical models for images: Compression, restoration and synthesis. In Proc 31st Asilomar Conf on Signals, Systems and Computers, volume 1, pages 673­678, Pacific Grove, CA, November 2-5 1997. IEEE Computer Society. [10] K.T. Fang, S. Kotz, and K.W. Ng. Symmetric Multivariate and Related Distributions. Chapman and Hall, London, 1990. [11] D. Nash and M. S. Klamkin. A spherical characterization of the normal distribution. Journal of Multivariate Analysis, 55:56­158, 1976. [12] E P Simoncelli and R W Buccigrossi. Embedded wavelet image compression based on a joint probability model. In Proc 4th IEEE Int'l Conf on Image Proc, volume I, pages 640­643, Santa Barbara, October 26-29 1997. IEEE Sig Proc Society. [13] O Schwartz and E P Simoncelli. Natural signal statistics and sensory gain control. Nature Neuroscience, 4(8):819­825, August 2001. [14] C Zetzsche and G Krieger. The atoms of vision: Cartesian or polar? J. Opt. Soc. Am. A, 16(7), July 1999. [15] William Feller. An Introduction to Probability Theory and Its Applications, volume 1. Wiley, January 1968. [16] J. Huang and D. Mumford. Statistics of natural images and models. In IEEE International Conference on Computer Vision and Pattern Recognition (CVPR), 1999. [17] M J Wainwright and E P Simoncelli. Scale mixtures of Gaussians and the statistics of natural images. In S. A. Solla, T. K. Leen, and K.-R. Muller, editors, Adv. Neural Information Processing Systems ¨ (NIPS*99), volume 12, pages 855­861, Cambridge, MA, May 2000. MIT Press. [18] L Parra, C Spence, and P Sajda. Higher-order statistical properties arising from the non-stationarity of natural signals. In T. K. Leen, T. G. Dietterich, and V. Tresp, editors, Adv. Neural Information Processing Systems (NIPS*00), volume 13, pages 786­792, Cambridge, MA, May 2001. MIT Press. [19] A. Hyvarinen, P. O. Hoyer, and M. Inki. Topographic ICA as a model of natural image statistics. In the ¨ First IEEE Int'l. Workshop on Bio. Motivated Comp. Vis., London, UK, 2000. [20] A Srivastava, X Liu, and U Grenander. Universal analytical forms for modeling image probability. IEEE Pat. Anal. Mach. Intell., 24(9):1200­1214, Sep 2002. [21] P I M Johannesma. The pre-response stimulus ensemble of neurons in the cochlear nucleus. In Symposium on Hearing Theory (IPO), pages 58­69, Eindhoven, Holland, 1972. [22] M. Studeny and J. Vejnarova. The multiinformation function as a tool for measuring stochastic dependence. In M. I. Jordan, editor, Learning in Graphical Models, pages 261­297. Dordrecht: Kluwer., 1998. [23] T. Cover and J. Thomas. Elements of Information Theory. Wiley-Interscience, 2nd edition, 2006. [24] A. Kraskov, H. Stogbauer, and P. Grassberger. Estimating mutual information. Phys. Rev. E, 69(6):66­82, ¨ Jun 2004. [25] E. G. Learned-Miller and J. W. Fisher. ICA using spacings estimates of entropy. Journal of Machine Learning Research, 4(1):1271­1295, 2000. [26] S. Lyu and E. P. Simoncelli. Nonlinear extraction of "independent components" of elliptically symmetric densities using radial Gaussianization. Technical Report TR2008-911, Computer Science Technical Report, Courant Inst. of Mathematical Sciences, New York University, April 2008. [27] J H van Hateren and A van der Schaaf. Independent component filters of natural images compared with simple cells in primary visual cortex. Proc. R. Soc. Lond. B, 265:359­366, 1998. [28] A. Hyvarinen. Fast and robust fixed-point algorithms for independent component analysis. IEEE Trans¨ actions on Neural Networks, 10(3):626­634, 1999. [29] Scott Saobing Chen and Ramesh A. Gopinath. Gaussianization. In Advances in Neural Computation Systems (NIPS), pages 423­429, 2000. [30] D. J. Heeger. Normalization of cell responses in cat striate cortex. Visual neural science, 9:181­198, 1992. [31] S. Roth and M. Black. Fields of experts: A framework for learning image priors. In IEEE Conference on Computer Vision and Patten Recognition (CVPR), volume 2, pages 860­867, 2005. [32] S Lyu and E P Simoncelli. Statistical modeling of images with fields of Gaussian scale mixtures. In B Scholkopf, J Platt, and T Hofmann, editors, Adv. Neural Information Processing Systems 19, volume 19, ¨ Cambridge, MA, May 2007. MIT Press. 8