Learning Horizontal Connections in a Sparse Coding Model of Natural Images Anonymous Author(s) Affiliation Address email Abstract It has been shown that adapting a dictionary of basis functions to the statistics of natural images so as to maximize sparsity in the coefficients results in a set of dictionary elements whose spatial properties resemble those of V1 (primary visual cortex) receptive fields. However, the resulting sparse coefficients still exhibit pronounced statistical dependencies, thus violating the independence assumption of the sparse coding model. Here, we propose a model that attempts to capture the dependencies among the basis function coefficients by including a pairwise coupling term in the prior over the coefficient activity states. When adapted to the statistics of natural images, the coupling terms learn a combination of facilitatory and inhibitory interactions among neighboring basis functions. These learned interactions may offer an explanation for the function of horizontal connections in V1, and we discuss the implications of our findings for physiological experiments. 1 Introduction Over the last decade, mathematical explorations into the statistics of natural scenes have led to the observation that these scenes, as complex and varied as they appear, have an underlying structure that is sparse [1]. That is, one can learn a possibly overcomplete basis set such that only a small fraction of the basis functions is necessary to describe a given image, where the operation to infer this sparse representation is non-linear. This approach is known as sparse coding. Exploiting this structure has led to advances in our understanding of how information is represented in the visual cortex, since the learned basis set is a collection of oriented Gabor filters that resemble the receptive fields in primary visual cortex (V1). The approach of using sparse coding to infer sparse representations of unlabeled data is useful for classification as shown in the framework of self-taught learning [2]. Note that classification performance relies on finding "hard-sparse" representations where a few coefficients are nonzero while all the others are exactly zero. An assumption of the sparse coding model is that the coefficients of the representation are independent. However, in the case of natural images, this is not the case. For example, the coefficients corresponding to quadrature pair or colinear Gabor filters are not independent. This has been shown and modeled in the early work of [3], in the case of the responses of model complex cells [4], feedforward responses of wavelet coefficients [5, 6, 7] or basis functions learned using independent component analysis [8, 9]. These dependencies are informative and exploiting them leads to improvements in denoising performance [5, 7]. We develop here a generative model of image patches that does not make the independence assumption. The prior over the coefficients is a mixture of a Gaussian when the corresponding basis function is active and a delta function at zero when it is silent as in [10]. We model the binary variables or "spins" that control the activation of the basis functions with an Ising model, whose 1 coupling weights model the dependencies among the basis functions. The representations inferred by this model are also "hard-sparse", which is a desirable feature [2]. Our model is motivated in part by the architecture of the visual cortex, namely the extensive network of horizontal connections among neurons in V1 [11]. It has been hypothesized that they facilitate contour integration [12] and that they allow to compute border ownership [13]. In both of these models the connections are set a priori based on geometrical properties of the receptive fields. We propose here to learn the connection weights in an unsupervised fashion. We hope with our model to gain insight into the the computations performed by this extensive collateral system and compare our findings to known physiological properties of these horizontal connections. Furthermore, a recent trend in neuroscience is to model networks of neurons using Ising models, and it has been shown to predict remarkably well the statistics of groups of neurons in the retina [14]. Our model gives a prediction for what is expected if one fits an Ising model to future multi-unit recordings in V1. 2 A non-factorial sparse coding model Let x Rn be an image patch, where the xi 's are the pixel values. We propose the following generative model: im ai i + , x = a + = =1 is an overcomplete transform or basis set, and the columns i where = [1 . . . m ] R + are its basis functions. N (0, 2 In ) is small Gaussian noise. Each coefficient ai = si2 1 ui is a Gaussian scale mixture (GSM). We model the multiplier s with an Ising model, i.e. s {-1, 1}m T 1 1T has a Boltzmann-Gibbs distribution p(s) = Z e 2 s W s+b s , where Z is the normalization constant. If the spin si is down (si = -1), then ai = 0 and the basis function i is silent. If the spin si is up (si = 1), then the basis function is active and the analog value of the coefficient ai is drawn from a 2 Gaussian distribution with ui N (0, i ). The prior on a can thus be described as a "hard-sparse" prior as it is a mixture of a point mass at zero and a Gaussian. The corresponding graphical model is shown in Figure 1. It is a chain graph since it contains both undirected and directed edges. It bears similarities to [15], which however does not have the intermediate layer a and is not a sparse coding model. To sample from this generative model, one first obtains a sample s from the Ising model, then samples coefficients a according to p(a | s), and then x according to p(x | a) N (a, 2 In ). m 1 W n×m Figure 1: Proposed graphical model 2 The parameters of the model to be learned from data are = (, (i )i=1..m , W, b). This model does not make any assumption about which linear code should be used, and about which units should exhibit dependencies. The matrix W of the interaction weights in the Ising model describes these dependencies. Wij > 0 favors positive correlations and thus corresponds to an excitatory connection, whereas Wij < 0 corresponds to an inhibitory connection. A local magnetic field bi < 0 favors the spin si to be down, which in turn makes the basis function i mostly silent. 2 m m s a n x m 2 W 2 x 2 2 s a 1 x 1 1 s a 3 Inference and learning 3.1 Coefficient estimation We describe here how to infer the representation a of an image patch x in our model. To do so, we first compute the maximum a posteriori (MAP) multiplier s (see Section 3.2). Indeed, a GSM model reduces to a linear-Gaussian model conditioned on the multiplier s, and therefore the estimation of a is easy once s is known. Given s = s, let = {i : si = 1} be the set of active basis functions. We know that i , ai = 0. ^ ^ / Hence, we have x = a + , where a = (ai )i and = [(i )i ]. The model reduces thus 2 to linear-Gaussian, where a N (0, H = diag((i )i )). We have a | x, s N (µ, K ), where ^ -2 T -1 -1 -2 T K = ( + H ) and µ = K x. Hence, conditioned on x and s, the Bayes ^ Least-Square (BLS) and maximum a posteriori (MAP) estimators of a are the same and given by µ. 3.2 Multiplier estimation The MAP estimate of s given x is given by si = arg maxs p(s | x). Given s, x has a Gaussian ^ 2 T distribution N (0, ), where = 2 In + : si =1 i i i . Using Bayes' rule, we can write -Ex (s) p(s | x) p(x | s)p(s) e , wh e r e 1 T -1 1 1 x x + log det - sT W s - bT s. 2 2 2 We can thus compute the MAP estimate using Gibbs sampling and simulated annealing. In the Gibbs sampling procedure, the probability that node i changes its value from si to si given x, all the ¯ other nodes s¬i and at temperature T is given by -1 1 - Ex , p(si si |s¬i , x) = ¯ + exp T Ex (s) = where Ex = Ex (si , s¬i ) - Ex (si , s¬i ). Note that computing Ex requires the inverse and the ¯ ¯ determinant of , which is expensive. Let and be the covariance matrices corresponding to the proposed state (si , s¬i ) and current state (si , s¬i ) respectively. They differ only by a rank 1 matrix, ¯ 2 ¯ i.e. = + i T , where = 1 (si - si )i . Therefore, to compute Ex we can take advantage i 2¯ of the Sherman-Morrison formula ¯ -1 = -1 - -1 i (1 + T -1 i )-1 T -1 (1 ) i i and of a similar formula for the log det term . 1 ¯ log det = log det + log + T -1 i i Using (1) and (2) Ex can be written as T (2 ) Ex = 1 (x i ) 1 - log 2 1 + T -1 i 2 i -1 2 1 + T -1 i i + (si - si ) ¯ j =i Wij sj + bi . The transition probabilities can thus be computed efficiently, and if a new state is accepted we update and -1 using (1). 3.3 Model estimation Given a dataset D = {x(1) , . . . , x(N ) } of image patches, we want to learn the parameters = N (i) (, , W, b) that offer the best explanation of the data. Let p (x) = i=1 (x - x ) be the empirical distribution. Since in our model the variables a and s are latent, we use a variational expectation maximization algorithm [16] to optimize , which amounts to maximizing a lower bound on the log-likelihood derived using Jensen's inequality sa p(x, a, s | ) log p(x | ) q (a, s | x) log da, q (a, s | x) 3 where q (a, s | x) is a probability distribution. We restrict ourselves to the family of point mass distributions Q = {q (a, s | x) = (a - a) (s - s)}, and with this choice the lower bound on the ^ ^ log-likelihood of D can be written as L(, q ) ^^ = Ep [log p(x, a, s | )] ^ = Ep [log p(x | a, )] + Ep [log p(a | ^ L L (3 ) s, (i )i=1..m )] + Ep [log p(s | ^2 ^ W, b)] . We perform coordinate ascent in the objective function L(, q ). 3.3.1 Maximization with respect to q L W,b We want to solve maxqQ L(, q ), which amounts to finding arg maxa,s log p(x, a, s) for every x D. This is computationally expensive since s is discrete. Hence, we introduce two phases in the algorithm. In the first phase, we infer the coefficiients in the usual sparse coding model where the prior over a i exp{-S (ai )}. In this setting, we have p(ai ) is factorial, i.e. p(a) = i i 1 S (ai ). (4 ) e-S (ai ) = arg min 2 x - a 2 + a = arg max p(x|a) ^ 2 a 2 a With S (ai ) = |ai |, (4) is known as basis pursuit denoising (BPDN) whose solution has been shown to be such that many coefficient of a are exactly zero [17]. This allows us to recover the sparsity ^ pattern s, where si = 2.1[ai = 0] - 1 i. BPDN can be solved efficiently using a competitive ^ ^ ^ algorithm [18]. Another possible choice is S (ai ) = 1[ai = 0] (p(ai ) is not a proper prior though), where (4) is combinatorial and can be solved approximately using orthogonal matching pursuits (OMP) [19]. After several iterations of coordinate ascent and convergence of using the above approximation, we enter the second phase of the algorithm and refine by using the GSM inference described in Section 3.1 where s = arg max p(s|x) and a = E[a | s, x]. ^ ^ ^ 3.3.2 Maximization with respect to We want to solve max L(, q ). Our choice of variational posterior allowed us to write the objective function as the sum of the three terms L , L and LW,b (3), and hence to decouple the variables , 2 (i )i=1..m and (W, b) of our optimization problem. Maximization of L . Note that L is the same objective function as in the standard sparse coding problem when the coefficients a are fixed. Let {a(i) , s(i) } be the coefficients and multipliers ^^ corresponding to x(i) . We have L = - N 1i Nn log 2 2 . x(i) - a(i) 2 - ^2 2 2 =1 2 We add the constraint that i 2 1 to avoid the spurious solution where the norm of the basis functions grows and the coefficients tend to 0. We solve this 2 constrained least-square problem using the Lagrange dual as in [20]. 2 Maximization of L . The problem of estimating i is a standard variance estimation problem for a 0-mean Gaussian random variable, where we only consider the samples ai such that the spin si is ^ ^ equal to 1, i.e. i 1 2 i = ai 2 . ^ card{i : si = 1} ^ : si =1 ^ Maximization of LW,b . This problem is tantamount to estimating the parameters of a fully visible Boltzmann machine [21] which is a convex optimization problem. We do gradient ascent in LW,b , LW LW where the gradients are given by Wi,b = -Ep [si sj ] + Ep [si sj ] and bi,b = -Ep [si ] + Ep [si ]. j We use Gibbs sampling to obtain estimates of Ep [si sj ] and Ep [si ]. 4 Note that since computing the parameters (a, s) of the variational posterior in phase 1 only depends ^^ on , we first perform several steps of coordinate ascent in (, q ) until has converged, which is the same as in the usual sparse coding algorithm. We then maximize L and LW,b , and after that we enter the second phase of the algorithm. 4 Recovery of the model parameters Although the learning algorithm relies on a method where the family of variational posteriors q (a, s | x) is quite limited, we argue here that if data D = {x(1) , . . . , x(N ) } is being sampled according to parameters 0 that obey certain conditions that we describe now, then our proposed learning algorithm is able to recover 0 with good accuracy using phase 1 only. Let µ be the coherence parameter of the basis set which equals the maximum absolute inner product between two distinct basis functions. It has been shown that given a signal that is a sparse linear combination of p basis functions, BP and OMP will identify the optimal basis functions and their 1 coefficients provided that p < 2 (µ-1 + 1), and the sparsest representation of the signal is unique [19]. Similar results can be derived when noise is present ( > 0) [22], but we restrict ourselves to the noiseless case forssimplicity. Let s e the number of spins that are up. We require (W0 , b0 ) b to be such that P r < 1 (µ-1 + 1) 1, which can be enforced by imposing strong negative 2 (i) biases. A data point x D has thus with high probability a unique sparse representation in the basis set , and provided that we have a good estimate of we can recover its sparse representation using OMP or BP, and therefore identify s(i) that was used to originally sample x(i) . That is we recover with high probability all the samples from the Ising model used to generate D, which allows us to recover (W0 , b0 ). We provide for illustration a simple example of model recovery where n = 7 and m = 8. Let i 1 (e1 , . . . , e7 ) be an orthonormal basis in R7 . We let 0 = [e1 , . . . e7 , 7 ei ]. We fix the biases b0 at -1.2 such that the model is sufficiently sparse as shown by the histogram of s in Figure 2, and the weights W0 are sampled according to a Gaussian distribution. The variance parameters 0 are fixed to 1. We then generate synthetic data by sampling 100000 data from this model using 0 . We then estimate from this synthetic data using the variational method described in Section 3 using OMP and phase 1 only. We found that the basis functions are recovered exactly (not shown), and that the parameters of the Ising model are recovered with high accuracy as shown in Figure 2. Figure 2: Recovery of the model 5 Results for natural images We build our training set by randomly selecting 12 × 12 image patches from a standard set of 10 512 × 512 whitened images as in [1]. It has been shown that change of luminance or contrast have little influence on the structure of natural scenes [23]. As our goal is to uncover this structure, we subtract from each patch its own mean and divide it by its standard deviation such that our dataset is contrast normalized (we do not consider the patches whose variance is below a small threshold). We fix the number of basis functions to 200. In the second phase of the algorithm we only update , and we have found that the basis functions do not change dramatically after the first phase. Figure 3 shows the learned parameters. The basis functions resemble Gabor filters at a variety of orientations, positions and scales. It is hard to visualize the connections among the basis functions 5 Figure 3: Parameters of the model learned from natural images by displaying W as in Figure 3. We thus show in Figure 4 these connections according to the spatial properties (position, orientation, length) of the basis functions that are linked together by them. Each basis function is denoted by a bar that matches its position, orientation, and length within the 12x12 patch. Figure 4: a, Visualization of the weights of 169 basis functions. Each subplot shows the horizontal connections (Wij )j =i corresponding to the basis function i displayed as the black bar. The basis functions j are displayed in the same subplot and the color of their corresponding bar is colorcoded according to Wij (see b for an illustration), where positive (resp. negative) weights are red (resp. blue). c (resp. d), Ordered from left two right are the basis function pairs that share the strongest positive (resp. negative) weights. We observe that the connections are mainly local and connect basis functions at a variety of orientations. We can see in Figure 4c,d that the 10 most "positive" pairs have similar orientations, and that there is only one out of the 10 most "negative" pairs where the orientations differ by . In Figure 5c 2 we show for a basis function the average number for basis functions sharing with it a weight larger 6 than 0.03 in absolute value as a function of their orientation difference in four bins. The error bars are a standard deviation. We see that positive and negative weights have a similar profile, and the shape is consistent with what has been observed in physiological experiments [24, 25]. The histogram of the weights (see Figure 5a) shows a long negative tail, and plotting them in Figure 5b against the elements of the Gramm matrix T shows that they correspond to basis functions pairs that are strongly correlated. This is due in part to "explaining away" in the variational inference using the Laplacian prior. In a control experiment where the horizontal connections are learned from synthetic data generated using a factorial prior, we have found that the weights are very small as compared to the weights learned from natural scenes, thus indicating that these inhibitory terms are derived from the statistics of natural images and not explaining away per se. We also show in Figure 5d the tradeoff curves between the signal to noise ratio (SNR) of an image patch x and its reconstruction a, and the 0 norm of the representation a 0. We consider a ^ ^ ^ inferred using both the Laplacian prior and our proposed prior. We vary (see Equation (4)) and respectively, and average over 1000 patches to obtain the two tradeoff curves. We see that at similar SNR the representations inferred by our model are more sparse by about a factor of 2. We have also compared our prior for tasks such as denoising and filling-in, and have found its performance to be similar to the Laplacian prior even though it does not exploit the dependencies of the code. It is possible that the greater sparsity of our inferred representations could make them less robust to noise, and we are currently investigating whether this property may instead have advantages in the self-taught learning setting in improving classification performance. Figure 5: a, b, c, Properties of the weight matrix W . d, Comparison of the tradeoff curve SNR - 0 norm between a Laplacian prior over the coefficients and our prior. 6 Discussion In this paper, we proposed a new sparse coding model where we include pairwise coupling terms among the coefficients to capture their dependencies. We derived a new learning algorithm to adapt the parameters to the model given a data set of natural images, and were able to discover the dependencies among the basis functions. We showed that the learned connection weights are consistent with physiological data. Furthermore, the representations inferred in our model have greater sparsity than when they are inferred using the Laplacian prior as in the standard sparse coding model. Note however that we haven't found evidence for these horizontal connections to facilitate contour integration as they do not primarily connect colinear basis functions contrary to many previous models in the literature that simply assume these weights according to prior intuitions about the function of horizontal connections. It is of great interest to develop new models and unsupervised learning schemes possibly involving attention that will help us understand the computational principles underlying contour integration in the visual cortex. 7 References [1] B.A. Olshausen and D. J. Field. Emergence of simple-cell receptive field properties by learning a sparse code for natural images. Nature, 381(6583):607­609, June 1996. [2] R. Raina, A. Battle, H. Lee, B. Packer, and A.Y. Ng. Self-taught learning: Transfer learning from unlabeled data. Proceedings of the Twenty-fourth International Conference on Machine Learning, 2007. [3] B.A. Olshausen. A functional model of v1 horizontal connectivity based on the statistical structure of natural images. Society for Neuroscience Abstracts, 23, (2363), 1997. [4] P. Hoyer and A. Hyvarinen. A multi-layer sparse coding network learns contour coding from natural ¨ images. Vision Research, 42:1593­1605, 2002. [5] M.J. Wainwright, E.P. Simoncelli, and A.S. Willsky. Random cascades on wavelet trees and their use in modeling and analyzing natural imagery. Applied and Computational Harmonic Analysis, 11(1):89­123, July 2001. [6] O. Schwartz, T. J. Sejnowski, and P. Dayan. Soft mixer assignment in a hierarchical generative model of natural scene statistics. Neural Comput, 18(11):2680­2718, November 2006. [7] S. Lyu and E. P. Simoncelli. Statistical modeling of images with fields of gaussian scale mixtures. In Advances in Neural Computation Systems (NIPS), Vancouver, Canada, 2006. [8] A. Hyvarinen, P.O. Hoyer, J. Hurri, and M. Gutmann. Statistical models of images and early vision. ¨ Proceedings of the Int. Symposium on Adaptive Knowledge Representation and Reasoning (AKRR2005), Espoo, Finland, 2005. [9] Y. Karklin and M.S. Lewicki. A hierarchical bayesian model for learning non-linear statistical regularities in non-stationary natural signals. Neural Computation, 17(2):397­423, 2005. [10] B.A. Olshausen and K.J. Millman. Learning sparse codes with a mixture-of-gaussians prior. Advances in Neural Information Processing Systems, 12, 2000. [11] D. Fitzpatrick. The functional organization of local circuits in visual cortex: insights from the study of tree shrew striate cortex. Cerebral Cortex, 6:329­41, 1996. [12] O. Ben-Shahar and S. Zucker. Geometrical computations explain projection patterns of long-range horizontal connections in visual cortex. Neural Comput, 16(3):445­476, March 2004. [13] L. Zhaoping. Border ownership from intracortical interactions in visual area v2. Neuron, 47:143­153, 2005. [14] E. Schneidman, M.J. Berry, R. Segev, and W. Bialek. Weak pairwise correlations imply strongly correlated network states in a neural population. Nature, April 2006. [15] G. Hinton, S. Osindero, and K. Bao. Learning causally linked markov random fields. Artificial Intelligence and Statistics, Barbados, 2005. [16] M.I. Jordan, Z. Ghahramani, T. Jaakkola, and L.K. Saul. An introduction to variational methods for graphical models. Learning in Graphical Models, Cambridge, MA: MIT Press, 1999. [17] S.S. Chen, D.L. Donoho, and M.A. Saunders. Atomic decomposition by basis pursuit. SIAM Review, 43(1):129­159, 2001. [18] C.J. Rozell, D.H. Johnson, R.G. Baraniuk, and B.A. Olshausen. Neurally plausible sparse coding via competitive algorithms. In Proceedings of the Computational and Systems Neuroscience (Cosyne) meeting, Salt Lake City, UT, February 2007. [19] J.A. Tropp. Greed is good: algorithmic results for sparse approximation. IEEE Transactions on Information Theory, 50(10):2231­2242, 2004. [20] H. Lee, A. Battle, R. Raina, and A.Y. Ng. Efficient sparse coding algorithms. In Advances in Neural Information Processing Systems 19, pages 801­808. MIT Press, Cambridge, MA, 2007. [21] D.H. Ackley, G.E. Hinton, and T.J. Sejnowski. A learning algorithm for boltzmann machines. Cognitive Science, 9(1):147­169, 1985. [22] J.A. Tropp. Just relax: convex programming methods for identifying sparse signals in noise. IEEE Transactions on Information Theory, 52(3):1030­1051, 2006. [23] Z. Wang, A.C. Bovik, and E.P. Simoncelli. Structural approaches to image quality assessment. In Alan Bovik, editor, Handbook of Image and Video Processing, chapter 8.3, pages 961­974. Academic Press, May 2005. 2nd edition. [24] R. Malach, Y. Amir, M. Harel, and A. Grinvald. Relationship between intrinsic connections and functional architecture revealed by optical imaging and in vivo targeted biocytin injections in primate striate cortex. Proc. Natl. Acad. Sci. U.S.A., 82:935­939, 1993. [25] W. Bosking, Y. Zhang, B. Schofield, and D. Fitzpatrick. Orientation selectivity and the arrangement of horizontal connections in the tree shrew striate cortex. J. Neuroscience, 17(6):2112­2127, 1997. 8