Simplified Rules and Theoretical Analysis for Information Bottleneck Optimization and PCA with Spiking Neurons Lars Buesing, Wolfgang Maass Institute for Theoretical Computer Science Graz University of Technology A-8010 Graz, Austria {lars,maass}@igi.tu-graz.at Abstract We show that under suitable assumptions (primarily lineari zation) a simple and perspicuous online learning rule for Information Bottlene ck optimization with spiking neurons can be derived. This rule performs on common benchmark tasks as well as a rather complex rule that has previously been prop osed [1]. Furthermore, the transparency of this new learning rule makes a theo retical analysis of its convergence properties feasible. A variation of this le arning rule (with sign changes) provides a theoretically founded method for perfo rming Principal Component Analysis (PCA) with spiking neurons. By applying this rule to an ensemble of neurons, different principal components of the input can be extracted. In addition, it is possible to preferentially extract those pr incipal components from incoming signals X that are related or are not related to some additional target signal YT . In a biological interpretation, this target signal YT (also called relevance variable) could represent proprioceptive feedback, input from other sensory modalities, or top-down signals. 1 Introduction The Information Bottleneck (IB) approach [2] allows the investigation of learning algorithms for unsupervised and semi-supervised learning on the basis of c lear optimality principles from information theory. Two types of time-varying inputs X and YT are considered. The learning goal is to learn a transformation from X into another signal Y that extracts only those components from X that are related to the relevance signal YT . In a more global biological interpretation X might represent for example some sensory input, and Y the output of the first processing stage for X in the cortex. In this article Y will simply be the spike output of a neuron that receives the spike trains X as inputs. The starting point for our analysis is the first lea rning rule for IB optimization in for this setup, which has recently been proposed in [1], [3]. Unfortunately, this learning rule is complicated, restricted to discrete time and no theoretical analy sis of its behavior is feasible. Any online learning rule for IB optimization has to make a number of simp lifying assumptions, since true IB optimization can only be carried out in an offline setting. We show here, that with a slightly different set of assumptions than those made in [1] and [3], one arrives at a drastically simpler and intuitively perspicuous online learning rule for IB optimization with s piking neurons. The learning rule in [1] was derived by maximizing the objective function1 L0 : ~ (1) L0 = -I (X, Y ) + I (Y , YT ) - DK L (P (Y ) P (Y )), 1 ~ The term DK L (P (Y ) P (Y )) denotes the Kullback-Leibler divergence between the distribution P (Y ) ~ and a target distribution P (Y ). This term ensures that the weights remain bounded, it is shortly discusse d in [4]. 1 where I (., .) denotes the mutual information between its arguments and is a positive trade-off factor. The target signal YT was assumed to be given by a spike train. The learning rule fro m [1] (see [3] for a detailed interpretation) is quite involved and req uires numerous auxiliary definitions (hence we cannot repeat it in this abstract). Furthermore, it can on ly be formulated in discrete time (steps size t) for reasons we want to outline briefly: In the limit t 0 the essential contribution to the learning rule, which stems from maximizing the mutual infor mation I (Y , YT ) between output and target signal, vanishes. This difficulty is rooted in a rathe r technical assumption, made in appendix k A.4 in [3], concerning the expectation value at time step k of the neural firing probability , given the information about the postsynaptic spikes and the target signal spikes up to the preceding time step k - 1 (see our detailed discussion in [4])2 . The restriction to discrete time prevents the application of powerful analytical methods like the Fokker-Planck equation, which requires continuous time, for analyzing the dynamics of the learning rule. In section 2 of this paper, we propose a much simpler learning rule for IB optimization with spiking neurons, which can also be formulated in continuous time. In contrast to [3], we approximate the k critical term with a linear estimator, under the assumption that X and YT are positively correlated. Further simplifications in comparison to [3] are achi eved by considering a simpler neuron model (the linear Poisson neuron, see [5]). However we show through computer simulation in [4] that the resulting simple learning rule performs equally we ll for the more complex neuron model with refractoriness from [1] - [5]. The learning rule presen ted here can be analyzed by the means of the drift function of the corresponding Fokker-Planck equa tion. The theoretical results are outlined in section 3, followed by the consideration of a concrete IB optimization task in section 4. A link between the presented learning rule and Principal Componen t Analysis (PCA) is established in section 5. A more detailed comparison of the learning rule prese nted here and the one of [3] as well as results of extensive computer tests on common benchmark tas ks can be found in [4]. 2 Neuron model and learning rule for IB optimization We consider a linear Poisson neuron with N synapses iof weights w = (w1 , . . . , wN ) . It is driven by the input X , consisting of N spike trains Xj (t) = (t - ti ), j {1, . . . , N }, where ti denotes j j the time of the i'th spike at synapse j . The membrane potential u(t) of the neuron at time t is given by the weighted sum of the presynaptic activities (t) = (1 (t), . . . , N (t)): u(t) = j (t) = jN wj j (t) (t - s)Xj (s)ds. (2) =1 - t The kernel (.) models the EPSP of a single spike (in simulations (t) was chosen to be a decaying exponential with a time constant of m = 10 ms). The postsynaptic neuron spikes at time t with the probability density g (t): g (t) = u(t) , u0 i with u0 being a normalization constant. The postsynaptic spike train is denoted as Y (t) = (t - ti ), with the firing times ti . f f We now consider the IB task described in general in [2], which consists of maximizing the objective function LIB , in the context of spiking neurons. As in [6], we introduce a further term L3 into the the objective function that reflects the higher metabolic costs for the neuron to maintain strong w2 synapses, a natural, simple choice being L3 = - j . Thus the complete objective function L to maximize is: L = LIB + L3 = -I (X, Y ) + I (YT , Y ) - jN 2 wj . (3) =1 2 The remedy, proposed in section 3.1 in [3], of replacing the mutual info rmation I (Y , YT ) in L0 by an information rate I (Y , YT )/t does not solve this problem, as the term I (Y , YT )/t diverges in the continuous time limit. 2 The objective function L differs slightly from L0 given in (1), which was optimized in [3]; this change turned out to be advantageous for the PCA learning rul e given in section 5, without significantly changing the characteristics of the IB learning rule . The online learning rule governing the change of the weights wj (t) at time t is obtained by a gradient ascent of the objective function L: d L wj (t) = . dt wj For small learning rates and under the assumption that the presynaptic input X and the target signal YT are stationary processes, the following learning rule can b e derived: F - Y (t)j (t) - d (u(t) - u(t)) + wj (t) = [YT ](t) - F [YT ](t) wj (t), (4) dt u(t)u(t) where the operator (.) denotes the low-pass filter with a time constant C (in simulations C = 3s), i. e. for a function f : - f t t-s 1 exp (s)ds. (5) f (t) = C - C The operator F [YT ](t) appearing in (4) is equal to the expectation value of the membrane potential u(t) X |YT = E [u(t)|YT ], given the observations (YT ( )| R) of the relevance signal; F is thus closely linked to estimation and filtering theory. For a know n joint distribution of the processes X and YT , the operator F could in principal be calculated exactly, but it is not clear how this quantity can be estimated in an online process; thus we look for a simple approximation to F . Under the above assumptions, F is time invariant and can be approximated by a Volterra serie s (for details see [4]): R n R in u(t) X |YT = F [YT ](t) = · · · n (t - t1 , . . . , t - tn ) YT (ti )dti . (6) =0 =1 In this article, we concentrate on the situation, where F can be well approximated by its linearization F1 [YT ](t), corresponding to a linear estimator of u(t) X |YT . For F1 [YT ](t) we make the following ansatz: R F [YT ](t) F1 [YT ](t) = c · uT (t) = c 1 (t - t1 )YT (t1 )dt1 . (7) According to (7), F is approximated by a convolution uT (t) of the relevance signal YT and a suitable prefactor c. Assuming positively correlated X and YT , 1 (t) is chosen to be a non-anticipating decaying exponential exp(-t/0 )(t) with a time constant 0 (in simulations 0 = 100 ms), where (t) is the Heaviside step function. This choice is motivated by the standard models for the impact of neuromodulators (see [7]), thus such a kernel may be imple mented in a realistic biological mechanism. It turned out that the choice of 0 was not critical, it could be varied over a decade ranging from 10 ms to 100 ms. The prefactor c appearing in (7) can be determined from the fact that F1 is the optimal linear estimator of the form given in (7), leadin g to: c= u T (t), u(t) u T (t), uT (t) . The quantity c can be estimated online in the following way: d c(t) = (uT (t) - uT (t)) [(u(t) - u(t)) - c(t)(uT (t) - uT (t))] . dt Using the above definitions, the resulting learning rule is g iven by (in vector notation): d Y (t) (t) w(t) = [- (u(t) - u(t)) + c(t) (uT (t) - uT (t))] - w(t). dt u(t)u(t) (8) Equation (8) will be called the spike-based learning rule, as the postsynaptic spike train Y (t) explicitly appears. An accompanying rate-base learning rule can a lso be derived: (t) d w(t) = [- (u(t) - u(t)) + c(t) (uT (t) - uT (t))] - w(t). dt u0 u(t) 3 (9) 3 Analytical results The learning rules (8) and (9) are stochastic differential e quations for the weights wj driven by the processes Y (.), j (.) and uT (.), of which the last two are assumed to be stationary with the means j (t) = 0 and uT (t) = uT ,0 respectively. The evolution of the solutions w(t) to (8) and (9) may be studied via a Master equation for the probability distribution of the weights p(w, t) (see [8]). For small learning rates , the stationary distribution p(w) sharply peaks3 at the roots of the drift function A(w) of the corresponding Fokker-Planck equation (the detailed derivation is given in [4]). Thus, for 1, the temporal evolution of the learning rules (8) and (9) may be studied via the deterministic differential equation: d w ^ dt z = A(w) = ^ = jN wj , ^ -0 ^ 1 C + C 1 w - w ^ 0 u 0 z (10) (11) =1 where z is the total weight. The matrix C = -C 0 + C 1 (with the elements Cij ) has two contributions. C 0 is the covariance matrix of the input and the matrix C 1 quantifies the covariance between the activities j and the trace uT : 0 Cij 1 ij = = i (t), j (t) i (t), uT (t) uT (t), j (t) u . T (t), uT (t) Now the critical points w of dynamics of (10) are investigated. These critical points, if asymptotically stable, determine the peaks of the stationary distribution p(w) of the weights w; we therefore expect the solutions of the stochastic equations to fluctuate around these fixed points w . If and are much larger than one, the term containing the matrix C 0 can be neglected and equation (10) has a unique stable fixed point w : w T Ci CT = i (t), uT (t) . Under this assumption the maximal mutual information between the target signal YT (t) and the output of the neuron Y (t) is obtained by a weight vector w = w that is parallel to the covariance vector C T . In general, the critical points of equation (10) depend on the eigenvalue spectrum of the symmetric matrix C : If all eigenvalues are negative, the weight vector w decays to the lower hard bound 0. In ^ case of at least one positive eigenvalue (which exists if is chosen large enough), there is a unique stable fixed point w : µ b (12) w = u 0 0 b iN bi . b := =1 The vector b appearing in (12) is the eigenvector of C corresponding to the largest eigenvalue µ. Thus, a stationary unimodal4 distribution p(w) of the weights w is predicted, which is centered around the value w . 4 A concrete example for IB optimization A special scenario of interest, that often appears in the lit erature (see for example [1], [9] and [10]), is the following: The synapses, and subsequently the input spike trains, form M different subgroups 3 It can be shown that the diffusion term in the FP equation scales like O(), i. e. for small learning rates , fluctuations tend to zero and the dynamics can be approximated by the differential equation (10) . 4 Note that p(w) denotes the distribution of the weight vector, not the distribution of a single we ight p(wj ). 4 A X 1(t) X 2(t) X N(t) B Output Y(t) Relevance Signal YT(t) C D Figure 1: A The basic setup for the Information Bottleneck optimization. B-D Numerical and analytical results for the IB optimijzation task described in section 4. The temporal evolution of the average weights wl = 1/M ~ Gl wj of the four different synaptic subgroups Gl are shown. B The performance of the spike-based rule (8). The highest tra jectory corresponds to w1 ; it stays ~ close to its analytical predicted fixed point value obtained from (12), which is visualized by the upper dashed line. The trajectory just below belongs to w3 , for which the fixed point value is also ~ plotted as dashed line. The other two trajectories w2 and w4 decay and eventually fluctuate above ~ ~ the predicted value of zero. C The performance of the rate-based rule (9); results are anal ogous to the ones of the spike-based rule. D Simulation of the deterministic equation (10). Gl , l {1, . . . , N/M } of the same size N/M N. The spike trains Xj and Xk , j = k , are statistically independent if they belong to different subgroups; within a subgroup there is a homogeneous 0 covariance term Cj k = cl , j = k for j, k Gl , which can be due either to spike-spike correlations or correlations in rate modulations. The covariance betwee n the target signal YT and the spike trains Xj is homogeneous among a subgroup. As a numerical example, we consider in figure 1 a modification o f the IB task presented in figure 2 of [1]. The N = 100 synapses form M = 4 subgroups Gl = {25(l - 1) + 1, . . . , 25l}, l {1, . . . , 4}. Synapses in G1 receive Poisson spike trains of constant rate 0 = 20 Hz, which are mutually spikespike correlated with a correlation-coefficient 5 of 0.5. The same holds for the spike trains of G2 . Spike trains for G3 and G4 are uncorrelated Poisson trains with a common rate modulati on, which is equal to low pass filtered white noise (cut-off frequency 5 Hz) with mean 0 and standard deviation (SD) = 0 /2. The rate modulations for G3 and G4 are however independent (though identically distributed). Two spike trains for different synapse subgr oups are statistically independent. The target signal YT was chosen to be the sum of two Poisson trains. The first is of constant rate 0 and has spike-spike correlations with G1 of coefficient 0.5; the second is a Poisson spike train with the same rate modulation as the spike trains of G3 superimposed by additional white noise of SD 2 Hz. Furthermore, the target signal was turned off during random intervals6 . The resulting evolution of the weights is shown in figure 1, illustrating the performanc e of the spike-based rule (8) as well as of the rate-based rule (9). As expected, the weights of G1 and G3 are potentiated as YT has mutual information with the corresponding part of the input. The sy napses of G2 and G4 are depressed. The analytical result for the stable fixed point w obtained from (12) is shown as dashed lines and ^ is in good agreement with the numerical results. Furthermor e the trajectory of the solution w(t) to Spike-spike correlated Poisson spike trains were generated according to the method outlined in [9]. These intervals of silence were modeled as random telegraph noise with a tim e constant of 200 ms and a overall probability of silence of 0.5. 6 5 5 the deterministic equation (10) is plotted. The presented concrete IB task was slightly changed from the one presented in [1], because for the setting used here, the largest eigenvalue µ of C and its corresponding eigenvector b can be calculated analytically. The simulation results for the original sett ing in [1] can also be reproduced with the simpler rules (8) and (9) (not shown). 5 Relevance-modulated PCA with spiking neurons The presented learning rules (8) and (9) exhibit a close rela tion to Principal Component Analysis (PCA). A learning rule which enables the linear Poisson neuron to extract principal components from the input X (.) can be derived by maximizing the following objective function: LPCA = -LIB - jN 2 wj = +I (X, Y ) - I (YT , Y ) - =1 jN 2 wj , (13) =1 which just differs from (3) by a change of sign in front of LIB . The resulting learning rule is in close analogy to (8): d Y (t) (t) w(t) = [(u(t) - u(t)) - c(t) (uT (t) - uT (t))] - w(t). dt u(t)u(t) (14) The corresponding rate-based version can also be derived. Without the trace uT (.) of the target signal, it can be seen that the solution w(t) of deterministic equation corresponding to (14) (which is o f ^ the same form as (10) with the obvious sign changes) converges to an eigenvector of the covariance matrix C 0 . Thus, for = 0 we expect the learning rule (14) to perform PCA for small lear ning rates . The rule (14) without the relevance signal is comparable to other PCA rules, e. g. the covariance rule (see [11]) for non-spiking neurons. The side information given by the relevance signal YT (.) can be used to extract specific principal components from the input, thus we call this paradigm releva nce-modulated PCA. Before we consider a concrete example for relevance-modulated PCA, we wa nt to point out a further application of the learning rule (14). The target signal YT can also be used to extract different components from the inp ut with different neurons (see figure 2). Consider m neurons receiving the same input X . These neurons have the 1 m outputs Y1 (.), . . . , Ym (t), target signals YT (.), . . . , YT (t) and weight vectors w1 (t), . . . , wm (t), the latter evolving according to (14). In order to prevent al l weight vectors from converging towards i the same eigenvector of C 0 (the principal component), the target signal YT for neuron i is chosen to be the sum of all output spike trains except Yi : N i YT (t) = j Yj (t). =1, j =i (15) If one weight vector wi (t) is already close to the eigenvector ek of C 0 , than by means of (15), the basins of attraction of ek for the other weight vectors wj (t), j = i are reduced (or even vanish, depending on the value of ). It is therefore less likely (or impossible) that they also converge to ek . In practice, this setup is sufficiently robust, if only a small number ( 4) of different components is to be extracted and if the differences between the eigenvalues i of these principal components are not too big7 . For the PCA learning rule, the time constant 0 of the kernel 1 (see (7)) had to be chosen smaller than for the IB tasks in order to obtain good pe rformance; we used 0 = 10 ms in i simulations. This is in the range of time constants for IPSPs. Hence, the signals YT could probably be implemented via lateral inhibition. The learning rule considered in [3] displayed a close relati on to Independent Component Analysis (ICA). Because of the linear neuron model used here and the linearization of further terms in the derivation, the resulting learning rule (14) performs PCA i nstead of ICA. The results of a numerical example are shown in figure 2. The m = 3 for the regular PCA experiment neurons receive the same input X and their weights change according to (14). The weights and input spike trains are grouped into four subgroups G1 , . . . , G4 , as for the IB optimization discussed 7 Note that the input X may well exhibit a much larger number of principal components. However it is only possible to extract a limited number of them by different neurons at the sa me time. 6 A Output Y1(t) X 1(t) X 2(t) Output Y (t) m XN(t) B neuron 1 C neuron 2 D neuron 1 E neuron 2 F neuron 3 Figure 2: A The basic setup for the PCA task: The m different neurons receive the same input X and are expected to extract different prinj ipal components of it. B-F The temporal evolution of the c average subgroup weights wl = 1/25 ~ Gl wj for the groups G1 (black solid line), G2 (light gray solid line) and G3 (dotted line). B-C Results for the relevance-modulated PCA task: neuron 1 (fig. B) specializes on G2 and neuron 2 (fig. C) on subgroup G3 . D-F Results for the regular PCA task: neuron 1 (fig. D) specialize on G1 , neuron 2 (fig. E) on G2 and neuron 3 (fig. F) on G3 . in section 4. The only difference is that all groups (except for G4 ) receive spike-spike correlated Poisson spike trains with a correlation coefficient for the g roups G1 , G2 , G3 of 0.5, 0.45, 0.4 respectively. Group G4 receives uncorrelated Poisson spike trains. As can be seen i n figure 2 D to F, the different neurons specialize on different principal c omponents corresponding to potentiated i synaptic subgroups G1 , G2 and G3 respectively. Without the relevance signals YT (.), all neurons tend to specialize on the principal component correspondin g to G1 (not shown). As a concrete example for relevance-modulated PCA, we consider the above setup with slight modifications: Now we want m = 2 neurons to extract the components G2 and G3 from the input X , 0 and not the principal component G1 . This is achieved with an additional relevance signal YT , which is the same for both neurons and has spike-spike correlation s with G2 and G3 of 0.45 and 0.4. We 0 add the term I (Y , YT ) to the objective function (13), where is a positive trade-off factor. The 0 resulting learning rule has exactly the same structure as (1 4), with an additional term due to YT . The numerical results are presented in figure 2 B and C, showing that it is possible in this setup to explicitly select the principle components that are extrac ted (or not extracted) by the neurons. 6 Discussion We have introduced and analyzed a simple and perspicuous rul e that enables spiking neurons to perform IB optimization in an online manner. Our simulation s show that this rule works as well as the substantially more complex learning rule that had previously been proposed in [3]. It also performs well for more realistic neuron models as indicated in [4]. We have shown that the convergence properties of our simplified IB rule can be analyzed with the help of the Fokker-Planck equation (alternatively one may also use the theoretical framework described in A.2 in [12] for its analysis). The investigation of the weight vectors to which this rule converges reveals interesting relationships to PCA. Apparently, very little is known about learning rules that enable spiking neurons to extract multiple principal components from an input stream (a discussion of a basic learning rule performing PCA is given in chapter 11.2.4 of [5]). We have demonstrated both analytically and through simulations that a slight variation of our new learn ing rule performs PCA. Our derivation of this rule within the IB framework opens the door to new vari ations of PCA where preferentially those components are extracted from a high dimensional inpu t stream that are ­or are not­ related to some external relevance variable. We expect that a further i nvestigation of such methods will shed light on the unknown principles of unsupervised and semi-supervised learning that might shape and constantly retune the output of lower cortical areas to inte rmediate and higher cortical areas. The learning rule that we have proposed might in principle be abl e to extract from high-dimensional 7 sensory input streams X those components that are related to other sensory modaliti es or to internal expectations and goals. Quantitative biological data on the precise way in which relevance signals YT (such as for example dopamin) might reach neurons in the cortex and modulate thei r synaptic plasticity are still missing. But it is fair to assume that these signals reach the synapse in a low-pass filtered form of the type uT that we have assumed for our learning rules. From that perspe ctive one can view the learning rules that we have derived (in contrast to the rules proposed in [3] ) as local learning rules. Acknowledgments Written under partial support by the Austrian Science Fund FWF, project # P17229, project # S9102 and project # FP6-015879 (FACETS) of the European Union. References [1] S. Klampfl, R. A. Legenstein, and W. Maass. Information bottleneck optimization and independent component extraction with spiking neurons. In Proc. of NIPS 2006, Advances in Neural Information Processing Systems, volume 19. MIT Press, 2007. [2] N. Tishby, F. C. Pereira, and W. Bialek. The information bottleneck method. In Proceedings of the 37-th Annual Allerton Conference on Communication, Control and Computing, pages 368­377, 1999. [3] S. Klampfl, R. Legenstein, and W. Maass. Spiking neurons can learn to solve information bottleneck problems and to extract independent components. Neural Computation, 2007. in press. [4] L. Buesing and W. Maass. journal version. 2007. in preparation. [5] W. Gerstner and W. M. Kistler. Spiking Neuron Models. Cambridge University Press, Cambridge, 2002. [6] Taro Toyoizumi, Jean-Pascal Pfister, Kazuyuki Aihara, and Wulfram Gerstner. Optimality Model of Unsupervised Spike-Timing Dependent Plasticity: Synaptic Memory and Weight Distribution. Neural Computation, 19(3):639­671, 2007. [7] Eugene M. Izhikevich. Solving the Distal Reward Problem through Linkage of STDP and Dopamine Signaling. Cereb. Cortex, page bhl152, 2007. [8] H. Risken. The Fokker-Planck Equation. Springer, 3rd edition, 1996. ¨ [9] R. Gutig, R. Aharonov, S. Rotter, and H. Sompolinsky. Learning input correlations through non-linear temporally asymmetric hebbian plasticity. Journal of Neurosci., 23:3697­3714, 2003. [10] H. Meffin, J. Besson, A. N. Burkitt, and D. B. Grayden. Learning the structure of correlated synaptic subgroups using stable and competitive spike-timing-dependent plasticity. Physical Review E, 73, 2006. [11] T. J. Sejnowski and G. Tesauro. The hebb rule for synaptic plas ticity: algorithms and implementations. In J. H. Byrne and W. O. Berry, editors, Neural Models of Plasticity, pages 94­103. Academic Press, 1989. [12] N. Intrator and L. N. Cooper. Objective function formulation of the BCM theory of visual cortical plasticity: statistical connections, stability conditions. Neural Networks, 5:3­17, 1992. 8