An online Hebbian learning rule that performs Independent Component Analysis Claudia Clopath School of Computer Science and Brain Mind Institute Ecole polytechnique federale de Lausanne 1015 Lausanne EPFL claudia.clopath@epfl.ch Andre Longtin Center for Neural Dynamics University of Ottawa 150 Louis Pasteur, Ottawa alongtin@uottawa.ca Wulfram Gerstner School of Computer Science and Brain Mind Institute Ecole polytechnique federale de Lausanne 1015 Lausanne EPFL wulfram.gerstner@epfl.ch Abstract Independent component analysis (ICA) is a powerful method to decouple signals. Most of the algorithms performing ICA do not consider the temporal correlations of the signal, but only higher moments of its amplitude distribution. Moreover, they require some preprocessing of the data (whitening) so as to remove second order correlations. In this paper, we are interested in understanding the neural mechanism responsible for solving ICA. We present an online learning rule that exploits delayed correlations in the input. This rule performs ICA by detecting joint variations in the firing rates of pre- and postsynaptic neurons, similar to a local rate-based Hebbian learning rule. 1 Introduction The so-called cocktail party problem refers to a situation where several sound sources are simultaneously active, e.g. persons talking at the same time. The goal is to recover the initial sound sources from the measurement of the mixed signals. A standard method of solving the cocktail party problem is independent component analysis (ICA), which can be performed by a class of powerful algorithms. However, classical algorithms based on higher moments of the signal distribution [1] do not consider temporal correlations, i.e. data points corresponding to different time slices could be shuffled without a change in the results. But time order is important since most natural signal sources have intrinsic temporal correlations that could potentially be exploited. Therefore, some algorithms have been developed to take into account those temporal correlations, e.g. algorithms based on delayed correlations [2, 3, 4, 5] potentially combined with higher-order statistics [6], based on innovation processes [7], or complexity pursuit [8]. However, those methods are rather algorithmic and most of them are difficult to interpret biologically, e.g. they are not online or not local or require a preprocessing of the data. Biological learning algorithms are usually implemented as an online Hebbian learning rule that triggers changes of synaptic efficacy based on the correlations between pre- and postsynaptic neurons. A Hebbian learning rule, like Oja's learning rule [9], combined with a linear neuron model, has been shown to perform principal component analysis (PCA). Simply using a nonlinear neuron combined with Oja's learning rule allows one to compute higher moments of the distributions which yields ICA if the signals have been preprocessed (whitening) at an earlier stage [1]. In this paper, we are 1 s x Hebbian Learning y C W Mixing ICA Figure 1: The sources s are mixed with a matrix C , x = C s, x are the presynaptic signals. Using a linear neuron y = W x, we want to find the matrix W which allows the postsynaptic signals y to recover the sources, y = P s, where P is a permutation matrix with different multiplicative constants. interested in exploiting the correlation of the signals at different time delays, i.e. a generalization of the theory of Molgedey and Schuster [4]. We will show that a linear neuron model combined with a Hebbian learning rule based on the joint firing rates of the pre- and postsynaptic neurons of different time delays performs ICA by exploiting the temporal correlations of the presynaptic inputs. 2 Mathematical derivation of the learning rule 2.1 The problem We assume statistically independent autocorrelated source signals si with mean < si >= 0 (<> means averaging over time) and correlations < si (t)sj (t ) >= Ki (|t - t |)ij . The sources s are mixed by a matrix C x = C s, (1) where x are the mixed signals recorded by a finite number of receptors (bold notation refers to a vector). We think of the receptors as presynaptic neurons that are connected via a weight matrix W to postsynaptic neurons. We consider linear neurons [9], so that the postsynaptic signals y can be written y = W x. (2) The aim is to find a learning rule that adjusts the appropriate weight matrix W to W (* denotes the value at the solution) so that the postsynaptic signals y recover the independent sources s (Fig 1), i.e. y = P s where P is a permutation matrix with different multiplicative constants (the sources are recovered in a different order up to a multiplicative constant), which means that, neglecting P , W = C -1 . (3) To solve this problem we extend the theory of Molgedey and Schuster [4] in order to derive an online biological hebbian rule. 2.2 Theory of Molgedey and Schuster and generalization The paper of Molgedey and Schuster [4] focuses on the instantaneous correlation matrix but also the time delayed correlations Mij =< xi (t)xj (t + ) > of the incoming signals. Since the correlation matrix Mij is symmetric, it has up to n(n + 1)/2 independent elements. However, the unknown mixing matrix C has potentially n2 elements (for n sources and n detectors). Therefore, we need to ¯ evaluate two delayed correlation matrices M and M with two different time delays defined as ¯ Mij =< xi (t)xj (t + 2 ) > Mij =< xi (t)xj (t + 1 ) > (4) to get enough information about the mixing process [10]. l l ¯ ¯ From equation 1, we obtain the relation Mij = Cil Cj l ll and similarly Mij = Cil Cj l ll ¯ ij = ij Ki (1 ) are diagonal matrices. Since M = C C T and where ij = ij Ki (2 ) and ¯ ¯ M = C C T , we have ¯ ¯ (M M -1 )C = C (-1 ). 2 (5) It follows that C can be found from an eigenvalue problem. Since C is the mixing matrix, a simple algorithmic inversion allows Molgedey and Schuster to recover the original sources [4]. 2.3 Our learning rule In order to understand the putative neural mechanism performing ICA derived from the formalism developed above, we need to find an online learning rule describing changes of the synapses as ¯ a function of pre- and postsynaptic activity. Taking the inverse of (5), we have C -1 M M -1 = ¯ -1 C -1 . Therefore, for weights that solve the ICA problem we expect because of (3) that ¯ ¯ W M = -1 W M , which defines the weight matrix W at the solution. For the sake of simplicity, consider only one linear postsynaptic neuron. The generalization to many postsynaptic neurons is straightforward (see section 4). The output signal y of the neuron can be written as y = wT x, where wT is a row of the matrix W . Then equation 6 can be written as ¯ w T M = w T M , ¯ where is one element of the diagonal matrix -1 . In order to solve this equation, we can use the following iterative update rule with update parameter . ¯ w = [wT M - wT M ]. (8) (7) (6) The fixed point of this update rule is giving by (7), i.e. w = w . Furthermore, multiplication of (7) ¯ wT M w with w yields = wT M w . If we insert the definition of M from (2), we obtain the following rule w = [< y (t)x(t + 1 ) > - < y (t)x(t + 2 ) >], with a parameter given by = < y (t)y (t + 1 ) > . < y (t)y (t + 2 ) > (9) It is possible to show that w is orthogonal to w. This implies that to first order (in |w/w|), w will keep the same norm during iterations of (9). The rule 9 we derived is a batch-rule, i.e. it averages over all sample signals. We convert this rule into an online learning rule by taking a small learning rate and using an online estimate of . 1 y (t)x(t + 2 )] 2 1 = -1 + y (t)y (t + 1 ) 2 = -2 + y (t)y (t + 2 ). w = [y (t)x(t + 1 ) - (10) Note that the rule defined in (10) uses information on the correlated activity xy of pre- and postsynaptic neurons as well as an estimate of the autocorrelation < y y > of the postsynaptic neuron. is taken sufficiently long so as to average over a representative sample of the signals and | | 1 is a small learning rate. Stability properties of updates under rule (10) are discussed in section 4. 3 Performances of the learning rule A simple example of a cocktail party problem is shown in Fig 2 where two signals, a sinus and a ramp (saw-tooth signal), have been mixed. The learning rule converges to a correct set of synaptic 3 A B autocorrelation Ki(t-t') signals time -10 -5 0 time 5 10 C D signals signals time time Figure 2: A. Two periodic source signals, a sinus (thick solid line) and a ramp (thin solid line), are mixed into the presynaptic signals (dotted lines). B. The autocorrelation functions of the two source signals are shown (the sinus in thick solid line and the ramp in thin solid line). The sources are normalized so that (0) = 1 for both. C. The learning rule with 1 = 3 and 2 = 0 extracts the sinusoidal output signal (dashed) composed to the two input signals. In agreement with the calculation of stability, > 0 , the output is recovering the sinus source because sin (3) > ramp (3). D. The learning rule with 1 = 10, 2 = 0, converges to the other signal (dashed line), i.e. the ramp, because ramp (10) > sin (10). Note that the signals have been rescalled since the learning rule recovers the signals up to a multiplicative factor. weights so that the postsynaptic signal recovers correctly one of the sources. Postsynaptic neurons with different combinations of 1 and 2 are able to recover different signals (see the section 4 on Stability). In the simulations, we find that the convergence is fast and the performance is very accurate and stable. Here we show only a two-sources problem for the sake of visual clarity. However, the rule can easily recover several mixed sources that have different temporal characteristics. Fig 3 shows an ICA problem with sources s(t) generated by an Ornstein-Uhlenbeck process of the form si si = -si + , where is some gaussian noise. The different sources are characterized by different time constants. The learning rule is able to decouple these colored noise signals with gaussian amplitude distribution since they have different temporal correlations. Finally, Fig 4 shows an application with nine different sounds. We used 60 postsynaptic neurons with time delays 1 chosen uniformly in an interval [1,30ms] and 2 = 0 . Globally 52 of the 60 neurons recovered exactly 1 source (A, B) and the remaining 8 recovered mixtures of 2 sources (E). One postsynaptic neuron is recovering one of the sources depending on the source's autocorrelation at time 1 and 2 (.i.e. the source with the biggest autocorrelation at time 1 since 2 = 0 for all neurons, see section Stability). A histogram (C) shows how many postsynaptic neurons recover each source. However, as it will become clear from the stability analysis below, a few specific postsynaptic neurons tuned to time delays, where the autocorrelation functions intersect (D, at time 1 = 3ms and 2 = 0), cannot recover one of the sources precisely (E). 4 A B singals time singals time Figure 3: A. The 3 source signals (solid lines generated with the equation si si = -si + with different time constants, where is some gaussian noise) are plotted together with the output signal (dashed). The learning rule is converging to one of the sources. B. Same as before, but only the one signal (solid) that was recovered is shown together with the neuronal output (dashed). A B signals 1 2 time [s] 3 4 5 signals 10 ms time C 15 D 5 0 autocorrelation -4 10 # of ouput 1 2 3 4 5 6 sources # 7 8 9 -2 0 time [ms] 2 4 E signals 1 2 3 time [s] 4 5 Figure 4: Nine different sound sources from [11] were mixed with a random matrix. 60 postsynaptic neurons tuned to different 1 and 2 were used in order to recover the sources, i.e. 1 varies from 1ms to 30ms by steps of 0.5ms and 2 = 0 for all neurons. A. One source signal (below) is recovered by one of the postsynaptic neurons (above, for clarity reason, the output is shifted upward). B. Zoom on one source (solid line) and one output (dashed line). C. Histogram of the number of postsynaptic neurons recovering each sources. D. Autocorrelation of the different sources. There are several sources with the biggest autocorrelation at time 3ms. E. The postsynaptic neuron tuned to a 1 = 3ms and 2 = 0 (above) is not able to recover properly one of the sources even though it still performs well except for the low amplitude parts of the signal (below). 5 4 Stability of the learning rule In principle our online learning rule (10) could lead to several solutions corresponding to different fixed points of the dynamics. Fixed points will be denoted by w = ek , which are by construction the row vectors of the decoupling matrix W (see (5) and (7)). The rule 10 has two parameters, i.e. the delays 1 and 2 (the is considered fixed). We assume that in our architecture, these delays characterize different properties of the postsynaptic neuron. Neurons with different choices of 1 and 2 will potentially recover different signals from the same mixture. The stability analysis will show which fixed point is stable depending on the autocorrelation functions of the signals and the delays 1 and 2 . We analyze the stability, assuming small perturbation of the weights, i.e. w = ei + ej where {ek }, the basis of the matrix C -1 , are the fixed points. We obtain the expression (see Appendix for calculation details) = j j (1 )ii (2 ) - ii (1 )j j (2 ) , ii (2 ) (11) where ( )ij =< si (t)sj (t + ) > is the diagonal correlation matrix. To illustrate the stability equation (11), let us take 1 = 0 and assume that ii (0) = j j (0), i.e. all signals have the same zero-time-lag autocorrelation. In this case (11) reduces to = [j j (1 ) - ii (1 )]. That is the solution ei is stable if j j (1 ) < ii (1 ) for all directions ej (with biggest autocorrelation at time 1 ) for > 0. If < 0, the solution ei is stable for j j (1 ) > ii (1 ). This stability relation is verified in the simulations. Fig 2 shows two signals with different autocorrelation functions. In this example, we chose 1 = 0 and (0) = I, i.e. the signals are normalized. The learning rule is recovering the signal with the biggest autocorrelation at time 1 , kk (1 ), for a positive learning rate. 5 Comparison between Spatial ICA and Temporal ICA One of the algorithms most used to solve ICA is FastICA [1]. It is based on an approximation of negentropy and is purely spatial, i.e. it takes into account only the amplitude distribution of the signal, but not it's temporal structure. Therefore we show an example (Fig. 5), where three signals generated by Ornstein-Uhlenbeck processes have the same spatial distribution but different time constants of the autocorrelation. With a spatial algorithm data points corresponding to different time slices can be shuffled without any change in the results. Therefore, it cannot solve this example. We tested our example with FastICA downloaded from [11] and it failed to recover the original sources (Fig. 5). However, to our surprise, FastICA could for very few trial solve this problem even though the convergence was not stable. Indeed, since FastICA algorithm is an iterative online algorithm, it takes the signals in the temporal order in which they arrive. Therefore temporal correlations can in some cases be taken into account even though this is not part of the theory of FastICA. 6 Discussions and conclusions We presented a powerful online learning rule that performs ICA by computing joint variations in the firing rates of pre- and postsynaptic neurons at different time delays. This is very similar to a standard Hebbian rule with exception of an additional factor which is an online estimate of the output correlations at different time delays. The different delay times 1 , 2 are necessary to recover different sources. Therefore properties varying between one postsynaptic neuron and the next could lead to different time delays used in the learning rule. We could assume that the time delays are intrinsic properties of each postsynaptic neuron due to for example the distance on the dendrites where the synapse is formed [12], i.e. due to different signal propagation time. The calculation of stability shows that a postsynaptic neuron will recover the signal with the biggest autocorrelation at the considered delay time or the smallest depending of the sign of the learning rates. We assume that for biological signals autocorrelation functions cross so that it's possible with different postsynaptic neurons to recover all the signals. 6 A B distribution signals autocorrelation time delay C D signals signals time time Figure 5: Two signals generated by an Ornstein-Uhlenbeck process are mixed. A. The signals have the same spatial distributions. B. The time constants of the autocorrelations are different. C. Our learning rule converges to an output (dashed line) recovering one of the signals source (solid line). D. FastICA (dashed line) doesn't succeed to recover the sources (solid line). The algorithm assumes centered signals. However for a complete mapping of those signals to neural rates, we have to consider positive signals. Nevertheless we can easily compute an online estimate of the mean firing rate and remove this mean from the original rates. This way the algorithm still holds taking neural rates as input. Hyvaerinen proposed an ICA algorithm [8] based on complexity pursuit. It uses the nongaussianity of the residuals once the part of the signals that is predictable from the temporal correlations has been removed. The update step of this algorithm has some similarities with our learning rule even though the approach is completely different since we want to exploit temporal correlations directly rather than formally removing them by a "predictor". We also do not assume pre-whitened data and are not considering nongaussianity. Our learning rule considers smooth signals that are assumed to be rates. However, it is commonly accepted that synaptic plasticity takes into account spike trains of pre- and postsynaptic neurons looking at the precise timing of the spikes, i.e. Spike Timing Dependent Plasticity (STDP) [13, 14, 15]. Therefore a spike-based description of our algorithm is currently under study. Appendix: Stability calculation By construction, the row vectors {ek , k = 1,..,n} of W = C -1 , the inverse of the mixing matrix, are solutions of the batch learning rule 9 (n is the number of sources). Assume one of these row vectors eT , (i.e. a fixed point of the dynamic), and consider w = ei + ej a small perturbation in i direction eT . Note that {ek } is a basis because det(C ) = 0 (the matrix must be invertible). The rule j (9) becomes: 7 ei = [< x(t + 1 )(ei + ej )T x(t) > - < (ei + ej ) x(t)(ei + ej ) x(t + 1 ) > < x(t + 2 >)(ei + ej )T x(t) >]. < (ei + ej )T x(t)(ei + ej )T x(t + 2 >) T T (12) We can expand the terms on the righthand side to first order in . Multiplying the stability expression by eT (here we can assume that eT ej = 1 since the recovering of the sources are up to a j j multiplicative constant), we find: [eT C (1 )C T ej ][eT C (2 )C T ei ] - [eT C (1 )C T ei ][eT C (2 )C T ej ] j i i j eT C (2 )C T ei i [eT C (1 )C T ej ][eT C (2 )C T ei ] i j . eT C (2 )C T ei i = (13) -4 where ( )ij =< si (t)sj (t + ) > is the diagonal matrix. This expression can be simplified because eT is a row of W = C -1 , so that eT C is the unit vector i i of the form (0,0,...,1,0,...) where the position of the "1" indicates the solution number 0. Therefore, we have eT C ( )C T ek = ( )ik . i The expression of stability becomes = j j (1 )ii (2 ) - ii (1 )j j (2 ) ii (2 ) (14) References [1] A. Hyvaerinen, J. Karhunen, and E. Oja. Independent Component Analysis. Wiley-Interscience, 2001. [2] L Tong, R Liu, VC Soon, and YF Huang. Indeterminacy and identifiability of blind identification. IEEE Trans. on Circuits and Systems, 1991. [3] A. Belouchrani, KA. Meraim, JF. Cardoso, and E. Moulines. A blind source separation technique based on second order statistics. IEEE Trans. on Sig. Proc., 1997. [4] L. Molgedey and H.G. Schuster. Separation of a mixture of independent signals using time delayed correlations. Phys. Rev. Lett., 72:3634­37, 1994. [5] A. Ziehe and K. Muller. Tdsep ­ an efficient algorithm for blind separation using time structure. [6] KR. Mueller, P. Philips, and A. Ziehe. Jade td : Combining higher-order statistics and temporal information for blind source separation (with noise). Proc. Int. Workshop on ICA, 1999. [7] A. Hyvaerinen. Independent component analysis for time-dependent stochastic processes. Proc. Int. Conf. on Art. Neur. Net., 1998. [8] A. Hyvaerinen. Complexity pursuit: Separating interesting components from time-series. Neural Computation, 13:883­898, 2001. [9] E. Oja. A simplified neuron model as principal component analyzer. J. Math. Biol., 15:267 ­273, 1982. [10] J.J. Hopfield. Olfactory computation and object perception. PNAS, 88:6462­6466, 1991. [11] H. Gavert, J. Hurri, J. Sarela, and A. Hyvarinen. http://www.cis.hut.fi/projects/ica/. Fastica and cocktail party demo. [12] R. C. Froemke, M. Poo, and Y. Dan. Spike-timing dependent synaptic plasticity depends on dentritic location. Nature, 434:221­225, 2005. [13] G. Bi and M. Poo. Synaptic modification by correlated activity: Hebb's postulate revisited. Annual Review of Neuroscience, 2001. ¨ [14] H. Markram, J. Lubke, M. Frotscher, and B. Sakmann. Regulation of synaptic efficacy by coincidence of postsynaptic APs and EPSPs. Science, 275:213­215, 1997. [15] W. Gerstner, R. Kempter, JL. van Hemmen, and H. Wagner. A neuronal learning rule for sub-millisecond temporal coding. Nature, 383:76­78, 1996. 8