Theoretical Analysis of Learning with Reward-Modulated Spike-Timing-Dependent P la s t ic it y

Robert Legenstein, Dejan Pecevski, Wolfgang Maass Institute for Theoretical Computer Science Graz University of Technology A-8010 Graz, Austria { legi,dejan,maass} @igi.tugraz.at

Abstract
Reward-modulated spike-timing-dependent plasticity (STDP) has recently emerged as a candidate for a learning rule that could explain how local learning rules at single synapses support behaviorally relevant ada ptive changes in complex networks of spiking neurons. However the potential and limitations of this learning rule could so far only be tested through computer si mulations. This article provides tools for an analytic treatment of reward-mo dulated STDP, which allow us to predict under which conditions reward-modulate d STDP will be able to achieve a desired learning effect. In particular, we can p roduce in this way a theoretical explanation and a computer model for a fundame ntal experimental finding on biofeedback in monkeys (reported in [1]).

1 Introduction
A major puzzle for understanding learning in biological org anisms is the relationship between experimentally well-established learning rules for synapse s (such as STDP) on the microscopic level and adaptive changes of the behavior of biological organisms on the macroscopic level. Neuromodulatory systems which send diffuse signals related to reinf orcements (rewards) and behavioral state to several large networks of neurons in the brain, have been i dentified as likely intermediaries that relate these two levels of learning. It is well-known that the consolidation of changes of synaptic weights in response to pre- and postsynaptic neuronal activ ity requires the presence of such third signals [2]. Corresponding spike-based learning rules of t he form dwj i (t) = cj i (t)d(t), dt (1 )

have been proposed in [3], where wj i is the weight of a synapse from neuron i to neuron j , cj i (t) is an eligibility trace of this synapse which collects proposed weight changes resulting from a learning Ż Ż rule such as STDP, and d(t) = h(t) - h is a neuromodulatory signal with mean h (where h(t) might for example represent reward prediction errors, encoded th rough the concentration of dopamine in the extra-cellular fluid). We will consider in this article o nly cases where the reward prediction error is equal to the current reward. We will refer to d(t) simply as the reward signal. Obviously such learning scheme (1) faces a large credit-assignment pr oblem, since not only those synapses for which weight changes would increase the chances of future reward receive the top-down signal d(t), but billions of other synapses too. Nevertheless the brain is able to solve this credit-assignment problem, as has been shown in one of the earliest (but still among the most amazing) demonstrations of biofeedback in monkeys [1]. The spiking activity of single neurons (in area 4 of the precentral gyrus) was recorded, the current firing rate of this neuron wa s made visible to the monkey in the 1


form of an illuminated meter, and the monkey received food rewards for increases (or in alternating trials for decreases) of the firing rate of this neuron from it s average level. The monkeys learnt quite reliably (on the time scale of 10's of minutes) to change the firing rate of this neuron in the currently rewarded direction 1. Obviously the existence of learning mechanisms in the brain which are able to solve this difficult credit assignment problem is fundamental for understanding and modeling many other learning features of the brain. We present in section 3 and 4 of this abstract a learning theory for (1), where the eligibility trace cij (t) results from standard forms of STDP, which is able to explain the success of the experiment in [1]. This theoretical model is confirmed by computer simulations (see section 4.1). In section 5 we leave this concrete learning experiment and investigate under what conditions neurons can learn through trial and error (via re ward-modulated STDP) associations of specific firing patterns to specific patterns of input spikes. The resulting theory leads to predictions of specific parameter ranges for STDP that support this gener al form of learning. These were tested through computer experiments, see 5.1. Other interesting results of computer simulations of rewar d-modulated STDP in the context of neural circuits were recently reported in [3] and [4] (we also refer to these articles for reviews of preceding work by Seung and others).

2 Models for neurons and synaptic plasticity
The spike train of a neuron i which fires action potentials at times ti , ti , ti , . . . is formalized t (n) by a sum of Dirac delta functions Si (t) = ). We assume that positive and negative (n)  (t - ti i weight changes suggested by STDP for all pairs of pre- and postsynaptic spikes (according to the two integrals in (2)) are collected in an eligibility trace cj i (t), where the impact of a spike pairing with the second spike at time t - s on the eligibility trace at time t is given by some function fc (s) fo r s  0 :   p p cj i (t) = dsfc (s) dr W (r)Sj ost (t - s)Si re (t - s - r) 0 0 .  post pr e dr W (-r)Sj (t - s - r)Si (t - s) (2 ) +
0
s

(1)

(2)

(3)

In our simulations, fc (s) is a function of the form fc (s) = se e- e if s  0 and 0 otherwise, with time constant e = 0.5s. W (r) denotes the standard exponential STDP learning window A -r/ + , if r  0 +e W (r) = , (3 ) r /- , if r < 0 -A- e where the positive constants A+ and A- scale the strength of potentiation and depression, + and - are positive time constants defining the width of the positive and negative learning window, and p p Si re , Sj ost are the spike trains of the presynaptic and postsynaptic neu ron respectively. The actual weight change is the product of the eligibility trace with the reward signal as defined by equation (1). We assume that weights are clipped at the lower boundary valu e 0 and an upper boundary wmax .

p We use a linear Poisson neuron model whose output spike train Sj ost (t) is a realization of a Poisson process with the underlying instantaneous firing rate Rj (t). The effect of a spike of presynaptic neuron i at time t on the membrane potential of neuron j is modeled by an increase in the instantaneous firing rate by an amount wj i (t )(t - t ), where  is a response kernel which models the time course of a postsynaptic potential (PSP) elicited by an input spike. Since STDP according to [3] has been experimentally confirmed only for excitatory sy napses, we will consider plasticity only for excitatory connections and assume that wj i  0 for all i and (s)  0 for all s. Because the synaptic response is scaled by the synaptic weights, we can assume without loss of generality that  the response kernel is normalized to 0 ds (s) = 1. In this linear model, the contributions of all inputs are summed up linearly:  in Rj (t) = ds wj i (t - s) (s) Si (t - s) , (4 ) =1 0
1 Adjacent neurons tended to change their firing rate in the sam e direction, but also differential changes of directions of firing rates of pairs of neurons are reported in [1] (when these differential changes were rewarded).

2


where S1 , . . . , Sn are the n presynaptic spike trains.

3 Theoretical analysis of the resulting weight changes
We are interested in the expected weight change over some tim e interval T (see [5]), where the expectation is over realizations of the stochastic input- and output spike trains as well as a stochastic realization of the reward signal, denoted by the ensemble average · E E t d TE +T 1 d wj i (t + T ) - wj i (t) E   = = wj i (t )dt wj i (t) , (5 ) T T dt dt t t+T where we used the abbreviation f (t) T = T -1 t f (t ) dt . Using equation (1), this yields   wj i (t + T ) - wj i (t) E ds fc (s) Dj i (t, s, r) j i (t - s, r) T dr W (r) = T 0 0 0  + dr W (r) ds fc (s + r) Dj i (t, s, r) j i (t - s, r) T ,(6)
- |r |

where Dj i (t, s, r) = d(t)| Neuron j spikes at t - s, and neuron i spikes at t - s - r E is the average reward at time t given a presynaptic spike at time t - s - r and a postsynaptic spike at time t - s, and j i (t, r) = Sj (t)Si (t - r) E describes correlations between pre- and postsynaptic spike timings (see [6] for the derivation). We see that the expected weight change depends on how the correlations between the pre- and postsynaptic neurons correlate with the reward signal. If these correlations are varying slowly with time, we can exploit th e self-averaging property of the weight vector. Analogously to [5], we can drop the ensemble average on the left hand side and obtain:   d wj i (t) T = dr W (r) ds fc (s) Dj i (t, s, r) j i (t - s, r) T dt 0 0 0  + dr W (r) ds fc (s + r) Dj i (t, s, r) j i (t - s, r) T . (7 )
- |r |

In the following, we will always use the smooth time-averaged vector wj i (t) T , but for brevity, we will drop the angular brackets. If one assumes for simplicit y that the impact of a pre-post spike pair on the eligibility trace is always triggered by the postsyna ptic spike, one gets (see [6] for details):   dwj i (t) = ds fc (s) dr W (r) Dj i (t, s, r) j i (t - s, r) T . (8 ) dt 0 - This assumption (which is common in STDP analysis) will introduce a small error for post-before pre spike pairs, since if a reward signal arrives at some time dr after the pairing, the weight update will be proportional to fc (dr ) instead of fc (dr + r). For the analyses presented in this article, the simplified equation (8) is a good approximation for the learn ing dynamics (see [6]). Equation (8) shows that if the reward signal does not depend on pre- and postsynaptic spike statistics, the weight will change according to standard STDP scaled by a constant proportional to the mean reward.

4 Application to biofeedback experiments
We now apply our theoretical approach to the biofeedback exp eriments by Fetz and Baker [1] that we have sketched in the introduction. The authors showed that it is possible to increase and decrease the firing rate of a randomly chosen neuron by rewarding the mo nkey for its high (respectively low) firing rates. We assume in our model that a reward is delivered to all neurons in the simulated recurrent network with some delay dr every time a specific neuron k in the network produces an action potential  p d(t) = dr Sk ost (t - dr - r)r (r). (9 )
0

where r (r) is the shape of the reward pulse corresponding to one postsyn aptic spike of the rein forced neuron. We assume that the reward kernel r has zero mass, i.e., r = 0 dr r (r) = 0. In Ż 3


our simulations, this reward kernel will have a positive bum p in the first few hundred milliseconds, and a long tailed negative bump afterwards. With the linear P oisson neuron model (see Section 2), the correlation of the reward with pre-post spike pairs of th e reinforced neuron is (see [6])  Dki (t, s, r) = wki dr r (r )(s + r - dr - r ) + r (s - dr )  r (s - dr ). (1 0 )
0

The last approximation holds if the impact of a single input spike on the membrane potential is small. The correlation of the reward with pre-post spike pai rs of non-reinforced neurons is  kj (t - dr - r , s - dr - r ) + wki wj i (s + r - dr - r )(r) Dj i (t, s, r) = dr r (r ) . j (t - s) + wj i (r) 0 (1 1 ) If the contribution of a single postsynaptic potential to the membrane potential is small, we can neglect the impact of the presynaptic spike and write  kj (t - dr - r , s - dr - r ) Dj i (t, s, r)  dr r (r ) . (1 2 ) j (t - s) 0 Hence, the reward-spike correlation of a non-reinforced ne uron depends on the correlation of this neuron with the reinforced neuron. The mean weight change fo r weights to the reinforced neuron is given by   d wki (t) = ds fc (s + dr )r (s) dr W (r) ki (t - dr - s, r) T . (1 3 ) dt 0 - This equation basically describes STDP with a learning rate that is proportional to the eligibility function in the time around the reward-delay. The mean weigh t change of neurons j = k is given by  T      d kj (t - dr - r , s - dr - r ) dr W (r) dr r (r ) wj i (t) = ds fc (s) j i (t - s, r) dt j (t - s) 0 - 0 (1 4 ) If the output of neurons j and k are uncorrelated, this evaluates to approximately zero (se e [6]). The result can be summarized as follows. The reinforced neur on is trained by STDP. Other neurons are trained by STDP with a learning rate proportional to thei r correlation with the reinforced neuron. If a neuron is uncorrelated with the reinforced neuron, the l earning rate is approximately zero. 4.1 Computer simulations

In order to test the theoretical predictions for the experim ent described in the previous section, we have performed a computer simulation with a generic neural m icrocircuit receiving a global reward signal. This global reward signal increases its value every time a specific neuron (the reinforced neuron) in the circuit fires. The circuit consists of 1000 lea ky integrate-and-fire (LIF) neurons (80% excitatory and 20% inhibitory), which are interconnected b y conductance based synapses. The short term dynamics of synapses was modeled in accordance with experimental data (see [6]). Neurons within the recurrent circuit were randomly connected with p robabilities pee = 0.08, pei = 0.08, pie = 0.096 and pii = 0.064 where the ee, ei, ie, ii indices designate the type of the presynaptic and postsynaptic neurons (excitatory or inhibitory). To re produce the synaptic background activity of neocortical neurons in vivo, an Ornstein-Uhlenbeck (OU) conductance noise process modeled according to ([7]) was injected in the neurons, which also elicited spontaneous firing of the neurons in the circuit with an average rate of 4Hz. In half of the neurons part of the noise was substituted with random synaptic connections from the circuit, in order to observe how the learning mechanisms work when most of the input conductance in the neuron comes from a larger number of input synapses which are plastic, instead of a static noise process. The function fc (t) from equation (2) t had the form fc (t) = te e- e if t  0 and 0 otherwise, with time constant e = 0.5s. The reward signal during the simulation was computed according to eq. (9), with the following shape for r (t) r (t) = A+ r t - t t -t + - e r - A- - e r . r + r r (1 5 )

The parameter values for r (t) were chosen such as to produce a positive reward pue with a p eak ls delayed 0.5s from the spike that caused it, and a long tailed negative bump so that 0 dt r (t) = 0. 4


11 10 9 8 7 6 5 4 3

(w/wmax)

A

B

0.70 0.65

C
before learning

rate [Hz]

0.60 0.55 0.50 0.45 0 5 10 15 time [min] 20 after learning 0 1 2 345 time [sec] 6 7 8

5

10 15 time [min]

20

Figure 1: Computer simulation of the experiment by Fetz and Baker [1]. A) The firing rate of the reinforced neuron (solid line) increases while the average firing rate of 20 other randomly chosen neurons in the circuit (dashed line) remains unchanged. B) Evolution of the average synaptic weight of excitatory synapses connecting to the reinforced neuron (solid line) and to other neurons (dashed line). C) Spike trains of the reinforced neuron at the beginning and at the end of the simulation. For values of other model parameters see [6]. The learning ru le (1) was applied to all synapses in the circuit which have excitatory presynaptic and postsynapti c neurons. The simulation was performed for 20 min simulated biological time with a simulation time step of 0.1ms. Fig. 1 shows that the firing rate and synaptic weights of the re inforced neuron increase within a few minutes of simulated biological time, while those of the other neurons remain largely unchanged. Note that this reinforcement learning task is more difficult than that of the first computer experiment of [3], where postsynaptic firing within 10 ms after presynap tic firing of a randomly chosen synapse was rewarded, since the relationship between synaptic activity (and hence with STDP) is less direct in this setup. Whereas a very low spontaneous firing rate of 1 H z was required in [3], this simulation shows that reinforcement learning is also feasible at rate l evels which correspond to those reported in [1].

5 Rewarding spike-timings
In order to explore the limits of reward-modulated STDP, we h ave also investigated a substantially more demanding reinforcement learning scenario. The rewar d signal d(t) was given in dependence p on how well the output spike train Sj ost of the neuron j matched some rather arbitrary spike train S  that was produced by some neuron that received the same n input spike trains as the trained neuron    with arbitrary weights w = (w1 , . . . , wn )T , wi  {0, wmax }, but in addition n - n further  spike trains Sn+1 , . . . , Sn with weights wi = wmax . This setup provides a generic reinforcement learning scenario, when a quite arbitrary (and not perfectl y realizable) spike output is reinforced, but simultaneously the performance of the learner can be evalua ted quite clearly according to how well its weights w1 , . . . , wn match those of the target neuron for those n input spike trains which both of them receive. The reward d(t) at time t is given by  p d(t) = dr (r)Sj ost (t - dr )S  (t - dr - r), (1 6 )
-

 where the function (r) with  = - ds (s) > 0 describes how the reward signal depends Ż on the time difference between a postsynaptic spike and a target spike and dr > 0 is the delay of the reward. Our theoretical analysis below suggests that this reinforcement learning task can in principle be solved by reward-modulated STDP if some constraints are fulfilled. The analysis also reveals which reward kernels  are suitable for this learning setup. The reward correlatio n for synapse i is (see [6])  p [ dr (r ) j ost (t - dr ) +  (s - dr ) + wj i (s + r - dr )(s + r - dr ) Dj i (t, s, r) =
- p p where j ost (t) = Sj ost (t) E denotes the mean rate of the trained neuron at time t, and   (t) = S  (t) E denotes the mean rate of the target spike train at time t. Since weights are changing very

avg. weights

   (t - dr - r ) + wi (s + r - dr - r )] , (17)

5


slowly, we have wj i (t - s - r) = wj i (t). In the following, we will drop the dependence of wj i on t for brevity. For simplicity, we assume that input rates are s tationary and uncorrelated. In this case (since the weights are changing slowly), also the correlati ons between inputs and outputs can be assumed stationary, j i (t, r) = j i (r). We assume that the eligibility function fc (dr )  fc (dr + r) if |r| is on a time scale of a PSP, the learning window, or the reward kernel, and that dr is large p compared to these time scales. Then, for uncorrelated Poiss on input spike trains of rate i re and the linear Poisson neuron model, the weight change at synapse j i is given by + p dwj i (t) p p Ż Ż  fc   i re j ost j ost W + wj i W ŻŻ dt +  p p p Ż Ż +   wj i + wi j ost fc (dr )i re j ost W + wj i W Ż  +   p p fc (dr )wi i re j ost dr W (r) (r) + wj i dr W (r)(r) (r)
- -

 We will now bound the expected weight change for synapses j i with wi = wmax and for synapses  j k with wj k = 0. In this way we can derive conditions for which the expected weight change for the former synapses is positive, and that for the latter type is n egative. First, we assume that the integral over the reward kernel is positive. In this case, the weight c hange is negative for synapses i with p p p  Ż Ż wi = 0 if and only if i re > 0, and -j ost W > wj i W . In the worst case, wj i is wmax and j ost post is small. We have to guarantee some minimal output rate min such that even if wj i = wmax , this  inequality is fulfilled. This could be guaranteed by some noise current. For synapses i with wi = wmax , we obtain two more conditions (see [6] for a derivation). Th e conditions are summarized in inequalities (19)-(21). If these inequalities are fulfilled and input rates are positive, then the weight vector converges on average from any initial weight vector t o w . po Ż Ż (1 9 ) -mist W > wmax W n   post Ż dr (r) (r) (2 0 ) dr W (r)(r) (r)  -max W 0 -   post Ż ,  max fc  post ŻŻ dr W (r) (r) > -W  +   + max (2 1 ) + wmax fc (dr ) wmax - post where max is the maximal output rate. The second condition is less severe, and should be easily fulfilled in most setups. If this is the case, the first condition (19) ensures that weights with w = 0 are depressed while the third condition (21) ensures that we ights with w = wmax are potentiated.

dr (r) (r), (1 8 )    Ż Ż where fc = 0 dr fc (r), W = - dr W (r),  (r) = - dr (r )(r - r ) is the convolution f the reward kernel with the PSP is the integral ove r the STDP learning window, and o Ż W = - dr (r)W (r).
j 0

p  fc (dr )wi wj i i re

 post

Ż Ż W + wj i W


Optimal reward kernels: From condion (21), we can deduce optimal reward kernels . The ti kernel should be such that the integral - dr W (r) (r) is large, while the integral over  is small (but positive). Hence,  (r) should be positive for r > 0 and negative for r < 0. In the following experiments, we use a simple kernel which satisfies the afore mentioned constraints: A t t - t- - t-   1 - e 2 ) , if t - t  0 + (e (r) = t-t t-t   -A (e 1 - e 2 ) , otherwise -
  where A and A are positive scaling constants, 1 and 2 define the shape of the two double+ - exponential functions the kernel is composed of, and t defines the offset of the zero-crossing from the origin. The optimal offset from the origin is negative an d in the order of tens of milliseconds for usual PSP-shapes . Hence, reward is positive if the neuron spikes around the ta rget spike or somewhat later, and negative if the neuron spikes much too early.

5.1 Computer simulations In the computer simulations we explored the learning rule in a more biologically realistic setting, where we used a leaky integrate-and-fire (LIF) neuron with in put synaptic connections coming from 6


A
(w/wmax)

1.0 0.8 0.6 0.4 0.2

B
before learning target S (= rewarded spike times) realizable part  of target S after learning


average weights

0.0

0

30

60 time [min]

90

120

0

1

2 time [sec]

3

4

Figure 2: Reinforcement learning of spike times. A) Synaptic weight changes of the trained LIF neuron, for 5 different runs of the experiment. The curves sh ow the average of the synaptic weights  that should converge to wi = 0 (dashed lines), and the average of the synaptic weights that should  converge to wi = wmax (solid lines) with different colors for each simulation run . B) Comparison of the output of the trained neuron before (upper trace) and a fter learning (lower trace; the same input spike trains and the same noise inputs were used before and after training for 2 hours). The second trace from above shows those spike times which are rew arded, the third trace shows the target spike train without the additional noise inputs.
A
w (w=wmax)
1.0 0.5 0.0 -0.5 0.5 0.0 -0.5 -1.0

B
w (w=0)

Exp.No.

1

2

3

4

5

6

Figure 3: Predicted average weight change (black bars) calculated from equation (18), and the estimated average weight change (gray bars) from simulations, presented for 6 different experiments with different parameter settings (see Table 1).2 A) Weight change  values for synapses with wi = wmax . B) Weight change values  for synapses with wi = 0. Cases where the constraints are not fulfilled are shaded with gray color.

a generic neural microcircuit composed of 1000 LIF neurons. The synapses were conductance based exhibiting short term facilitation and depression. T he trained neuron and the arbitrarily given neuron which produced the target spike train S  ("target neuron") both were connected to the same randomly chosen, 100 excitatory and 10 inhibitory neurons f rom the circuit. The target neuron had 10 additional excitatory input connections (these weights were set to wmax ), not accessible to the trained neuron. Only the synapses of the trained neuron c onnecting from excitatory neurons  were set to be plastic. The target neuron had a weight vector w ith wi = 0 for 0  i < 50 and  wi = wmax for 50  i < 110. The generic neural microcircuit from which the trained and the target neurons receive the input had 80% excitatory and 2 0% inhibitory neurons interconnected randomly with a probability of 0.1. The neurons received bac kground synaptic noise as modeled in [7], which caused spontaneous activity of the neurons with an average firing rate of 6.9Hz. During the simulations, we observed a firing rate of 10.6Hz for the trained, and 19Hz for the target neuron. The reward was delayed by 0.5s, and we used the same eligibility trace function fc (t) as in the simulations for the biofeedback experiment (see [6] for det ails). The simulations were run for two hours simulated biological time, with a simulation time step of 0.1ms. We performed 5 repetitions of the experiment, each time with different randomly genera ted circuits and different initial weight values for the trained neuron. In each of the 5 runs, the avera ge synaptic weights of synapses with   wi = wmax and wi = 0 approach their target values, as shown in Fig. 2A. In order to test how
2

The values in the figure are calculated as w =

w(tsim )-w (0) wmax /2

for the simulations, and with w =

dw/dt tsim wmax /2

for the predicted value. w(t) is the average weight over synapses with the same value of w .

7


Ex.  [ms] 1 10 2 7 3 20 4 7 5 10 6 25

po  wmax mist [Hz] A+ 106 A- + ,2 [ms] n + 0.012 10 16.62 1.05 20,20 0.020 5 11.08 1.02 15,16 0.010 6 5.54 1.10 25,40 0.020 5 11.08 1.07 25,16 0.015 6 20.77 1.10 25,20 0.005 3 13.85 1.01 25,20

A

A tsim [h] + 3.34 5 4.58 10 1.46 16 4.67 13 3.75 3 3.34 13

Table 1: Parameter values used for the simulations in Figure 3. Both cases where the constraints are satisfied and not satisfied were covered. PSPs were modeled as (s) = e(-s/ ) / .

closely the learning neuron reproduces the target spike tra in S  after learning, we have performed additional simulations where the same spiking input SI is applied to the learning neuron before and after we conducted the learning experiment (results are rep orted in Fig. 2B). The equations in section 5 define a parameter space for which t he trained neuron can learn the target synapse pattern w . We have chosen 6 different parameter values encompassing c ases with satisfied and non-satisfied constraints, and performed experiments w here we compare the predicted average weight change from equation (18) with the actual average weight change produced by simulations. Figure 3 summarizes the results. In all 6 experiments, the su fficient conditions (19)-(21) were correct. In those cases where these conditions were not met, the weight moved in the opposite direction, suggesting that the theoretically sufficient conditions (19)-(21) might also be necessary.

6 Discussion
We have developed in this paper a theory of reward-modulated STDP. This theory predicts that reinforcement learning through reward-modulated STDP is als o possible at biologically more realistic spontaneous firing rates than the average rate of 1 Hz that was used (and argued to be needed) in the extensive computer experiments of [3]. We have also shown both analytically and through computer experiments that the result of the fundamental biofeedback experiment in monkeys from [1] can be explained on the basis of reward-modulated STDP. The resulting theory of reward-modulated STDP makes concrete predictions regarding the shape of various f unctions (e.g. reward functions) that would optimally support the speed of reward-modulated lear ning for the generic (but rather difficult) learning tasks where a neuron is supposed to respond to input spikes with specific patterns of output spikes, and only spikes at the right times are rewarde d. Further work (see [6]) shows that reward-modulated STDP can in some cases replace supervised training of readout neurons from generic cortical microcircuit models. Acknowledgment: We would like to thank Gordon Pipa and Matthias Munk for helpful discussions. Written under partial support by the Austrian Science Fund F WF, project # P17229, project # S9102 and project # FP6-015879 (FACETS) of the European Union.

References
[1] E. E. Fetz and M. A. Baker. Operantly conditioned patterns of precentral unit activity and correlated responses in adjacent cells and contralateral muscles. J Neurophysiol, 36(2):179­204, Mar 1973. [2] C. H. Bailey, M. Giustetto, Y.-Y. Huang, R. D. Hawkins, and E. R. Kandel. Is heterosynaptic modulation essential for stabilizing Hebbian plasticity and memory? Nature Reviews Neuroscience, 1:11­20, 2000. [3] E. M. Izhikevich. Solving the distal reward problem through linkage of STDP and dopamine signaling. Cerebral Cortex Advance Access, January 13:1­10, 2007. [4] R. V. Florian. Reinforcement learning through modulati on of spike-timing-dependent synaptic plasticity. Neural Computation, 6:1468­1502, 2007. [5] W. Gerstner and W. M. Kistler. Spiking Neuron Models. Cambridge University Press, Cambridge, 2002. [6] R. Legenstein, D. Pecevski, and W. Maass. Theory and applications of reward-modulated spike-timingdependent plasticity. in preparation, 2007. [7] J.M. Fellous A. Destexhe, M. Rudolph and T.J. Sejnowski. Fluctuating synaptic conductances recreate in vivo-like activity in neocortical neurons. Neuroscience, 107(1):13­24, 2001.

8