Hebbian Learning of Bayes Optimal Decisions Bernhard Nessler Michael Pfeiffer and Wolfgang Maass , , Institute for Theoretical Computer Science Graz University of Technology A-8010 Graz, Austria {nessler,pfeiffer,maass}@igi.tugraz.at Abstract Uncertainty is omnipresent when we perceive or interact wit h our environment, and the Bayesian framework provides computational methods for dealing with it. Mathematical models for Bayesian decision making typically require datastructures that are hard to implement in neural networks. Th is article shows that even the simplest and experimentally best supported type of synaptic plasticity, Hebbian learning, in combination with a sparse, redundant neural code, can in principle learn to infer optimal Bayesian decisions. We pre sent a concrete Hebbian learning rule operating on log-probability ratios. Modula ted by reward-signals, this Hebbian plasticity rule also provides a new perspective for understanding how Bayesian inference could support fast reinforcement le arning in the brain. In particular we show that recent experimental results by Ya ng and Shadlen [1] on reinforcement learning of probabilistic inference in prim ates can be modeled in this way. 1 Introduction Evolution is likely to favor those biological organisms which are able to maximize the chance of achieving correct decisions in response to multiple unreli able sources of evidence. Hence one may argue that probabilistic inference, rather than logical in ference, is the "mathematics of the mind", and that this perspective may help us to understand the princ iples of computation and learning in the brain [2]. Bayesian inference, or equivalently inferen ce in Bayesian networks [3] is the most commonly considered framework for probabilistic inferenc e, and a mathematical theory for learning in Bayesian networks has been developed. Various attempts to relate these theoretically optimal mod els to experimentally supported models for computation and plasticity in networks of neurons in the bra in have been made. [2] models Bayesian inference through an approximate implementation of the Belief Propagation algorithm (see [3]) in a network of spiking neurons. For reduced classes of probabil ity distributions, [4] proposed a method for spiking network models to learn Bayesian inference with an online approximation to an EM algorithm. The approach of [5] interprets the weight wj i of a synaptic connection between neurons p(x ,x ) representing the random variables xi and xj as log p(xi )i·p(j j ) , and presents algorithms for learning x these weights. Neural correlates of variables that are important for decis ion making under uncertainty had been presented e.g. in the recent experimental study by Yang and S hadlen [1]. In their study they found that firing rates of neurons in area LIP of macaque monkeys refl ect the log-likelihood ratio (or logodd) of the outcome of a binary decision, given visual evidence. The learning of such log-odds for Bayesian decision making can be reduced to learning weights for a linear classifier, given an appropriate but fixed transformation from the input to possi bly nonlinear features [6]. We show Both authors contributed equally to this work. 1 that the optimal weights for the linear decision function are actually log-odds themselves, and the definition of the features determines the assumptions of the learner about statistical dependencies among inputs. In this work we show that simple Hebbian learning [7] is sufficient to implement learning of Bayes optimal decisions for arbitrarily complex probability dis tributions. We present and analyze a concrete learning rule, which we call the Bayesian Hebb rule, and show that it provably converges towards correct log-odds. In combination with appropriate preprocessing networks this implements learning of different probabilistic decision making proce sses like e.g. Naive Bayesian classification. Finally we show that a reward-modulated version of this Hebbian learning rule can solve simple reinforcement learning tasks, and also provides a model for the experimental results of [1]. 2 A Hebbian rule for learning log-odds We consider the model of a linear threshold neuron with outpu t y0 , where y0 = 1 means that the neuron is firing and y0 = 0 means non-firing. The neuron's current den ision y0 whether to fire or not c ^ is given by a linear decision function y0 = sign(w0 · constant + i=1 wi yi ), where the yi are the ^ current firing states of all presynaptic neurons and wi are the weights of the corresponding synapses. We propose the following learning rule, which we call the Bayesian Hebb rule: wi = (1 + e-wi ), - (1 + ewi ), 0, if y0 = 1 and yi = 1 if y0 = 0 and yi = 1 if yi = 0. (1) This learning rule is purely local, i.e. it depends only on th e binary firing state of the pre- and postsynaptic neuron yi and y0 , the current weight wi and a learning rate . Under the assumption of a stationary joint probability distribution of the pre- a nd postsynaptic firing states y0 , y1 , . . . , yn the Bayesian Hebb rule learns log-probability ratios of the postsynaptic firing state y0 , conditioned on a corresponding presynaptic firing state yi . We consider in this article the use of the rule in a supervised, teacher forced mode (see Section 3), and also in a reinforcement learning mode (see Section 4). We will prove that the rule converges globally to the target weight value wi , given by wi = log p(y0 = 1|yi = 1) p(y0 = 0|yi = 1) . (2) We first show that the expected update E[wi ] under (1) vanishes at the target value wi : E[wi ] = 0 p(y0 =1, yi =1) (1 + e-wi ) - p(y0 =0, yi =1) (1 + ewi ) = 0 p(y0 =1, yi =1) 1 + ewi = p(y0 =0, yi =1) 1 + e-wi p(y0 =1|yi =1) wi = log p(y0 =0|yi =1) . (3) Since the above is a chain of equivalence transformations, this proves that wi is the only equilibrium value of the rule. The weight vector w is thus a global point-attractor with regard to expected wei ght changes of the Bayesian Hebb rule (1) in the n-dimensional weight-space Rn . Furthermore we show, using the result from (3), that the expe cted weight change at any current value of wi points in the direction of wi . Consider some arbitrary intermediate weight value wi = wi + 2: E[wi ]|wi +2 = E[wi ]|wi +2 - E [wi ]|wi p(y0 =1, yi =1)e-wi (e-2 - 1) - p(y0 =0, yi =1)ewi (e2 - 1) = (p(y0 =0, yi =1)e- + p(y0 =1, yi =1)e )(e- - e ) . (4) The first factor in (4) is always non-negative, hence < 0 implies E [wi ] > 0, and > 0 implies E [wi ] < 0. The Bayesian Hebb rule is therefore always expected to perf orm updates in the right direction, and the initial weight values or perturbations o f the weights decay exponentially fast. 2 Already after having seen a finite set of examples y0 , . . . , yn {0, 1}n+1 , the Bayesian Hebb rule closely approximates the optimal weight vector w that can be inferred from the data. A traditional ^ frequentist's approach would use counters ai = #[y0 =1 yi =1] and bi = #[y0 =0 yi =1] to estimate every wi by ai wi = log ^ . (5) bi A Bayesian approach would model p(y0 |yi ) with an (initially flat) B eta-distribution, and use the counters ai and bi to update this belief [3], leading to the same MAP estimate wi . Consequently, in ^ both approaches a new example with y0 = 1 and yi = 1 leads to the update 1 = ai + 1 1 1 ai ^ wi ew = log ^n + wi + log(1 + ^ (6) = log (1 + e-wi )) , bi bi ai Ni where Ni := ai + bi is the number of previously processed examples with yi = 1, thus bi 1 Ni (1 + ai ). Analogously, a new example with y0 = 0 and yi = 1 gives rise to the update 1 = ai 1 ai ^ new = log (1 + ewi )). wi - log(1 + ^ wi ^ = log 1 bi + 1 b i 1 + bi Ni 1 ai = (7) Furthermore, wi ew = wi for a new example with yi = 0. Using the approximation log(1 + ) ^n ^ 1 the update rules (6) and (7) yield the Bayesian Hebb rule (1) with an adaptive learning rate i = Ni for each synapse. In fact, a result of Robbins-Monro (see [8] for a review) implies that the updating of weight estimates wi according to (6) and (7) converges to the target values wi not only for the particular choice ^ (Ni ) (Ni ) (N ) (N ) 1 i = Ni , but for any sequence i that satisfies Ni =1 i i = and Ni =1 (i i )2 < . More than that the Supermartingale Convergence Theorem (s ee [8]) guarantees convergence in distribution even for a sufficiently small constant learnin g rate. Learning rate adaptation One can see from the above considerations that the Bayesian Hebb rule with a constant learning rate converges globally to the desired log-odds. A too small constant learning rate, however, tends to slow down the initial convergence of the weight vector, and a too large constant learning rate produces larger fluctuations once the steady state is reache d. 1 (6) and (7) suggest a decaying learning rate i i = Ni , where Ni is the number of preceding examples with yi = 1. We will present a learning rate adaptation mechanism that avoids biologically implausible counters, and is robust enough to deal even with non-stationary distributions. (N ) Since the Bayesian Hebb rule and the Bayesian approach of updating B eta-distributions for conditional probabilities are closely related, it is reasonable to expect that the distribution of weights wi over longer time periods with a non-vanishing learning rate will resemble a B eta(ai , bi )-distribution transformed to the log-odd domain. The parameters ai and bi in this case are not exact counters anymore but correspond to virtual sample sizes, depending on th e current learning rate. We formalize this statistical model of wi by (wi ) = (ai + bi ) 1 B eta(ai , bi ) wi (wi )ai (-wi )bi , 1 + e-wi (ai )(bi ) In practice this model turned out to capture quite well the actually observed quasi-stationary distri1 1 i bution of wi . In [9] we show analytically that E[wi ] log ai and Var[wi ] ai + bi . A learning b rate adaptation mechanism at the synapse that keeps track of the observed mean and variance of the synaptic weight can therefore recover estimates of the virt ual sample sizes ai and bi . The following mechanism, which we call variance tracking implements this by computing running averages of the weights and the squares of weights in wi and qi : ¯ ¯ n i ew wi ew ¯n qi ew ¯n qi -wi ¯ ¯2 1+cosh wi ¯ (1 - i ) wi ¯ + i wi 2 (1 - i ) qi + i wi . ¯ (8) 3 1 In practice this mechanism decays like Ni under stationary conditions, but is also able to handle changing input distributions. It was used in all presented experiments for the Bayesian Hebb rule. 3 Hebbian learning of Bayesian decisions where mk is the number of different possible values of the input varia ble xk , and the indicator function I is defined as I (true) = 1 and I (f alse) = 0. We now show how the Bayesian Hebb rule can be used to learn Bayes optimal decisions. The first application is the Naive Bayesian classifier, where a binary target variable x0 should be inferred from a vector of multinomial variables x = x1 , . . . , xm , under the assumption that the xi 's are m conditionally independent given x0 , thus p(x0 , x) = p(x0 ) 1 p(xk |x0 ). Using basic rules of probability theory the posterior probability ratio for x0 = 1 and x0 = 0 can be derived: (1-m) km p m p(x0 =1|xk ) p(x0 =1) k p(xk |x0 =1) (x0 =1) p(x0 =1|x) = = = (9) p(x0 =0|x) p(x0 =0) p(xk |x0 =0) p(x0 =0) p(x0 =0|xk ) =1 =1 p (1-m) km j k p I (xk =j ) m (x0 =1) (x0 =1|xk =j ) = , p(x0 =0) p(x0 =0|xk =j ) =1 =1 The binary target variable x0 is represented directly by the binary state y0 of the postsynaptic neuron. Substituting the state variables y0 , y1 , . . . , yn in (9) and taking the logarithm leads to n p(yi = 1|y0 = 1) p(y0 = 1) i p(y0 = 1|y) yi log = (1 - m) log + . log p(y0 = 0|y) p(y0 = 0) p(yi = 1|y0 = 0) =1 Hence the optimal decision under the Naive Bayes assumption is in wi yi ) . y0 = sign((1 - m)w0 + ^ =1 The optimal weights w0 and wi w0 = log Let the m input variables x1 , . . . , xm be represented through the binary firing states y1 , . . . , yn {0, 1} of the n presynaptic neurons in a population coding manner. More pre cisely, let each input variable xk {1, . . . , mk } be represented by mk neurons, where each neuron fires only for one of the mk possible values of xk . Formally we define the simple preprocessing ( SP) T y = (x1 )T , . . . , (xm )T with (xk )T = [I (xk = 1), . . . , I (xk = mk )] . (10) p(y0 = 1) p(y0 = 0) and wi = log p(y0 = 1|yi = 1) p(y0 = 0|yi = 1) for i = 1, . . . , n. are obviously log-odds which can be learned by the Bayesian Hebb rule (the bias weight w0 is simply learned as an unconditional log-odd). 3.1 Learning Bayesian decisions for arbitrary distributions We now address the more general case, where conditional inde pendence of the input variables x1 , . . . , xm cannot be assumed. In this case the dependency structure of the underlying distribution is given in terms of an arbitrary Bayesian network BN for discrete variables (see e.g. Figure 1 A). Without loss of generality we choose a numbering scheme of the nodes of the BN such that the node to be learned is x0 and its direct children are x1 , . . . , xm . This implies that the BN can be described by m + 1 (possibly empty) parent sets defined by Pk = {i | a directed edge xi xk exists in BN and i 1} . The joint probability distribution on the variables x0 , . . . , xm in BN can then be factored and evaluated for x0 = 1 and x0 = 0 in order to obtain the probability ratio p(x0 = 1|x) p(x0 = 1|xP0 ) k p(x0 = 1, x) = = p(x0 = 0, x) p(x0 = 0|x) p(x0 = 0|xP0 ) 4 m =1 p(xk |xPk , x0 = 1) k p(xk |xPk , x0 = 0) m =m +1 p(xk |xPk ) p(xk |xPk ) . A B Figure 1: A) An example Bayesian network with general connectivity. B) Population coding applied to the Bayesian network shown in panel A. For each combination of values of the variables {xk , xPk } of a factor there is exactly one neuron (indicated by a black c ircle) associated with the factor that outputs the value 1. In addition OR's of these values are computed (black squares). We refer to the resulting preprocessing circuit as generalized preproces sing (GP). Obviously, the last term cancels out, and by applying Bayes' rule and taking the logarithm the target log-odd can be expressed as a sum of conditional log-odds only: p(x0 =1|x) p(x0 =1|xP0 ) k log + = log p(x0 =0|x) p(x0 =0|xP0 ) m =1 l p(x0 =1|xk , xPk ) p(x0 =1|xPk ) og - log p(x0 =0|xk , xPk ) p(x0 =0|xPk ) . (11) The negated vector parts in (12) correspond to the negative c oefficients in the sum in (11). Inserting the sparse coding (12) into (11) allows writing the Bayes optimal decision function (11) as a pure sum of log-odds of the target variable: x0 = y0 = sign( ^ ^ in wi yi ), We now develop a suitable sparse encoding of of x1 , . . . , xm into binary variables y1 , . . . , yn (with n m) such that the decision function (11) can be written as a weighted sum, and the weights correspond to conditional log-odds of yi 's. Figure 1 B illustrates such a sparse code: One binary vari able is created for every possible value assignment to a variable and all its parents, and one additional binary variable is created for every possible value assignm ent to the parent nodes only. Formally, the previously introduced population coding operator is generalized such that (xi1 , xi2 , . . . , xil ) l creates a vector of length j =1 mij that equals zero in all entries except for one 1-entry which identifies by its position in the vector the present assignment of the input variables xi1 , . . . , xil . The concatenation of all these population coded groups is collected in the vector y of length n T (12) . y = (xP0 )T , (x1 , xP1 )T , -(xP1 )T , . . . , (xm , xPm )T , -(xPm )T with wi = log =1 p(y0 =1|yi =0) . p(y0 =0|yi =0) Every synaptic weight wi can be learned efficiently by the Bayesian Hebb rule (1) with the formal modification that the update is not only triggered by yi =1 but in general whenever yi =0 (which obviously does not change the behavior of the learning proce ss). A neuron that learns with the Bayesian Hebb rule on inputs that are generated by the genera lized preprocessing (GP) defined in (12) therefore approximates the Bayes optimal decision function (11), and converges quite fast to the best performance that any probabilistic inference coul d possibly achieve (see Figure 2B). 4 The Bayesian Hebb rule in reinforcement learning We show in this section that a reward-modulated version of th e Bayesian Hebb rule enables a learning agent to solve simple reinforcement learning tasks. We c onsider the standard operant conditioning scenario, where the learner receives at each trial a n input x = x1 , . . . , xm , chooses an action out of a set of possible actions A, and receives a binary reward signal r {0, 1} with probability p(r|x, a). The learner's goal is to learn (as fast as possible) a policy (x, a) so that action selection according to this policy maximizes the average reward. In contrast to the previous 5 learning tasks, the learner has to explore different action s for the same input to learn the rewardprobabilities for all possible actions. The agent might for example choose actions stochastically with (x, a = ) = p(r = 1|x, a = ), which corresponds to the matching behavior phenomenon often observed in biology [10]. This policy was used during training in our computer experiments. p(r =1|x,a) The goal is to infer the probability of binary reward, so it su ffices to learn the log-odds log p(r=0|x,a) for every action, and choose the action that is most likely to yield reward (e.g. by a Winner-Take-All structure). If the reward probability for an action a = is defined by some Bayesian network BN, one can rewrite this log-odd as m p(xk |xPk , r = 1, a = ) p(r = 1|a = ) k p(r = 1|x, a = ) log = log + . (13) log p(r = 0|x, a = ) p(r = 0|a = ) p(xk |xPk , r = 0, a = ) =1 p(r =1|yi =0,a=) where the optimal weights are w,0 = log p(r=1|a=) and w,i = log p(r=0|yi =0,a=) . These logp(r =0|a=) odds can be learned for each possible action with a reward-modulated version of the Bayesian Hebb rule (1): if r = 1, yi = 0, a = · (1 + e-w,i ), - · (1 + ew,i ), if r = 0, yi = 0, a = w,i = (14) 0, otherwise In order to use the Bayesian Hebb rule, the input vector x is preprocessed to obtain a binary vector y. Both a simple population code such as (10), or generalized preprocessing as in (12) and Figure 1B can be used, depending on the assumed dependency structure. The reward log-odd (13) for the preprocessed input vector y can then be written as a linear sum in p(r = 1|y, a = ) log w,i yi , = w,0 + p(r = 0|y, a = ) =1 The attractive theoretical properties of the Bayesian Hebb rule for the prediction case apply also to the case of reinforcement learning. The weights correspond ing to the optimal policy are the only equilibria under the reward-modulated Bayesian Hebb rule, and are also global attractors in weight space, independently of the exploration policy (see [9]). 5 5.1 Experimental Results Results for prediction tasks We have tested the Bayesian Hebb rule on 400 different prediction tasks, each of them defined by a general (non-Naive) Bayesian network of 7 binary variables. The networks were randomly generated by the algorithm of [11]. From each network we sampled 2000 tr aining and 5000 test examples, and measured the percentage of correct predictions after every update step. The performance of the predictor was compared to the Bayes op timal predictor, and to online logistic regression, which fits a linear model by gradient descent on t he cross-entropy error function. This non-Hebbian learning approach is in general the best perfor ming online learning approach for linear discriminators [3]. Figure 2A shows that the Bayesian Hebb rule with the simple preprocessing (10) generalizes better from a few training examples, but is outp erformed by logistic regression in the long run, since the Naive Bayes assumption is not met. With the generalized preprocessing (12), the Bayesian Hebb rule learns fast and converges to the Bayes optimum (see Figure 2B). In Figure 2C we show that the Bayesian Hebb rule is robust to noisy updates - a condition very likely to occur in biological systems. We modified the weight update wi such that it was uniformly distributed in the interval wi ± %. Even such imprecise implementations of the Bayesian Hebb rule perform very well. Similar results can be obtained if the exp-function in (1) is replaced by a low-order Taylor approximation. 5.2 Results for action selection tasks The reward-modulated version (14), of the Bayesian Hebb rul e was tested on 250 random action selection tasks with m = 6 binary input attributes, and 4 possible actions. For every a ction a 6 A 1 0.95 B 1 0.95 C 1 0.95 Correctness Correctness 0.9 0.85 0.8 0.75 0.7 0 200 0.9 0.85 0.8 0.75 0.7 0 200 400 Correctness 0.9 0.85 0.8 0.75 0.7 0 200 400 Bayesian Hebb SP Log. Regression =0.2 Naive Bayes Bayes Optimum 400 600 800 1000 Bayesian Hebb GP Bayesian Hebb SP Bayes Optimum 600 800 1000 Without Noise 50% Noise 100% Noise 150% Noise 600 800 1000 # Training Examples # Training Examples # Training Examples Figure 2: Performance comparison for prediction tasks. A) The Bayesian Hebb rule with simple preprocessing (SP) learns as fast as Naive Bayes, and faster than logistic regression (with optimized constant learning rate). B) The Bayesian Hebb rule with generalized preprocessing ( GP) learns fast and converges to the Bayes optimal prediction performance. C) Even a very imprecise implementation of the Bayesian Hebb rule (noisy updates, uniformly distributed in wi ± %) yields almost the same learning performance. random Bayesian network [11] was drawn to model the input and reward distributions (see [9] for details). The agent received stochastic binary rewards for every chosen action, updated the weights w,i according to (14), and measured the average reward on 500 ind ependent test trials. In Figure 3A we compare the reward-modulated Bayesian Hebb rule with simple population coding (10) (Bayesian Hebb SP), and generalized preprocessing (12) ( Bayesian Hebb GP), to the standard learning model for simple conditioning tasks, the non-Hebbian Rescorla-Wagner rule [12]. The reward-modulated Bayesian Hebb rule learns as fast as the Rescorla-Wagner rule, and achieves in combination with generalized preprocessing a higher perfo rmance level. The widely used tabular Q-learning algorithm, in comparison is slower than the othe r algorithms, since it does not generalize, but it converges to the optimal policy in the long run. 5.3 A model for the experiment of Yang and Shadlen In the experiment by Yang and Shadlen [1], a monkey had to choose between gazing towards a red target R or a green target G. The probability that a reward was received at either choice depended on four visual input stimuli that had been shown at the beginning of the trial. Every stimulus was one shape out of a set of ten possibilities and had an associated weight, which had been defined by the experimenter. The sum of the four weights yielded the log -odd of obtaining a reward at the red target, and a reward for each trial was assigned accordingly to one of the targets. The monkey thus had to combine the evidence from four visual stimuli to optimize its action selection behavior. In the model of the task it is sufficient to learn weights only for the action a = R, and select this action whenever the log-odd using the current weights i s positive, and G otherwise. A simple population code as in (10) encoded the 4-dimensional visual stimulus into a 40-dimensional binary vector y. In our experiments, the reward-modulated Bayesian Hebb ru le learns this task as fast and with similar quality as the non-Hebbian Rescorla-Wagner ru le. Furthermore Figures 3B, C show that it produces after learning similar behavior as that rep orted for two monkeys in [1]. 6 Discussion We have shown that the simplest and experimentally best supported local learning mechanism, Hebbian learning, is sufficient to learn Bayes optimal decisions. We have introduced and analyzed the Bayesian Hebb rule, a training method for synaptic weights, which converges fast and robustly to optimal log-probability ratios, without requiring any com munication between plasticity mechanisms for different synapses. We have shown how the same plasticity mechanism can learn Bayes optimal decisions under different statistical independence assumptions, if it is provided with an appropriately preprocessed input. We have demonstrated on a variety of pre diction tasks that the Bayesian Hebb rule learns very fast, and with an appropriate sparse prepro cessing mechanism for groups of statistically dependent features its performance converges to the B ayes optimum. Our approach therefore suggests that sparse, redundant codes of input features may simplify synaptic learning processes in spite of strong statistical dependencies. Finally we have shown that Hebbian learning also suffices 7 A 0.8 B Percentage of red choices 100 80 60 40 20 0 -4 C Percentage of red choices -2 0 2 4 100 80 60 40 20 0 -4 Average Reward 0.7 0.6 0.5 0.4 0 Bayesian Hebb SP Bayesian Hebb GP Rescorla-Wagner Q-Learning Optimal Selector 400 800 1200 1600 2000 -2 0 2 4 Trials Evidence for red (logLR) Evidence for red (logLR) Figure 3: A) On 250 4-action conditioning tasks with stochastic rewards, the reward-modulated Bayesian Hebb rule with simple preprocessing (SP) learns similarly as the Rescorla-Wagner rule, and substantially faster than Q-learning. With generalize d preprocessing (GP), the rule converges to the optimal action-selection policy. B, C) Action selection policies learned by the reward-modulated Bayesian Hebb rule in the task by Yang and Shadlen [1] after 100 (B), and 1000 (C) trials are qualitatively similar to the policies adopted by monkeys H and J in [1] after learning. for simple instances of reinforcement learning. The Bayesi an Hebb rule, modulated by a signal related to rewards, enables fast learning of optimal action selection. Experimental results of [1] on reinforcement learning of probabilistic inference in prim ates can be partially modeled in this way with regard to resulting behaviors. An attractive feature of the Bayesian Hebb rule is its ability to deal with the addition or removal of input features through the creation or deletion of synapt ic connections, since no relearning of weights is required for the other synapses. In contrast to di scriminative neural learning rules, our approach is generative, which according to [13] leads to fas ter generalization. Therefore the learning rule may be viewed as a potential building block for models of the brain as a self-organizing and fast adapting probabilistic inference machine. Acknowledgments We would like to thank Martin Bachler, Sophie Deneve, Rodney Douglas, Konrad Koerding, Rajesh Rao, and especially Dan Roth for inspiring discussions. Written under partial support by the Austrian Science Fund FWF, project # P17229-N04, project # S9102-N04, and project # FP6-015879 (FACETS) as well as # FP7-216593 (SECO) of the European Union. References [1] T. Yang and M. N. Shadlen. Probabilistic reasoning by neurons. Nature, 447:1075­1080, 2007. [2] R. P. N. Rao. Neural models of Bayesian belief propagation. In K. Doya, S. Ishii, A. Pouget, and R. P. N. Rao, editors, Bayesian Brain., pages 239­267. MIT-Press, 2007. [3] C. M. Bishop. Pattern Recognition and Machine Learning. Springer (New York), 2006. [4] S. Deneve. Bayesian spiking neurons I, II. Neural Computation, 20(1):91­145, 2008. ¨ [5] A. Sandberg, A. Lansner, K. M. Petersson, and O. Ekeberg. A Bayesian attractor network with incremental learning. Network: Computation in Neural Systems, 13:179­194, 2002. [6] D. Roth. Learning in natural language. In Proc. of IJCAI, pages 898­904, 1999. [7] D. O. Hebb. The Organization of Behavior. Wiley, New York, 1949. [8] D. P. Bertsekas and J.N. Tsitsiklis. Neuro-Dynamic Programming. Athena Scientific, 1996. [9] A. Anonymous. Journal version. in preparation, 2008. [10] L. P. Sugrue, G. S. Corrado, and W. T. Newsome. Matching behavior and the representation of value in the parietal cortex. Science, 304:1782­1787, 2004. [11] J. S. Ide and F. G. Cozman. Random generation of Bayesian networks. In Proceedings of the 16th Brazilian Symposium on Artificial Intelligence, pages 366­375, 2002. [12] R. A. Rescorla and A. R. Wagner. Classical conditioning II. In A. H. Black and W. F. Prokasy, editors, A theory of Pavlovian conditioning, pages 64­99. 1972. [13] A. Y. Ng and M. I. Jordan. On discriminative vs. generative classifiers. NIPS, 14:841­848, 2002. 8