Goal-directed decision making in prefrontal c o r t e x : A computational framework Matthew Botvinick Princeton Neuroscience Institute and Department of Psychology, Princeton University Princeton, NJ 08540 matthewb@princeton.edu James An Computer Science Department Princeton University Princeton, NJ 08540 an@princeton.edu A b st r a c t Resear ch in animal learning and behavioral neuroscience has distinguished b etw een two forms of action control: a habit-based form, which relies on sto r ed action values, and a goal-directed form, which forecasts and co mp ar es action outcomes based on a model of the environment. While h ab it- b ased control has been the subject of extensive computational r esear ch , the computational principles underlying goal-directed control in an imals have so far received less attention. In the present paper, we ad v an ce a computational framework for goal-directed control in animals an d humans. We take three empirically motivated points as founding p r emises: (1) Neurons in dorsolateral prefrontal cortex represent action p o licies, (2) Neurons in orbitofrontal cortex represent rewards, and (3) N eu r al computation, across domains, can be appropriately understood as p er f o r min g structured probabilistic inference. On a purely computational lev el, the resulting account relates closely to previous work using Bayesian in f er en ce to solve Markov decision problems, but extends this work by in tr o d u cin g a new algorithm, which provably converges on optimal plans. O n a cognitive and neuroscientific level, the theory provides a unifying f r amew o r k for several different forms of goal-directed action selection, p lacin g emphasis on a novel form, within which orbitofrontal reward r ep r esen tati o n s directly drive policy selection. 1 G o a l - d i r e c t e d act i o n cont r o l I n the study of human and animal behavior, it is a long-standing idea that reward-based d ecisio n making may rely on two qualitatively different mechanisms. In habit-based d ecisio n making, stimuli elicit reflex-like responses, shaped by past reinforcement [1]. In g o a l- d irected or purposive decision making, on the other hand, actions are selected based on a prospective consideration of possible outcomes and future lines of action [2]. Over the past tw en ty years or so, the attention of cognitive neuroscientists and computationally minded p sy ch o lo g ist s has tended to focus on habit-based control, due in large part to interest in p o ten tial links between dopaminergic function and temporal-difference algorithms for r ein f o r cemen t learning. However, a resurgence of interest in purposive action selection is n o w being driven by innovations in animal behavior research, which have yielded powerful n ew behavioral assays [3], and revealed specific effects of focal neural damage on goald ir ected behavior [4]. I n discussing some of the relevant data, Daw, Niv and Dayan [5] recently pointed out the clo se relationship between purposive decision making, as understood in the behavioral scien ces, and model-based methods for the solution of Markov decision problems (MDPs), w h er e action policies are derived from a joint analysis of a transition function (a mapping from states and actions to outcomes) and a reward function (a mapping from states to r ew ar d s) . Beyond this important insight, little work has yet been done to characterize the co mp u tatio n s underlying goal-directed action selection (though see [6, 7]). As discussed b elo w, a great deal of evidence indicates that purposive action selection depends critically on a particular region of the brain, the prefrontal cortex. However, it is currently a critical, and q u ite open, question what the relevant computations within this part of the brain might be. O f course, the basic computational problem of formulating an optimal policy given a model o f an MDP has been extensively studied, and there is no shortage of algorithms one might co n sid er as potentially relevant to prefrontal function (e.g., value iteration, policy iteration, b ack w ar d induction, linear programming, and others). However, from a cognitive and n eu r o scien ti f ic perspective, there is one approach to solving MDPs that it seems particularly ap p ealin g to consider. In particular, several researchers have suggested methods for solving MD Ps through probabilistic inference [8-12]. The interest of this idea, in the present co n tex t, derives from a recent movement toward framing human and animal information p r o cessin g , as well as the underlying neural computations, in terms of structured p r o b ab ilisti c inference [13, 14]. Given this perspective, it is inviting to consider whether g o al- d ir ecte d action selection, and the neural mechanisms that underlie it, might be u n d er sto o d in those same terms. O n e challenge in investigating this possibility is that previous research furnishes no `off-thesh elf ' algorithm for solving MDPs through probabilistic inference that both provably yields o p timal policies and aligns with what is known about action selection in the brain. We en d eav o r here to start filling in that gap. In the following section, we introduce an account o f how goal-directed action selection can be performed based on probabilisitic inference, w ith in a network whose components map grossly onto specific brain structures. As part of th is account, we introduce a new algorithm for solving MDPs through Bayesian inference, alo n g with a convergence proof. We then present results from a set of simulations illu str atin g how the framework would account for a variety of behavioral phenomena that ar e thought to involve purposive action selection. 2 Computational model A s noted earlier, the prefrontal cortex (PFC) is believed to play a pivotal role in purposive b eh av io r. This is indicated by a broad association between prefrontal lesions and imp air men ts in goal-directed action in both humans (see [15]) and animals [4]. Single-unit r eco r d in g and other data suggest that different sectors of PFC make distinct contributions. I n particular, neurons in dorsolateral prefrontal cortex (DLPFC) appear to encode tasksp ecif ic mappings from stimuli to responses (e.g., [16]): "task representations," in the lan g u ag e of psychology, or "policies" in the language of dynamic programming. Although th er e is some understanding of how policy representations in DLPFC may guide action ex ecu tio n [15], little is yet known about how these representations are themselves selected. O u r most basic proposal is that DLPFC policy representations are selected in a prospective, mo d el- b ased fashion, leveraging information about action-outcome contingencies (i.e., the tr an sitio n function) and about the incentive value associated with specific outcomes or states ( th e reward function). There is extensive evidence to suggest that state-reward associations ar e represented in another area of the PFC, the orbitofrontal cortex (OFC) [17, 18]. As for th e transition function, although it is clear that the brain contains detailed representations of actio n - o u tco me associations [19], their anatomical localization is not yet entirely clear. H o w ev er, some evidence suggests that the enviromental effects of simple actions may be r ep r esen ted in inferior fronto-parietal cortex [20], and there is also evidence suggesting that med ial temporal structures may be important in forecasting action outcomes [21]. A s detailed in the next section, our model assumes that policy representations in DLPFC, r ew ar d representations in OFC, and representations of states and actions in other brain r eg io n s, are coordinated within a network structure that represents their causal or statistical in ter d ep en d e n cies, and that policy selection occurs, within this network, through a process of p r o b ab ilisti c inference. 2.1 A rch it ect u re Th e implementation takes the form of a directed graphical model [22], with the layout shown in Figure 1. Each node represents a discrete random variable. State variables (s), r ep r esen tin g the set of m possible world states, serve the role played by parietal and medial temp o r al cortices in representing action outcomes. Action variables (a) representing the set o f available actions, play the role o f high-level cortical motor areas in v o lv ed in the programming of actio n sequences. Policy variables ( ), each repre-senting the set of all deterministic policies asso ciated with a specific state, cap tu r e the representational role o f DLPFC. Local and global u tility variables, described further Fig 1. Left: Single-step decision. Right: Sequential decision. b elo w, capture the role of OFC in Each time-slice includes a set of m policy nodes. r ep r esen tin g incentive value. A sep ar ate set of nodes is included for each discrete time-step up to the planning horizon. Th e conditional probabilities associated with each variable are represented in tabular form. State probabilities are based on the state and action variables in the preceding time-step, and th u s encode the transition function. Action probabilities depend on the current state and its asso ciated policy variable. Utilities depend only on the current state. Rather than r ep r esen tin g reward magnitude as a continuous variable, we adopt an approach introduced by [ 2 3 ] , representing reward through the posterior probability of a binary variable (u). States asso ciated with large positive reward raise p(u) (i.e, p(u=1|s)) near to one; states associated w ith large negative rewards reduce p(u) to near zero. In the simulations reported below, we u sed a simple linear transformation to map from scalar reward values to p(u): p (u si ) = , 1 R ( si ) +1 rmax 2 rmax max j R ( s j ) (1) I n situations involving sequential actions, expected returns from different time-steps must be in teg r ated into a global representation of expected value. In order to accomplish this, we emp lo y a technique proposed by [8], introducing a "global" utility variable (u G) . Like u, this 1 is a binary random variable, but associated with a posterior probability determined as: p (uG ) = 1 N p(u i ) i (2) w h er e N is the number of u nodes. The network as whole embodies a generative model for in str u men tal action. The basic idea is to use this model as a substrate for probabilistic in f er en ce, in order to arrive at optimal policies. There are three general methods for acco mp lish in g this, which correspond three forms of query. First, a desired outcome state can be identified, by treating one of the state variables (as well as the initial state variable) as observed (see [9] for an application of this approach). Second, the expected return for sp ecif ic plans can be evaluated and compared by conditioning on specific sets of values over th e policy nodes (see [5, 21]). However, our focus here is on a less obvious possibility, w h ich is to condition directly on the utility variable u G , as explained next. 2.2 P o licy s elect io n b y p ro b a b ilis t ic in f eren ce: a n it era t iv e a lg o rit h m Co o p er [23] introduced the idea of inferring optimal decisions in influence diagrams by tr eatin g utility nodes into binary random variables and then conditioning on these variables. A lth o u g h this technique has been adopted in some more recent work [9, 12], we are aware of n o application that guarantees optimal decisions, in the expected-reward sense, in multi-step task s. We introduce here a simple algorithm that does furnish such a guarantee. The p r o ced u r e is as follows: (1) Initialize the policy nodes with any set of non-deterministic 2 p r io r s. (2) Treating the initial state and u G as observed variables (u G = 1), use standard belief Note that temporal discounting can be incorporated into the framework through minimal modifications to Equation 2. 2 In the single-action situation, where there is only one u node, it is this variable that is treated as observed (u = 1). 1 propagation (or a comparable algorithm) to infer the posterior distributions over all policy n o d es. (3) Set the prior distributions over the policy nodes to the values (posteriors) o b tain ed in step 2. (4) Go to step 2. The next two sections present proofs of monotonicity an d convergence for this algorithm. 2.2.1 Monotonicity We show first that, at each policy node, the probability associated with the optimal policy will rise on every iteration. Define * as follows: p uG ( *, + ) > p (u G , + ), * (3) where + is the current set of probability distributions at all policy nodes on subsequent time-steps. (Note that we assume here, for simplicity, that there is a unique optimal policy.) The objective is to establish that: p ( t* ) > p ( t* 1 ) (4) where t indexes processing iterations. The dynamics of the network entail that p( t ) = p( t 1 uG ) (5) where represents any value (i.e., policy) of the decision node being considered. Substituting this into (4) gives p t* 1 uG > p ( t* 1 ) (6) ( ) From this point on the focus is on a single iteration, which permits us to omit the relevant subscripts. Applying Bayes' law to (6) yields p (uG p (uG * ) p ( *) ) p ( ) > p( *) (7) Canceling, and bringing the denominator up, this becomes p (uG *) > p (uG ) p( ) ) p( ) (8) Rewriting the left hand side, we obtain p ( uG *) p ( ) > p (uG (9) Subtracting and further rearranging: p (uG p (uG *) p (uG ) ) p( ) > 0 p (uG (10) *) p ( uG *) p( *) + * p ( uG p( *) ) p( )>0 (11) (12) p (uG * *) p ( uG )> 0 Note that this last inequality (12) follows from the definition of *. Remark: Of course, the identity of * depends on +. In particular, the policy * will only be part of a globally optimal plan if the set of choices + is optimal. Fortunately, this requirement is guaranteed to be met, as long as no upper bound is placed on the number of processing cycles. Recalling that we are considering only finite-horizon problems, note that for policies leading to states with no successors, + is empty. Thus * at the relevant policy nodes is fixed, and is guaranteed to be part of the optimal policy. The proof above shows that * will continuously rise. Once it reaches a maximum, * at immediately preceding decisions will perforce fit with the globally optimal policy. The process works backward, in the fashion of backward induction. 2.2.2 Convergence Continuing with the same notation, we show now that limt pt ( * uG ) = 1 (13) Note that, if we apply Bayes' law recursively, pt ( uG ) = p (uG pi (uG ) ) p ( ) = p (u t G ) 2 pt 1 ( 1 ( uG ) pi (uG ) pt p (uG )= ), p (uG pt (uG ) pt ) 3 pt 2 ( ) 1 ( u G ) pt 2 ( u G ) ... (14) Thus, p1 ( uG ) = p (uG ) p ( ), p ( p (u ) 1 2 1 G uG ) = ) p 2 p1 ( p2 (uG ) p1 (uG ) p3 ( uG ) = p (uG ) 3 p1 ( ) p3 (uG ) p2 (uG ) p1 (uG ) , (15) and so forth. Thus, what we wish to prove is p uG ( * ) ( *) 1 = 1 pt (uG ) t =1 (16) or, rearranging, pt (uG ) t =1 = p (uG ) p1 ( ). (17) Note that, given the stipulated relationship between p( ) on each processing iteration and p( | uG) on the previous iteration, pt (uG ) = p (uG ) t( )= p (uG p ) t1 p ( uG ) = p (uG pt 1 ) 2 t1 p ( ) ( uG ) (18) p ( uG = pt 1 ) 3 t1 p ( ) = pt 1 p (uG ) 4 t1 p ( ) (uG ) pt 2 (uG ) ) ) p (uG p (uG ) pt 2 (uG ) pt 3 (uG ) ( ... With this in mind, we can rewrite the left hand side product in (17) as follows: p1 (uG ) p (uG ( uG ) 2 1 p 1 ( G (uG 1 ) 3 1 p G ) p (uG ) p p ( uG ) p (u ) p (u ) p2 (uG ) ) 1( ) p p ... p1 (uG ) p2 (uG ) p3 (uG ) ( uG ) 4 (19) Note that, given (18), the numerator in each factor of (19) cancels with the denominator in the subsequent factor, leaving only p(uG| *) in that denominator. The expression can thus be rewritten as 1 p (uG (uG ) 4 1 p ( ) p ) p (uG 1 ) p (uG 1 ) p p ( uG ) ...= p (uG ) (uG ) p1 ( ) . (20) The objective is then to show that the above equals p( *). It proceeds directly from the definition of * that, for all other than *, p ( uG p (uG ) ) ( <1 (21) Thus, all but one of the terms in the sum above approach zero, and the remaining term equals p1( *). Thus, p p (uG ) (uG ) p1 ) = p1 ( ) (22) 3 3.1 Simulations Binary choice We begin with a simulation of a simple incentive choice situation. Here, an animal faces tw o levers. Pressing the left lever reliably yields a preferred food (r = 2), the right a less p r ef er r ed food (r = 1). Representing these contingencies in a network structured as in Fig. 1 ( lef t) and employing the iterative algorithm described in section 2.2 yields the results in Fig u r e 2A. Shown here are the posterior probabilities for the policies press left and press rig h t, along with the marginal value of p(u = 1) under these posteriors (labeled EV for ex p ected value). The dashed horizontal line indicates the expected value for the optimal p lan , to which the model obviously converges. A key empirical assay for purposive behavior involves outcome devaluation. Here, actions y ield in g a previously valued outcome are abandoned after the incentive value of the outcome is reduced, for example by pairing with an aversive event (e.g., [4]). To simulate this within th e binary choice scenario just described, we reduced to zero the reward value of the food y ield ed by the left lever (fL) , by making the appropriate change to p(u|fL) . This yielded a r ev er sal in lever choice (Fig. 2B). A n o th er signature of purposive actions is that they are abandoned when their causal co n n ectio n with rewarding outcomes is removed (contingency degradation, see [4]). We simu lated this by starting with the model from Fig. 2A and changing conditional p r o b ab ilitie s at s for t=2 to reflect a decoupling of the left action from the fL outcome. The r esu ltin g behavior is shown in Fig. 2C. Fig 2. Simulation results, binary choice. 3.2 Stochastic outcomes A critical aspect of the present modeling paradigm is that it yields reward-maximizing ch o ices in stochastic domains, a property that distinguishes it from some other recent ap p r o ach es using graphical models to do planning (e.g., [9]). To illustrate, we used the ar ch itectu r e in Figure 1 (left) to simulate a choice between two fair coins. A `left' coin y ield s $1 for heads, $0 for tails; a `right' coin $2 for heads but for tails a $3 loss. As illu str ated in Fig. 2D, the model maximizes expected value by opting for the left coin. F ig 3. Simulation results, two-step sequential choice. 3.3 Sequential decision H er e, we adopt the two-step T-maze scenario used by [24] (Fig. 3A). Representing the task co n tin g en cie s in a graphical model based on the template from Fig 1 (right), and using the r ew ar d values indicated in Fig. 3A, yields the choice behavior shown in Figure 3B. Fo llo w in g [24], a shift in motivational state from hunger to thirst can be represented in the graphical model by changing the reward function (R(cheese) = 2, R(X) = 0, R(water) = 4, R( car r o ts) = 1). Imposing this change at the level of the u variables yields the choice b eh av io r shown in Fig. 3C. The model can also be used to simulate effort-based decision. Star tin g with the scenario in Fig. 2A, we simulated the insertion of an effort-demanding scalab le barrier at S 2 (R(S 2 ) = -2) by making appropriate changes p(u|s). The resulting b eh av io r is shown in Fig. 3D. A famous empirical demonstration of purposive control involves detour behavior. Using a maze like the one shown in Fig. 4A, with a food reward placed at s5 , Tolman [2] found that r ats reacted to a barrier at location A by taking the upper route, but to a barrier at B by taking th e longer lower route. We simulated this experiment by representing the corresponding 3 tr an sitio n and reward functions in a graphical model of the form shown in Fig. 1 (right), r ep r esen tin g the insertion of barriers by appropriate changes to the transition function. The r esu ltin g choice behavior at the critical juncture s2 is shown in Fig. 4. F ig 4. Simulation results, detour behavior. B: No barrier. C: Barrier at A. D: Barrier at B. A n o th er classic empirical demonstration involves latent lea rn in g . Blodgett [25] allowed rats to explore the maze sh o w n in Fig. 5. Later insertion of a food reward at s1 3 w as followed immediately by dramatic reductions in the r u n n in g time, reflecting a reduction in entries into blind alley s. We simulated this effect in a model based on the temp late in Fig. 1 (right), representing the maze layout v ia an appropriate transition function. In the absence of a reward at s1 2 , random choices occurred at each in ter sectio n . However, setting R(s1 3 ) = 1 resulted in the set of choices indicated by the heavier arrows in Fig. 5. Fig 5. Latent learning. 4 R e l a t i o n t o p r e v i o u s work I n itial proposals for how to solve decision problems through probabilistic inference in g r ap h ical models, including the idea of encoding reward as the posterior probability of a r an d o m utility variable, were put forth by Cooper [23]. Related ideas were presented by Sh ach ter and Peot [12], including the use of nodes that integrate information from multiple u tility nodes. More recently, Attias [11] and Verma and Rao [9] have used graphical models to solve shortest-path problems, leveraging probabilistic representations of rewards, though n o t in a way that guaranteed convergence on optimal (reward maximizing) plans. More clo sely related to the present research is work by Toussaint and Storkey [10], employing the EM algorithm. The iterative approach we have introduced here has a certain resemblance to th e EM procedure, which becomes evident if one views the policy variables in our models as p ar ameter s on the mapping from states to actions. It seems possible that there may be a f o r mal equivalence between the algorithm we have proposed and the one reported by [10]. A s a cognitive and neuroscientific proposal, the present work bears a close relation to recent w o r k by Hasselmo [6], addressing the prefrontal computations underlying goal-directed actio n selection (see also [7]). The present efforts are tied more closely to normative p r in cip les of decision-making, whereas the work in [6] is tied more closely to the details of n eu r al circuitry. In this respect, the two approaches may prove complementary, and it will b e interesting to further consider their interrelations. In this simulation and the next, the set of states associated with each state node was limited to the set of reachable states for the relevant time-step, assuming an initial state of s1 . 3 Acknowledgments Th an k s to Andrew Ledvina, David Blei, Yael Niv, Nathaniel Daw, and Francisco Pereira for u sef u l comments. R ef eren ces [1 ] Hull, C.L., Principles of Behavior. 1943, New York: Appleton-Century. [2 ] Tolman, E.C., Purposive Behavior in Animals and Men. 1932, New York: Century. [3 ] Dickinson, A., Actions and habits: the development of behavioral autonomy. Philosophical Tran sactio n s of the Royal Society (London), Series B, 1985. 308: p. 67-78. [4 ] Balleine, B.W. and A. Dickinson, Goal-directed instrumental action: contingency and incentive lea rn in g and their cortical substrates. Neuropharmacology, 1998. 37: p. 407-419. [5 ] Daw, N.D., Y. Niv, and P. Dayan, Uncertainty-based competition between prefrontal and striatal system s for behavioral control. Nature Neuroscience, 2005. 8: p. 1704-1711. [6 ] Hasselmo, M.E., A model of prefrontal cortical mechanisms for goal-directed behavior. Journal of C o g n itiv e Neuroscience, 2005. 17: p. 1115-1129. [7 ] Schmajuk, N.A. and A.D. Thieme, Purposive behavior and cognitive mapping. A neural network m o d el. Biological Cybernetics, 1992. 67: p. 165-174. [8 ] Tatman, J.A. and R.D. Shachter, Dynamic programming and influence diagrams. IEEE Tran sactio n s on Systems, Man and Cybernetics, 1990. 20: p. 365-379. [9 ] Verma, D. and R.P.N. Rao. Planning and acting in uncertain enviroments using probabilistic in feren ce. in IEEE/RSJ International Conference on Intelligent Robots and Systems. 2006. [1 0 ] Toussaint, M. and A. Storkey. Probabilistic inference for solving discrete and continuous state m a rko v decision processes. in Proceedings of the 23rd International Conference on Machine L ea rn in g . 2006. Pittsburgh, PA. [11 ] Attias, H. Planning by probabilistic inference. in Proceedings of the 9th Int. Workshop on A rtificia l Intelligence and Statistics. 2003. [1 2 ] Shachter, R.D. and M.A. Peot. Decision making using probabilistic inference methods. in U n certa in ty in artificial intelligence: Proceedings of the Eighth Conference (1992). 1992. Stanford U n iv ersity : M. Kaufmann. [1 3 ] Chater, N., J.B. Tenenbaum, and A. Yuille, Probabilistic models of cognition: conceptual fo u n d a tio n s. Trends in Cognitive Sciences, 2006. 10(7): p. 287-291. [1 4 ] Doya, K., et al., eds. The Bayesian Brain: Probabilistic Approaches to Neural Coding. 2006, MIT P ress: Cambridge, MA. [1 5 ] Miller, E.K. and J.D. Cohen, An integrative theory of prefrontal cortex function. Annual Review o f Neuroscience, 2001. 24: p. 167-202. [1 6 ] Asaad, W.F., G. Rainer, and E.K. Miller, Task-specific neural activity in the primate prefrontal co rtex. Journal of Neurophysiology, 2000. 84: p. 451-459. [1 7 ] Rolls, E.T., The functions of the orbitofrontal cortex. Brain and Cognition, 2004. 55: p. 11-29. [1 8 ] Padoa-Schioppa, C. and J.A. Assad, Neurons in the orbitofrontal cortex encode economic value. N atu re, 2006. 441: p. 223-226. [1 9 ] Gopnik, A., et al., A theory of causal learning in children: causal maps and Bayes nets. P sy ch o lo g ica l Review, 2004. 111: p. 1-31. [2 0 ] Hamilton, A.F.d.C. and S.T. Grafton, Action outcomes are represented in human inferior fro n to p a rieta l cortex. Cerebral Cortex, 2008. 18: p. 1160-1168. [2 1 ] Johnson, A., M.A.A. van der Meer, and D.A. Redish, Integrating hippocampus and striatum in d ecisio n -m a k in g . Current Opinion in Neurobiology, 2008. 17: p. 692-697. [2 2 ] Jensen, F.V., Bayesian Networks and Decision Graphs. 2001, New York: Springer Verlag. [2 3 ] Cooper, G.F. A method for using belief networks as influence diagrams. in Fourth Workshop on U n certa in ty in Artificial Intelligence. 1988. University of Minnesota, Minneapolis. [2 4 ] Niv, Y., D. Joel, and P. Dayan, A normative perspective on motivation. Trends in Cognitive S cien ces, 2006. 10: p. 375-381. [2 5 ] Blodgett, H.C., The effect of the introduction of reward upon the maze performance of rats. U n iv ersity of California Publications in Psychology, 1929. 4: p. 113-134.