Modeling Interaction via the Principle of Maximum Causal Entropy

Brian D. Ziebart bziebart@cs.cmu.edu J. Andrew Bagnell dbagnell@ri.cmu.edu Anind K. Dey anind@cs.cmu.edu School of Computer Science, Carnegie Mellon University, Pittsburgh, PA 15213, USA

Abstract
The principle of maximum entropy provides a powerful framework for statistical models of joint, conditional, and marginal distributions. However, there are many important distributions with elements of interaction and feedback where its applicability has not been established. This work presents the principle of maximum causal entropy--an approach based on causally conditioned probabilities that can appropriately model the availability and influence of sequentially revealed side information. Using this principle, we derive models for sequential data with revealed information, interaction, and feedback, and demonstrate their applicability for statistically framing inverse optimal control and decision prediction tasks.

where side information is dynamic, i.e., revealed over time. For example, consider an agent interacting with a stochastic environment. The agent may have a model for the distribution of future states given its current state and possible actions, but, due to stochasticity, it does not know what value a future state will take until after selecting the sequence of actions temporally preceding it. Thus, future states have no causal influence over earlier actions. Conditional maximum entropy approaches are ill-suited for this setting as they assume all side information is available a priori. Building on the recent advance of the Marko-Massey theory of directed information (Massey, 1990), we present the principle of maximum causal entropy (MaxCausalEnt). It prescribes a probability distribution by maximizing the entropy of a sequence of variables conditioned on the side information available at each time step. This contribution extends the maximum entropy framework for statistical modeling to processes with information revelation, feedback, and interaction. We motivate and apply this approach on decision prediction tasks, where actions stochastically influence revealed side information (the state of the world) with examples from inverse optimal control, multi-player dynamic games, and interaction with partially observable systems. Though we focus on the connection to decision making and control in this work, it is important to note that the principle of maximum causal entropy is not specific to those domains. It is a general approach that is applicable to any sequential data where future side information's assumed lack of causal influence over earlier variables is reasonable.

1. Introduction
The principle of maximum entropy (Jaynes, 1957) serves a foundational role in the theory and practice of constructing statistical models, with applicability to statistical mechanics, natural language processing, econometrics, and ecology (Dud´ & Schapire, 2006). ik Conditional extensions of the principle that consider a sequence of side information (i.e., additional variables that are not predicted, but are related to variables that are predicted), and specifically conditional random fields (Lafferty et al., 2001), have been applied with remarkable success in recognition, segmentation, and classification tasks, and are a preferred tool in natural language processing, and activity recognition. This work extends the maximum entropy approach to conditional probability distributions in settings characterized by interaction with stochastic processes
Appearing in Proceedings of the 27 th International Conference on Machine Learning, Haifa, Israel, 2010. Copyright 2010 by the author(s)/owner(s).

2. Maximum Causal Entropy
Motivated by the task of modeling decisions with elements of sequential interaction, we introduce the principle of maximum causal entropy, describe its core theoretical properties, and provide efficient algorithms for inference and learning.

Modeling Interaction via the Principle of Maximum Causal Entropy

2.1. Preliminaries When faced with an ill-posed problem, the principle of maximum entropy (Jaynes, 1957) prescribes the use of "the least committed" probability distribution that is consistent with known problem constraints. This criterion is formally measured by Shannon's information entropy, EY [- log P (Y )], and many of the fundamental building blocks of statistics, including Gaussian and Markov random field distributions, maximize this entropy subject to moment constraints. In the presence of side information, X, that we do not desire to model, the standard prescription is to maximize the conditional entropy, EY,X [- log P (Y|X)], yielding, for example, the conditional random field (CRF) (Lafferty et al., 2001). Though our intention is to similarly model conditional probability distributions, CRFs assume a knowledge of future side information, Xt+1:T , for each Yt that does not match settings with dynamically revealed information. Despite not being originally formulated for such uses, marginalizing over the CRF's joint distribution is possible: P (Yt |X1:t , Y1:t-1 )  e
 F (X,Y) Xt+1:T ,Yt+1:T

Causal entropy (Kramer, 1998; Permuter et al., 2008), H(AT ||ST ) =
t=1

EA,S [- log P (AT ||ST )]
T

(3)

H(At |S1:t , A1:t-1 ),

measures the uncertainty present in the causally conditioned distribution. It upper bounds the conditional entropy; intuitively this reflects the fact that additionally conditioning on information from the future (i.e., acausally) only decreases uncertainty. The causal entropy has previously found applicability in the analysis of communication channels with feedback (Kramer, 1998), decentralized control (Tatikonda & Mitter, 2004), sequential investment and online compression with side information (Permuter et al., 2008). Using this notation, any joint distribution can be expressed as P (A, S) = P (AT ||ST )P (ST ||AT -1 ). Our approach estimates a policy--the factors, P (At |S1:t , A1:t-1 ), of P (AT ||ST )--based on a provided (explicitly or implicitly) distribution of side information P (ST ||AT -1 ) = t P (St |A1:t-1 , S1:t-1 ). 2.3. Maximum Causal Entropy Optimization With the causal entropy (Equation 3) as our objective function, we now pose and solve the maximum causal entropy optimization problem. We constrain our distribution to match expected feature functions, F(S, A) with empirical expectations of those same ~ functions, ES,A [F(S, A)], yielding the following optimization problem: argmax
{P (At |S1:t ,A1:t-1 )}

(1) P (Xt+1:T |X1:t , Y1:t-1 ),

in what we refer to as a latent CRF model. However, we argue that entropy-based approaches like this that do not address the causal influence of side information are inadequate for interactive settings. In the context of this paper, we focus on modeling the sequential actions of an agent interacting with a stochastic environment. Thus, we replace the predicted variables, Y, and side information, X, with sequences of action variables, A, and state variables, S. 2.2. Directed Information and Causal Entropy The causally conditioned probability (Kramer, 1998) from the Marko-Massey theory of directed information (Massey, 1990) is a natural extension of the conditional probability, P (A|S), to the situation where each At is conditioned on only a portion of the S variables, S1:t , rather than the entirety, S1:T . Following the previously developed notation (Kramer, 1998), the probability of A causally conditioned on S is
T

H(AT ||ST )

(4)

~ such that: ES,A [F(S, A)] = ES,A [F(S, A)] and S1:t ,A1:t-1
At

P (At |S1:t , A1:t-1 ) = 1,

and given: P (ST ||AT -1 ). We assume for explanatory simplicity that feature functions factor as: F(S, A) = t F (St , At ), and that state dynamics are first-order Markovian, P (ST ||AT -1 ) = t P (St |At-1 , St-1 ). Theorem 1. The distribution satisfying the maximum causal entropy constrained optimization (Equation 4) has a form defined recursively as:
P (At |St ) = log ZAt |St , (5) X =  F (St , At ) + P (St+1 |St , At ) log ZSt+1 ,
St+1

P (A ||S )
t=1

T

T

P (At |S1:t , A1:t-1 ).

(2)

ZAt |St , ZSt ,

The subtle, but significant difference from conditional T probability, P (A|S) = t=1 P (At |S1:T , A1:t-1 ), serves as the underlying basis for our approach.

log ZSt , = log

X
At

ZAt |St , = softmax log ZAt |St ,
At

Modeling Interaction via the Principle of Maximum Causal Entropy

where softmaxx f (x)

log

x

ef (x) .

Proof (sketch). The (negated) primal objective function (Equation 4) is convex in the variables P (A||S) and subject to linear constraints on feature function expectation matching, valid probability distributions, and the non-causal influence of future side information. Differentiating the Lagrangian of the maximum causal entropy optimization (Equation 4), and equating to zero, we obtain the general form:
n P (At |St )  exp  ES,A [F(S, A)|St , At ] o X - ES,A [log P (A |S )|St , At ] . (6)
 >t

Theorem 3 follows naturally from Gr¨nwald & Dawid u (2003) and extends their "robust Bayes" results to the interactive setting as one justification for the maximum causal entropy approach. The theorem can be understood by viewing maximum causal entropy as a maximin game where nature chooses a distribution to maximize a predictor's perplexity while the predictor tries to minimize it. By duality, the minimax view of the theorem is equivalent. This strong result is not shared when maximizing alternate entropy measures (e.g., conditional or joint entropy) and marginalizing out future side information (as in Equation 1). 2.4. Inference and Learning Algorithms The procedure for inferring decision probabilities in the MaxCausalEnt model based on Theorem 1 is illustrated by Algorithm 1. Algorithm 1 MaxCausalEnt Inference Procedure
1: for t = T to 1 do 2: if t = T then 3: At ,St log ZAt |St ,   F (At , St ) 4: else 5: At ,St log ZAt |St ,   F (At , St ) ESt+1 [log ZSt+1 , |St , At ] 6: end if 7: St log ZSt ,  softmaxAt log ZAt |St 8: At ,St P (At |St )  9: end for
ZA |S , t t ZSt ,

Substituting the more operational recurrence of Equation 5 into Equation 6 verifies the theorem. We note that Theorem 1 relies on strong duality to identify the form of this probability distribution; the sharp version of Slater's condition (Boyd & Vandenberghe, 2004) using the existence of a feasible point in the relative interior ensures this but requires that (1) prescribed feature constraints are achievable, and (2) the distribution has full support. The first naturally follows if both model and empirical expectations are taken with respect to the provided model of side information, P (ST ||AT -1 ). For technical simplicity in this work, we will further assume full support for the modeled distribution, although relatively simple modifications (e.g., constraints hold within a small deviation ) ensure the correctness of this form in all cases. Theorem 2. The gradient of the dual with respect to ~  is ES,A [F(S, A)]-ES,A [F(S, A)] , which is the difference between the empirical feature vector given the complete policy, {P (At |St )}, and the expected feature vector under the probabilistic model. In many instances, the statistics of interest ~ ES,A [F(S, A)]) for the gradient (Theorem 2) are only known approximately as they are obtained from small sample sets. We note that this uncertainty can be rigorously addressed by extending the duality analysis of Dud´ & Schapire (2006), leading to paik rameter regularization that may be naturally adopted in the causal setting as well. Theorem 3. The maximum causal entropy distribution minimizes the worst case prediction log-loss, i.e., inf sup
A,S

+

Using the resulting action distributions and empirical ~ feature functions, E(F), the gradient is obtained by employing Algorithm 2. Algorithm 2 MaxCausalEnt Gradient Calculation
1: for t = 1 to T do 2: if t = 1 then 3: St ,At DSt ,At  P (St )P (At |St ) 4: else 5: St ,At DSt ,At  P DSt-1 ,At-1 P (At |St-1 , At-1 )P (At |St ) St-1 ,At-1 6: end if P 7: E[F]  E[F] + St ,At DSt ,At F (At , St ) 8: end for ~ ~ ~ 9:  log P (A||S)  E[F] - E[F]

~ P (A, S) log P (A ||S ),
T T

P (A||S) P (AT ||ST ) ~

As a consequence of convexity, standard gradientbased optimization techniques converge to the max^ imum likelihood estimate of feature weights, . 2.5. Graphical Representation We extend the influence diagram graphical framework (Howard & Matheson, 1984) as a convenient representation for the MaxCausalEnt variables and their relationships. The structural elements of the repre-

~ ~ given P (A, S) = P (AT ||ST )P (ST ||AT -1 ) and feature expectations EP (S,A) [F(S, A)] when S is sequentially ~ revealed from a known distribution and actions are sequentially predicted using only previously revealed variables.

Modeling Interaction via the Principle of Maximum Causal Entropy Table 1. Graphical representation structural elements.

Type Decision Uncertainty Utility

Symbol

Parent relationship Specifies observed variables when A is selected Specifies conditional probability, P (S|par(S)) Specifies feature functions,  F (par(U )) 

~ quadratic moments, e.g., E[ t st st ] = E[ t st , st ], guarantees equivalent performance under quadratic cost functions, e.g., t st Qst for unknown Q. Unfortunately, matching feature counts is fundamentally ill-posed--usually no truly optimal policy will achieve those feature counts, but many stochastic policies (and policy mixtures) will satisfy the constraint. Ziebart et al. (2008) resolve this ambiguity by using the classical maximum entropy criteria to select a single policy from all the distributions over decisions that match feature counts. However, for inverse optimal stochastic control (IOSC)--characterized by stochastic state dynamics--their proposed approach is to marginalize over future state variables (Equation 1). For IOSC, the maximum causal entropy approach provides prediction guarantees (Theorem 3) and a softened interpretation of optimal decision theory, while the latent CRF approach provides neither. 3.1.1. MDP and LQR Formulations In the IOSC problem, side information (states) and decisions (actions) are inter-dependent with the distribution of side information provided by the known dynamics, P (ST ||AT -1 ) = t P (St |St-1 , At-1 ). We employ the MaxCausalEnt framework to model this setting, as shown in Figure 1.

sentation are outlined in Table 1. We constrain the graph to possess perfect recall1 . Every graphical representation can then be reduced to the earlier causal entropy form (Equation 4) by marginalizing over each decision's non-parent uncertainty nodes to obtain side information transition dynamics and expected feature functions. Whereas in traditional influence diagrams, the parent-dependent values for decisions nodes that provide the highest expected utility are inferred, in the MaxCausalEnt setting, utilities should be interpreted as potentials for inferring a probability distribution over decisions.

3. Applications
We now present a series of applications with increasing complexity of interaction: (1) control with stochastic dynamics; (2) multiple agent interaction; and (3) interaction with a partially observable system. 3.1. Inverse Optimal Stochastic Control Optimal control frameworks, such as Markov decision processes (MDPs) and linear-quadratic regulators (LQRs), provide rich representations of interactions with stochastic systems. Inverse optimal control (IOC) is the problem of recovering a cost function that makes a particular controller or policy (nearly) optimal (Kalman, 1964; Boyd et al., 1994). Recent work has demonstrated that IOC is a powerful technique for modeling the decision-making behavior of intelligent agents in problems as diverse as robotics (Ratliff et al., 2009), personal navigation (Ziebart et al., 2008), and cognitive science (Ullman et al., 2009). Many IOC approaches (Abbeel & Ng, 2004; Ziebart et al., 2008) consider cost functions linear in a set of features and attempt to find behaviors that induce the same feature counts as the policy to be mimicked ~ (E[ t fSt ] = E[ t fSt ]); by linearity such behaviors achieve the same expected value. For settings with vectors of continuous states and actions, matching
Variables observed during previous decisions are either observed in future decisions or irrelevant (i.e., d-separated from future value nodes by observed variables).
1

Figure 1. The graphical representation for MaxCausalEnt inverse optimal control. For MDPs: Ut (St , At ) =  fSt , and for LQRs: Ut (st , at ) = st Qst + at Rat .

Using the action-based cost-to-go (Q) and state-based value (V ) notation, the inference procedure for MDP MaxCausalEnt IOC reduces to: Qsoft (At , St ) = ESt+1 [Vsoft (St+1 )|St , At ]  Vsoft (St ) = softmax Qsoft (At , St )  At +  fSt , (7)

and for the continuous, quadratic-reward setting to Qsoft (at , st ) = Est+1 [Vsoft (st+1 )|st , at ] + at Rat (8)  Vsoft (st ) = softmax Qsoft (at , st ) + st Qst . 
at

Note that by replacing the softmax 2 function with the maximum, this algorithm becomes equivalent to the
2

The continuous version of the softened maximum is

Modeling Interaction via the Principle of Maximum Causal Entropy

(stochastic) value iteration algorithm (Bellman, 1957) for finding the optimal control policy. The relative magnitudes of the action values in the MaxCausalEnt model control the amount of stochasticity in the resoft sulting action policy,  (a|s)  eQ (a,s) . For the special case where dynamics are linear functions with Gaussian noise, the quadratic MaxCausalEnt model permits a closed-form solution and, given dynamics st+1  N (Ast + Bat , ), Equation 8 reduces to: Qsoft (at , st ) = 
at st B DB + R B DA A DB A DA at st

Vsoft (st ) = st (Cs,s + Q - Ca,s C-1 Ca,s )st + const, a,a where C and D are recursively computed as: Ca,a = B DB; Cs,a = Ca,s = B DA; Cs,s = A DA; and D = Cs,s + Q - C C-1 Ca,s . a,a 3.1.2. Inverse Helicopter Control We demonstrate the MaxCausalEnt approach to IOSC on the problem of building a controller for a helicopter with linearized stochastic dynamics. Existing approaches to IOSC (Ratliff et al., 2006; Abbeel & Ng, 2004) have both practical and theoretical difficulties in the presence of imperfect demonstrated behavior, leading to unstable controllers due to large changes in cost weights (Abbeel et al., 2007) or poor predictive accuracy (Ratliff et al., 2006). To test the robustness of our approach, we generated five 100 time step sub-optimal training trajectories (Figure 2) by noisily sampling actions from an optimal LQR controlled designed for hovering using the linearized stochastic helicopter simulator of Abbeel et al. (2007).

beled InvOpt in Figure 2) and MaxCausalEnt models trained using demonstrated trajectories. Using the true cost function, we measure the cost of trajectories sampled from the optimal policy under the cost function learned by each model. The InvOpt model performs poorly because there is no optimal trajectory for any cost function that matches demonstrated features. In contrast, by design the MaxCausalEnt model is guaranteed to match the performance of demonstrated behavior (Demo) in expectation even if that behavior is sub-optimal. Additionally, because of the quadratic cost function, the optimal controller using the MaxCausalEnt cost function is always at least as good as the demonstrated behavior on the original, unknown cost function, and often better, as shown in Figure 2. In this sense, MaxCausalEnt provides a rigorous approach to learning a cost function for such stochastic optimal control problems: it is both predictive and can guarantee good performance of the learned controller. 3.2. Inverse Dynamic Games Modeling the interactions of multiple agents is an important task for uncovering the motives of negotiating parties, planning a robot's movement in a crowded environment, and assessing the perceived roles of interacting agents (Ullman et al., 2009). While game and decision theory can prescribe equilibria or optimal policies when the utilities of agents are known, often the utilities are not known and only observed behavior is available for modeling tasks. We investigate recovering the agents' utilities from those observations. 3.2.1. Markov Game Formulation We consider a Markov game where agents act in sequence, taking turns after observing the preceding agents' actions and the resulting stochastic outcome sampled from known state dynamics, P (St+1 |At , St ). Agents are assumed to act based on a utility function that is linear in features associated with each state, wi fS and to know the other agents' utilities and policies. Learning a single agent's MaxCausalEnt policy, i , given the others' policies, -i , reduces to an IOSC problem where the entropy of the agent's actions given state is maximized while matching state features: argmax H(A(i) ||S(i) )
i (A|S)

Figure 2. Left: An example sub-optimal helicopter trajectory attempting to hover around the origin point. Right: The average cost under the original cost function of: (1) demonstrated trajectories; (2) the optimal controller using the inverse optimal control model; and (3) the optimal controller using the maximum causal entropy model.

(9) fi (St )]

such that: E[
t

~ fi (St )] = E[
t

We contrast between the policies obtained from the maximum margin planning (Ratliff et al., 2006) (ladefined as: softmaxx f (x) log R ef (x) dx. x

and given: -i (A|S) and P (S||A). The distribution of side information (i.e., the agent's next state, St+N given the agent's previous state and

Modeling Interaction via the Principle of Maximum Causal Entropy

action, St and At , is obtained by marginalizing over the other agents' states and actions: P (St+N |At , St ) = ESt+1:t+N -1 ,At+1:t+N -1 P (St+N |St+N -1 , At+N -1 ) St , At . Difficulties arise, however, because the policies of other agents are not known in our setting. Instead, all of the agents' utilities and policies are learned from demonstrated trajectories. Maximizing the joint causal entropy of all agents' actions leads either to nonconvexity (multiplicative functions of agent policies in the feature matching constraints) or an assumption that agents share a common utility function. We settle for potentially non-optimal solution policies by employing a cyclic coordinate descent approach. It iteratively maximizes the causal entropy of agent i's actions subject to constraints: i  argmax Fp (i |^-i ), ^ 
i

We generate data for this setting using the following procedure. First, mobilities and co-location utilities are sampled (uniformly from their domain). Next, opt timal policies3 , i (A|S), for a range of time horizons t  {T0 , ..., T0 + T }, are computed with complete knowledge of other agents' utilities and future policies. A stochastic policy, i (A|S), is obtained by first sam~ 1 pling a time horizon from P (t) = T , and then sampling an action from the optimal policy for that time horizon. Lastly, starting from random initial locations, trajectories of length 40 time steps (five for training and one for testing) are sampled from the policy and state dynamics for six different parameters. Despite its simplicity, this setting produces surprisingly rich behavior. For example, under certain optimal policies, a first evader will help its pursuer corner a more desirable second evader so that the first evader will be spared.

where Fp is the Lagrangian primal of Equation 9. However, instead of matching observed feature counts, which may be infeasible given the estimate of -i , the ^ expectation of agent i's features given its empirical actions under the current estimate of agent policies, ~ F(S, A) = EA,S [ SS fS |^ ], is matched.  3.2.2. Pursuit-Evasion Modeling We consider a generalization of the pursuit-evasion multi-agent setting (Parsons, 1976) with three agents operating in a four-by-four grid world (Figure 3). Each agent has a mobility, mi  [0.2, 1], which corresponds to the probability of success when attempting to move in one of the four cardinal directions, and a utility, wi,j  [-1, 1], for being co-located with each of the other agents. Unlike traditional pursuer-evader, there is no "capture" event in this game; agents continue to act after being co-located. The mobilities of each agent and a time sequence of their actions and locations are provided and the task is to predict their future actions.

MaxCausalEnt perplexity

Latent CRF perplexity Figure 4. The average per-action perplexities of the latent CRF model and the MaxCausalEnt model plotted against each other for three agents from six different pursuitevasion settings. The MaxCausalEnt model outperforms the latent CRF model in the region below the dotted line.

Figure 3. The pursuit-evasion grid with three agents and their co-location utilities. Agent X has a mobility of mX and a utility of wX,Y when co-located with agent Y .

A comparison between the latent CRF model (Equation 1) trained to maximize data likelihood and the maximum causal entropy model is shown in Figure 4 1 using perplexity, T at ,st log P (at |st ), as the evaluation metric of predictive performance. MaxCausalEnt consistently outperforms the latent CRF model. The explanation for this is based on the observation that under the latent CRF model, the action-value of Equation 7 is instead: Q(at , st ) = softmaxst+1 (V (st+1 ) + log P (st+1 |st , at )). This has a disconcerting interpretation that the agent chooses the next state by "paying" an extra log P (st+1 |st , at ) penalty to ignore the true stochasticity of the state transition dynamics, resulting in an unrealistic probability distribution.
3 "Ties" in action value lead to uniform distributions over the corresponding actions.

Modeling Interaction via the Principle of Maximum Causal Entropy

3.3. Inverse Diagnostics Many important interaction tasks involve partial observability. In medical diagnosis, for example, sequences of symptoms, tests, and treatments are used to identify (and mediate) unknown illnesses. Motivated by the objective of learning diagnosis policies from experts, we investigate the inverse diagnostics problem of modeling interaction with partially observed systems. 3.3.1. Bayes Net Diagnosis Formulation We consider a set of variables (distributed according to a known dynamic Bayesian network) that evolve over time based in part on employed actions, A1:T , as shown in Figure 5. Those actions are made with only partial knowledge of the Bayes net variables, as relayed through observation variables O1:T . Previous actions determine what information from the hidden variables is revealed in the next time step.
Figure 6. The vehicle fault detection Bayesian network with replaceable variables denoted with an asterisk. All variables are binary (working or not) except Battery Age.

tional probability distributions. Apart from the relationship between Battery Age and Battery (exponentially increasing probability of failure with battery age), the remaining conditional probability distributions are deterministic-or's (i.e., failure in any parent causes a failure in the child). A mechanic can either test a component of the vehicle (revealing its status) or repair a component (making it and potentially its descendants operational). Replacements and tests are both characterized by three action features: (1) a cost to the vehicle owner; (2) a profit for the mechanic; and (3) a time requirement. Ideally the sequence of mechanic's actions would minimize the expected cost to the vehicle owner, but an over-booked mechanic might instead choose to minimize the total repair time, and a less ethical mechanic might seek to optimize personal profit. To generate a dataset of observations and replacements, a stochastic policy is obtained by adding Gaussian noise, s,a , to each action's future expected value, Q (s, a), under the optimal policy for a fixed set of feature weights and the highest noisy-valued action, Q (s, a) + s,a , is selected at each time step. Different vehicle failure samples are generated from the Bayesian network conditioned on the vehicle's engine failing to start, and the stochastic policy is sampled until the vehicle is operational. In Figure 7, we evaluate the prediction error rate and perplexity of our model and a Markov model that ignores the underlying mechanisms for decision making and simply predicts behavior in proportion to the frequency it has previously been observed (with small pseudo-count priors). The MaxCausalEnt approach consistently outperforms the Markov model even with an order of magnitude less training data. The classification error rate quickly reaches the limit implied by the stochasticity of the data generation process.

Figure 5. The MaxCausalEnt representation of the diagnostic problem with an abstract dynamic Bayesian network represented as BN nodes. Perfect recall edges from all past observations and actions to future actions are suppressed.

We assume that the utility for the state of the Bayes net and actions invoked is an unknown linear function of known feature vectors,  fBNt ,At . We formulate the modeling problem as a maximum causal entropy optimization by maximizing the causally conditioned entropy of actions given observations, H(AT ||OT ). We marginalize over the latent Bayes net variables to obtain side information dynamics, P (Ot+1 |Ot , St ) = EBN1:t [P (Ot+1 |par(Ot+1 |O1:t , A1:t ] and expected feature functions, E[ft |O1:t , A1:t ] = EBN1:t [FUt (par(Ut ))|O1:t , A1:t ] to estimate the fullhistory policy, P (At |O1:t , A1:t ) using Algorithm 1 and Algorithm 2. 3.3.2. Vehicle Diagnosis Experiments We apply our inverse diagnostics approach to the vehicle fault detection Bayesian network (Heckerman et al., 1994) shown in Figure 6 with fully specified condi-

Modeling Interaction via the Principle of Maximum Causal Entropy Linear Matrix Inequalities in System and Control Theory, volume 15 of Studies in Applied Mathematics. 1994. Dud´ M. and Schapire, R. E. Maximum entropy distribuik, tion estimation with generalized regularization. In Proc. COLT, pp. 123­138, 2006. Gr¨nwald, P. D. and Dawid, A. P. Game theory, maximum u entropy, minimum discrepancy, and robust Bayesian decision theory. Annals of Statistics, 32:1367­1433, 2003. Heckerman, D., Breese, J. S., and Rommelse, K. Troubleshooting under uncertainty. In Communications of the ACM, pp. 121­130, 1994. Figure 7. Error rate and perplexity of the MaxCausalEnt model and Markov model for diagnosis action prediction as training set size (log-scale) increases. Howard, R. A. and Matheson, J. E. Influence diagrams. In Readings on the Principles and Applications of Decision Analysis, pp. 721­762. Strategic Decisions Group, 1984. Jaynes, E. T. Information theory and statistical mechanics. Physical Review, 106:620­630, 1957. Kalman, R. When is a linear control system optimal? Trans. ASME, J. Basic Engrg., 86:51­60, 1964. Kramer, G. Directed Information for Channels with Feedback. PhD thesis, Swiss Federal Institute of Technology (ETH) Zurich, 1998. Lafferty, J., McCallum, A., and Pereira, F. Conditional random fields: Probabilistic models for segmenting and labeling sequence data. In Proc. ICML, pp. 282­289, 2001. Massey, J. L. Causality, feedback and directed information. In Proc. IEEE International Symposium on Information Theory and Its Applications, pp. 27­30, 1990. Ortiz, L. E., Shapire, R. E., and Kakade, S. M. Maximum entropy correlated equilibria. In AISTATS, pp. 347­354, 2007. Parsons, T. D. Pursuit-evasion in a graph. In Theory and Applications of Graphs, pp. 426­441. Springer-Verlag, 1976. Permuter, H. H., Kim, Y.-H., and Weissman, T. On directed information and gambling. In Proc. IEEE International Symposium on Information Theory, pp. 1403­ 1407, 2008. Ratliff, N., Bagnell, J. A., and Zinkevich, M. Maximum margin planning. In Proc. ICML, pp. 729­736, 2006. Ratliff, N., Silver, D., and Bagnell, J. A. Learning to search: Functional gradient techniques for imitation learning. Auton. Robots, 27(1):25­53, 2009. Tatikonda, S. and Mitter, S. Control under communication constraints. Automatic Control, IEEE Transactions on, 49(7):1056­1068, July 2004. ISSN 0018-9286. doi: 10.1109/TAC.2004.831187. Ullman, T., Baker, C., Macindoe, O., Evans, O., Goodman, N., and Tenenbaum, J. Help or hinder: Bayesian models of social goal inference. In Proc. NIPS, pp. 1874­ 1882, 2009. Ziebart, B. D., Maas, A., Bagnell, J. A., and Dey, A. K. Maximum entropy inverse reinforcement learning. In Proc. AAAI, pp. 1433­1438, 2008.

4. Conclusion and Future Work
We extended the principle of maximum entropy to settings with sequentially revealed information in this work. We demonstrated the applicability of the resulting principle of maximum causal entropy for learning policies in stochastic control, multi-agent interaction, and partially observable settings. In addition to further investigating modeling applications, our future work will investigate the applicability of MaxCausalEnt on non-modeling tasks in dynamics settings. For instance, we note that the proposed principle provides a natural criteria for efficiently identifying a correlated equilibrium in dynamic Markov games, generalizing the approach to normal-form games of Ortiz et al. (2007).

Acknowledgments
The authors gratefully acknowledge the Richard King Mellon Foundation, the Quality of Life Technology Center, and the Office of Naval Research MURI N00014-09-1-1052 for support of this research. We also thank our reviewers for many useful suggestions.

References
Abbeel, P. and Ng, A. Y. Apprenticeship learning via inverse reinforcement learning. In Proc. ICML, pp. 1­8, 2004. Abbeel, P., Coates, A., Quigley, M., and Ng, A. Y. An application of reinforcement learning to aerobatic helicopter flight. In NIPS, pp. 1­8, 2007. Bellman, R. A Markovian decision process. Journal of Mathematics and Mechanics, 6:679­684, 1957. Boyd, S. and Vandenberghe, L. Convex Optimization. Cambridge University Press, 2004. Boyd, S., Ghaoui, L. El, Feron, E., and Balakrishnan, V.