Exponential Family Predictive Representations of State David Wingate Computer Science and Engineering University of Michigan wingated@umich.edu Satinder Singh Computer Science and Engineering University of Michigan baveja@umich.edu Abstract In order to represent state in controlled, partially observable, stochastic dynamical systems, some sort of sufficient statistic for history is necessary. Predictive representations of state (PSRs) capture state as statistics of the future. We introduce a new model of such systems called the "Exponential family PSR," which defines as state the time-varying parameters of an exponential family distribution which models n sequential observations in the future. This choice of state representation explicitly connects PSRs to state-of-the-art probabilistic modeling, which allows us to take advantage of current efforts in high-dimensional density estimation, and in particular, graphical models and maximum entropy models. We present a parameter learning algorithm based on maximum likelihood, and we show how a variety of current approximate inference methods apply. We evaluate the quality of our model with reinforcement learning by directly evaluating the control performance of the model. 1 Introduction One of the basic problems in modeling controlled, partially observable, stochastic dynamical systems is representing and tracking state. In a reinforcement learning context, the state of the system is important because it can be used to make predictions about the future, or to control the system optimally. Often, state is viewed as an unobservable, latent variable, but models with predictive representations of state [5] propose an alternative: PSRs represent state as statistics about the future. The original PSR models used the probability of specific, detailed futures called tests as the statistics of interest. Recent work has introduced the more general notion of using parameters that model the distribution of length n futures as the statistics of interest [8, 12]. To clarify this, consider an agent interacting with the system. It observes a series of observations o1 ...ot , which we call a history ht (where subscripts denote time). Given any history, there is some distribution over the next n observations: p(Ot+1 ...Ot+n |ht ) p(Ft |ht ) (where Ot+i is the random variable representing an observation i steps in the future, and Ft is a mnemonic for future). We emphasize that this distribution directly models observable quantities in the system. Instead of capturing state with tests, the more general idea is to capture state by directly modeling the distribution p(Ft |ht ). Our central assumption is that the parameters describing p(Ft |ht ) are sufficient for history, and therefore constitute state (as the agent interacts with the system, p(Ft |ht ) changes because ht changes; therefore the parameters and hence state change). As an example of this, the Predictive Linear-Gaussian (PLG) model [8] assumes that p(Ft |ht ) is jointly Gaussian; state therefore becomes its mean and covariance. Nothing is lost by defining state in terms of observable quantities: Rudary et al [8] proved that the PLG is formally equivalent to the latent-variable approach in linear dynamical systems. In fact, because the parameters are grounded, statistically consistent parameter estimators are available for PLGs. 1 Thus, as part of capturing state in a dynamical system in our method, p(Ft |ht ) must be estimated. This is a density estimation problem. In systems with rich observations (say, camera images), p(Ft |ht ) may have high dimensionality. As in all high-dimensional density estimation problems, structure must be exploited. It is therefore natural to connect to the large body of recent research dealing with high-dimensional density estimation, and in particular, graphical models. In this paper, we introduce the Exponential Family PSR (EFPSR) which assumes that p(Ft |ht ) is a standard exponential family distribution. By selecting the sufficient statistics of the distribution carefully, we can impose graphical structure on p(Ft |ht ), and therefore make explicit connections to graphical models, maximum entropy modeling, and Boltzmann machines. The EFPSR inherits both the advantages and disadvantages of graphical exponential family models: inference and parameter learning in the model is generally hard, but all existing research on exponential family distributions is applicable (in particular, work on approximate inference). Selecting the form of p(Ft |ht ) and estimating its parameters to capture state is only half of the problem. We must also model the dynamical component, which describes the way that the parameters vary over time (that is, how the parameters of p(Ft |ht ) and p(Ft+1 |ht+1 ) are related). We describe a method called "extend-and-condition," which is a generalization of many state update mechanisms in PSRs. Importantly, the EFPSR has no hidden variables, but can still capture state, which sets it apart from other graphical models of sequential data. It is not directly comparable to latent-variable models such as HMMs, CRFs [3], or Maximum-entropy Markov Models (MEMMs) [6], for example. In particular, EM-based procedures used in the latent-variable models for parameter learning are unnecessary, and indeed, impossible. This is a consequence of the fact that the statistics of interest are always related directly to observable quantities. This paper introduces the exponential family PSR (Section 2) and a complete algorithm for model learning, including structure learning and maximum likelihood parameter estimation (Section 3). We discuss inference in Section 4. We present two sets of experiments in Section 5: we first show that POMDP-like systems can be learned on several toy problems. Second, we use the model in a larger domain, in which the resulting state vectors can be used for traditional RL planning. 2 The Exponential Family PSR The exponential family PSR model captures state as the parameters of the distribution p(Ft |ht ). Recall that at each timestep, Ft is the random variable representing the next n observations, given history until time t: Ft |ht = [Ot+1 · · · Ot+n |ht ], where each Ot Rd ; thus, each Ft Rnd . We are now faced with two questions: first, what parametric form should we choose to represent p(Ft |ht )? This is the density estimation component, for which we choose an exponential family distribution, as discussed next. Second, given the parameters of p(Ft |ht ), how can we incorporate a new observation to find the parameters of p(Ft+1 |ht , ot+1 )? This is the dynamical component, for which our strategy is to extend and condition, as explained in Section 2.3. 2.1 Standard Exponential Family Distributions We select the exponential form for p(Ft |ht ) because of its close connections to maximum entropy modeling. We refer the reader to Jaynes [2] for detailed justification; briefly, he states that the maximum entropy distribution "agrees with everything that is known, but carefully avoids assuming anything that is not known," which "is the fundamental property which justifies its use for inference." The standard exponential is the form of the maximum entropy distribution under certain constraints. For a random variable X , a member of the standard exponential family of probability distributions has the form p(X = x; s) = exp{sT (x) - (s)} where s is the canonical vector of parameters and (x) is a vector of features of variable x; (x) also forms the sufficient statistics of the distribution. (s) is known as the log-partition function, and is a normalizing constant which ensures that p(x; s) defines a valid distribution: e (s) = log xp{sT (x)}dx. 2 Observation features t+1 t+2 t+n t+1 t+2 t+n t+n+1 t+1 t+2 t+n t+n+1 Distribution of next n observations Extended distribution Conditioned distribution p(Ft |ht ) p(Ft , Ot+n+1 |ht ) p(Ft+1 |ht , ot+1 ) Figure 1: An illustration of extending and conditioning the distribution. By carefully selecting the features (x), graphical structure may be imposed on the resulting distribution. In our case, the random variable X will be replaced with Ft |ht , and the features will be conjunctions of binary random variables, as explained in Section 2.2. 2.2 Binary Features In general, each multidimensional observation Ot (and therefore, Ft |ht ) may be discrete or continuous. In this paper, we assume that a number of binary features have been extracted from the observations (although we use binary features, we emphasize that almost all of the concepts we develop apply equally well to the cases of real-valued and discrete-valued random variables). Hereafter, we will treat these binary features as if they were the observations, and ignore the underlying observations from which they are generated. To impose graphical structure on the model, we construct the feature vector as follows. Let each Ot {0, 1}d ; therefore, each Ft {0, 1}nd . Let Fti be the i'th random variable in Ft . We assume that we have an undirected graph G = (V , E ) where V = {1, ..., nd} are the nodes in the graph (one for each Fti ), and (i, j ) E are the edges. We create on entry in for each node i V , and create one entry for each edge. Specifically, if (i, j ) E , then there will be some feature (ft )k = fti ftj (there is no edge (i, j ) E if and only if Fti and Ftj are conditionally independent given all other variables in the graph). For ease of exposition, we will only allow pairwise-interactions between random variables, but the extension to higher order features is straightforward. The graph G is not arbitrary; we will impose special structure on this graph (called the temporal invariance property) so that its form does not change over time, as explained later. 2.3 Extend and Condition Having defined the form of p(Ft |ht ), we now address the second central question of our model: given the parameters of p(Ft |ht ), how can we incorporate a new observation to find the parameters of p(Ft+1 |ht , ot+1 )? The rhythm of our model will be to extend and condition. Figure 1 illustrates the process. As explained in Section 2.2, we assume that each observation is a vector of binary random variables, and that we have imposed some graphical structure on p(Ft |ht ). We assume that we have the parameters of p(Ft |ht ), which we will refer to as st |ht , or simply st : T p(Ft = ft |ht ; st ) = exp{st (ft ) - (st )} (1) We extend the distribution to include a set of new parameters which describe the observations at time t + n + 1: s+ = extend(st ; ) t Together, st and s+ describe p(Ft , Ot+n+1 |ht ), which is a temporary distribution with (n + 1)d t random variables. Once we have extended the distribution to model the t + n + 1'st observation in the future, we then condition on the actual t + 1'st observation, which results in the parameters of a distribution over observations from t + 1 through t + n + 1: st+1 = condition(s+ , ot+1 ) t which are precisely the statistics representing p(Ft+1 |ht+1 ), which in turn is our state at time t + 1. We then repeat the process. 3 s+ , t Note that although st and s+ are the parameters defining the distribution p(Ft |ht ), they are not t the model parameters ­ that is, we cannot freely select them. Instead, the model parameters are the which govern the extension function. This is a significant difference from standard maximum entropy models, and stems from the fact that our overall problem is that of modeling a dynamical system, rather than just density estimation. There is only one restriction on the graph defining the distribution: we must ensure that temporally shifted copies of each feature exist (we call this the temporal invariance property). For example, if there is a feature relating x at time t + 5 and y at time t + 6, then there must also be a feature relating x at time t + 4 to y at time t + 5. This ensures that the structure of the graph does not change between state updates (seen pictorally in Figure 1). Because we have stipulated that all features are either atomic variables or conjunctions of variables, finding the parameters of the conditioned distribution is easy (this is true even if the random variables are discrete or real-valued). When we condition on an observation, we freeze the observed variables to their observed values, and then collect parameters. For example, suppose we have features (ft )l = ftj and (ft )k = fti ftj , with associated multipliers sl and sk . Upon observing fti , the t t feature ftj has a new multiplier equal to sl = sl + sk fti = [1; fti ] [sl ; sk ]. This operation is linear tt t t t in st . We can apply the same logic to conditioning p(Ft , Ot+n+1 |ht ) on ot+1 by defining the linear conditioning operator G(ot+1 ) as a matrix which adds the appropriate parameters together. The extension function extend can take any form. In the PLG family of work, for example, a linear extension allows the model to capture linear dynamics [8], while a non-linear extension allows the model to capture non-linear dynamics [11]. Here, we focus on linear extensions: s+ = st t where becomes a matrix of parameters. The combination of a linear extension and a linear conditioning operator can be rolled together into a single operation. Without loss of generality, we can permute the indices in our state vector such that s G (ot+1 ) st+1 = t Note that given model parameters , an initial state s0 , and a sequence of observations, the sequence of st 's is completely determined. This is analogous to the belief state update in, say, a POMDP: the belief state update is a deterministic function of a prior belief state and an observation. No matter what the form of the extend function, the overall extend-and-condition operation does not involve any inference. Thus, given an EFPSR model, tracking state is computationally efficient ­ it is only as hard as the extension, which in our case is a linear operation. 3 Model Learning We have defined our concept of state, as well as our method for tracking that state. We now address the question of learning the model from data. There are two things which can be learned in our model: the structure of the graph, and the parameters governing the state update. We briefly address each in the next two subsections. We assume we are given a sequence of T observations, [o1 · · · oT ], which we stack to create a sequence of samples from the Ft |ht 's: ft |ht = [ot+1 · · · ot+n |ht ]. 3.1 Structure Learning There are two aspects to structure learning: first, selecting the value of n, and second, determining the graphical structure of the model ­ that is, deciding which edges to include in our graph (and therefore, which feature conjunctions to include in the feature vector (·)). For this part of learning, we make the approximation of ignoring the dynamical component of the model. That is, we treat each ft as an observation, and try to estimate the density of the resulting unordered set, ignoring the t subscripts (we appeal to density estimation because many good algorithms have been developed for structure induction). We therefore ignore temporal relationships across samples, but we preserve temporal relationships within samples. For example, if observation a is always followed by observation b, this fact will be captured within the ft 's. 4 The problem therefore becomes one of inducing graphical structure for a non-sequential data set, which is a problem that has already received considerable attention. In all of our experiments, we used the method of Della Pietra et. al [7]. Their method iteratively evaluates a set of candidate features and adds the one with highest expected gain in log-likelihood. To enforce the temporal invariance property, whenever we add a feature, we also add all of the temporally shifted copies of that feature, as well as the conditioned versions of that feature. 3.2 Maximum Likelihood Parameter Estimation With the structure of the graph in place, we are left to learn the parameters of the state extension. It is now useful that our state is defined in terms of observable quantities, for two reasons: first, because everything in our model is observed, EM-style procedures for estimating the parameters of our model are not needed, simply because there are no unobserved variables over which to take expectations. Second, when trying to learn a sequence of states (st 's) given a long trajectory of futures (ft 's), each ft is a sample of information directly from the distribution we're trying to model. Finally, note that given a parameter estimate, an initial state s0 , and a sequence of observations, the sequence of st 's is completely determined. This will be a key element to our proposed maximumlikelihood learning algorithm. T The likelihood of the training data is p(o1 , o2 ...oT ) = t=1 p(ot |ht ). We will find it more conveT nient to measure the likelihood of the corresponding ft 's: p(o1 , o2 ...oT ) n t=1 p(ft |ht ) (the likelihoods are not the same because the likelihood of the ft 's counts a single observation n times; the approximate equality is because the first n and last n are counted fewer than n times). The expected regularized log-likelihood of the training ft 's under the model defined in Eq. 1 is T 1t LL = O -s (ft ) - log(Zt ) - 2 st t T =1 ur goal is to maximize this quantity. Any optimization method can be used to maximize the loglikelihood. Two popular choices are gradient ascent and quasi-Newton methods, such as (L-)BFGS. We use both, for different problems (as discussed later). However, both methods require the gradient of the likelihood with respect to the parameters, which we will now compute. Using the chain rule of derivatives, we can compute the derivative with respect to the parameters : t T LL st LL = st =1 First, we compute the derivative of the log-likelihood with respect to each state: LL - = st (ft ) - log(Zt ) - 2 |st = E[(Ft |ht )] - (ft ) - 2 st t (2) st st where E[(Ft |ht )] is the vector of expected sufficient statistics at time t. Computing these values is a standard inference problem in exponential family models, and is discussed in Section 4. This gradient tells us that we wish to adjust each state to make the expected features of the next n observations closer to the observed features. This is similar to the result obtained in standard maximum entropy gradients, where the gradient attempts to move the expectation of features under the model such that it is equal to the empirical expectation. There are two differences: first, we only have one sample for each timestep, and so the empirical expectation is simply the observed sample at time t. Second: we cannot adjust st directly; instead, we must adjust it implicitly by adjusting . We now compute the gradients of the state with respect to each parameter: G s G . st-1 st (ot+1 ) (ot+1 ) = = + s 1 D t-1 t- where is the Kronecker product, and D is a matrix we call the "derivative template." Dij = 1 if i is the conditioned version of the j 'th extended feature, and 0 otherwise. Note that the gradients at time t are temporally recursive, and depend on gradients from all previous timesteps. It might seem prohibitive to compute them, but fortunately, the necessary statistics can be efficiently computed in a recursive fashion as the algorithm walks through the data. 5 -1.4 Log-likelihood -1.6 -1.8 Training LL Testing LL True LL Naive LL -2.07 p 1-p q A 1-q -2.08 -2 0 10 20 0 10 20 0 10 20 0 10 20 B Iterations of optimization (a) (b) Figure 2: Results on two-state POMDPs. The right shows the generic model used. By varying the transition and observation probabilities, three different POMDPs were generated. The left shows learning performance on the three models. Likelihoods for naive predictions are shown as a dotted line near the bottom; likelihoods for optimal predictions are shown as a dash-dot line near the top. Problem Paint Network Tiger # of states 16 7 2 # of obs. 2 2 2 # of actions 4 4 3 Naive LL 6.24 6.24 6.24 True LL 4.66 4.49 5.23 Training set LL % 4.67 99.7 4.50 99.5 5.24 92.4 Test set LL % 4.66 99.9 4.52 98.0 5.25 86.0 Figure 3: Results on standard POMDPs. See text for explanation. 4 Inference In order to compute the gradients needed for model learning, the expected sufficient statistics E[(Ft |ht )] at each timestep must be computed (see Eq. 2): E [(Ft |ht )] = (ft )p(Ft |ht )dft = (s). This quantity, also known as the mean parameters, is of central interest in standard exponential families, and has several interesting properties. For example, each possible set of canonical parameters s induces one set of mean parameters; assuming that the features are linearly independent, each set of valid mean parameters is uniquely determined by one set of canonical parameters [9]. Computing these marginals is an inference problem. This is repeated T times (the number of samples) in order to get one gradient, which is then used in an outer optimization loop; because inference must be repeatedly performed in our model, computational efficiency is a more stringent requirement than accuracy. In terms of inference, our model inherits all of the properties of graphical models, for better and for worse. Exact inference in our model is generally intractable, except in the case of fully factorized or tree-structured graphs. However, many approximate algorithms exist: there are variational methods such as naive mean-field, tree-reweighted belief propagation, and log-determinant relaxations [10]; other methods include Bethe-Kikuchi approximations, expectation propagation, (loopy) belief propagation, MCMC methods, and contrastive divergence [1]. 5 Experiments and Results Two sets of experiments were conducted to evaluate the quality of our model and learning algorithm. The first set tested whether the model could capture exact state, given the correct features and exact inference. We evaluated the learned model using exact inference to compute the exact likelihood of the data, and compared to the true likelihood. The second set tested larger models, for which exact inference is not possible. For the second set, bounds can be provided for the likelihoods, but may be so loose as to be uninformative. How can we assess the quality of the final model? One objective gauge is control performance: if the model has a reward signal, reinforcement learning can be used to determine an optimal policy. Evaluating the reward achieved becomes an objective measure of model quality, even though approximate likelihood is the learning signal. 6 Using NMF Gradients 0.15 Average Reward Average Reward 0.15 Using CD Gradients 0.15 Using LDR Gradients 1 Normalized Bound Average Reward 0.8 0.6 0.4 0.2 0 Algorithm Progression CD value NMF bound LDR bound 0.1 0.1 0.1 0.05 0.05 0.05 2 4 6 8 10 12 Iterations 2 4 6 8 10 12 Iterations 2 4 6 8 10 12 Iterations 0 5 10 Iterations 15 Figure 4: Control performance results on cheesemaze for different approximate inference methods. First set. We tested on three two-state problems, as well as three small, standard POMDPs. For each problem, training and test sets were generated (using a uniformly random policy for controlled systems). We used 10,000 samples, set n = 3 and used structure learning as explained in Section 3.1. We used exact inference to compute the E[(Ft |ht )] term needed for the gradients. We optimized the likelihood using BFGS. For each dataset, we computed the log-likelihood of the data under the true model, as well as the log-likelihood of a "naive" model, which assigns uniform probability to every possible observation. We then learned the best model possible, and compared the final log-likelihood under the learned and true models. Figure 2 (a) shows results for three two-state POMDPs with binary observations. The left panel of Fig. 2 (a) shows results for a two-state MDP. The likelihood of the learned model closely approaches the likelihood of the true model (although it does not quite reach it; this is because the model has trouble modeling deterministic observations, because the weights in the exponential need to be infinitely large [or small] to generate a probability of one [or zero]). The middle panel shows results for a moderately noisy POMDP; again, the learned model is almost perfect. The third panel shows results for a very noisy POMDP, in which the naive and true LLs are very close; this indicates that prediction is difficult, even with a perfect model. Figure 3 shows results for three standard POMDPs, named Paint, Network and Tiger1 . The table conveys similar information to the graphs: naive and true log-likelihoods, as well as the loglikelihood of the learned models (on both training and test sets). To help interpret the results, we also report a percentage (highlighted in bold), which indicates the amount of the likelihood gap (between the naive and true models) that was captured by the learned model. Higher is better; again we see that the learned models are quite accurate, and generalize well. Second set. We also tested on a more complicated standard POMDP called cheesemaze1 . For this problem, exact inference is intractable, and so we used approximate methods. We experimented with contrastive divergence (CD) [1], naive mean field (NMF), and log-determinant relaxations (LDR) [10] to compute the E[(Ft |ht )] term needed for the gradients. Since the NMF and LDR bounds on the log-likelihood were so loose (and CD provides no bound), it was impossible to assess our model by an appeal to likelihood. Instead, we opted to evaluate the models based on control performance. To do this, we used Least-Squares Policy Iteration (LSPI) [4]. We generated a fixed sequence of actions and observations using a random policy. After each step of optimization, we used the parameter estimate to generate a corresponding set of states. We fed the states, rewards, and actions to LSPI to generate an approximate Q-function. We then ran the agent using a greedy policy and report the average reward over 1000 steps. This model had 1580 parameters to work with, while the true model has 561. We used n = 3 and 10,000 training points. Figure 4 (first three panels on the left) shows the results. Surprisingly, only a few iterations of optimization were necessary to generate a state representation that was conducive to good control performance; in fact, in all three experiments, more than about 3-4 steps of optimization began to hurt performance. We compared against a perfect policy and a naive policy. The optimal policy generated a control performance of 0.1586, which is shown as the dash-dot line in the figure. We also evaluated a policy which does not use any sort of state, and ignores the observations entirely. This naive policy generates a control performance of 0.0246 (shown as the dotted line on the bottom). 1 From Tony Cassandra's POMDP repository at http://www.cs.brown.edu/research/ai/pomdp/index.html 7 The far right panel in Figure 4 shows the progress of each method over time. We report the lower bound provided by LDR, the upper bound from NMF, and a quantity derived from CD (which does not provide a bound), normalized to be between zero and one. In all three cases, the algorithms appear to making steady progress in terms of likelihood, but this does not translate into control performance gains. The best control performance in the learned model achieves an average reward of 0.142. This accounts for 87.47% of the gap between a naive and an optimal policy ­ an exciting result, considering the model was learned. 6 Conclusions We have presented the Exponential Family PSR, a new model of controlled, stochastic dynamical systems. The model has several appealing properties, including its close connections to graphical models and maximum entropy modeling. Based on our empirical results, we conclude that the EFPSR is capable of learning models of partially observable domains. We were able to learn almost perfect models of several small systems, both from a likelihood perspective and from a control perspective. The biggest drawback is computational: the repeated inference calls make the learning process very slow. However, the final models can be accurate, both in terms of likelihood and in terms of control performance. Interestingly, when average reward is the performance metric, only a few iterations of optimization are necessary. This is a boon, since such iterations are expensive, and suggests that on-line learning may be possible. Approximating the gradients efficiently enough for on-line use is an interesting direction for future research. Acknowledgments David Wingate is supported under a National Science Foundation Graduate Research Fellowship. Satinder Singh is supported by NSF grant IIS-0413004. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the authors and do not necessarily reflect the views of the NSF. References [1] G. E. Hinton. Training products of experts by minimizing contrastive divergence. Neural Computation, 14(8):1771­1800, 2002. [2] E. T. Jaynes. Notes on present status and future prospects. In W. Grandy and L. Schick, editors, Maximum Entropy and Bayesian Methods, pages 1­13, 1991. [3] J. Lafferty, A. McCallum, and F. Pereira. Conditional random fields: Probabilistic models for segmenting and labeling sequence data. In International Conference on Machine Learning (ICML), 2001. [4] M. G. Lagoudakis and R. Parr. Least-squares policy iteration. Journal of Machine Learning Research, 4:1107­1149, 2003. [5] M. L. Littman, R. S. Sutton, and S. Singh. Predictive representations of state. In Neural Information Processing Systems (NIPS), pages 1555­1561, 2002. [6] A. McCallum, D. Freitag, and F. Pereira. Maximum entropy Markov models for information extraction and segmentation. In International Conference on Machine Learning (ICML), pages 591­598, 2000. [7] S. D. Pietra, V. D. Pietra, and J. Lafferty. Inducing features of random fields. IEEE Transactions on Pattern Analysis and Machine Intelligence, 19(4):380­393, 1997. [8] M. Rudary, S. Singh, and D. Wingate. Predictive linear-Gaussian models of stochastic dynamical systems. In Uncertainty in Artificial Intelligence (UAI), pages 501­508, 2005. [9] M. J. Wainwright and M. I. Jordan. Graphical models, exponential families, and variational inference. Technical Report 649, UC Berkeley, 2003. [10] M. J. Wainwright and M. I. Jordan. Log-determinant relaxation for approximate inference in discrete Markov random fields. IEEE Transactions on Signal Processing, 54(6):2099­2109, 2006. [11] D. Wingate and S. Singh. Mixtures of predictive linear Gaussian models for nonlinear stochastic dynamical systems. In National Conference on Artificial Intelligence (AAAI), 2006. [12] D. Wingate and S. Singh. On discovery and learning of models with predictive representations of state for agents with continuous actions and observations. In International Conference on Autonomous Agents and Multiagent Systems (AAMAS), 2007. 8