Efficiently Learning Linear-Linear Exp onential Family Predictive Representations of State

David Wingate wingated@umich.edu Satinder Singh baveja@umich.edu Computer Science and Engineering, University of Michigan, 2260 Hayward, Ann Arbor, MI 48109

Abstract
Exponential Family PSR (EFPSR) models capture stochastic dynamical systems by representing state as the parameters of an exponential family distribution over a shortterm window of future observations. They are appealing from a learning perspective because they are fully observed (meaning expressions for maximum likelihood do not involve hidden quantities), but are still expressive enough to both capture existing models and predict new models. While maximumlikelihood learning algorithms for EFPSRs exist, they are not computationally feasible. We present a new, computationally efficient, learning algorithm based on an approximate likelihood function. The algorithm can be interpreted as attempting to induce stationary distributions of observations, features and states which match their empirically observed counterparts. The approximate likelihood, and the idea of matching stationary distributions, may apply to other models.

The original PSR models used the probability of specific, detailed futures called tests as the statistics of interest. Recent work has introduced the more general notion of using parameters that model the distribution of length n futures as the statistics of interest (Rudary et al., 2005; Wingate, 2008). To clarify this, consider an agent interacting with the system. It observes a series of observations o1 ...ot , which we call a history ht (where subscripts denote time). Given any history, there is some distribution over the next n observations: p(Ot+1 ...Ot+n |ht )  p(F n |ht ) (where Ot+i is the random variable representing an observation i steps in the future, and F n is a mnemonic for future ). We emphasize that this distribution directly models observable quantities in the system. The Exponential Family PSR is a new family of models of partially observable, stochastic dynamical systems. EFPSR models assume that the distribution p(F n |ht ) has an exponential family form, and that the parameters of that distribution are the state of the system (Wingate, 2008). This idea has been shown to unify a number of existing models of dynamical systems: for example, if p(F n |ht ) is assumed to be Gaussian (and certain other choices are made), the model can capture any domain modeled by a Kalman filter. Existing algorithms for learning EFPSR models from data are based on maximizing exact likelihood, but the algorithms are slow. This paper presents an efficient algorithm for one particular EFPSR, named the Linear-Linear EFPSR. We begin by presenting an approximate likelihood function, and then show that the terms needed to maximize it can be efficiently computed by virtue of the linearity of the Linear-Linear EFPSR's state update. The resulting algorithm is computationally efficient, and can be interpreted in terms of stationary distributions of observations, features and states. It allows us to begin to learn models of domains which are too large (in terms of the amount of data required, and in terms of the complexity of the observation space) to tackle with any other EFPSR.

1. Intro duction
One of the basic problems in modeling controlled, partially observable, stochastic dynamical systems is representing and tracking state. In a reinforcement learning context, the state of the system is important because it can be used to make predictions about the future, or to control the system optimally. Often, state is viewed as an unobservable, latent variable, but models with predictive representations of state (Littman et al., 2002) propose an alternative: PSRs represent state as statistics about the future.
Appearing in Proceedings of the 25 th International Conference on Machine Learning, Helsinki, Finland, 2008. Copyright 2008 by the author(s)/owner(s).


Efficiently Learning Linear-Linear EFPSRs
Observation features

...

...

...

Ot+1

Ot+2

Ot+n

Ot+1

Ot+2

Ot+n

Ot+n+1

Ot+1

Ot+2

Ot+n

O t+n+1

Distribution of next n observations

Extended distribution

Conditioned distribution

p(F n |ht )

p(F n , Ot+n+1 |ht )
Figure 1. An illustration of extending and conditioning.

p(F n |ht , ot+1 )

2. The Exp onential Family PSR
We first review the EFPSR family of models, including how state is represented and how it is maintained. State. The EFPSR defines state as the parameters of an exponential family distribution modeling p(F n |ht ), which is a window of n future observations. To emphasize that these parameters represent state, we will refer to them as st . The form of the distribution is: s , p(F n = f n |ht ; st ) = exp  (f n ) - log Z (st ) (1) t

The extension function helps govern the kinds of dynamics that the model can capture. For example, in the PLG family of work, a linear extension allows the model to capture linear dynamics (Rudary et al., 2005), while a non-linear extension allows the model to capture non-linear dynamics (Wingate, 2008). Condition. Once we have extended the distribution to model the n + 1'st observation in the future, we then condition on the actual observation ot+1 , which results in the parameters of p(F n |ht+1 ):
+ st+1 = condition(st , ot+1 ),

with both { (f n ), st }  Rl×1 . The vector (f n ) is a feature vector which controls the particular form of the distribution. For example, (X ) = [X, X 2 ], yields a Gaussian, but (X ) = [X, log(X )] yields a gamma. Since the distribution is over the future,  can be thought of as features of the future.

which is our state at time t + 1. The entire process of extending and conditioning is illustrated in Fig. 1. We have drawn graphs to suggest that there can be structure in the distributions, and to informally hint at the fact that the form of the distribution does not change over time. This, and other constraints on the features and extension function, are discussed in detail elsewhere (Wingate, 2008). 2.1. The Linear-Linear EFPSR The EFPSR is a family of models. Specific members of the family are chosen by selecting two things: the features , and an extension function. For example, if p(F n |ht ) is Gaussian, and a special extension function is chosen, the predictively defined version of a linear dynamical system (Kalman filter) is recovered. The Linear-Linear EFPSR chooses features and an extension function designed to make it both analytically tractable and efficiently approximable. The extension function is linear, and features are chosen such that conditioning is always a linear operation (hence the name, "Linear-Linear"). In this paper, we also assume the base observations are vectors of binary random variables. If they are not, we assume that binary features are extracted from the observations, and discard the original observations (we call these atomic features, to distinguish them from the higher-order features defined by ).

As the agent interacts with the system, p(F n |ht ) changes because ht changes; therefore the parameters st and hence state change. The feature vector (f n ) does not change over time. Maintaining State. In addition to selecting the form of p(F n |ht ), there is a dynamical component: given the parameters of p(F n |ht ), how can we incorporate a new observation to find the parameters of p(F n |ht , ot+1 )? That is, how can we update state? Our strategy is to extend and condition. Extend. We assume that we have the parameters of p(F n |ht ), denoted st . We extend the distribution of F n |ht to include Ot+n+1 , which forms a new variable F n+1 |ht , and we assume it has the distribution p(F n , Ot+n+1 |ht ) = p(F n+1 |ht ). This is a temporary distribution over (n + 1) observations. To perform the extension, we define an extension function which maps the current state vector to the parameters of the extended distribution: s+ = extend(st ; ), t where  is a vector of parameters controlling the extension function (and hence, the overall dynamics).


Efficiently Learning Linear-Linear EFPSRs

Features. Let each base observation Ot be a vector  {0, 1}d ; therefore, each F n |ht  {0, 1}nd . We restrict all features comprising the feature vector  to be conjunctions of the atomic binary variables in the base observations. For example, if each Ot  {0, 1}3 , there could be a feature (ot )k which is a conjunction of the second and third components of the observation: (ot )k = (ot )2 (ot )3 . By selecting features this way, the resulting distribution can be conditioned with an operator that is nonlinear in the observation ot+1 , but linear in the state st . We therefore define the linear conditioning operator G(ot+1 ) to be a ma+ trix which transforms s+ into st+1 : st+1 = G(ot+1 )st . t See (Wingate, 2008) for details. Extension function. We choose a linear extension: s+ = Ast + B . t A  Rk×l and B  Rk×1 are our model parameters. The combination of a linear extension and a linear conditioning op erator means that the entire extend-and-condition op eration (ie, state up date) is a linear op eration: st+1 = G(ot+1 ) (Ast + B ) . This will be critical in the sequel.

We can differentiate with respect to our parameters: t T  LL   st  LL =  {A, B }  st   =1 (3)

and with respect to each state:  -  LL = = st (ft ) - log Z (st )  st  st Est [(F n |ht )] - (ft )

(4)

where Est [(F n |ht )]  Rl×1 is the vector of expected sufficient statistics at time t. The gradient of st with respect to A is given by A ,  st-1  st = G(ot+1 ) + s 1  I t- A A where  is the Kronecker product, and I is an identity matrix the same size as A. The gradient of the state with respect to B is A .  st-1  st = G(ot+1 ) +I B B Note that the gradients at time t are temporally recursive ­ they depend all previous gradients. There are two bottlenecks which motivate this paper: 1. Computing Est [(F n |ht )] is a standard inference problem in exponential family models, and is computationally expensive because it scales exponentially with the number of atomic observation variables included in the domain of p(F n |ht ). Even approximate inference is NP-hard (Dagum & Luby, 1993), and it must be done T times. 2. The gradients are temporally recursive, but can be computed in a single pass through the data. However, the process is expensive. For the discussion, assume that we have l features in (ft ), and that we have k features in the extended distribution. This means that the matrix A  Rk×l , that the vector st  Rl , and that there are k l total parameters describing A. The term  st / A is a matrix, with l rows and k l columns. Given  st-1 / A, part of computing  st / A involves multiplying  st / A by A. This is an expensive matrix-matrix multiplication, which scales poorly as the number of features in the model grows, and it must be performed T times to get the true gradient of the likelihood, which scales poorly as the size of the training set grows. To summarize, the exact learning algorithm does not scale well with either the number of training samples T , the dimension of the observations, the window size n, or the number of features ||.

3. Learning with Exact Likeliho o d
We now briefly sketch how to learn a Linear-Linear EFPSR model from data by maximizing exact likelihood. We do this to point out the two primary computational bottlenecks that motivate this paper. We assume we are given a sequence of T observations, [o1 · · · oT ], which we stack to create a sequence of samples from the F n |ht 's: ft |ht = [ot+1 · · · ot+n |ht ]. The likelihood of the training data is p(o1 , o2 ...oT ) = T t=1 p(ot |ht ), but we will find it more convenient to measure the likeliT ood of the corresponding ft 's: h p(o1 , o2 ...oT )  n t=1 p(ft |ht ) (maximizing this is equivalent to maximizing the standard likelihood). The expected log-likelihood of the training ft 's under the model defined in Eq. 1 is T ( 1t  -st (ft ) - log Z (st ) 2) LL = T =1 Our goal is to maximize this quantity. Any optimization method can be used to maximize the loglikelihood. Two popular choices are gradient ascent and quasi-Newton methods, such as (L-)BFGS, which require the gradient of the likelihood with respect to the parameters, which we will now compute.


Efficiently Learning Linear-Linear EFPSRs

4. Approximate Likeliho o d
We now turn to the main contribution of this paper. In order to achieve an efficient learning algorithm, we will present an approximate expression for likelihood, named LL, and show that its gradient can be efficiently computed. We will also examine what happens in the limit as T  . The quantity LL could be used with any model, not just the Linear-Linear EFPSR, but we will show that the Linear-Linear EFPSR allows us to compute the needed terms easily. We now present our approximate log-likelihood LL, which is an approximate lower bound on the exact likelihood. To begin, we will make one central assumption: Assumption 4.1. We assume that Cov[st , (ft )] = 0 and that Cov[st , ot ] = 0, t. This assumes that the state does not covary with observable quantities. It implies that E[s (ft )] = t E[st ] E[(ft )], which will be repeatedly used in the following derivation. This is not as severe of an assumption as it may appear to be ­ in particular, that this does not imply that st and (ft ) are independent. We derive LL using Assumption 4.1 and a lower bound based on Jensen's inequality: T = 1t -s (ft ) - log Z (st ) LL = t T =1 = - ET st (ft ) - log Z (st ) - - st (ft ) ET [log Z (st )] ET  ET [-st ] ET [(ft )] - ET [log Z (st )]
 

Algorithm 1 LEARN-EFPSRS-W-APPROX-LL Input: ET [ot ], ET [(ft )] Initialize A = 0, B = 0. rep eat (LL, A LL, B LL)=GRADS-OF-APPROXLL(ET [ot ], ET [(ft )], A, B )

// Use the gradients in an optimizer. Steepest // descent would look like this: A = A + A LL B = B + B LL until LL is maximized Return A, B

The second and third lines follow because of the convexity of the functions - log and exp, and the fourth line follows by Assumption 4.1. The approximate log-likelihood involves several new terms, which we now explain. Consider ET [st ]. Because this is an unconditional expectation, as T  , this can be interpreted as the stationary distribution of states induced by a particular setting of the parameters of the model. At first glance, this term would appear to defeat the point of our approximations: it appears to depend on T and on the model parameters, which means that we would have to recompute it, at cost T , every time the parameters change (as they would inside any sort of optimization loop). Fortunately, because it is the stationary distribution of states, it can be efficiently computed in the case of the Linear-Linear EFPSR as the solution to a linear system of equations in a way that does not depend on T . The other terms have similar interpretations. ET [(ft )] is empirically observed stationary distribution of features of n-step windows of observations. Since it does not depend on the model parameters, it can be computed once at the beginning of learning in a single pass through the data. The quantity log Z (ET [st ]) is the log partition function Z computed using the vector ET [st ], and can be computed in the same way as the partition function associated with any ordinary state st . 4.1. Computing the Approximate Likeliho o d Can the approximate log-likelihood LL and its derivatives be computed efficiently? The answer is yes: Appendix A shows that in the case of the Linear-Linear EFPSR, both LL and the derivative of LL with respect to the model parameters can be computed efficiently. The computation does not depend on T (the amount

where we have defined the operator ET [X ] 

 ET [-st ] ET [(ft )] - log Z (ET [st ])  LL
T 1t X. T =1

The fourth line in the derivation follows because of Assumption 4.1. The fifth line is obtained by a double application of Jensen's inequality:  - e E[- log Z (st )] = E log( xp(-s (F ))dF ) t e )  - log(E xp(-st (F ))dF e - )  - log( xp(E s (F ) dF ) t e   - log( xp(E [-st ] E [(F )])dF ) = - log Z (E[st ]).


Efficiently Learning Linear-Linear EFPSRs

of training data), and only involves the solution to two sparse linear systems of equations. Inference must be performed on the graphical model only once. In addition, the expensive matrix-matrix multiplications are completely eliminated. 4.2. Algorithm Summary Let us pause for a brief summary. The exact loglikelihood LL in Eq. 2 is intractable to maximize. However, we have introduced LL, and shown that it and its derivatives can be computed efficiently. Putting everything together, we see that this learning algorithm is attempting: · to find a setting of the parameters A and B · which generate a stationary distribution of states ET [st ], · based on a transition operator defined using the stationary distribution of observations ET [ot ], · which imply a stationary distribution of features of length n tra jectories EET [st ] [(F n |ht )] as close as possible to the empirically observed stationary distribution of features of length n tra jectories ET [(ft )]. With gradients in hand, any optimization method may be used to find the optimal settings for A and B . The final gradient algorithm is shown in Algorithm 2 (in Appendix A), and a simple companion steepest descent optimizer is shown in Algorithm 1.

0 -10 -20 -30 -40 -50 -60

Figure 2. The setup of the bouncing ball problem.

to this problem: the gradients A LL have a natural rank-one form, and therefore mesh well with singular value decomposition (SVD) update algorithms (Brand, 2006). Instead of maintaining the full matrix A, we can maintain a low-rank SVD of A. Given the SVD of A and a rank-one gradient update, the parameters of the updated SVD can be efficiently computed. The entire process can be meshed with a rank-aware line search. The advantage is that the full matrix A is never computed, but exact line searches can be conducted. See (Wingate, 2008) for more details.

6. Exp eriments and Results
We now evaluate the quality of our approximations. For large problems, we cannot compute the exact likelihoods to compare with, and since we are using approximate likelihoods, it is not clear what a comparison would mean. Instead, we use reinforcement learning to help measure the quality of the model: we use the states generated by the EFPSR as the input to an reinforcement learning planner. We conclude that our model is good if the RL agent is able to use it generate performance comparable to that of the true model. For comparison, we also tested RL using the raw observations as state (called the "reactive," or first-order Markov policy), and a random policy. 6.1. Planning in the EFPSR We used the Natural Actor Critic (or NAC) algorithm (Peters et al., 2005) to test our model. NAC requires two things: a stochastic, parameterized policy and the gradients of the log probability of that policy. We used a softmax function of a linear projection of the state: the probability of taking action ai from state st given the policy parameters  s / |A| s , is p(ai ; st , ) = exp  i where t t j j =1 exp  is to be learned. See (Wingate, 2008) for details. 6.2. Bouncing Ball The first test domain is called the Bouncing Ball domain. In this domain, the observations are factored in a way that is closely related to the dynamics of the system. This domain was hand-crafted to be compatible with the EFPSR: the domain has significant structure

5. A Low-Rank Parameterization
We briefly turn our attention to the parameter matrices A and B . So far, we have implicitly assumed that the matrix A is reasonably sized, but this assumption is false in the case of a large number of features. To clarify this, recall that our state st is a vector  Rl×1 , where l is the number of features of the future. When we extend and condition, we implicitly compute s+ , which is a vector of parameters describing n + 1 t observations: s+ = Ast + B . If we assume that there t are k extended features, the A matrix is  Rk×l . One of the goals of EFPSRs is to be able to use many features in order to capture state. If the number of features l is very large (say, tens of thousands, or even millions), the number of extended features k will be even larger, and the matrix A will be too large to work with. For example, if there are 10,000 features, and if the extended distribution has 15,000 features, the matrix A  R15,000×10,000 , which is simply too large. LL has another property which suggests a solution

O t

O t+1

O t+2


Efficiently Learning Linear-Linear EFPSRs
0 Average Reward -5 -10 -15 -20 Noiseless 1% Noise 10% Noise EFPSR Reactive

Observation variables

... ...

...
...

Figure 3. Results on the bouncing ball domain.

t+1 t+2

t+n

in the observations, and basically requires the use of a model which is able to capture that structure. Figure 2 describes the domain pictorially. The left figure shows the the ball bouncing. At each timestep, the agent observes an 11x10 array of pixels which may be black or white. One of these pixels represents the "ball," which bounces diagonally around the box (shown as a gray trail in the figure). The agent has two actions: 0 means "do nothing," and 1 means "reverse the direction of the ball." The reward signal is shown in the middle. This domain is episodic: every 50 timesteps, the ball is reset to a random position. We define three different versions of the domain. In the noiseless version, the agent sees the exact position of the ball. This domain is second-order Markov with 11 × 10 = 110 observations. The second version adds a p=1% chance of flipping white pixels to black. This domain is no longer second-order Markov, and has 2110 possible observations. The third version uses p=10%. Figure 2 shows the features we used, which are handcoded to correspond with the known dynamics. We set n = 2 and added singleton features for each observation. Pairwise features were added for each variable to its diagonal neighbors in the next timestep (to capture the diagonal motion of the ball). The extended distribution p(F 3 |ht ) used quartets consisting of an action and observation at time t, and diagonal observations at time t + 1 and t + 2. There were 584 features describing p(F 2 |ht ) and 1,292 features describing p(F 3 |ht ). We used the timeless gradients, the low rank approximation of A, and 100,000 training samples. Figure 3 collects the results. The EFPSR is able to consistently improve over the best reactive policy, generating a policy with 30% higher reward in the noiseless version, a policy with 25% higher reward when p=1%, and a policy with 13% higher reward when p=10%. It is an open question as to whether different feature sets would improve these results further. 6.3. Rob ot Vision Domain Together, the combination of the Linear-Linear EFPSR, the approximate maximum likelihood ob jective function, and the low-rank decomposition of the pa-

Figure 4. Setup of the vision domain.

rameter matrix allow experimentation on domains with hundreds of observation variables and tens of thousands of features, which is larger than any other model with a predictive representation of state. Here, we apply the entire suite of techniques to the task of visual navigation, where a robot must navigate a maze using only features of camera images as observations. Figure 4 explains the setup. The latent state space consists of a position x, y and orientation . The experiments used two different maps (bottom left). The agent has four actions: move forward, move backward, turn left and turn right. We tested two kinds of dynamics: in the "coarse" dynamics, the agent took large steps and turns, and in the "fine" dynamics, the agent took small steps and turns. The initial observations are 64x64 color images, from which binary features are extracted (upper left). We tried two different sets of binary features. The first set consisted of 884 features like edges, corners and colors, and the second feature set was a post-processed version of the first. The idea of the second set was to create higher-order features which represented things like walls and hallways. To do this, images from Maze #1 were clustered according to the latent states, and then the binary features were averaged together to create a sort of filter. New images were tested against each filter, triggering if the response exceeded a threshold. There were 373 of these features. Note that while the images were all taken from Maze #1, they were also used in Maze #2, where the colors, hall geometry, etc. were all different. We set n = 3. For the feature vector (), we used "streamer features." These connect each observation variable only to its temporal successors (Fig. 4, right). There were between 12,000 and 50,000 total features in the final feature set. We trained on 200,000 samples generated with a random policy. For the NAC parameters, we used a TD rate of  = 0.85, a stepsize  = 10.0, gradient termination test  = 0.001 and remembering factor  = 0.0. Figure 5 shows the results. The random policy performed the same in both domains, regardless of map or dynamics. Higher rewards were obtained in general


Efficiently Learning Linear-Linear EFPSRs
Feature Set #1 1 Average reward 0.8 0.6 0.4 0.2 0 EFPSR Reactive Random Map 1 - Coarse Map 1 - Fine Map 2 - Coarse Map 2 - Fine Feature Set #2 1 Average reward 0.8 0.6 0.4 0.2 0 EFPSR Reactive Random Map 1 - Coarse Map 1 - Fine Map 2 - Coarse Map 2 - Fine

wall, but if I turn left, I'll see a pink corner." The idea that low-order conjunctions of more abstract features gives more modeling benefit than low-order conjunctions of granular, low-level features suggests several directions for future improvement of these results. Not reflected in the performance graphs is the computation required. Learning the model was relatively easy, taking only about 30 seconds. Because of the intensive rendering and relatively large size of the domains, the NAC algorithm required about a day to generate the policies to be reported. Informal calculations indicated that it would take about a week to get a single gradient with exact likelihood.

7. Conclusions and Future Work
We have presented a computationally efficient learning algorithm for the Linear-Linear EFPSR model and illustrated it on two domains. Our main contribution is an approximate likelihood, and the insight that maximizing it is equivalent to attempting to match stationary distributions. This idea may find traction in other learning problems. While evaluation of the model and learning algorithm is challenging, it is only by virtue of these approximations that we were able to attempt at all domains like the Bouncing Ball or the Robot Vision domain, which have continuous state spaces, rich observations, and tens of thousands of features. For both domains, we obtained better-than-reactive control policies, suggesting that information from history has successfully been incorporated into the state representation. This is a positive result considering the size of the data set and the number of features involved. Future work needs to address the problem of learning good atomic features and the graphical structure, since these appear to be key factors affecting performance.

Figure 5. Results on the vision domain.

with coarse dynamics, regardless of map, feature set, or learning algorithm (presumably because the agent can reach high-reward regions more quickly). The difference between the two feature sets that is most interesting. Using feature set #1, the EFPSR performs just under the performance of the reactive policy, regardless of map or dynamics. Perhaps this means that the EFPSR was unable to capture any meaningful dynamics, and instead learned to predict the identity function, with some noise. This would result in a policy equivalent to the reactive policy. The results are reversed for feature set #2. Here, the EFPSR consistently outperforms the reactive policy. Together, these observations imply a coherent story. For both feature sets, we used the same set of streamer features. One plausible explanation for the results is that low-order conjunctions of more abstract features gives more modeling benefit than low-order conjunctions of granular, low-level features. It is easy to imagine that low-order conjunctions of granular features is insufficient to capture useful abstract structure in the domain. For example, to represent the corner of a wall, the agent might need a conjunction of 10 features, but we only had fourth order conjunctions. This was part of the motivation for feature set #2: because the camera images were clustered according to latent states, they were typically images of the same thing, from slightly different positions and angles. Using this feature set, the highest-order conjunction was still four or five, but these conjunctions may represent more abstract knowledge: if one feature represents "pink wall" and another represents "pink corner," perhaps a loworder conjunction could express "I'm looking at a pink

To compute LL we must compute three terms: ET [st ] (the stationary distribution of states), ET [(ft )] (which is computed once from data), and the log partition function log Z (ET [st ]). We begin with ET [st ]. Recall that our goal is to compute this term in a way that is independent of T . This will be possible using Assumption 4.1, the linearity of the state update, and an insight related to stationary distributions: ET [st ] = ET [G(ot ) (Ast-1 + B )]  ET [G(ot )A] ET [st-1 ] + ET [G(ot )] B = ET [G(ot )A] ET [st ] + BG = G(ET [ot ])AET [st ] + BG = (I - G(ET [ot ])A)
-1

A. Computing LL and Its Derivatives

BG

(5)


Efficiently Learning Linear-Linear EFPSRs

where I is an appropriately sized identity matrix, and where BG = G(ET [ot ])B . The second line follows by Assumption 4.1. The third and fourth lines are both interesting for different reasons. The fourth line follows by the linearity of the operator G(·). The matrix G(E[ot ]) can be interpreted as the expected transition operator, and is a simple function of the stationary distribution of observations ET [ot ]. The third line follows by the limiting properties of our expectations: we assume that ET [st ] = ET [st-1 ] because as T  , both represent the stationary distribution of states. The result is that ET [st ] can be computed as the solution to a linear system of equations. Note that G(ET [ot ]) will typically be very sparse, and a designer may force the A part to be sparse or low-rank. If so, a matrix-vector product can be computed efficiently, and an iterative solver should be used to solve Eq. 5. Computing Derivatives. We now compute the derivatives of LL with respect to A and B : We begin with the left-hand term:  ET [st ]  LL  LL =  {A, B }  ET [st ]  {A, B }


Algorithm 2 GRADS-OF-APPROX-LL Input: ET [ot ], ET [(ft )], A, B // Compute stationary distribution of states -1 ET [st ] = (I - G(ET [ot ])A) B // Use ET [st ] to perform inference Compute EET [st ] [(F )] and log Z (ET [st ]) // Compute the approximate log-likelihood: LL = -ET [st ] EE [s ] [(F )] - log Z (ET [st ]). Tt

// Compute the gradient:  = E[(ft )] - EET [st ] [(F )]. -1  =  (I - G(E[ot ])A) G(E[ot ]) LL =  ET [st ] A  note: a rank-one matrix B LL = G(ET [ot ])  Return LL, A LL, B LL  ET [st ] B = = 

The derivative with respect to B is similar:   [G(E[ot ])(AET [st ] + B )] B

  G(E[ot ])B =  G(E[ot ]) B

This result has an appealing intuitive interpretation. EET [st ] [(F )] can be interpreted as the expected features that would be obtained if inference were performed using ET [st ] as the state ­ in other words, it represents the stationary distribution of features under the model. Since ET [(ft )] represents the empirically observed stationary distribution, we see that the gradient wishes to match the two. If we use a variational method to compute the log partition function log Z (ET [st ]), which is needed to determine the value of the log-likelihood, then the expected features EET [st ] [(F )] are available as a byproduct of the optimization. This is a pleasing efficiency. However, we are not done. We still must find the transition parameters which allow us to move the expected sufficient statistics closer:  E  ET [st ] -1 = (I - G(E[ot ])A) G(E[ot ])A T [st ] A A We now find it convenient to remember that the full derivative also includes the term  LL/ ET [st ]  , which is a column vector. Let   -1  (I - G(E[ot ])A) G(E[ot ]). Then: 
  ET

 LL = EET [st ] [(F )] - ET [(ft )]    ET [st ]

The completed algorithm is shown in Algorithm 2.

Acknowledgments
Both authors were supported by NSF grant IIS0413004. Any opinions, findings, and conclusions or recommendations expressed are those of the authors and do not necessarily reflect the views of the NSF.

References
Brand, M. (2006). Fast low-rank modifications of the thin singular value decomposition. Linear Algebra and its Applications, 415, 20­30. Dagum, P., & Luby, M. (1993). Approximate probabilistic reasoning in bayesian belief networks is NP-hard. Artificial Intel ligence, 60, 141­153. Littman, M. L., Sutton, R. S., & Singh, S. (2002). Predictive representations of state. Neural Information Processing Systems (NIPS) (pp. 1555­1561). Peters, J., Vijayakumar, S., & Schaal, S. (2005). Natural actor-critic. European Conference on Machine Learning (ECML) (pp. 280­291). Rudary, M. R., Singh, S., & Wingate, D. (2005). Predictive linear-Gaussian models of stochastic dynamical systems. Uncertainty in Artificial Intel ligence (pp. 501­508). Wingate, D. (2008). Exponential family predictive representations of state. PhD thesis, University of Michigan.

 [st ]  = AET [st ] =  ET [st ] A A