Online Kernel Selection for Bayesian Reinforcement Learning

Joseph Reisinger Peter Stone Risto Miikkulainen Department of Computer Sciences, The University of Texas at Austin, Austin, TX 78712

JOERAII@CS.UTEXAS.EDU P S TO N E @ C S . U T E X A S . E D U R I S TO @ C S . U T E X A S . E D U

Abstract
Kernel-based Bayesian methods for Reinforcement Learning (RL) such as Gaussian Process Temporal Difference (GPTD) are particularly promising because they rigorously treat uncertainty in the value function and make it easy to specify prior knowledge. However, the choice of prior distribution significantly affects the empirical performance of the learning agent, and little work has been done extending existing methods for prior model selection to the online setting. This paper develops Replacing-Kernel RL, an online model selection method for GPTD using sequential Monte-Carlo methods. ReplacingKernel RL is compared to standard GPTD and tile-coding on several RL domains, and is shown to yield significantly better asymptotic performance for many different kernel families. Furthermore, the resulting kernels capture an intuitively useful notion of prior state covariance that may nevertheless be difficult to capture manually.

for biasing learning. An important open question for Bayesian RL is how to perform model selection efficiently and online. In GPTD, model selection determines the particular form of the prior covariance function and the settings of any hyperparameters. This paper contributes towards answering this question in two ways: (1) It demonstrates empirically the importance of model selection in Bayesian RL; and (2) it outlines Replacing-Kernel Reinforcement Learning (R K R L), a simple and effective sequential Monte-Carlo procedure for selecting the model online. R K R L not only improves learning in several domains, but does so in a way that cannot be matched by any choice of standard kernels. Although conceptually similar to methods combining evolutionary algorithms and RL (Whiteson & Stone, 2006), R K R L is novel for two reasons: (1) The sequential MonteCarlo technique employed is simpler and admits a clear empirical Bayesian interpretation (Bishop, 2006), (2) Since GPs are nonparametric, it is possible to replace kernels online during learning without discarding any previously acquired knowledge, simply by maintaining the dictionary of saved training examples between kernel swaps. This online replacing procedure significantly improves performance over previous methods, because learning does not need to start from scratch for new kernels. This paper is divided into seven main sections: Section 2 introduces GPTD, Section 3 describes R K R L, Section 4 details the experimental setup using Mountain Car, Ship Steering and Capture Go as example domains, and the last three sections give results, future work and conclusions.

1. Introduction
Bayesian methods are a natural fit for Reinforcement Learning (RL) because they represent prior knowledge compactly and allow for rigorous treatment of value function uncertainty. Modeling such uncertainty is important because it offers a principled solution for balancing exploration and exploitation in the environment. One particularly elegant Bayesian RL formulation is Gaussian Process Temporal Difference (GPTD) (Engel et al., 2005). GPTD is an efficient adaptation of Gaussian processes (GPs) to the problem of online value-function estimation. In GPs, prior knowledge in the form of value covariance across states is represented compactly by a Mercer kernel (Rasmussen & Williams, 2006), offering a conceptually simple method
Appearing in Proceedings of the 25 th International Conference on Machine Learning, Helsinki, Finland, 2008. Copyright 2008 by the author(s)/owner(s).

2. Gaussian Process Reinforcement Learning
In RL domains with large or infinite state spaces, function approximation becomes necessary as it is impractical or impossible to store a table of all state values (Sutton & Barto, 1998). Gaussian Processes (GPs) have emerged as a principled method for solving regression problems in Machine Learning (Rasmussen & Williams, 2006), and have recently been extended to performing function approxima-


Online Kernel Selection for Bayesian Reinforcement Learning

tion in RL as well (Engel et al., 2005). In this section, we briefly review GPs and their application to temporal difference learning. GPs are a class of statistical generative models for Bayesian nonparametric inference. Instead of relying on a fixed functional form as in parametric model, GPs are defined directly in some (infinite-dimensional) function space (Rasmussen & Williams, 2006). Specifically, an indexed collection of random variables V : X  over a common probability space is a GP if the distribution of any finite subset of V is Gaussian. Gaussian processes are completely specified by prior mean and covariance functions. In this paper, the mean function is assumed to be identically zero and the prior covariance function is specified as a Mercer kernel k (ˇ, ˇ). Mechanistically, a GP is composed of the set of training data and a prior covariance function (kernel) that defines how to interpolate between those points. Following the formulation of (Engel, 2005), consider a statistical generative model of the form R(x) = HV (x) + N (x), (1)

A policy ľ : X × U  [0, 1] is a mapping from states to action selection probabilities. The discounted return for a state x under policy ľ is defined as D(x) = i
=0

 i R(xi )|x0 = x,

where xi+1  pľ (ˇ|xi ), the policy-dependent state transition probability distribution, and   [0, 1] is the discount factor. The goal of RL is to compute a value function that estimates the discounted reward for each state under a policy ľ, V (x) = Eľ [D(x)]. GPs can be used to model the latent value function given a sequence of observed rewards and an appropriate noise model. Reward is related to value by R(x) = V (x) -  V (x ) + N (x, x ), where x  pľ (ˇ|x). Extending this model temporally to a series of states x0 , x1 , . . . , xt yields the system of equations Rt-1 = Ht Vt + Nt , where R and V are defined as before, and Nt = (N (x0 , x1 ), . . . , N (xt-1 , xt ))   1 - 0 . . . 0  0 1 - . . . 0    = . . . . . . 0 0 ... 1 -
,

where V is the unknown function to be estimated, N is a noise model, H is a linear transformation, and R is the observed regression function. Given a set of data D = {(xi , yi )}t=0 , the model reduces to a sysi tem of linear equations Rt = Ht Vt + Nt , where Rt = (R(x0 ), . . . , R(xt )) , Vt = (V (x0 ), . . . , V (xt )) , and Nt = (N (x0 ), . . . , N (xt )) . Assuming that V  N (0, Kt ) is a zero-mean GP with def [Kt ]i,j = k (xi , xj ) for xi , xj  D and N  N (0, t ), then the Gauss-Markov theorem gives the posterior distribution of V conditional on the observed R: ^ Vt (x) = kt (x)
 t, C t kt (x),

Ht

(2) (3)

and Nt  N (0, t ). If the environment dynamics are assumed to be deterministic, the covariance of the statedependent noise can be modeled as t =  2 I. For stochastic environments, the noise model t =  2 Ht+1 Ht+1 is more suitable. See (Engel, 2005) for a complete derivation for these models. Given Ht , t , and a sequence of states and reward val^ ues, the posterior moments Vt and Pt can be computed to yield value function estimates. Thus GPs fit naturally into the RL framework: Learning is straightforward and does not require setting unintuitive parameters such as  or ; prior knowledge of the problem can be built in through the covariance function; the full distribution of the posterior is available, making it possible to select actions in more compelling ways, e.g. via interval estimation. Furthermore, the ^ value estimator Vt and covariance Pt can be computed incrementally online as each new state action pair is sampled, without having to invert a t × t matrix at each step. For details of this procedure, see (Engel et al., 2005). One issue with using GPs for RL is that the size of Kt , k(ˇ), Ht and rt each grow linearly with the number of states visited, yielding a computational complexity of O(|D|2 ) for each step. Since it is not practical to remember every single experience in online settings, the GP dictionary size must

Pt (x) = k (x, x) - kt (x) where t Ct
def

= Ht (Ht Kt Ht + )-1 rt-1 , = Ht (Ht Kt Ht + )-1 Ht ,

and kt (x) = (k (x, x1 ), . . . , k (x, xt )) . This closed-form posterior can be used to calculate the predicted value of V at some new test point x . In RL, the sequence of observed reward values are assumed to be related by some (possibly stochastic) environment dynamics that are captured through the matrix H. In order to adapt GPs to RL, the standard Markov Decision Process (MDP) framework needs to first be formalized as follows. Let X and U be the state and action spaces, respectively. Define R : X  to be the reward function and let p : X × U × X  [0, 1] be the state transition probabilities.


Online Kernel Selection for Bayesian Reinforcement Learning

be limited in some way. To this end, Engel et al. derive a kernel sparsification procedure based on an approximate linear dependence (ALD) test. As the number of observed training examples tends to infinity, the number of examples that need to be saved tends to zero (Engel et al., 2005). A matrix At contains approximation coefficients for the ALD test and a parameter  controls how "novel" a particular training example must be before it is remembered by the GP, making it possible to tune how compact and computationally efficient the value function representation is. Finally, note that GPTD can be extended to the case where no environment model is available, simply by defining the covariance function over state-action pairs k (x, u, x , u ). This procedure will be termed G P - S A R S A in this paper. GPTD has been shown to be successful, but in practice performance relies on a good choice of kernel. The next section will focus on a particular online method for performing such kernel selection.

Algorithm 1 Sequential Monte Carlo Parameters: n, ľ,  , 
1: Draw { i }n 1  p( ) i= 2: for t = 0, 1, . . . do (t) 3: Calculate {wi }n 1 from equation 4. i= 4:
(0)

~ (t+1) }n by resampling {( (t) , w(t) )}n . Draw { i i=1 i=1 i i (t+1) ~ (t+1) + (c0 0 , . . . , ck k ) where ck  5: i  i Bernoulli(ľ) and k  N (0, 1). 6: end for

tionary (Bishop, 2006). In the RL case, stationarity implies that when evaluating  , previous data acquired while evaluating  cannot be used. To avoid this inefficiency, we instead employ a sequential Monte-Carlo (SMC) method adapted from (Gordon et al., 1993) that relaxes the stationarity assumption. SMC approximates the posterior distribution p( |D) at time t empirically via a set of n samples and weights (t) (t) {( i , wi )}n 1 where i= wi =
(t) def

3. Online Model Selection
A common requirement in RL is that learning take place online, e.g. the learner must maximize total reward accrued. However, traditional model selection techniques applied to GPs, such as cross-validation, or Bayesian Model Averaging, are not designed to address this constraint. The main contribution of this paper is to introduce ReplacingKernel Reinforcement Learning (R K R L), an online procedure for model selection in RL. In section 3.1 an online sequential Monte-Carlo method developed and used to implement R K R L, as described in section 3.2. 3.1. Sequential Monte-Carlo Methods Given a set of kernels {k (ˇ, ˇ)|  M} parameterized by a random variable  and a prior p( ) over these parameterizations, a fully Bayesian approach to learning involves integrating over all possible settings of  , yielding the posterior distribution p p(R|D) = (R|V , D,  )p(V |D,  )p( )dV d . The integration over V given  is carried out implicitly when using GPs, however, the remaining integral over  is generally intractable for all but the most simple cases. Instead of integrating over all possible model settings  , we can use the data distribution D to infer reasonable settings for  via p( |D). Such evidence approximation can is more computationally efficient and is an example of an empirical Bayes approach, where likelihood information is used to guide prior selection (Bishop, 2006). Monte-Carlo methods can be used to sample from p( |D), however, such methods assume that this distribution is sta-

p(D| i )p( i ) , m p(D| (t) )p( (t) ) m m
(t) (t)

(t)

(t)

(4)

where p(D| i ) is the likelihood of  i and p( i ) is the prior. Inference proceeds sequentially with samples for time t + 1 drawn from the empirical distribution l (t) (t) p( (t+1) |D) = wl p( (t+1) | l ), (5) where p( (t+1) | (t) ) is the transition kernel, defining how the hyperparameter space should be explored. In this paper, the prior over models p( ) is the uniform distribution over [0, 1] for each of the kernel hyperparameters (listed in table 1) and the transition kernel p( (t+1) | (t) ) is defined w ~ (t+1) + (c0 0 , . . . , ck k ) mechanistically as  (t+1)   here ck  Bernoulli(ľ) and k  N (0, 1). Pseudocode for this procedure is given in algorithm 1. 3.2. Replacing-Kernel Reinforcement Learning In Replacing-Kernel Reinforcement Learning (R K R L), SMC is used to select good kernel hyperparameter settings. Rather than calculating the true model likelihood p(D| i ) in equation 4, R K R L instead weights models based on their relative predictive likelihood, p(D| i ), where ~ log p(D| i ) =  -1 ~
def

t

rt ,

and (r0 , r1 , . . .) is the sequence of rewards obtained by evaluating the hyperparameter setting  i for  episodes.


Online Kernel Selection for Bayesian Reinforcement Learning Table 1. Basic kernel functions and the corresponding extended parameterizations. KERNEL NORM G AU S S I A N P O LY N O M I A L TA N H N O R M TA N H D OT BA S I C
| k(x, x ) = 1 - h|x-x |  - k(x, x ) = exp -||x2x
|2 |2

|

i

k(x, x ) = ( x, x + 1)d k(x, x ) = tanh(v ||x - x ||2 - c) k(x, x ) = tanh(v x, x - c)

EXTENDED P k(x, x ) = 1 - i wi (xi - xi )2 ^P ~ k(x, x ) = exp - i wi (xi - xi )2 P k(x, x ) = ( i wi xi xi + 1)d P k(x, x ) = tanh( i wi (xi - xi )2 - 1) P k(x, x ) = tanh( i wi xi xi - 1)

The parameter  is introduced to control how strongly model search should focus on hyperparameter settings that yield high reward. Maximizing predictive ability directly is preferable as it is more closely related to the goal of learning than maximizing the fit to the observed data. In tabular methods these two approaches indeed coincide in the limit of large data, however when using function approximation, they may differ. When using GPTD, the current value function estimate is formed from the combination of the kernel parameterization  determining the prior covariance function and the ~ dictionary D  D gathered incrementally from observing state transitions. In this paper we consider two variants of R K R L : Standard R K R L and Experience-Preserving R K R L (E P - R K R L) that differ based on their treatment of the saved ~ ~ experience D. In Standard R K R L, D is discarded at the start of each new kernel evaluation (making p( |D) stationary). In contrast, in E P - R K R L each kernel parameterization sam~ ple  (t) inherits D1 from the sample  (t-1) that generated it in equation 5.
R K R L naturally spends more time evaluating hyperparameter settings that correspond to areas with high predictive likelihood, i.e. maximizes online reward. Each sampling step increases information about the predictive likelihood in the sample (exploitation), while sampling from the transition kernel reduces such information (exploration).

 = 0.001. For each R K R L evaluation, G P - S A R S A is run for  episodes using the specified kernel parameterization. The R K R L parameters were set to n = 25, ľ = 0.01 and  = 0.5. Performance of R K R L is insensitive to changes doubling  or ľ. Higher settings of n improve the initial performance, but reduce the total number of epochs possible given a fixed number of episodes. The setting of  significantly impacts performance, although the main results of this paper are insensitive for 0.25    1.0. All kernels are extended to functions of both the state and action, with actions treated as extra state variables. 4.2. Kernels Although R K R L automates the choice of kernel hyperparameters, there is still a need to choose a set of kernels that represents the search space for R K R L. General kernel classes are derived from basic classes commonly found in the literature (table 1) by replacing the standard inner products and norms with weighted variants (cf. automatic relevance determination), yielding kernel classes with significantly more hyperparameters. Setting these hyperparameters is the model selection task; as more hyperparameters are added, the model becomes more general, but the corresponding difficulty of inferring the model parameters increases as well. In order to give a fair baseline G P - S A R S A comparison, the best hyperparameter setting for each basic kernel class was derived manually for each domain using grid search. Note that although they are common, the hyperbolic tangent kernels are not positive semi-definite; however they still yield ¨ good performance in practice (Smola & Scholkopf, 2004). 4.3. Test Domains
G P - S A R S A , R K R L and E P - R K R L are compared across three domains: Mountain Car, Ship Steering and Capture Go. Each domain highlights a different aspect of complexity found in RL problems: Mountain Car and Ship Steering have continuous state spaces and thus require function approximation, Ship Steering also has a large (discrete) action space, and Capture Go is stochastic with a highdimensional state space.

4. Experimental Setup
Standard R K R L and E P - R K R L are compared against G P S A R S A on three domains. This section gives the parameter settings, kernel classes and domains used. 4.1. Parameters In all experiments, the TD discount factor was fixed at  = 1.0 and -greedy action selection was employed with = 0.01. The G P - S A R S A parameters for prior noise variance ( ) and dictionary sparsity ( ) were  = 1.0 and
~ ~ For efficiency the sufficient statistics  and C for sparsified are also inherited, though they can be recalculated. The matrix of approximation coefficients At is not recalculated, although doing so should lead to be better performance in general.
1

GP-SARSA


Online Kernel Selection for Bayesian Reinforcement Learning

4 . 3 . 1 . M O U N TA I N C A R In Mountain Car, the learning agent must drive an underpowered car up a steep hill (Sutton & Barto, 1998). The available actions are a  {-1, 0, 1}, i.e., brake, neutral and accelerate. The state xt = (xt , xt )  2 is comprised of  the position and velocity. The environment is deterministic with state evolution governed by xt+1 xt+1  = xt + xt+1  = xt + 0.001at + -0.0025 cos (3xt ) 

Table 2. Asymptotic performance on Mountain car. Bold numbers represent statistical significance. KERNEL P O LY N O M I A L G AU S S I A N TA N H N O R M TA N H D OT
GP-SARSA RKRL EP-RKRL

- 6 7 . 6 ą0 . 3 - 2 3 0 ą1 6 - 6 3 8 ą7 2 - 4 8 2 ą3 7

- 1 4 9 ą5 . 5 - 5 2 1 ą8 4 - 5 6 9 ą4 1 - 5 3 2 ą1 1 3

- 6 3 . 7 ą0 . 3 - 6 6 . 9 ą0 . 9 - 1 3 0 ą8 . 7 - 9 7 . 0 ą2 . 0

Table 3. Asymptotic performance (×102 ) on Ship Steering. KERNEL P O LY N O M I A L G AU S S I A N TA N H N O R M TA N H D OT
GP-SARSA RKRL EP-RKRL

where -1.2  x  0.5 and |x|  0.07. Reward is -1 for  each time step the car has not passed the goal at x = 0.5. In all R K R L experiments with Mountain Car,  = 100 (100 episodes per epoch) and each episode is limited to 1000 steps to reduce computation time. 4.3.2. SHIP STEERING In Ship Steering, the learning agent must properly orient a sailboat to a specific heading and travel as fast as possible (White, 2007). Actions are two-dimensional rudder position (degrees) and thrust (Newtons), at = (rt , Tt )  [-90, 90] × [-1, 2]. Possible rudder settings are discretized at 3-degree increments and thrust increments are 0.5 Newtons, yielding 427 possible actions. The state is a 3-tuple consisting of the heading, angular velocity and velocity  xt = (t , t , xt )  3. State evolution is described by xt+1   t+1 t+1 = xt +  1 2 (30Tt - 2xt - 0.03xt (5t + rt ))   250 xt rt + xt    = t + 1000   = t + 0.5(t+1 + t )

3 4 . 0 ą1 . 0 2 . 5 ą0 . 3 2 . 1 ą0 . 8 2 . 9 ą0 . 6

3 . 3 ą0 . 7 5 . 0 ą0 . 6 4 . 5 ą0 . 7 3 . 0 ą0 . 8

1 7 1 ą2 4 1 1 2 . 9 ą9 . 1 6 6 2 ą1 8 3 1 9 . 5 ą1 5 . 3

Table 4. Asymptotic performance (% wins) on Capture Go. KERNEL NORM P O LY N O M I A L G AU S S I A N TA N H N O R M TA N H D OT
GP-SARSA RKRL EP-RKRL

9 0 . 9 ą0 . 2 8 9 . 7 ą0 . 4 9 0 . 3 ą0 . 5 5 5 . 7 ą0 . 2 6 2 . 4 ą2 . 8

7 6 . 1 ą4 . 4 6 9 . 5 ą1 . 1 7 8 . 3 ą0 . 7 7 8 . 7 ą3 . 7 7 0 . 5 ą1 . 5

9 4 . 3 ą0 . 5 9 2 . 6 ą1 . 3 9 3 . 3 ą0 . 1 9 4 . 5 ą0 . 6 8 9 . 1 ą1 . 1

 = 1000. This domain was chosen because it has a high dimensional state vector and stochastic dynamics.

5. Results
G P - S A R S A , R K R L and E P - R K R L were applied to three RL domains. Section 5.1 summarizes asymptotic performance in the three domains, Section 5.2 compares asymptotic dictionary sizes, Section 5.3 evaluates the learned kernel performance as a stand-alone static kernel and Section 5.4 analyzes the learned kernel hyperparameter settings in Capture Go.

Reward at time step t is equal to xt if |t | < 5 and zero  otherwise. In all R K R L experiments with Ship Steering,  = 1. By comparing results in Ship Steering to Mountain Car, it is possible to elucidate how the learner's performance depends on the size of the action space. 4.3.3. CAPTURE GO The third domain used in this paper is Capture Go, a simplified version of Go played where the first player to make a capture wins.2 The learner plays against a fixed random opponent on a 5 × 5 board, and receives reward of -1 for a loss and +1 for a win. The board state xt  {-1, 0, 1}25 is encoded as a vector where -1 entries correspond to opponent pieces, 0 entries correspond to blank territory and 1 entries correspond to the agent's pieces. The agent is given knowledge of afterstates, that is, knowledge of how its moves affect the state. In all R K R L experiments using Capture Go,
2 http://www.usgo.org/teach/capturegame. html

5.1. Asymptotic Reward Asymptotic performance is evaluated across three domains: Mountain Car, Ship Steering and Capture Go. In each domain E P - R K R L significantly outperforms both G P - S A R S A and R K R L over most kernel classes. 5 . 1 . 1 . M O U N TA I N C A R In Mountain Car, learning trials are run for 125,000 episodes and asymptotic performance is measured as the average reward over the last 100 episodes. E P - R K R L significantly outperforms both G P - S A R S A and R K R L across all kernel classes asymptotically (table 2). G P - S A R S A performance using the P O LY N O M I A L kernel reaches a peak at -51.6 after 33 episodes, which is significantly better than


Online Kernel Selection for Bayesian Reinforcement Learning

(p < 10-5 ). However, performance degrades significantly with more episodes. This is a phenomenon common to neural-network based function approximators (Sutton & Barto, 1998).
EP-RKRL

Table 5. Asymptotic dictionary size for Mountain Car. Bold numbers indicate statistical significance. KERNEL P O LY N O M I A L G AU S S I A N TA N H N O R M TA N H D OT
GP-SARSA RKRL EP-RKRL

The Mountain Car problem has been studied extensively in RL literature. The best asymptotic results from the Reinforcement Learning Library stand at -53.92 (White, 2007). NEAT+Q, a similar method for combining TD and evolutionary algorithms, achieves -52.0 (Whiteson & Stone, 2006). However, in the former case, different values for and  are used and in the latter case, the learner is run for significantly more episodes, making direct comparison difficult. Running Mountain Car using a standard tilecoding function approximator (Sutton & Barto, 1998) with 8 tilings and the same RL parameter settings yields asymptotic performance of -108.9, significantly better than G P S A R S A across all kernels except P O LY N O M I A L , but significantly worse than E P - R K R L under all kernels except TA N H N O R M. Note that E P - R K R L significantly outperforms R K R L because it discards less experience over the course of learning. Since kernels can only describe smoothness properties of the value functions, the data points themselves become more important for learning; hence discarding them at each model selection step significantly reduces performance. This contrasts with Whiteson's NEAT+Q work precisely because neural networks are more expressive. 5.1.2. SHIP STEERING In Ship Steering, each learner trains for 2500 episodes (1000 steps each) and the asymptotic performance is measured as the average reward obtained in the last 10 episodes. E P - R K R L significantly outperforms G P - S A R S A and R K R L in all kernel classes except G AU S S I A N (table 3). In the remaining three cases, however, E P - R K R L outperforms both methods by several orders of magnitude. Tile coding with 8 tilings yields asymptotic performance of 0.17, significantly higher performance than G P - S A R S A in all cases3 except for the P O LY N O M I A L kernel class (p < 10-9 ), but significantly worse than E P - R K R L in all cases. 5.1.3. CAPTURE GO In Capture Go, each learner trains for 3.75 ˇ 106 episodes, and asymptotic performance is measured as the average number of wins over the last 1000 episodes. E P - R K R L outperforms G P - S A R S A and R K R L across all kernel classes (table 4). G P - S A R S A's average reward peaks early and declines under the TA N H D OT kernel, achieving a maximum of 78.7% wins after 10,000 episodes, still significantly lower than E P - R K R L (p < 10-6 ).
3

2 1 . 6 ą0 . 4 2 9 . 6 ą0 . 5 2 . 5 ą0 . 1 7 . 7 ą0 . 7

1 0 . 3 ą0 . 5 7 . 2 ą0 . 6 8 . 5 ą0 . 8 5 . 1 ą0 . 4

1 3 . 6 ą0 . 2 1 2 . 5 ą0 . 2 1 2 . 6 ą0 . 2 1 1 . 7 ą0 . 4

Table 6. Asymptotic dictionary size for Ship steering. KERNEL P O LY N O M I A L G AU S S I A N TA N H N O R M TA N H D OT
GP-SARSA RKRL EP-RKRL

1 5 . 3 ą0 . 8 1 2 . 3 ą0 . 5 3 . 8 ą0 . 6 7 . 7 ą0 . 8

4 . 0 ą0 . 1 1 5 . 5 ą0 . 3 5 . 1 ą0 . 2 3 . 2 ą0 . 1

6 . 2 ą0 . 3 1 3 . 7 ą0 . 1 1 2 . 3 ą1 . 0 5 . 3 ą0 . 2

Table 7. Asymptotic dictionary size for Capture Go. KERNEL NORM P
O LY N O M I A L G AU S S I A N TA N H N O R M TA N H D OT GP-SARSA RKRL EP-RKRL

2 8 . 5 ą0 . 2 1 4 7 . 1 ą3 . 3 6 6 . 1 ą0 . 5 6 2 . 0 ą1 . 1 3 2 9 . 6 ą1 4 . 4

5.9 ą 2 5 . 2 ą1 . 4 7 1 . 7 ą3 . 4 1 9 . 6 ą0 . 4 2 5 . 6 ą1 . 5

3.0 ą 4 0 . 4 ą1 . 4 9 1 . 9 ą6 . 1 1 7 . 5 ą0 . 7 2 8 . 7 ą1 . 2

5.2. Dictionary Size
R K R L and G P - S A R S A can be compared in terms of computational complexity by measuring the final dictionary sizes ~ |D| of each learning agent. At each decision point, the com~ putational complexity of GPTD is O(|D|2 ), arising from matrix-vector multiplications and partitioned matrix inver~ sion (Engel, 2005). Furthermore, in practice the O(|D|) def cost of computing k(x) = (k (x, x1 ), . . . , k (x, xt )) for ~ xi  D can carry a high constant overhead for complex ~ kernels. Thus, keeping |D| small is critical for online performance.

The dictionary sizes for each kernel and learning algorithm pair is given in table 5. In eight of the thirteen cases, E P R K R L kernels generate significantly smaller dictionaries for  = 0.001 than G P - S A R S A kernels, and likewise in ten of the thirteen cases R K R L generates significantly smaller dictionaries. Thus in the majority of cases employing model selection yields faster learning both in terms of episodes and in terms of computation. However, these dictionary sizes never differ by more than a single order of magnitude, with the largest difference being between R K R L and G P - S A R S A the TA N H D OT kernel for Capture Go. 5.3. Generalization How well a particular kernel hyperparameter setting found by E P - R K R L performs depends on what training examples it encounters during learning. Determining to what degree

Performance values in table 3 are scaled by a factor of 100.


Online Kernel Selection for Bayesian Reinforcement Learning Table 8. Generalization performance of E P - R K R L kernel parameterizations for Capture Go. In all cases asymptotic performance declines after the original dictionary is discarded. KERNEL NORM P O LY N O M I A L G AU S S I A N TA N H N O R M TA N H D OT BA S E L I N E 9 4 . 3 ą0 . 5 9 2 . 6 ą1 . 3 9 3 . 3 ą0 . 1 9 4 . 5 ą0 . 6 8 9 . 1 ą1 . 1 RELEARNED 6 8 . 6 ą1 . 7 7 2 . 5 ą0 . 8 8 1 . 1 ą1 . 0 9 0 . 7 ą1 . 9 7 4 . 6 ą0 . 7

RKRL

EP-RKRL

this performance depends on the exact set of saved training examples yields a notion of how general the kernel parameterization is. In order to evaluate this generalization in E P - R K R L, final kernel parameterizations for Capture Go were saved and all stored training examples were discarded. Learning was then restarted with the kernel parameterization fixed. Table 8 summarizes the results. Across all kernels, the asymptotic performance decreases significantly. Because most training examples are acquired in the first several hundred episodes, this result indicates that the kernel hyperparameter settings are perhaps overfitting to the particular dictionary of saved experience. In other words R K R L is exploiting the acquired data to pick a highly biased covariance function that has low generalization error given that particular set of stored experience. Although kernels learned through RKRL overfit the dictionary, this is not a serious problem as it improves generalization performance. Overfitting in this nonparametric case simply means that the learned parameterizations are not transferable between agents with different experience. Thus the saved dictionary should be considered part of the model parameters being optimized. 5.4. Analysis of Learned Kernels Because the hyperparameter space for kernels in Capture Go is high dimensional, R K R L can exploit many kinds of symmetries and patterns in the value function. It is enlightening to analyze whether the learned kernel parameter settings correspond to intuitively meaning covariance functions. Figure 1 plots the 25 hyperparameter values for the TA N H N O R M kernel averaged over the entire sample for both R K R L and E P - R K R L. The average pattern of weights at the final epoch differs significantly from the expected average over the prior p( ). Furthermore, the pattern learned in E P - R K R L has a regular structure: Each weight corresponding to a particular board location takes the opposite sign of its neighbors, yielding a regular "checkerboard" pattern of positive and negative weights. This pattern enforces a simple strategy whereby the learner plays to capture isolated stones. Such a strategy is effective against opponents that do not pay attention to stones in danger of being captured.

Figure 1. "Hinton diagram" of the average learned hyperparameters by board position for the TA H N N O R M kernel in Capture Go. Filled boxes correspond to positive weights and white boxes to negative; box area is proportional to the weight. The parameterization generated by E P - R K R L shows a significant amount of structure, biasing play towards moves that surround single stones.

To further elucidate this result, a second trial was run using the same kernel class, but with only two hyperparameters, corresponding to the positive and negative parameter settings observed above. This translation invariant TA N H 25 N O R M kernel k (x, x ) = tanh( i=0 wi mod 2 (xi - xi )2 - 1) can express the same alternating positive and negative weights in a more compact form, thus trading off generality compared to the original parameterization. Under the same experimental setup, the translation invariant kernel only reaches an asymptotic performance of 88.6% wins, compared to 94.5% wins with the more general parameterization (p < 10-7 ). Furthermore, the resulting dictionary size of the translation invariant kernel is 32.4, significantly larger than more general parameterization (17.4; p < 0.001). The general TA N H N O R M kernel parameterization also outperforms a variant with built-in rotational symmetry. The rotationally symmetric TA N H N O R M kernel obtains asymptotic performance of 89.3% wins (p < 10-4 ). Taken together these results highlight some of the difficulties of manually building in prior knowledge.

6. Related and Future Work
This paper has presented an efficient and conceptually simple online method for selecting kernels in Bayesian RL. There is a growing body of model selection literature in machine learning and statistics, both on theory and applications, e.g. (Hastie et al., 2001; Seeger, 2001). In RL, model selection has been performed previously using regularization (Jung & Polani, 2006; Loth et al., 2007) and evolutionary methods (Whiteson & Stone, 2006). R K R L is most similar to the latter approach, differing in its use of GPs, ability to save training data across lineages, and simpler hyperparameter optimization procedure. There are several interesting areas of future work. First, R K R L can be naturally applied to more general kernel


Online Kernel Selection for Bayesian Reinforcement Learning

´ classes, e.g. the Matern kernels, that admit many basic kernels as special cases (Genton, 2002). In particular, the tradeoff between kernel class complexity and performance should be explored. Second, the learned kernel parameterizations acquired under one dictionary do not perform well under different dictionaries, indicating that the dictionaries themselves can be thought of as hyperparameters to be optimized. Developing such a theory of nonparametric sparsification in RL may lead to significantly better value function approximations. Third, it is possible to derive a full kernel-replacing procedure where all covariance function evaluations share the same accumulated learning, updated after every fitness evaluation of every individual. Such an approach would further reduce the sample complexity of R K R L, and also makes possible the use of Bayesian Model Averaging techniques within a single learning agent, i.e. averaging over an ensemble of GP value functions weighted by their predictive likelihoods (Hastie et al., 2001). Finally, there is a deep connection between action kernels and the concept of Relocatable Action Models (Leffler et al., 2007). It may be possible to cast the latter in terms of state-independent prior covariance functions, yielding a powerful framework for model selection.

References
Bishop, C. M. (2006). Pattern recognition and machine learning. Springer. Engel, Y. (2005). Algorithms and representations for reinforcement learning. Doctoral dissertation, Hebrew University. Engel, Y., Mannor, S., & Meir, R. (2005). Reinforcement learning with gaussian processes. Proc. of ICML-05 (pp. 201­208). New York, NY, USA: ACM Press. Genton, M. G. (2002). Classes of kernels for machine learning: a statistics perspective. Journal of Machine Learning Research, 2, 299­312. Gordon, N. J., Salmond, D. J., & Smith, A. F. M. (1993). Novel approach to nonlinear/non-gaussian bayesian state estimation. Radar and Signal Processing, IEE Proceedings F, 140, 107­113. Hastie, T., Tibshirani, R., & Friedman, J. H. (2001). The elements of statistical learning. Springer. Jung, T., & Polani, D. (2006). Least squares svm for least squares td learning. ECAI (pp. 499­503). IOS Press. Leffler, B. R., Littman, M. L., & Edmunds, T. (2007). Efficient reinforcement learning with relocatable action models. Proc. of AAAI-07 (pp. 572­577). Menlo Park, CA, USA: The AAAI Press. Loth, M., Davy, M., & Preux, P. (2007). Sparse temporal difference learning using lasso. IEEE International Symposium on Approximate Dynamic Programming and Reinforcement Learning. Hawaii, USA. Rasmussen, C. E., & Williams, C. K. (2006). Gaussian processes for machine learning. Adaptive Computation and Machine Learning. Cambridge, MA, USA: MIT Press. Seeger, M. (2001). Covariance kernels from bayesian generative models. NIPS (pp. 905­912). MIT Press. ¨ Smola, A. J., & Scholkopf, B. (2004). A tutorial on support vector regression. Statistics and Computing, 14, 199­ 222. Sutton, R. S., & Barto, A. G. (1998). Introduction to reinforcement learning. Cambridge, MA, USA: MIT Press. White, A. (2007). The University of Alberta Reinforcement Learning Library. http://rlai.cs.ualberta. ca/RLR/. Edmonton, Alberta: University of Alberta. Whiteson, S., & Stone, P. (2006). Evolutionary function approximation for reinforcement learning. Journal of Machine Learning Research, 7, 877­917.

7. Conclusion
This paper developed R K R L, a simple online procedure for improving the performance of Gaussian process temporal difference learning by automatically selecting the prior covariance function. In several empirical trials, R K R L yielded significantly higher asymptotic reward than the best handpicked parameterizations for common covariance functions, even in cases where a large number of hyperparameters must be adapted. Furthermore, the learned covariance functions exhibit highly structured knowledge of the task that would have been difficult to build in a priori without significant knowledge of the domain. Overall these initial results are promising, and suggest that leveraging work in statistical model selection will significantly improve online learning.

Acknowledgements
Thanks to the anonymous reviewers for providing very poignant feedback on the original draft, also thanks to Yaakov Engel for help with some implementation details and Tobias Jung and Bryan Silverthorn for helpful discussions. This work was supported in part by an NSF Graduate Research Fellowship to the first author and NSF CAREER award IIS-0237699.