Bayesian Multi-Task Reinforcement Learning

Alessandro Lazaric alessandro.lazaric@inria.fr Mohammad Ghavamzadeh mohammad.ghavamzadeh@inria.fr INRIA Lille - Nord Europe, Team SequeL, 40 avenue Halley, 59650 Villeneuve d'Ascq, France

Abstract
We consider the problem of multi-task reinforcement learning where the learner is provided with a set of tasks, for which only a small number of samples can be generated for any given policy. As the number of samples may not be enough to learn an accurate evaluation of the policy, it would be necessary to identify classes of tasks with similar structure and to learn them jointly. We consider the case where the tasks share structure in their value functions, and model this by assuming that the value functions are all sampled from a common prior. We adopt the Gaussian process temporal-difference value function model and use a hierarchical Bayesian approach to model the distribution over the value functions. We study two cases, where all the value functions belong to the same class and where they belong to an undefined number of classes. For each case, we present a hierarchical Bayesian model, and derive inference algorithms for (i) joint learning of the value functions, and (ii) efficient transfer of the information gained in (i) to assist learning the value function of a newly observed task.

training samples for the learner and can improve the performance of the resulting solution. Most reinforcement learning (RL) algorithms (Sutton & Barto, 1998) often need a large number of samples to solve a problem and cannot directly take advantage of the information coming from other similar tasks. Nonetheless, recent work has shown that transfer and multi-task learning techniques can be employed in RL to reduce the number of samples needed to achieve nearly-optimal solutions. All approaches to multi-task RL (MTRL) assume that the tasks share similarity in some components of the problem such as dynamics, reward structure, or value function. While some methods explicitly assume that the shared components are drawn from a common generative model (Wilson et al., 2007; Mehta et al., 2008), this assumption is more implicit in others (Taylor et al., 2007; Lazaric et al., 2008). In Mehta et al. (2008), tasks share the same dynamics and reward features, and only differ in the weights of the reward function. The proposed method initializes the value function for a new task using the previously learned value functions as a prior. In Wilson et al. (2007), the distribution over the dynamics and the reward functions of the tasks is drawn from a hierarchical Bayesian model (HBM). Due to some similarity to our work, we discuss this method in more details in Section 5. Lazaric et al. (2008) implicitly assume that the tasks are drawn from a common distribution. They propose a method to selectively transfer samples from source tasks to a target task based on the likelihood of the target samples being generated by the models built for the source tasks. Finally, in Taylor et al. (2007), learning the value function of the target task is expedited using the solution learned in a source task with related, but different, state and action spaces. In this paper, we study the MTRL scenario in which the learner is provided with a number of tasks with common state and action spaces. For any given policy, only a small number of samples can be generated in each task, which may not be enough to accurately

1. Introduction
Multi-task learning (MTL) is an important learning paradigm and has recently been an area of active research in machine learning (e.g., Caruana 1997; Baxter 2000; Yu et al. 2005; Xue et al. 2007; Bonilla et al. 2008). A common setup is that there are multiple related tasks for which we are interested in improving the performance over individual learning by sharing information across the tasks. This transfer of information is particularly important when we are provided with only a limited number of data to learn each task. Exploiting data from related problems provides more
Appearing in Proceedings of the 27 th International Conference on Machine Learning, Haifa, Israel, 2010. Copyright 2010 by the author(s)/owner(s).

Bayesian Multi-Task Reinforcement Learning

evaluate the policy. In such a MTRL problem, it is necessary to identify classes of tasks with similar structure and to learn them jointly. In our work, we consider a particular class of MTRL problems in which the tasks share structure in their value functions. To allow the value functions to share a common structure, one way would be to assume that they are all sampled from a common prior. We adopt the Gaussian process temporal-difference (GPTD) value function model (Engel et al., 2005) for each task, model the distribution over the value functions using a HBM, and develop solutions to (i) joint learning of the value functions, and (ii) efficient transfer of the information acquired in (i) to facilitate learning the value function of a newly observed task. We refer to the above problems as symmetric and asymmetric multi-task learning, respectively. In Section 3, we present a HBM for the case in which all the value functions belong to the same class, and derive an EM algorithm to find MAP estimates of the value functions and the model's hyper-parameters. However, as pointed out in Caruana (1997) and Baxter (2000), if the functions do not belong to the same class, simply learning them together can be detrimental (negative transfer). It is therefore important to have models that will generally benefit from related tasks and will not hurt performance when the tasks are unrelated. This is particularly important in RL as changing the policy at each step of the policy iteration algorithm can change the way tasks are clustered together. This means that even if we start with value functions belonging to the same class, after one iteration the new value functions may be clustered into several classes. In Section 4, we introduce a Dirichlet process (DP) based HBM for the case that the value functions belong to an undefined number of classes, and derive inference algorithms for both the symmetric and asymmetric scenario. In Section 5, we discuss the similarities and differences with closely related work. In Section 6, we report and analyze experimental results.

fined by D (x) = t=0  t R(xt )|x0 = x, with xt+1  P  (ˇ|xt ). The value function V  (x) is the expected value of D (x) where the expectation is over all possible trajectories and rewards collected along them. A key problem in RL is to learn the value function of a given policy, which is called policy evaluation (Sutton & Barto, 1998). Loosely speaking, in policy evaluation the goal is to find a "close enough" approximation V of the value function V  . Unlike in supervised learning, the target function V  is not known in advance and its values have to be inferred from the observed rewards. Therefore, it is required to define a stochastic generative model connecting the underlying hidden value function with the observed rewards. In this paper, we adopt the GPTD value function model proposed in Engel et al. (2005), in which the discounted return D is decomposed into its mean V and a random zero-mean residual V , D(x) = V (x) + V (x). Combining this decomposition with the Bellman equation, we get
R(x) = V (x) - V (x ) + (x, x ),
def


x  P  (ˇ|x),

(1)

where (x, x ) = V (x) - V (x ). Suppose we are provided with a set of samples D = {(xn , x , rn )}N , n n=1 where rn and x are the reward and the next state n observed by following policy  in state xn , respectively. By writing the model of Eq. (1) w.r.t. these samples, we obtain R = HV + E, where H  RN ×2N and
  R = (rn )N ; H =  n=1  E  = (xn , x ) n
N n=1


1 0 . . . 0 ;

- 0 0

0 1 0

0 - 0

... ... ...

0 0 . . . 1

0 0 . . . -
N n=1


  ;  .

V  = V (xn ), V (x ) n

2. Preliminaries
The agent-environment interaction in RL is conventionally modelled as a Markov Decision Process (MDP). A MDP is a tuple M = X , A, R, P where X and A are the state and action spaces, respectively; R is the probability distribution over rewards R; and P is the transition probability distribution. A stationary policy  : X ×A  [0, 1] is a mapping from states to action selection probabilities. The MDP controlled by a policy  induces a Markov chain with transition probability distribution P  (x |x) = A P (x |x, a)(a|x)da. Given a policy , the (possibly discounted,   [0, 1)) return for a state x, D (x), is a random process de-

Note that if the samples are generated from a single trajectory then x = xn+1 and H is of the form defined n by Eq. (2.7) in Engel et al. (2005). In order to specify a complete probabilistic generative model connecting values and rewards, we need to define a prior distribution for the value function V and the distribution of the noise . Similar to Engel et al. (2005), we model the value function as a Gaussian process (GP), and the noise vector as E  N (0, S), where S is the noise covariance matrix modelling the correlation of the noise between different states. In the following we write S =  2 P, where  2 and P are the variance and the correlation matrix of the noise, respectively. For a more extended discussion about different models of noise we refer readers to Section 8.4 in Lazaric & Ghavamzadeh (2010). The value function V may be represented either in parametric or non-parametric form. In this paper we use the parametric representation to make

Bayesian Multi-Task Reinforcement Learning
cm vc 

wm

(ľ, )

(ľ0, k0, 0, 0)

wm

(ľ, )

(ľ0, k0, 0, 0)

wm

(ľc, c)

(ľ0, k0, 0, 0)

rmn

2

(0, 0)

rmn

2

(0, 0)

rmn

2 c

(0, 0)

xmn

x mn

xmn

x mn

N M (b)

xmn

x mn

N M  (c)

N

(a)

Figure 1. Graphical representations for (a) the single-task model -- an extension of GPTD by defining a hyper-prior over the parameters, (b) the single-class multi-task model of Section 3, and (c) the multi-class multi-task model of Section 4.

the formulation easier to follow, but all the results can be extended to the non-parametric case following similar steps as in Section 5.2 of Yu et al. (2005). In the parametric form, the value function is represented by  a finite set of d features (ˇ) = 1 (ˇ), . . . , d (ˇ)  and a weight vector w = (w1 , . . . , wd ) as V (ˇ) = (ˇ) w. The randomness in V is now due to w being a random vector with Gaussian prior w  N (ľ, ). The model equation now becomes R = H w + E, where  = [(x1 ), (x ), . . . , (xN ), (x )] is a 1 N d × 2N matrix. Fig. 1(a) shows the graphical representation of this model used for single-task learning (STL) in the experiments of this paper. It is an extension of the original GPTD model by defining Normal-inverse-Wishart and inverse-Gamma hyperpriors parametrized by 0 = (ľ0 , k0 , 0 , 0 , 0 , 0 ) over the model parameters (ľ, ,  2 ). This allows us to optimize the model parameters given the data. In the MTRL setting of this paper, the learner is provided with M tasks or MDPs with common state and action spaces Mm = X , A, Rm , Pm , m = 1, . . . , M . Given a fixed policy, N samples are generated in each task, i.e., Dm = {(xmn , x , rmn )}N , which may not mn n=1 be enough to have an accurate evaluation of the policy. We consider the case in which the tasks share structure in their value functions. In the parametric value function model discussed above, this can be interpreted as the value functions share the same feature space and their weight vectors are sampled independently from a common prior, i.e., Vm (ˇ) = (ˇ) wm ; wm  N (ľ, ). In the next two sections, we study two different scenarios: 1) when all the tasks belong to the same class, i.e., they share the same prior, and 2) when they can be clustered into an undefined number of classes.

same distribution over their value functions wm  N (ľ, ), m = 1, . . . , M ; and the same observation noise  2 . The goal in the symmetric form of this problem is to estimate {wm }M from the data {Dm }M , m=1 m=1 whereas in the asymmetric case we are interested in estimating the parameters  = (ľ, ,  2 ) from the data in order to use them as a prior for a newly observed task (e.g., task MM+1 ). We use a parametric HBM for this problem. HBMs allow us to model both the individuality of the tasks and the correlation between them. In HBMs, individual models with task specific parameters are usually located at the bottom, and at the layer above, tasks are connected together via a common prior placed over those parameters. Learning the common prior is a part of the training process in which data from all the tasks contribute to learning, thus making it possible to share information between the tasks usually via sufficient statistics. Then given the learned prior, individual models are learned independently. As a result, learning at each task is affected by both its own data and by data from the other tasks related through the common prior. 3.1. The Model We assume a normal-inverse-Wishart and an inverseGamma hyper-priors for (ľ, ) and  2 , respectively.
p(|0 ) = p(ľ, ) × p( 2 ) = N (ľ; ľ0 , /k0 ) IW(; 0 , 0 ) × IG( 2 ; 0 , 0 ). (2)

3. Single-class Multi-task Learning
In this section, we consider the case where all the tasks belong to the same class, i.e., they share the

These distributions are the conjugate priors for multivariate Gaussian distributions p(wm |ľ, ) and p(Rm |wm ,  2 ) = N (H wm ,  2 P), respectively. m This leads to the following generative model for the data, {Dm }. Fig. 1(b) shows the graphical representation of this model. The details of the model can be found in Lazaric & Ghavamzadeh (2010). Single-Class Model: Given the hyper-parameters 0 = (ľ0 , k0 , 0 , 0 , 0 , 0 ), 1. The parameters  = (ľ, ,  2 ) are sampled once

Bayesian Multi-Task Reinforcement Learning

from the hyper-prior as in Eq. (2), 2. For each task Mm (value function Vm ), the weight vector is sampled as wm  N (ľ, ), 3. Given {(xmn , x )}N , Rm = H wm + E, mn n=1 m where E  N (0,  2 P), m = 1, . . . , M . 3.2. Inference This model can be learned by optimizing the penalized likelihood p {Rm }|{(xmn , x )},  p() w.r.t. the pamn rameters  = (ľ, ,  2 ) using an EM algorithm. In the rest of the paper, we refer to this algorithm as SCMTL, for single-class multi-task learning. E-step: Since the posterior distribution of the latent variables p({wm }|{Dm }, ) is a product of M Gaussian posterior distributions p(wm |Dm , ) = N (ľ ,  ), 0m 0m for each task m, we compute the mean and covariance as
ľ m =  0m 0  = 0m 1 m H P-1 Rm + -1 ľ , 2
-1

priors over the number of classes and the class param2 eters c = (ľc , c , c ), respectively. G0 is the product of a d-dimensional normal-inverse-Wishart and a 1-dimensional inverse-Gamma distributions, with parameters 0 = (ľ0 , k0 , 0 , 0 , 0 , 0 ), (see Eq. 2). We employ the stick-breaking representation of the DP prior (Ishwaran & James, 2001), and define a task-toclass assignment variable (cm1 , . . . , cm ) for each task m, whose elements are all zero except that the cth element is equal to one if task m belongs to class c. Given the above, the data {Dm } can be seen as drawn from the following generative model, whose graphical representation is shown in Fig. 1(c). Multi-Class Model: Hyper-parameters (, 0 ), 1. Stick-breaking view: Draw vc from the Beta disc-1 tribution Be(1,  ), c = vc i=1 (1 - vi ), and independently draw c  G0 , c = 1, . . . , , 2. Task-to-class assignment: Draw the indicator (cm1 , . . . , cm ) from a multinomial distribution M (1; 1 , . . . ,  ), m = 1, . . . , M , 3. The weight vector is sampled as wm  N (ľcm , cm ), m = 1, . . . , M , 4. Given {(xmn , x )}N , Rm = H wm + E, mn n=1 m 2 where E  N (0, cm P), m = 1, . . . , M . 4.2. Inference We are interested in the posterior distribution of the latent variables Z = {wm }, {cm }, {c } given the observed data and the hyper-parameters  and 0 , i.e., p Z|{Dm }, , 0  p {Dm }|Z, , 0 p(Z|, 0 ). In the following we outline the main steps of the algorithm used to solve this inference problem, which we refer to as MCMTL, for multi-class multi-task learning (see Fig. 2). MCMTL combines the SCMTL algorithm of Sec. 3.2 for class parameters estimation, with a Gibbs sampling algorithm for learning the class assignments (Neal, 2000). The main advantage of such combination is that at each iteration, given the current estimate of the weights, we take advantage of the conjugate priors to derive an efficient Gibbs sampling procedure. More formally, given an arbitrary initial class assignment {cm }, a distinct EM algorithm is run on each class c = 1, . . . , C (with C the current estimate of the number of classes) and returns M distributions N (ľ m ,  m ). Given the weights estimated at the 0 0 previous step, wm = ľ , the Gibbs sampling solves 0m the DP inference by drawing samples from the posterior distribution p {cm }|{Rm }, {wm }, , 0 . In particular, the state of the Markov chain simulated in the Gibbs sampling is the class assignment {cm }, i.e., the vector of the classes each task belongs to. At each iteration, each component cm of the state is updated by sampling from the following distribution

1 m H P-1 H + -1 m 2

.

M-step: We optimize  to maximize the penalized expected log-likelihood of complete data log p {Dm }, {wm }| over the posterior distribution estimated in the E-step and obtain the new parameters
1 ľnew = M + k0
M

k0 ľ0 +
m=1

ľ m 0

,


1 new = M + 0 + d + 2
M

k0 (ľ - ľ0 )(ľ - ľ0 )


+ 0 ,

+
m=1

(ľ m - ľ)(ľ m - ľ) 0 0 20 +

+  m 0

1 2 new = MN + 2(1 + 0 ) +

M

tr P-1 H  m m H m 0
m=1 

Rm - H ľ m P-1 Rm - H ľ m m 0 m 0

.

4. Multi-class Multi-task Learning
In this section, we consider the case where the tasks belong to an undefined number of classes. Tasks in the same class {Mm | cm = c} share the same distribution over their value functions wm  N (ľc , c ), and the 2 same observation noise c . We use a nonparametric HBM for this problem. In the HBM proposed in this paper, the common prior is drawn from a Dirichlet process (DP). DP is powerful enough to model the parameters of individual classes, to fit them well without any assumption about the functional form of the prior, and to automatically learn the number of underlying classes. 4.1. The Model We place a DP(, G0 ) prior over the class assignment and the class parameters. The concentration parameter  and the base distribution G0 can be considered as

Bayesian Multi-Task Reinforcement Learning MCMTL {Rm }, , 0 Initialize {cm } repeat for c = 1, . . . , C do Initialize c repeat for m : cm = c do p(wm |Rm , c ) = N (ľ ,  ) (E-step) 0m 0m end for Optimize c (M-step) until convergence end for Set wm = ľ , m = 1, . . . , M 0m p({cm }|{wm }, {Rm }, , 0 ) until convergence return {wm } and {cm } Figure 2. The inference algorithm for the multi-class multitask learning (MCMTL) scenario. If c = cm , m = m: p cm = c|{cm }, Rm , wm , , 0 M-m,c =b p Rm , wm |c p c |{cm }, 0 dc , M -1+ else: p cm = cm , m = m|{cm }, Rm , wm , , 0  p(Rm , wm |)p(|0 )d, (3) =b M -1+

cm = cm . Finally, the integral can be analytically calculated as in Eq. (5), where the hyper-parameters  0 and the posterior hyper-parameters 0 are replaced  by 0c and 0c , respectively. 4.3. Symmetric vs. Asymmetric Learning The MCMTL algorithm returns both the distribution over the weights for each task and the learned hierarchical model (task-class assignments). While the former can be used to evaluate the learning performance in the symmetric case, the latter provides a prior for learning a new task in the asymmetric scenario. Symmetric Learning. According to the generative model in Section 4.1, the task weights are distributed according to the normal distribution N (ľ ,  ), 0m 0m where ľ and  are the posterior mean and co0m 0m variance of the weight vector wm returned by the  MCMTL algorithm. Since Vm (x) = (x) wm , the value of Vm at a test state x is distributed as
    p Vm (x )|x , ľ ,  0m = N (x ) ľ0m , (x ) 0 m (x ) . 0m -1

If MCMTL successfully clusters the task, we expect the value function prediction to be more accurate than learning each task independently. Asymmetric Learning. In the asymmetric setting the class of the new task is not known in advance. The inference problem is formalized as p wM+1 |RM+1 , 0 , {cm }M , where wM+1 and m=1 RM+1 are the weight vector and rewards of the new task MM+1 , respectively. Similar to Section 4.2, this inference problem cannot be solved in closed form, thus, we must apply the MCMTL algorithm to the new task. The main difference with the symmetric learning is that the class assignments {cm } and weights {wm } for all the previous tasks are kept fixed, and are used as a prior over the new task learned by the MCMTL algorithm. As a result, the Gibbs sampling reduces to a one-step sampling process assigning the new task either to one of the existing classes or to a new class. If MM+1 belongs to a new class, then the inference problem becomes p (wM+1 |RM+1 , 0 ), that is exactly the same as in STL. On the other hand, if MM+1 belongs to class c, the rewards and weights {Rm }, {wm } of the tasks in class c can be used to compute the poste rior hyper-parameters 0c as in Eq. (4), and to solve  the inference problem p(wM+1 |RM+1 , 0c ).

where M-m,c is the number of tasks in class c except task m, and b is a normalizing constant. While the first term in Eq. (3) is the probability of task m to belong to an existing class c, the second term returns the probability of assigning task m to a new class. Thanks to the conjugate base distribution G0 , the integrals in Eq. (3) can be solved analytically. In fact
p(Rm , wm |)p(|0 ) = p(Rm |wm ,  2 )p(wm |ľ, ) × p(ľ, |ľ0 , k0 , 0 , 0 )p( 2 |0 , 0 )
    N (ľ; ľ , /k0 )IW(; 0 ,  ) × IG( 2 ;  , 0 ), 0 0 0

(4)

    where 0 = (ľ , k0 , 0 ,  ,  , 0 ) are the posterior 0 0 0 parameters of G0 given the weight wm and the rewards Rm (see Lazaric & Ghavamzadeh 2010 for their definition). Using the posterior hyper-parameters, the second integral in Eq. (3) can be written as p(Rm , wm |)p(|0 )d k0  k0
d 2

=

|0 |0 /2
 | |0 /2 0

 
 0 2   -d 0 2

×(2|P|)-

N 2

0 0 ( ) 0 . (5) (0 )  0 0


In the first integral of Eq. (3), the density function p(c |{cm }, 0 ) is the posterior probability over the class parameters c given the data from all the tasks belonging to cm according to the current class assignment {cm }. Similar to Eq. (4), we compute the posterior hyper-parameters 0c of the normal-inverse-Wishart and inverse-Gamma distributions given {wm } and {Rm }, with m = m and

5. Related Work
In RL, the approach of this paper is mainly related to Wilson et al. (2007). Although we both use a DPbased HBM to model the distribution over the common structure of the tasks, in Wilson et al. (2007) the tasks share structure in their dynamics and reward function, while we consider the case that the similar-

Bayesian Multi-Task Reinforcement Learning

ity is in the value function. There are scenarios in which significantly different MDPs and policies may lead to very similar value functions. In such scenarios, the method proposed in this paper would still be able to leverage on the commonality of the value functions, thus performing better than single-task learning. Moreover in Wilson et al. (2007), the setting is incremental, i.e., the tasks are observed as a sequence, and there is no restriction on the number of samples generated by each task. The focus is not on joint learning with finite number of samples, it is on using the information gained from the previous tasks to facilitate learning in a new one. This setting is similar to the asymmetric learning considered in our work. In supervised learning, our work is related to Yu et al. (2005) and Xue et al. (2007). In Yu et al. (2005), the authors present a single-class HBM for learning multiple related functions using GPs. Our single-class model of Section 3 is an adaptation of this work for RL using GPTD. Besides, our multi-class model of Section 4 extends this method to the case with an undefined number of classes. In Xue et al. (2007), a DP-based HBM is used to learn the extent of similarity between classification problems. The problem considered in our paper is regression, the multi-class model of Section 4 is more complex than the one used in Xue et al. (2007), and the inference algorithms of Section 4 are based on Gibbs sampling, where a variational method is used for inference in Xue et al. (2007).

S = diag( 2 ) for all the algorithms. We evaluate the performance of each BMTL algorithm by computing

its relative mean squared error (MSE) improvement over STL : (M SESTL -M SEBMTL )/M SESTL . The MSEs are computed over N  = 1000 test samples. All the reported results are averaged over 200 independent runs. In the first experiment, we draw all the tasks from class c2 . Fig. 3(c) shows the performance of SCMTL for different number of tasks (M ) and samples per task (N ). SCMTL achieves an improvement over STL that varies from 29.86% ą 0.9 for N = 100 and 20 tasks to 67.64% ą 0.8 for 100 tasks with only 20 samples each. The results indicate that SCMTL successfully takes advantage of the samples coming from all the tasks to build a more accurate prior than the one obtained by considering each task separately as in STL. However, the advantage of SCMTL over STL declines as N is increased. In fact, as STL converges, SCMTL cannot make further improvement. We repeated the same experiment for the other classes. The minimum and maximum performance of SCMTL for all the classes (all obtained for N = 100, M = 20 and N = 20, M = 100, respectively) are summarized in Fig. 3(b). In the second experiment, we draw the tasks randomly from the four classes. We first apply SCMTL to this problem. Fig. 4(a) shows the SCMTL's performance. As it can be seen, the results are worse than those in the first experiment (Fig. 3(c)), varying from 30.15%ą4.8 to 54.05%ą1.2. By clustering all the tasks together, SCMTL takes advantage of all the available samples, thus, performs better than STL. However, when the tasks are drawn from significantly different distributions, it learns a very general prior which does not allow a significant improvement over STL. We then apply MCMTL to this problem. MCMTL's performance (Fig. 4(b)) varies from 45.64% ą 5.6 to 77.65% ą 0.8 and is significantly better than SCMTL's (Fig. 4(a)). In order to evaluate how well MCMTL classifies the tasks, we also compare its performance to a version of MCMTL in which each task is assigned to the right class in advance. The difference between the two algorithms is statistically significant only for N = 20 (with the maximum of 5.08% for M = 20), where the noise on the samples makes it more difficult to discriminate between the distributions generating the tasks, and thus, to classify them correctly. Finally, we compare SCMTL and MCMTL in the asymmetric setting. At the end of each run, we draw 100 new test tasks at random from the same four classes used to generate the training tasks. We run the asymmetric algorithm described in Section 4.3 on each of the test tasks separately. Fig. 4(c) shows the perfor-

6. Experiments
In this section, we report empirical results applying the Bayesian multi-task learning (BMTL) algorithms presented in this paper to a regression problem and a benchmark RL problem, inverted pendulum. We compare the performance of single-task learning (STL) with single-class multi-task learning (SCMTL), i.e., all tasks are assumed to belong to the same class, and multi-class multi-task learning (MCMTL), i.e., tasks belong to a number of classes not known in advance. By STL, we refer to running the EM algorithm of Section 3.2 for each task separately. The reason to use the regression problem in our experiments is that it allows us to evaluate our BMTL algorithms when the tasks are generated exactly according to the generative models of Sections 3 and 4. 6.1. A Regression Problem In this problem, tasks are functions in the linear space spanned by a feature space (x) = 1, x, x2 , x3 , x4 , x5 ) on the domain X = [-1, 1]. The weights for the tasks are drawn from four different classes, i.e., four 6-dim multivariate Gaussian distributions, with the parameters shown in Fig. 3(a). The noise covariance matrix

Bayesian Multi-Task Reinforcement Learning
0.8

Number of Tasks

c1 c2 c3 c4 c1 c2 c3 c4

2 ľc c (50 10 5 0 0 0) 1.0 (0 0 0 0.5 0.1 0.05) 3.0 (0 -20 0 -0.5 0 -0.05) 5.0 (-50 0 -5 0 -0.1 0) 7.5 c diag(20.0 10.0 5.0 0.0 0.0 0.0) diag(0.0 0.0 0.0 0.5 0.1 0.05) diag(0.0 5.0 0.0 0.5 0.0 0.01) diag(20.0 0.0 5.0 0.0 0.1 0.0)

20 0.7

c1 c2 c3 c4

Min 37.35% ą 1.2 29.86% ą 0.9 66.69% ą 0.8 56.47% ą 1.2

Max 64.78% ą 0.9 67.64% ą 0.8 91.81% ą 0.6 75.78% ą 0.3

30

0.6

40

0.5

60

0.4

0.3 100 20 30 40 60 100 0.2

Number of Samples

(a) (b) (c) Figure 3. (a) class parameters, (b) minimum and maximum improvement of SCMTL over STL in each class, (c) relative MSE improvement of SCMTL over STL when all the tasks are drawn from class c2 .

mance of SCMTL and MCMTL for different number of training tasks and N fixed to 20. The results indicate that MCMTL performs relatively better than SCMTL as the number of training tasks increases. 6.2. Inverted Pendulum The experiments of Section 6.1 indicate that when the tasks are generated exactly according to the generative models of Sections 3 and 4, the BMTL methods can significantly improve the performance of a regression problem w.r.t. STL. As discussed in Section 2, the policy evaluation step of policy iteration can be casted as a regression problem, thus, similar improvement can be expected. In this section, we compare our BMTL algorithms with STL in the problem of learning a control policy for balancing an inverted pendulum. Dynamics, reward function, and basis functions are the same as in Lagoudakis & Parr (2003). Each task is generated by drawing the parameters of the dynamics (pole mass, pole length, cart mass, and noise on the actions) from Gaussian distributions with means and variances summarized in Fig. 5(a). The distribution over the two classes is uniform. It is worth noting that, unlike the regression experiments, here we have no guarantee that the weights of the value functions will follow the generative models assumed by the BMTL methods. We use policy iteration with 10 iterations and the noise correlation matrix P-1 = m  (m m  )-1 m for all the algorithms (see Lazaric & Ghavamzadeh 2010 for details). In STL, each policy evaluation step is solved using the EM algorithm of Section 3.2 for each task separately, where in BMTL, it is solved by running SCMTL or MCMTL over all the tasks. All the results are averaged over 150 independent runs. Fig. 5(b) shows the performance of the policy learned by STL, SCMTL, and MCMTL for M = 10 tasks and different (up to 500) number of samples per task. Note that STL converges at about 1200 samples per task with an average performance of 2473 ą 61.9 balanced steps. As it can be seen, both BMTL methods outperform STL, and MCMTL achieves a better performance

than SCMTL as the number of samples is increased. Since SCMTL forces all the tasks to belong to a common distribution, it learns a very general prior, and thus, it cannot approximate the value functions as accurate as MCMTL, which is able to correctly discriminate between class c1 and c2 . In order to show how the performance changes with different number of tasks, we compute the area ratio (Taylor et al., 2007) on the -A first 500 samples as BMTL = ABMTLSTL STL , where ASTL A (ABMTL ) is the area under the learning curve of STL (BMTL) from 100 to 500 samples. Fig. 5(c) shows that MCMTL has significantly better area ratio than SCMTL for all values of M except very small ones.

7. Conclusions
We presented hierarchical Bayesian models (HBMs) and inference algorithms for multi-task reinforcement learning (RL) where the tasks share structure in their value functions. To the best of our knowledge, this is the first work that models value function similarity using HBMs. In particular, we considered two cases, where all the value functions belong to the same class, and where they belong to an undefined number of classes. In these cases, we modelled the distribution over the value functions using a parametric HBM and a Dirichlet process (DP) based non-parametric HBM, respectively. For each case, we derived inference algorithms for learning the value functions jointly and to transfer the knowledge acquired in the joint learning to improve the performance of learning the value function of a new task. We first applied our proposed Bayesian multi-task learning (BMTL) algorithms to a regression problem, in which the tasks are drawn from the generative models used by the BMTL methods. The results indicate that BMTL algorithms achieve significant improvement over single-task learning (STL) in both symmetric and asymmetric settings. We then applied our BMTL algorithms to a benchmark RL problem, inverted pendulum. Although the tasks are no longer generated according to the models used by the BMTL algorithms, they still outperform STL.

Bayesian Multi-Task Reinforcement Learning
0.8 20 0.7 20 0.7 0.8 0.7

% MSE Improvement (test set)

0.65 0.6 0.55 0.5 0.45 0.4 0.35 0.3 0.25 0.2 4 8 12 16 20 24

Number of Tasks

30

Number of Tasks

0.6

30

0.6

40

0.5

40

0.5

60

0.4

60

0.4

0.3 100 20 30 40 60 100 0.2 100 20 30 40 60 100

0.3

SCMTL MCMTL
28 32

0.2

Number of Samples

Number of Samples

Number of Training Tasks

(a) (b) (c) Figure 4. Results for the case that the tasks are drawn randomly from the four classes: (a) relative MSE improvement of SCMTL over STL, (b) relative MSE improvement of MCMTL over STL, (c) asymmetric performance of MCMTL and SCMTL for N = 20.
Number of Balanced Steps 2000 1 0.9 Area Ratio STL SCMTL MCMTL 200 300 400 Number of Samples 500 1500 0.8 0.7 0.6 500 100 0.5 0.4 4 6 8 10 12 14 Number of Tasks 16 SCMTL MCMTL 18 20

pole mass pole length cart mass noise

c1 1.0, 0.2 0.5, 0.0 6.0, 0.5 12, 0.1

c2 3.0, 0.0 2.0, 0.2 10.0, 0.5 12, 0.1

1000

(a) (b) (c) Figure 5. Results for the inverted pendulum problem: (a) distributions of the parameters of the dynamics, (b) comparing the performance of STL, SCMTL, and MCMTL in terms of the number of balanced steps for M = 10, (c) comparing the performance of SCMTL and MCMTL in terms of the area ratio on the first 500 samples.

Acknowledgments Experiments presented in this paper were carried out using the Grid'5000 experimental testbed (https://www.grid5000.fr). This work was supported by French National Research Agency (ANR) (project EXPLO-RA n ANR-08-COSI-004).

Lazaric, A., Restelli, M., and Bonarini, A. Transfer of samples in batch reinforcement learning. In Proceedings of ICML 25, pp. 544­551, 2008. Mehta, N., Natarajan, S., Tadepalli, P., and Fern, A. Transfer in variable-reward hierarchical reinforcement learning. Machine Learning, 73(3):289­312, 2008. Neal, R. Markov chain sampling methods for Dirichlet process mixture models. Journal of Computational and Graphical Statistics, 9(2):249­265, 2000. Sutton, R. and Barto, A. An Introduction to Reinforcement Learning. MIT Press, 1998. Taylor, M., Stone, P., and Liu, Y. Transfer learning via inter-task mappings for temporal difference learning. JMLR, 8:2125­2167, 2007. Wilson, A., Fern, A., Ray, S., and Tadepalli, P. Multitask reinforcement learning: A hierarchical Bayesian approach. In Proceedings of ICML 24, pp. 1015­ 1022, 2007. Xue, Y., Liao, X., Carin, L., and Krishnapuram, B. Multi-task learning for classication with dirichlet process priors. JMLR, 8:35­63, 2007. Yu, K., Tresp, V., and Schwaighofer, A. Learning Gaussian processes from multiple tasks. In Proceedings of ICML 22, pp. 1012­1019, 2005.

References
Baxter, J. A model of inductive bias learning. Journal of Artificial Intelligence Research, 12:149­198, 2000. Bonilla, E., Chai, K., and Williams, C. Multi-task Gaussian process prediction. In Proceedings of NIPS 20, pp. 153­160, 2008. Caruana, R. Multitask learning. Machine Learning, 28(1):41­75, 1997. Engel, Y., Mannor, S., and Meir, R. Reinforcement learning with Gaussian processes. In Proceedings of ICML 22, pp. 201­208, 2005. Ishwaran, H. and James, L. Gibbs sampling methods for stick-breaking priors. J. Amer. Statistical Assoc., 96:161­173, 2001. Lagoudakis, M. and Parr, R. Least-squares policy iteration. JMLR, 4:1107­1149, 2003. Lazaric, A. and Ghavamzadeh, M. Bayesian multitask reinforcement learning. Technical Report inria00475214, INRIA, 2010.