Efficient Reinforcement Learning with Multiple Reward Functions for Randomized Controlled Trial Analysis

Daniel J. Lizotte Michael Bowling Susan A. Murphy University of Michigan, Ann Arbor, MI 48109 USA

danjl@umich.edu bowling@cs.ualberta.ca samurphy@umich.edu University of Alberta, Edmonton, AB T6G 2R3 Canada

Abstract
We introduce new, efficient algorithms for value iteration with multiple reward functions and continuous state. We also give an algorithm for finding the set of all nondominated actions in the continuous state setting. This novel extension is appropriate for environments with continuous or finely discretized states where generalization is required, as is the case for data analysis of randomized controlled trials.

policy reinforcement learning methods to learn an optimal policy that can be used to select treatments for future patients. One difficulty with using trial data to formulate a reinforcement learning problem is that there is no obviously correct reward function. There are many possible reward functions one could define, since each patient record includes several different measurements of that patient's overall well-being. For example, data typically include a measure of the severity of the symptoms the patient is experiencing, as well as a measure of the severity of the side-effects caused by the current treatment. These different possible reward functions are typically at odds with one another to some degree, and will therefore induce different optimal policies. For example, a policy that minimizes expected symptom measurement will tend to choose more aggressive drugs that are very effective but have a more severe side-effect profile. On the other hand, a policy that minimizes expected side-effect measurements will choose drugs that are less effective but have a milder side-effect profile. In clinical practice, patients, doctors, and families decide on a treatment based on preferences that are not known to us at the time of data analysis. Continuing our example, these preferences may lean more toward symptom reduction or side-effect reduction, depending on the situation. However, treatment decisions are not usually influenced exclusively by one consideration or the other. We are interested in efficient algorithms that can compute the optimal policy for a range of tradeoffs between reward functions to investigate how our tradeoff--and therefore our choice of reward function--influences the optimal policy. Previous work on multiple-reward problems (Barrett & Narayanan, 2008) considered a small number of discrete states in a framework where the model is known. This setting does not match our application, since we

1. Introduction
We begin with a motivating example. Reinforcement learning methods (Pineau et al., 2007) have been used in evidence-based medicine to analyze multi-stage randomized controlled trials (Murphy et al., 2007; Zhao et al., 2009). These trials are designed to investigate the relative effectiveness of different sequences of treatments with respect to patient outcomes. A patient's progression through the trial is divided into stages, each of which consists of random assignment to a treatment followed by monitoring of the patient's condition. The patient observations collected during each stage are very rich and commonly include several continuous variables related to symptoms, side-effects, and treatment adherence, for example. To analyze these data using reinforcement learning methods, we consider each treatment as an action, and use the patient observations to define the resulting state and reward. Thus for the ith patient we obtain a i i i trajectory si , ai , r1 , si , ai , r2 , ..., si , ai , rT . These re2 2 1 1 T T defined data are treated as sample trajectories from the uniform random policy. We then apply batch offAppearing in Proceedings of the 27 th International Conference on Machine Learning, Haifa, Israel, 2010. Copyright 2010 by the author(s)/owner(s).

Efficient RL with Multiple Reward Functions for RCT Analysis

have real-valued patient observations and a finite data set. In this work, we introduce a new algorithm that is asymptotically more time- and space-efficient than the Barrett & Narayanan approach, and we describe how it can be directly applied to batch data. We then show how our algorithm can be extended to the continuous-state setting using linear function approximation, a case where the previous approach is not applicable, and we give an algorithm for finding the set of all non-dominated actions in the continuous-state setting. Throughout this paper we will focus on the case of two reward functions, parameterizing the tradeoff with a scalar   [0, 1]. While this is sufficient to model our symptoms-versus-side-effects example, we conclude by discussing how our approach can be generalized to more than two reward functions.

tradeoff parameter . We use a piecewise linear spline to exactly represent Vt (s, ) for each state and timepoint, which allows us to exactly compute value backups for all  more efficiently than the point-based representations of Barrett and Narayanan (2008). Our representation also allows identification of the set of dominated actions, i.e. the actions that are not optimal for any (s, ) pair. Value backups require two operations: maximization over actions, and expectation over future states. Although we are interested in settings with continuous states and linear value function approximators, the tools and concepts we need to achieve this are applicable and more easily presented in the discrete state setting. We begin by considering discrete states, and generalize our approach in Section 4. 3.1. Maximization First, we describe how to take a function QT (s, a, ·) and produce an explicit spline-representation of VT (s, ·) by maximizing over a. In Section 3.3, we show how this can be accomplished at earlier timepoints t < T using a divide-and-conquer approach.
Q-function: Line Representation 1 Q-function: Point Representation

2. Background
We consider sets of MDPs that all have the same state space, action space, and state transition function, but whose expected reward functions are parameterized by a scalar   [0, 1]. The parameter  determines the expected reward function r as follows: r(s, a, ) (1 - ) · r(0) (s, a) +  · r(1) (s, a). (1)

(1 - ) R0 +  R1

Each fixed  identifies a single MDP. This parameterized expected reward function induces a corresponding parameterized optimal1 state-action value function in the usual way via the Bellman equation: Q(s, a, ) = r(s, a, ) + Es |s,a [max Q(s , a, )].
a

0.8 0.7

0.8 (0.2,0.7) (0.5,0.6) 0.6 0.6 R1 (0.3,0.4) 0.4 0.4 (0.8,0.2)

0.5

(2)

0.3 0.2 0.2 0.2

We will also refer to the optimal state value function V (s, ) = maxa Q(s, a, ). For each patient, we take an action at each timepoint t = 1, 2, ..., T , after which they are no longer under our care. Because we are in this finite-horizon setting, we consider a separate rt , Qt , and Vt function for each time-point (Bertsekas & Tsitsiklis, 1996), and we have QT (s, a, )  rT (s, a, ). In this framework, "value iteration" algorithms proceed by first determining the value function for time T , and then receding to the beginning of time using the recurrence Qt (s, a, ) = rt (s, a, ) + Es |s,a [Vt+1 (s , )].

0

0.25

0.5 

0.75

1

0

0

0.2

0.4 R0

0.6

0.8

1

(a)

(b)

Figure 1. Computing V from Q for all  by convex hull.

3. Value Functions for All Tradeoffs: Discrete State Space
In the discrete state-space setting, the optimal stateaction value function Vt (s, ) is piecewise linear in the
Throughout this work, all Q- and V-functions are either optimal or estimates of optimal. We omit the usual  superscript throughout, and mark estimates with a hat^ .
1

The Q-function QT (s, a, ) for the last timepoint is equal to the terminal expected reward function rT , which is linear in  for each action as defined in (1). To represent QT (s, a, ), we maintain a list of linear functions, one for each action. Figure 1(a) shows an example Q-function for a fixed state at time T . There are four actions, three of which are optimal for some  and one which is not optimal for any . Each of these linear functions can be represented by a list of tradeoffs (i.e. [0, 1]) together with a list of their corresponding values (i.e. [r(s, a, 0), r(s, a, 1)]) at the tradeoff points. Each can also be represented by a point (r(s, a, 0), r(s, a, 1)) in the plane, as shown in Figure 1(b). These two equivalent representations offer an important conceptual and computational insight that is well-established in the multi-criterion op-

Efficient RL with Multiple Reward Functions for RCT Analysis

timization literature (Ehrgott, 2005): The set of actions that are optimal for some   [0, 1] are exactly those actions whose line-representations lie on the upper convex envelope of the Q-functions, and equivalently, whose point-representations lie on the upperright convex hull of the set of Q-pairs. In general, we can recover the actions that are optimal on   [1 , 2 ] by finding the upper-right convex hull of the points {(r(s, ai , 1 ), r(s, ai , 2 )) : i  {1...|A|} }. This equivalence is important because the time complexity of the convex hull operation on n points in two dimensions is O(n log n) ­ as fast as sorting. We make use of this equivalence to construct our spline-representation of VT (s, ·). Commonly-used convex hull routines produce output that is ordered, so that it is easy to recover the list of actions that are optimal for some , along with the values of  where the optimal action changes. These values are the "knots" in our spline-representation. We denote the list of knots of a piecewise linear function f (·) by (f (·)). The output of a convex hull algorithm is an ordered list of points of the form (r(s, a, 0), r(s, a, 1)), which in this case is [(0.8, 0.2), (0.5, 0.6), (0.2, 0.7)]. We know from the order of this list that the second knot in VT (s, ·) (after  = 0) occurs where the lines represented by (0.8, 0.2) and (0.5, 0.6) intersect. Thus we can compute that the line represented by (0.8, 0.2) is maximal from  = 0 to  = 0.428.. where it intersects the line represented by (0.5, 0.6). The intersection point × of two lines on the interval [1 , 2 ] with end-point function values [y1 , y2 ] and [z1 , z2 ] is × = 1 + (2 - 1 ) · (y1 - z1 )/(y1 - y2 + z2 - z1 ). After finding the knots, we represent the piecewise linear value function in Figure 1(a) by the knot-list (VT (s, ·)) = [0.00, 0.428.., 0.75, 1.00] and value-list [0.80.., 0.54.., 0.58.., 0.70..], rather than by the list of points. This allows us to more efficiently evaluate V (s, ) = maxa Q(s, a, ): To evaluate V (s, ), we use binary search to find the largest knot less than . This tells us which line is maximal for this . We then evaluate only the maximal line at , so that evaluating V (s, ·) takes O(log |(V (s, ·))|) time, i.e. the time for the cost of the search, rather than the O(|(V (s, ·))|) time it would take to maximize over all lines at . 3.2. Expectation We now demonstrate how to efficiently compute a spline-representation of QT -1 (s, a, ) = rT -1 (s, a, ) + Es |s,a [VT (s , )] using the spline-representation of VT . To do so, we must evaluate expectations of VT over sets of possible future states. Consider a two-state example where we take the expec-

V-function: Max-Of-Lines Representation

V-function: Max-Of-Lines Representation

Mean V-function: Max-Of-Lines Representation

0.8 (1 - ) R0 +  R1 (1 - ) R0 +  R1 0.7 0.6 0.5 0.7 0.6

0.8 (1 - ) R0 +  R1 0.75 0.7 0.6 0.55 0.45 0.4 0.75 0.7 0.6 0.55 0.5 0.35

0.5

0.2

0.2

0

0.25

0.5 

0.75

1

0

0.25

0.5 

0.75

1

0

0.25

0.5 

0.75

1

(a)

(b)

(c)

Figure 2. Computing expectations using unions of splinerepresentations. Graphs (a) and (b) show value functions in terms of  at two different states. Graph (c) shows the expected value function if the two states are each reached with probability 0.5.

tation Es |s,a [VT (s , )]. Suppose there are two reachable states s1 and s2 , and that the probability of arriving in state si is given by i . Since each VT (si , ) is linear over the intervals between its own knots, these two functions are simultaneously linear over the intervals between (VT (s1 , ·))  (VT (s2 , ·)), and their average is linear over the same intervals. Therefore this expectation is itself a piecewise linear function of  with knot-list (VT (s1 , ·))(VT (s2 , ·)), and we can compute its spline-representation by constructing the value-list for Es |s,a [VT (s , )]. The new value-list is i VT (si , ) :   (VT (s1 , ·))  (VT (s2 , ·)) .
i

(3) Let di = |(VT (si , ·))|. This construction uses O(d1 + d2 ) space and requires O(d1 + d2 ) evaluations of VT . We contrast this with the approach of Barrett and Narayanan (2008). The expectation can also be computed using the point-based representation in Figure 1(b): Let i be the set of points in the pointbased representation of V (si , ·). One can compute an expected value function by constructing a set of points {(1 a1 , 1 b1 ) + (2 a2 , 2 b2 )} s.t. (a1 , b1 )  1 , (a2 , b2 )  2 (4)

and then taking the upper-right portion of the convex hull of this set. Recovering the knot-list and value-list will result in exactly the same function as the method described above. Barrett and Narayanan (2008) advocate this procedure and prove its correctness; however, they note that the set given in (4) has |1 ||2 | points that must be constructed and fed into the convex hull algorithm. Since di = |i | + 1, computing the expectation in this way will take O(d1 d2 ) space and O(d1 d2 log d1 d2 ) time, which is much less efficient than our O(d1 + d2 ) spline-representation based approach.

Efficient RL with Multiple Reward Functions for RCT Analysis

3.3. Value Backups for t < T - 1 The maximization procedure in Section 3.1 relies on the linearity of QT (s, a, ·). However, for t < T , Qt (s, a, ·) is non-linear in general. We now show how to compute Vt and Qt from Vt+1 by decomposing Qt (s, a, ·) into linear pieces and applying the expectation and maximization operations to each piece. Recall Qt (s, a, ) = rt (s, a, ) + Es |s,a [Vt+1 (s , )]. (5)

Algorithm 1 Value Backup - Finite State Space /* A  B means A  A  B */ (s, ), VT +1 (s, ) 0. s, (VT +1 (s, ·)) {0, 1}. for t = T downto 1 do for all s  S do for all a  A do (Qt (s, a, ·))  {} for all s  S do  (Qt (s, a, ·))  (Vt+1 (s , ·)) end for for all   (Q(s, a, ·)) do Qt (s, a, )  r(s, a, ) + s P (s |s, a) · Vt+1 (s , ) end for end for Compute (Vt (s, ·)) by applying Algorithm 2 to Qt (s, a, ·), a  A end for end for Algorithm 2 Max of Convex Piecewise Linear Fns. /* A  B means A  A  B */ input Convex piecewise linear functions fi (·), i = 1..k defined on [0 , 1 ]. k all = i=1 (fi (·)) out  = all for i = 2 to |all | do if argmaxj fj (all ) = argmaxj fj (all ) then i-1 i out  (maxj fj (),   (all , all )) i-i i end if end for
  

For a fixed state s and action a at time T , we have shown by construction that Es |s,a [VT (s , ·)] is convex and piecewise linear when we have discrete states. By definition, rt (s, a, ) is linear in  for all s, a, t. Thus their positive-weighted sum, i.e. QT -1 , is convex and piecewise linear in . It follows by induction that Qt (s, a, ) is convex piecewise linear in  for all t  1, ..., T . To compute Qt from Vt+1 , we first identify the knots in Es |s,a [Vt+1 (s , )] and store them; this is done in the same way as for t + 1 = T . We then compute the sum of the expected future value plus the immediate expected reward evaluated at each of the stored knots. (The immediate expected reward is linear in  and so does not contribute additional knots.) To compute Vt (s, ·) = maxj Q(s, aj , ·), we take the maximum over actions of these piecewise linear Q-functions using Algorithm 2. First, we decompose the problem of finding maxj Qt (s, aj , ·) for   [0, 1] into sub-problems of finding maxj Qt (s, aj , ·) over intervals of  where we know the Qt (s, aj , ·) are simultaneously linear. The ends of these intervals are given by j (Qt (s, aj , ·)). Next, we apply the convex hull algorithm to each of these intervals to recover any additional knots in maxi Qt (s, aj , ·). The full backup procedure is described in Algorithm 1. In the case where the state transition model of the MDP is not known, note that we can estimate P (s |s, a) from data and the algorithm remains unchanged. Also, in practice, we can avoid running the convex hull algorithm over every interval by checking each interval's end points: If for some j we find that Qt (s, aj , ·) is maximal at both ends of an interval over which the Qt (s, aj , ·) are all linear, then maxj Qt (s, aj , ·) has no knots inside the interval. 3.4. Complexity of Qt (s, a, ·) and Vt (s, ·) Suppose there are |S| states and |A| actions. For any fixed s and a, the final Q-function QT (s, a, ·) has 2 knots,  = 0 and  = 1. Applying Algorithm 2 to these functions generates at most |A| - 1 new internal knots, and therefore VT (s, ·) has at most (|A| - 1) + 2 knots. To compute QT -1 (s, a, ·), we take the expectation of VT (s, ·) over future states. Since V (s, ·) might

have different internal knots for every s, QT -1 (s, a, ·) may have as many as |S|(|A| - 1) + 2 knots. However, for a fixed s, these knots will be the same for all a. Thus, computing VT -1 (s, ·) using Algorithm 2 adds at most |A| - 1 new knots between each pair of existing knots, for a total of (|A| - 1|)(|S|(|A| - 1) + 1) + 2. In general, Qt (s, a, ·) may have O(|S|T -t |A|T -t ) knots, and Vt (s, ·) may have O(|S|T -t |A|(T -t)+1 ) knots. To compute the expectation Es |s,a [Vt+1 (s , )] at time t, our approach requires O(|S|T -t |A|(T -t)+1 ) for each state, for a total of O(|S|(T -t)+1 |A|(T -t)+1 ) time. In contrast, the Barrett & Narayanan approach requires O(|S|2·(T -t)+1 |A|2·(T -t)+1 log |S|2·(T -t)+1 |A|2·(T -t)+1 ) for each of log2 |S| pairs of piecewise linear functions. 3.5. Identifying Non-Dominated Actions We have shown how to exactly represent the V and Q functions for all s, a, and  at all time points when

Efficient RL with Multiple Reward Functions for RCT Analysis

states are discrete, and we have shown how to identify all actions that are optimal for some  at a particular s. This representation allows us to identify which actions are non-dominated for any  by enumerating s.

^ construct any other estimate QT (s, a, ), we could con(0) (1) struct a scalar reward using rT , rT , and , and solve
a a a ^aT ^aT [0 , 1 ]T = (XT T XT )-1 XT T ((1 - )rT + rT ) (10) ^aT ^aT ^aT ^aT = (1 - )[10 , 11 ]T + [10 , 11 ]T . (11) (0) (1)

4. Value Functions for All Tradeoffs: Linear Function Approximation
When analyzing multistage clinical trial data, assuming a small or finite state space is unreasonable. For example, patient observations will be many-valued if they come from a survey instrument that measures symptom severity, or real-valued if they come from a lab test that measures an enzyme level. Furthermore, we typically have a limited number of trajectories, and we therefore need to generalize across states in order to estimate state-action values without overfitting. Here, we demonstrate how our previously developed algorithms for value backups over all tradeoffs can be extended to the case where we have a continuous state variable and a value function estimated using linear value function approximation. We also show how to efficiently identify all non-dominated actions in this setting. Again, we begin by considering the value at time T , which has the simplest form, and later describe how to compute value functions at earlier timepoints. Suppose that rT (s, a, 0) and rT (s, a, 1) are each linear functions of s. At time T , from (1), we have
aT aT aT aT QT (s, a, ) = (1-)·(00 +01 s)+·(10 +11 s) (6) aT aT aT aT where the coefficients 00 , 01 , 10 , 11 define the state-action value function at time T . For each action, QT (·, a, ·) is bilinear in s and , and VT (·, ·) = maxa QT (·, a, ·) is piecewise bilinear in s and .

Thus we need to solve for the regression coefficients only at  = 0 and  = 1, after which we can produce ^ QT (s, a, ) for any  by combining these coefficients. Therefore, for t = T , it is straightforward to exactly ^ solve for the QT we would estimate for every state s and every tradeoff  simultaneously.

Recall we have a set of N trajectories of the form (0)i (1)i (0)i (1)i (0)i (1)i si , ai , r1 , r1 , si , ai , r2 , r2 , ..., si , ai , rT , rT . 1 1 2 2 T T To estimate QT (s, a, 0) and QT (s, a, 1) using ordinary a least-squares regression, we find the NT trajectories in our dataset where aT = a, and we construct a design matrix and target vectors  (j)1    rT 1 s1 T  (j)2   1 s2   rT  T   a,(j) a  XT =  . = (7) .  , rT .   . .  .  . .   . Na (j)N a 1 sT T rT T for j = 0 and j = 1. We then estimate parameters
a,(0) a a a ^aT ^aT [00 , 01 ]T = (XT T XT )-1 XT T rT

Figure 3. Diagram of the regions in (s, ) space where different actions are optimal at time T . In this example, the state space is {s  [-6, 6]}.

4.1. Maximization ^ For any fixed value of s and any action a, QT (s, a, ) is a linear function of , and we can use the convex hull method to identify the actions that maximize value, as well as to recover the knots in the piecewise linear ^ VT (s, ·). Figure 3 is an illustration of the pieces of a ^ hypothetical VT that is a maximization over 10 actions. Each number in the figure marks the region where that action is optimal at time T . For example, a vertical slice at s = -4 of the value function has three linear pieces where actions 10, 1, and 7 are optimal. Whereas in the discrete case we represent VT (s, ·) separately for each nominal state s in the MDP, in the ^ continuous case we represent VT (s, ·) for each observed

(8) (9)

^aT ^aT [10 , 11 ]T =

a,(1) a a a (XT T XT )-1 XT T rT .

These estimated parameters are then substituted into ^ ^ definition (6), giving QT (s, a, 0) and QT (s, a, 1). To

Efficient RL with Multiple Reward Functions for RCT Analysis

state s1 , ..., sN , i.e. we represent a one-dimensional T T vertical slice of the value function for each of the si . T For i  1, ..., N , we apply Algorithm 2 to construct a ^ spline-representation for VT (si , ·). T 4.2. Regression of Value on s At stage T - 1, the parameters of our estimate ^ QT -1 (s, a, ) are formed as follows: ^ [0
a(T -1)

^a(T -1) ]T = (X a T X a )-1 X a T ya,() , 1 T -1 T -1 T -1 T -1 (12)

where yt and    ^a vt () =    ^ t Vt (s1 , ) ^ Vt (s2 , ) t . . . ^ Vt (st
a Nt-1

a,()

= ((1 - )rt

a,(0)

+ rt

a,(1)

^a ) + vt+1 ()      

(13)

Algorithm 3 Value Backup - Infinite State Space ^ ^ (s, ), VT +1 (s, ) 0. s, (VT +1 (s, ·)) {0, 1}. for t = T downto 1 do for all a  A do ^ Qt  {} a for all (st , at , st+1 )  D s.t. at = a do ^ ^ ^ Qt  Qt  (Vt+1 (st+1 , ·)) a a end for ^ for all   Qt do a a,() a,(0) a,(1) ^a yt = ((1 - )rt + rt ) + vt+1 () ^at ,  at ]T = (X a T X a )-1 X a T ya,() ^ [0 1 t t t t end for end for for all si  D do t ^ t Compute (Vt (si , ·)) by Algorithm 2 end for end for ^ 4.4. Non-convexity of Qt (·, aj , ·) ^ Note that for t < T , the resulting Qt (·, aj , ·) are not necessarily convex in . In the discrete case, we know by construction that the QT (s, a, ·) are linear and that therefore VT (s, ·) is convex and piecewise linear. It follows that each QT -1 (s, a, ·) is also convex because each is a positive weighted sum of the convex VT (s, ·). ^ In the regression case, the QT -1 (s, a, ·) are a weighted a,() ^ sum of the yT -1 which depend on VT (si , ·): T
a,() a a a ^ QT -1 (s, a, ·) = [1, s] · (XT -1 T XT -1 )-1 XT -1 T yT -1 (15)

(14)

, )

for t  {1, ..., T }. Here the si are taken from trajecT tories that match ai -1 = a. The components of the T vector y(T -1) are not linear in , so for t < T , solving the regression only for  = 0 and  = 1 does not ^ completely determine Qt (s, a, ·). However, the coma,() ponents of y(T -1) are each piecewise linear in . We determine the intervals over which the components are simultaneously linear and then explicitly represent the state-value function at the endpoints of these intervals. This procedure is analogous to that of Section 3.2, but in the continuous case, expectation is replaced by regression. The output is a list of knots together with a a(T -1) a(T -1) list of parameter pairs (0 , 1 ), each given T a(T -1) -1 a T a,() a by (X(T -1) X(T -1) ) X(T -1) y(T -1) . 4.3. Value Backups for t < T - 1 ^ Section 4.2 relies on the linearity of QT (s, a, ·). How^ t (s, a, ·) is non-linear in general. We ever, for t < T , Q ^ ^ ^ now demonstrate how to compute Vt and Qt from Vt+1 ^ t (s, a, ) from Vt+1 , we ^ when t < T - 1. To compute Q first identify the intervals of  where the components a,() ^ of yt are simultaneously linear and store them; this is done in the same way as for t + 1 = T . We then a,() a compute the regression coefficients of yt on Xt as in (12) for each  in the stored knots, resulting in a ^ piecewise bilinear function Qt (s, a, ). To then com^ t ^ t pute Vt (si , ·) = maxj Qt (si , aj , ·) for each si in our t dataset, we apply Algorithm 2 for each si . This entire t procedure is described in Algorithm 3.
a,()

= w(s)T · yT -1 .

a,()

(16)

Here, w(s) is a vector that depends on s and on the data, but does it not depend on . Elements of w(s) can be negative. Therefore, although each elea,() ment of yT -1 is convex and piecewise linear in , the ^ Qt (s, a, ·) may not be convex for t < T . This nonconvexity means that both the algorithm by Barrett & Narayanan (2008), as well as important algorithms from the POMDP literature (e.g. Pineau et al. 2003) that operate on convex piecewise linear value functions, are not directly applicable in our setting. ^ ^ 4.5. Complexity of Qt (s, a, ·) and Vt (s, ·) Suppose there are N trajectories and |A| actions. ^ The terminal estimated Q-function QT (s, a, ·) has two knots, one at  = 0 and one at  = 1. The ter^ minal value function VT (si , ·) is constructed at N T ^ points by applying Algorithm 2 to the QT (si , a, ·) T a NT 1 2 for each observed state sT , sT , ..., sT . Each result-

Efficient RL with Multiple Reward Functions for RCT Analysis ^ Table 1. Knot counts and timings for computing Q(s, a, ·). Results are over 1000 randomly generated datasets using N = 1290, |A| = 3, T = 3.

Knots in Knots in Time (s) for Time (s) for

^ Q2 ^1 Q ^ Q2 ^ Q1

Min 687 2814 3.17 5.46

Med 790 3160 3.26 5.73

Max 910 3916 3.44 6.55

Bound 3870  1.5 · 107 -

by considering sets of dominated and non-dominated actions at time T . (To save space we will not always write the T subscript.) To analytically identify these sets of actions, we analyze the boundaries between the bilinear regions where one action has higher value than another. These boundaries occur where QT (·, a1 , ·) = QT (·, a2 , ·) for various actions a1 and a2 , i.e. where
a1 a1 a1 a1 (1 - ) · (00 + 01 s) +  · (10 + 11 s) = a2 a2 a2 a2 (1 - ) · (00 + 01 s) +  · (10 + 11 s)

(17)

^ ing VT (si , ·) has at most |A| - 1 new internal knots, T and therefore at most (|A| - 1) + 2 knots. To com^ pute QT -1 (·, a, ·), we use regression with targets con^ structed from VT (si , ·), where the observed states T si come from tuples where ai -1 = a. There are T T a ^ NT -1 such observed states. Thus each QT -1 (·, a, ·) a has at most NT -1 (|A| - 1) + 2 knots. The union a of their knot-lists has ( aA NT -1 (|A| - 1)) + 2 = ^ N (|A| - 1) + 2 knots. Computing VT -1 (si -1 , ·) usT ing Algorithm 2 adds at most |A| - 1 new knots between each pair of knots in the union, for a total of ^ (|A|-1|)(N (|A|-1)+1)+2 knots. In general, Qt (s, a, ·) T -t T -t ^ may have up to O(N |A| ) knots, and Vt (s, ·) T -t (T -t)+1 may have up to O(N |A| ) knots. To compute the expectation described in Section 4.2 at time t, our approach requires O(N T -t |A|(T -t)+1 ) for each trajectory, for a total of O(N (T -t)+1 |A|(T -t)+1 ) time. To show that our approach is computationally practical, we construct an example that is representative of the clinical trial data we encounter. We use 1290 trajectories across three timepoints, and three actions per time point. This mimics a simplified version of STAR*D, of one of the largest randomized trial datasets currently available (Rush et al., 2004). Table 1 shows the number of knots and the computation ^ time to represent Q(s, a, ·). Experiments were run on 1000 simulated datasets using Matlab 2009a on an 8processor 3.0 GHz Xeon machine. In our example, the actual number of knots needed to represent a single ^ Qt (s, a, ·) was much lower than the worst case bound. Computation time was on the order of seconds and thus feasible for our purposes. 4.6. Identifying Non-Dominated Actions We now show how to identify all of the non-dominated actions in the linear function approximation setting. In our Figure 3 example, only actions 1, 4, 6, 7, 8, 9, and 10 are represented in the diagram because it happens that these are the only actions that are nondominated, i.e. that are estimated to be optimal for some (s, ) pair. Actions 2, 3, and 5 were not estimated to be optimal for any combination of s and . We begin

which describes the hyperbola =
a2 a2 a1 a1 (00 + 01 s) - (00 + 01 s) a1 a1 a1 a1 (10 + 11 s) - (00 + 01 s) + a2 a2 a2 a2 (00 + 01 s) - (10 + 11 s)

.

(18)

Along these boundaries, "triple-points" occur at (s, ) points where three or more actions are simultaneously optimal. Consider three actions a1 , a2 , a3 . The triple points for these actions occur at s values given by as2 + bs + c = 0 where
a2 a1 a1 a1 a3 a3 a = (01 - 01 )(11 - 01 + 11 - 01 ) - a3 a1 a1 a1 a2 a2 (01 - 01 )(11 - 01 + 11 - 01 ) a2 a1 a1 a1 a3 a3 b = (01 - 01 )(10 - 00 + 10 - 00 ) - a3 a1 a1 a1 a2 a2 (01 - 01 )(10 - 00 + 10 - 00 ) a2 a1 a1 a1 a3 a3 c = (00 - 00 )(10 - 00 + 10 - 00 ) - a3 a1 a1 a1 a2 a2 (00 - 00 )(10 - 00 + 10 - 00 )

(19)

.

After solving for (19) for s, we can substitute the result into (18) to recover the corresponding . At the solution points, either all of a1 , a2 and a3 are optimal, or none of them are. Knowing the location of these triple-points is useful for determining if an action is dominated for all (s, ), as we now demonstrate. Lemma 1. If action a is optimal at time T for some point (s, ) but is not optimal for any (s, ) on the boundary of the domain, then a is optimal for some (s, ) that is a triple-point. Proof. Suppose a is optimal for some (s, ) in the domain but is not optimal for any (s, ) on the boundary of the domain. Further suppose that a is not optimal at any triple-point. Then the region where a is optimal must be completely enclosed by the region where a single other action a is optimal. However, by Equation (18), the boundary between the regions where a is superior to a and vice-versa has infinite extent in both s and , and therefore must intersect the boundary of the domain of (s, ), since a is optimal for some (s, ) by assumption. Thus we have a contradiction.

Efficient RL with Multiple Reward Functions for RCT Analysis

From Lemma 1 we know that to find all actions that are optimal for some (s, ) we need only check the boundaries and the triple points. The boundaries can be checked using Algorithm 2. (Note that because V (s, ·) is bilinear in  and in s, we can also use Algorithm 2 to identify for any fixed  the actions that are optimal for some s.) We can then enumerate the |A| 3 triple-points and check them to detect any regions that do not intersect a the boundary of the domain, like that of action 1 in Figure 3 where we have identified the triple-points with white dots. This procedure reveals all actions that are optimal for some (s, ), and thereby identifies any actions that are not optimal for any (s, ). Checking the 10 intersections in the ex3 ample takes approximately 0.06 seconds using Matlab 2009b on a 3.0 GHz Xeon processor. While we have described the process for bilinear functions, we can immediately extend this algorithm for piecewise bilinear functions by applying it between pairs of knots.

be useful for patients who prefer to have few sideeffects. We have shown how uncertainty about these future priorities can be expressed as a set of possible MDPs that are defined by a convex set of reward functions. We discussed algorithms for simultaneously solving all MDPs in this set. We introduced a new algorithm that is asymptotically more time- and space-efficient than previous approaches in the discrete-state case, and showed how it can be extended to the continuous-state setting using linear function approximation. We also gave an algorithm for finding the set of all non-dominated actions in the continuous-state setting. This novel extension is appropriate for environments where we must estimate state-action value functions from data, as when analyzing randomized controlled trials.

References
Barrett, L. and Narayanan, S. Learning all optimal policies with multiple criteria. In Proceedings of the 25th International Conference on Machine Learning (ICML 2008), 2008. Bertsekas, D. P. and Tsitsiklis, J. N. Neuro-Dynamic Programming, chapter 2.1, pp. 12. Athena Scientific, 1996. Ehrgott, Matthias. Multicriteria Optimization, chapter 3. Springer, second edition, 2005. Murphy, S. A., Oslin, S. W., Rush, A. John, and Zhu, J. Methodological challenges in constructing effective treatment sequences for chronic psychiatric disorders. Neuropsychopharmacology, 32:257­262, 2007. Pineau, J., Gordon, G., and Thrun, S. Point-based value iteration: An anytime algorithm for POMDPs. In International Joint Conference on Artificial Intelligence (IJCAI), pp. 1025­1032, 2003. Pineau, J., Bellemare, M. G., Rush, A. J., Ghizaru, A., and Murphy, S. A. Constructing evidence-based treatment strategies using methods from computer science. Drug and Alcohol Dependence, 88(Suppl 2): S52­S60, May 2007. Rush, A. J., Fava, M., Wisniewski, S. R., Lavori, P. W., Trivedi, M., Sackeim, H. A., and et al. Sequenced treatment alternatives to relieve depression (STAR*D): rationale and design. Controlled Clinical Trials, 25(1):119­42, Feb 2004. Zhao, Y., Kosorok, M. R., and Zeng, D. Reinforcement learning design for cancer clinical trials. Statistics in Medicine, 28:3294­3315, 2009.

5. Extensions to Higher Dimensions
Although the presented algorithms are sufficient for handling the quality and quantity of data often found in the analysis of randomized clinical trials, it is interesting to consider extending the approaches to higher dimensions. Extending Algorithm 3 to accommodate more than one state variable is trivial; each regres^ sion simply results in more  coefficients at each knot. Accommodating more reward functions will be more ^ challenging. In our setting, Qt (s, a, ·) is piecewise linear but not convex in . For d reward functions and tradeoffs  = 1 , ..., d on the (d - 1)-simplex, we will represent the (d - 1)-dimensional regions over which ^ Qt (s, a, ·) is linear, along with its value within each region. For d = 3 we anticipate using triangulation algorithms that require O(n log n) time for n knots. We conjecture that an analogue of Lemma 1 holds in higher dimensions, and that identifying all nondominated actions for more state variables and/or reward functions will require computing intersections of hyperboloids in higher dimensions. For p state variables and d reward functions, we would need to find all points where p + d actions are simultaneously optimal. This will not have a closed-form solution, and will require numerical methods for zeros of polynomials.

6. Conclusion
Data analysis of multi-stage randomized clinical trials is challenging because the priorities of future users of the analysis are unknown. A policy learned for patients who desire very low symptom levels may not