Privacy-Preserving Reinforcement Learning

Jun Sakuma jun@fe.dis.titech.ac.jp Shigenobu Kobayashi kobayasi@dis.titech.ac.jp Tokyo Institute of Technology, 4259 Nagatsuta-cho, Midoriku, Yokohama, 226-8502, Japan Reb ecca N. Wright Rutgers University, 96 Frelinghuysen Road, Piscataway, NJ 08854, USA rebecca.wright@rutgers.edu

Abstract
We consider the problem of distributed reinforcement learning (DRL) from private perceptions. In our setting, agents' perceptions, such as states, rewards, and actions, are not only distributed but also should be kept private. Conventional DRL algorithms can handle multiple agents, but do not necessarily guarantee privacy preservation and may not guarantee optimality. In this work, we design cryptographic solutions that achieve optimal policies without requiring the agents to share their private information.

In this paper, we consider the privacy of agents' perceptions in DRL. Specifically, we provide solutions for privacy-preserving reinforcement learning (PPRL), in which agents' perceptions, such as states, rewards, and actions, are not only distributed but are desired to be kept private. Consider two example scenarios: Optimized Marketing (Abe et al., 2004): Consider the mo deling of the customer's purchase behavior as a Markov Decision Pro cess (MDP). The goal is to obtain the optimal catalog mailing strategy which maximizes the long-term profit. Timestamped histories of customer status and mailing records are used as state variables. Their purchase patterns are used as actions. Value functions are learned from these records to learn the optimal policy. If these histories are managed separately by two or more enterprises, they may not want to share their histories for privacy reasons (for example, in keeping with privacy promises made to their customers), but might still like to learn a value function from their joint data in order that they can all maximize their profits. Load Balancing (Cogill et al., 2006): Consider a load balancing among competing factories. Each factory wants to accept customer jobs, but in order to maximize its own profit, may need to redirect jobs when heavily loaded. Each factory can observe its own backlog, but factories do not want to share their backlog information with each other for business reasons, but they would still like to make optimal decisions. Privacy constraints prevent the data from being combined in a single lo cation where centralized reinforcement algorithms (CRL) could be applied. Although DRL algorithms work in a distributed setting, they are designed to limit the total amount of data sent between agents, but do not necessarily do so in a way that guarantees privacy preservation. Additionally, DRL often sacrifices optimality in order to learn with low communication. In contrast, we propose solutions

1. Introduction
With the rapid growth of computer networks and networked computing, a large amount of information is being sensed and gathered by distributed agents physically or virtually. Distributed reinforcement learning (DRL) has been studied as an approach to learn a control policy thorough interactions between distributed agents and environments--for example, sensor networks and mobile robots. DRL algorithms, such as the distributed value function approach (Schneider et al., 1999) and the policy gradient approach (Moallemi & Roy, 2004), typically seek to satisfy two types of physical constraints. One is constraints on communication, such as an unstable network environment or limited communication channels. The other is memory constraints to manage the huge state/action space. Therefore, the main emphasis of DRL has been to learn go o d, but sub-optimal, policies with minimal or limited sharing of agents' perceptions.
App earing in Proceedings of the 25 th International Conference on Machine Learning, Helsinki, Finland, 2008. Copyright 2008 by the author(s)/owner(s).


Privacy-Preserving Reinforcement Learning

that employ cryptographic techniques to achieve optimal policies (as would be learned if all the information were combined into a centralized reinforcement learning (CRL) problem) while also explicitly protecting the agents' private information. We describe solutions both for data that is "partitioned-by-time" (as in the optimized marketing example) and "partitionedby-observation" (as in the load balancing example). Related Work. Private distributed proto cols have been considered extensively for data mining, pioneered by Lindell and Pinkas (Lindell & Pinkas, 2002), who presented a privacy-preserving data-mining algorithm for ID3 decision-tree learning. Private distributed proto cols have also been proposed for other data mining and machine learning problems, including k -means clustering (Jagannathan & Wright, 2005; Sakuma & Kobayashi, 2008), support vector machines (Yu et al., 2006), bo osting (Gambs et al., 2007), and belief propagation (Kearns et al., 2007). Agent privacy in reinforcement learning has been previously considered by Zhang and Makedon (Zhang & Makedon, 2005). Their solution uses a form of average reward reinforcement learning that do es not necessarily guarantee an optimal solution; further, their solution only applies partitioning by time. In contrast, our solutions guarantee optimality under appropriate conditions and we provide solutions both when the data is partitioned by time and by observation. In principle, private distributed computations such as these can be carried out using secure function evaluation (SFE) (Yao, 1986; Goldreich, 2004), which is a general and well studied metho dology for evaluating any function privately. However, although asymptotically polynomially bounded, these computations can be to o inefficient for practical use, particular when the input size is large. For the reinforcement learning algorithms we address, we make use of existing SFE solutions for small portions of our computation in order as part of a more efficient overall solution. Our Contribution. We introduce the concepts of partitioning by time and partitioning by observation in distributed reinforcement learning (Section 2). We show privacy-preserving solutions for SARSA learning algorithms with random action selection for both kinds of partitioning (Section 4). Additionally, these algorithms are expanded to Q-learning with greedy or -greedy action selection (Section 5). We provide experimental results in Section 6. Table 1 provides a qualitative comparison of variants of reinforcement learning in terms of efficiency, learning accuracy, and privacy loss. We compare five

CRL DRL IDRL PPRL SFE

comp. good good good medium bad

comm. good good good medium bad

accuracy good medium bad good good

privacy none imp erfect p erfect perfect p erfect

Table 1. Comparison of different approaches

approaches: CRL, DRL, independent distributed reinforcement learning (IDRL, explained below), SFE, and our privacy-preserving reinforcement learning solutions (PPRL). In CRL, all the agents send their perceptions to a designated agent, and then a centralized reinforcement is applied. In this case, the optimal convergence of value functions is theoretically guaranteed when the dynamics of environments follow a discrete MDP; however, privacy is not provided, as all the data must be shared. On the opposite end of the spectrum, in IDRL (independent DRL), each agent independently applies CRL only using its own lo cal information; no information is shared. In this case, privacy is completely preserved, but the learning results will be different and independent. In particular, accuracy will be unacceptable if the agents have incomplete but important perceptions about the environment. DRL can be viewed as an intermediate approach between CRL and IDRL, in that the parties share only some information and accordingly reap only some gains in accuracy. The table also includes the direct use of general SFE and our approach of PPRL. Both PPRL and SFE obtain go o d privacy and go o d accuracy. Although our solution incurs a significant cost (as compared to CRL, IDRL, and DRL) in computation and communication to obtain this, it do es so with significantly improved computational efficiency over SFE. We provide a more detailed comparison of the privacy, accuracy, and efficiency of our approach and other possible approaches along with our experimental results in Section 6.

2. Preliminaries
2.1. Reinforcement Learning and MDP Let S be a finite state set and A be a finite action set. A policy  is a mapping from state/action pair (s, a) to the probability  (s, a) with which action a is taken at state s. At time step t, we denote by st , at , and rt , the state, action, and reward at time t, respectively. A Q-function is the expected ret,urn Q (s, a) = k E where  is a disk=0  rt+k+1 | st = s, at = a count factor (0   < 1). The goal is to learn the optimal policy  maximizing the Q-function: Q (s, a) =


Privacy-Preserving Reinforcement Learning
agent A agent B

(s1, a1 ,r1, s2 )
agent A

AAA B , (sA, a1 ,r1, s2 ) (sB, a1 ,r1B s2B) 1 1

this global reword. The perception of the ith agent at i time t is denoted as hi = {si , ai , rt , si+1 , ai+1 }. The t tt t t private information of the ith agent is H i = {hi }. t We note that partitioning by observation is more general than partitioning by time, in that one can always represent a sequence that is partitioned by time by one that is partitioned by observation. However, we provide more efficient solutions in simpler case of partitioning by time. Let  c be a policy learned by CRL. Then, informally, the ob jective of PPRL is stated as follows: Statement 1. The ith agent takes H i as inputs. After the execution of PPRL, al l agents learn a policy  which is equivalent to  c . Furthermore, no agent can learn anything that cannot be inferred from  and its own private input. This problem statement can be formalized as in SFE (Goldreich, 2004). This is a strong privacy requirement which precludes consideration of solutions that reveal intermediate Q-values, actions taken, or states visited. We assume our agents behave semihonestly, a common assumption in SFE--this assumes agents follows their specified proto col properly, but might also use their records of intermediate computations in order to attempt to learn other parties' private information.

agent B

(st , at ,rt , st+1)
Partitioned-by-time

Figure 1. Partitioning model in the two-agent case

max Q(s, a) for all (s, a). In SARSA learning, Qvalues are updated at each step as: Q(st , at ) Q(st , at )  (rt +  Q(st+1 , at+1 ) - Q(st , at )),  Q(st , at ) + Q(st , at ), (1)

where  is the learning rate. Q-learning is obtained by replacing the update of Q by: Q(st , at )  (rt +  max Q(st+1 , a) - Q(st , at )).
a

Iterating these updates under appropriate conditions, optimal convergence of Q-values is guaranteed with probability 1 in discrete MDPs (Sutton & Barto, 1998; Watkins, 1989); the resulting optimal policy can be readily obtained. 2.2. Mo deling Private Information in DRL Let ht = (st , at , rt , st+1 , at+1 ), let H = {ht }, and suppose there are m agents. We consider two kinds of partitioning of H (see Fig. 1). Partitioned-by-Time. This mo del assumes that only one agents interacts with the environment at any time step t. Let T i be the set of time steps at which only ith agent has interactions with the environment. Then T i  T j = , (i = j ) and the set H i = {ht | t  T i } is considered the private information of the ith agent. Partitioned-by-Observation. This mo del assumes that states and actions are represented as a collection of state and action variables. The state spaice and the ii S and A = Ai where action space are S = i i S and A are the space of the ith agent's state and action variables, respectively. Without loss of generality (and for notational simplicity), we consider each agent's lo cal state and action spaces to consist of a single variable. If st  S is the joint state of the agents at time t, we denote by si the state that ith agent pert i ceives and by ai the action of ith agent. Let rt be the t local reward of ith agent obtained at time t. We define ii rt in the global reward (or reward for short) as rt = this model. Our Q-functions are evaluated based on

.... ....

A B , (sA, atA ,rtA, st+1) (sB , atB ,rtB st+1) t t

Partitioned-by-observation

.... ....

.... ....

3. Cryptographic Building Blocks
Our solutions make use of several existing cryptographic to ols. Specifically, in our proto col, Q-values are encrypted by an additive homomorphic cryptosystem, which allows the addition of encrypted values without requiring their decryption, as described in Section 3.1. Using the homomorphic properties, this allows encrypted Q-values are updated in the regular RL manner, while unencrypted Q-values are not known to agents. For computations which cannot be treated by the homomorphic property, we use SFE as a primitive, as we describe in Section 3.2. 3.1. Homomorphic Public Key Cryptosystems In a public key cryptosystem, encryption uses a public key that can be known to everyone, while decryption requires knowledge of the corresponding private key. Given a corresponding pair of (sk, pk) of private and public keys and a message m, then c = epk (m; ) denotes a (random) encryption of m, and m = dsk (c) denotes decryption. The encrypted value c uniformly distributes over ZN if is taken from ZN randomly. An additive homomorphic cryptosystem allows addition computations on encrypted values without knowl-


Privacy-Preserving Reinforcement Learning

edge of the secret key. Specifically, there is some operation · (not requiring knowledge of sk) such that for any plaintexts m1 and m2 , epk (m1 + m2 ; ) = epk (m1 ; 1) · epk (m2 ; 2), where is uniformly random provided that at least one of 1 and 2 is. Based on this property, it also follows that given a constant k and the encryption epk (m1 ; ), we can compute multiplications by k via repeated application of ·. This also enables a re-randomization property, which allows the computation of a new random encryption c = epk (m; ) of m from an existing encryption c = epk (m; ) of m, again without knowledge of the private key or of m, as follows: epk (m; ) = Encpk (m; 1) · Encpk (0; 2). In the rest of the paper, we omit the random number from our encryptions for simplicity. In an (m, t)-threshold cryptosystem, m agents share a common public key pk while the agents hold different private keys sk1 , ..., skn . Each agent can encrypt any message with the common public key. Decryption cannot be performed by fewer than t agents, and can be performed by any group of at least t agents using a recovery algorithm based on the public key and their decryption shares dsk1 (c), ..., dskn (c). We require a cryptosystem that provides semantic security (under appropriate computational hardness assumptions), re-randomization, the additive homomorphic property, and threshold decryption, such as the generalized Paillier cryptosystem (D°mgard & Jurik, 2001). a 3.2. Private Comparison and Division As mentioned, secure function evaluation (SFE) is a cryptographic primitive which allows two or more parties to evaluate a specified function of their inputs without revealing (anything else about) their inputs to each other (Goldreich, 2004; Yao, 1986). Although our overall solution is more efficient than using SFE, we do make use of SFE for two kinds of computations. One is the problem of private comparison of random shares. Let x = (x1 , ..., xd )  Zd . For our purposes, N A and B have random shares of x if A has xA = (xA , ..., xA ) and B has xB = (xB , ..., xB ) such that 1 1 d d xA and xB are uniformly distributed in ZN such that i i xi = (xA + xB ) mo d N for all i. If A holds xA and B i i holds xB , where xA and xB are random shares of x, then private comparison of random shares computes the index i such that i = arg maxi (xA + xB ) in such i i a way that A learns only i and B learns nothing. The other is a problem of private division of random shares. The input of A and B are random shares of x, xA  ZN and xB  ZN , respectively. Let K be an integer known to both parties. Then, private division of random shares computes random shares QA and

QB of quotient Q  ZN such that x = (QK + R) mo d N , where R  ZN (0  R < K ), Q = (QA + QB ) mo d N . After the proto col, A and B learn QA and QB , respectively, and nothing else. We use private division of random shares in several places in our proto cols to achieve private division of encrypted values. Suppose agent A has a key pair (pk, sk) and agent B knows pk and epk (x). The following proto col allows B to learn the encrypted quotient epk (Q) from epk (x) and K :
1. B computes c  epk (x) · epk (-xB ), xB r ZN and send c to A . 2. A computes the decryption xA  dsk (c)( x - xB mod N ) . 3. Using SFE for private division on A and B 's inputs xA and xB , resp ectively, A and B obtain outputs QA and QB , resp ectively. 4. A sends epk (QA ) to B . 5. B computes epk (Q)  epk (QA ) · epk (QB ).

4. Private Q-Value Update
In this section, we describe privacy-preserving SARSA update of Q-values under random action selection is described for our two partitioning mo dels. We extend this to ( -)greedy action selection in Section 5. We assume that reward r, learning rate , and discount r e  are non-negative rational numbers and at that t=1 ( t Lrmax ) < N , where rmax is the largest reward that agents can obtain and L  ZN is a parameter defined in Section 4.1. In this paper, we describe proto cols for two agents; these can be extended to magent case (m  3) straightforwardly, as will be shown in an extended version of the paper. 4.1. Partitioned-by-Time Mo del We first restrict our attention to the case where agent A has perceptions during T A = {1, ..., t - 1} and B has perceptions during T B = {t}. In this setting, A first computes can learn intermediate Q-values during the time perio d T A , because they can be lo cally computed only from A's perception. At time t, the new Q-values must be computed based on the intermediate Q-values known to A and B 's observation at time t. In brief, we do this by carrying out the update on encrypted Q-values using the homomorphic property to carry this out privately. However, the update includes the multiplication of rational numbers, such as  or  , so the computation is not closed in ZN . Hence, we first scale these rational numbers by multiplying with large enough integers so that all computations are closed in ZN . We use private division of encrypted values to remove the scaling.


Privacy-Preserving Reinforcement Learning · · · · · 1. 2. 3. 4. Public input; L, K , learning rate , discount rate  A's input: Q(s, a) (trained by A during T A ) B 's input: (st , at ) A's output: Nothing B 's output: Encryption of up dated Q-value c(st , at ) A: Compute eA (Q(s, a)) for all (s, a) and send to B . B : Take action at and get rt , st+1 . B : Choose at+1 randomly. Up date Q-value: (a) B : Compute eA (K Q(st , at )) by eq. 3. (b) B : Do private division of eA (K Q(st , at )) with A, then B learns eA (Q (st , at )). (c) B : Up date c(st , at ) by eq. 4. Figure 2. Private up date of Q-values in partitioned-bytime model (SARSA/random action selection)

R(0  R < K ) is computed by private encrypted division and B obtains epkA (Q (st , at )) (step 4(b)). Then, B finally computes c(st , at )  epkA (Q (st , at )) · c(st , at ). (4) It follows that eq. 4 is equivalent to eq. 2 except for the truncation error included by the private encrypted division step (step 4(c)). This truncation is negligibly small if L is sufficiently large. Lemma 1. If A and B behave semi-honestly, then after the private update of Q-values for SARSA and random action selection in partitioned-by-time model, B correctly updates encrypted Q-values but learn nothing else. A learns nothing. The pro of of this lemma (omitted for space) follows the standardized pro of metho dology of secure multi-party computation (Goldreich, 2004), showing that one can create the required algorithms, called simulators, for A and B . Intuitively, Step 4(b) is secure because it is implemented by SFE. Everything else that B receives except for messages received at steps for step 4(b) are encrypted by A's public key, so do not reveal anything. A do es not receive anything except messages that are part of the SFE in step 4(b), so do es not learn anything. Thus, the proto col is secure overall. For the general setting of T A and T B , after time t, if B interacts with the environment at time t + 1 again, the proto col can be started from step 2. When interaction switches back to A, an SFE step is used to change the encryption of the Q-values from A's private key to B 's private key via an SFE step, and then the roles of A and B are switched. 4.2. Partitioned-by-Observation Mo del In this mo del, we use a (2, 2)-threshold cryptosystem. Both parties share a common public key pk: encryption of m by pk is denoted by e(m) in this section. A and B hold different secret keys skA and skB for decryption shares, respectively. A and B cannot decrypt without both co operating. In this partitioning mo del, we write at = (aA , aB ), t t A B st = (sA , sB ), and rt = rt + rt . A receives only t t A B (sA , aA , rt ) and B receives only (sB , aB , rt ). Private t t t t update of Q-values in this mo del is shown in Fig. 3. In this mo del, eq. 3 is rewritten as e(K Q(st , at )) = X A · X B · X A B X A = e(Lrt )K L , X B = e(Lrt )K L , X = c(st+1 , at+1 ) K · c(st , at )-K . (5)

We now describe our proto col for private update, shown in Fig. 2, in more detail. Let pkA be A's public key. At step 1, A computes c(s, a) = epkA (Q(s, a)) for all (s, a) and sends them to B . B takes action at , gets rt , st+1 (step 2), and chooses at+1 randomly (step 3). A and B must now update the encrypted Q-value c(s, a). By encrypting both sides of SARSA update (eq. 1), we obtain: c(st , at )  epkA (Q(st , at ) + Q(st , at )), = epkA (Q(st , at )) · epkA (Q(st , at )) = c(st , at ) · c(st , at ), (2) where c(st , at ) = epkA (Q(st , at )). If c(st , at ) is computed by B from what B observes, B can update c(st , at ) by eq. 2 lo cally. Therefore, step 4 is devoted to the computation of c(st , at ). As mentioned, large integers K and L are used to treat the multiplication of rationals  and  , where  K  ZN and Lrt  ZN for all rt . Multiplying K to both sides of eq. 1 and multiplying L to rt , we obtain K Q(st , at )  K (Lrt +  Q(st+1 , at+1 ) - Q(st , at )), in which the computation is closed in ZN . Encrypting both sides by A's public key, we obtain epkA (K Q(st , at )) ( = epkA (Lrt )K · c(st+1 , at+1 ) K · c(st , at )-K . 3) Since K, L, ,  are public and B has rt , c(s, a), B can compute epkA (K Q(st , at )) by eq. 3 (step 4(a)). B needs to divide epkA (K Q(st , at )) by K , however, division is again not allowable. Instead, a quotient Q (st , at ) satisfying Q(st , at ) = K Q (st , aT ) +

X A and X B can be computed by A and B . To obtain c(st+1 , at+1 ) and c(st , at ), let h, i, j, k be indices of Qtables where h  S A , i  S B , j  AA , k  AB . At


Privacy-Preserving Reinforcement Learning

step 4(a), A sends X A and tables {cik }, {cik } with rerandomization such that cik cik = = c(sA , i, aA , k ) · e(0) (i  S B , k  AB ), t t c(sA 1 , i, aA 1 , k ) t+ t+
B B

(6)

· · · ·

Public input; L, K , learning rate , discount rate  A's input: (sA , aA ), B 's input: (sB , aB ) t t t t A's output: Encryption of up dated Q-value c(st , at ) B 's output: Nothing

· e(0) (i  S , k  A ), (7)

to B . B determines c(st , at ) = csB ,aB , c(st+1 , at+1 ) = t t csB ,aB and obtains e(K Q(st , at )) by eq. 5 (step 4(b)). Then computes e(Q (st , at )) by private division (step 4(c)). For all (hij k ), B sets e chij k 
t+1 t+1

1. A: Initialize Q(s, a) arbitrarily and compute c(s, a)(= e(Q(s, a))) for all (s, a). 2. Interaction with the environment: A · A: Take action aA and get rt , sA 1 . t t+
B · B : Take action aB and get rt , sB 1 . t t+ 3. Action selection: · A: Choose aA 1 randomly. t+

(Q (st , at )) (i = sB , k = aB ) t t e(0) (o.w.) (8)

and sends {chij k } to A (step 4(d)). Finally, for all (ik ), Q-values are updated as c(sA , i, aA , k )  c(sA , i, aA , k ) · csA iaA k . (9) t t t t t t by A. With this update, e(Q (st , at )) is added only when (h, i, j, k ) = (sA , sB , aA , aB ). Otherwise, t t t t e(0) is added. Note that A cannot tell which element is e(Q (st , at )) in {chij k } because of the rerandomization. Thus, eq. 9 is the desired update. Lemma 2. If A and B behave semi-honestly, then after the private update of Q-values for SARSA and random action selection in partitioned-by-observation model, A updates encrypted Q-values correctly but learns nothing. B learns nothing. By iterating private updates, encrypted Q-values trained by SARSA learning are obtained.

· B : Choose aB 1 randomly. t+ 4. Up date Q-value: A (a) A: Send X , {cik }, {cik } to B by eq. 6, 7. (b) B : Compute e(K Q(st , at )) by eq. 5 (c) B : Do private division of e(K Q(st , at )) with A, then B learns e(Q (st , at )). (d) B : Generate {chij k } by eq. 8 and send it to A. (e) A: Up date c(s, a) with {chij k } by eq. 9. Figure 3. Private up date of Q-values in partitioned-byobservation model (SARSA/random action selection)

(i, j, k ). For all (i, k ), B generates and sends a table {cik } and {ik } whose values are set to cik
B ik

= =

c(sA , i, sB ,  (k )) · e(-QB (k) ), t t i dB (cik ),

(10) (11)

5. Private Greedy Action Selection
Private distributed algorithms for greedy action selection to compute a = arg maxa Q(s, a) from encrypted Q-values in both partitioning mo dels are described. These are used for: (1) ( -)greedy action selection, (2) max operation in updates of Q-learning, and (3) extracting learned policies from final Q-values. In the partitioned-by-time mo del, this is readily solved by using private comparison, so is omitted. 5.1. Private Greedy Action Selection in Partitioned-by-observation Mo del When A and B observe sA and sB , respect t tively, private greedy action selection requires that (1) A obtains aA and nothing else, (2) B obtains aB  and nothing else, where (aA , aB  ) = arg max(aA ,aB ) (Q(sA , aA , sB , aB )). t t The protocol is described in Fig. 4. Threshold decryption is used here, to o. First, A sends encrypted Q-values c(sA , i, j, k ) with re-randomization for all t

where  : S B  S B is a random permutation and QB (k) r ZN . At the third step, A recovi ers QA (= Q(sA , i, sB , k ) - QB ). With these rant t ik ik dom shares of Q(sA , i, sB ,  (k )), the values (i , k  ) = t t arg max(i,k) (QA + QB ) are obtained by A using private ik ik comparison. Finally, B learns aB  =  -1 (k  ), where  -1 is the inverse of  . Lemma 3. If A and B behaves semi-honestly, then, after the execution of private greedy action selection, A learns aA and nothing else. B learns aB  and nothing else. Note that aB  is not learned by A because index k is obscured by the random permutation generated by B . 5.2. Security of PPRL Privacy-preserving SARSA learning is constructed by alternate iterations of private update and random action selection. The policy  can be extracted by computing arg maxa Q(s, a) for all (s, a) using private greedy action selection. The security follows from the earlier lemmas:


Privacy-Preserving Reinforcement Learning · A's input: c(s, a) for all (s, a), sA , B 's input: sB t t · A's output: aA , B 's output: aB 1. A: For all i  S B , j  AA , k  AB , send c(sA , i, j, k) t to B . 2. B : For all j  AA , k  AB , compute cik (eq. 10), B ik (eq. 11) and send {cik }, {ik }to A. 3. A: For all i  A , k  A , compute = d (cik ). Then, compute QA by applying the threshold deik cryption recovery algorithm with public key pk and A B shares ik , ik . 4. A and B : Compute (i , k ) = arg max(i,k) (QA + ik QB ) by private comparison. (A learns (i .k ).) ik 5. A: Send k to B . Then output aA = i . 6. B : Output aB =  -1 (k ). Figure 4. Private greedy action selection in partitioned-byobservation model
A B B ik B

the agent moves to sp+1 . When a2 is taken at sp (p = 1), the agent moves to sp-1 , but the agent does not move when p = 1. A reward r = 1 is given only when the agent takes a1 at sn-1 ; else, r = 0. The episode is terminated at sn or after 1, 000 steps. A learns 15, 000 steps and then B learns 15, 000 steps. CRL, IDRL, PPRL, and SFE were compared. SARSA learning with random or -greedy action selection was used for all settings. Table 2 shows the comparison results of computational cost, learning accuracy (number of steps to reach the goal state, averaged over 30 trials, and number of trials that successfully reach the goal state), and privacy preservation. Learning accuracy of PPRL and SFE are the same as CRL because the policy learned by PPRL and SFE are guaranteed to be equal to the one learned by CRL. In contrast, the optimal policy is not obtained successfully by IDRL because learning steps for IDRL agents correspond to the half of others. Because most of the computation time is spent for private division and comparison, computation time with random selection is much smaller than with -greedy selection. These experiments demonstrate that PPRL obtains go o d learning accuracy, while IDRL do es not, though computation time is larger than DRL and IDRL. 6.2. Load Balancing Task In these experiments, we consider a load balancing problem (Cogill et al., 2006) in the partitioned-byobservation mo del with two factories A and B . Each factory can observe its own backlog sA , sB  {0, ..., 5}. At each time step, each factory decides whether or not to pass a job to other factories; the action variable is aA , aB  {0, 1}. Jobs arrive and are processed independently at each time step with probability 0.4 and 0.48, respectively. Agent A receives reward rA = 50 - (sA )2 . If A passes the job to B , then A's reward is reduced by 2 as a cost for redirection. If an overflow happens, the job is lost and rA = 0 is given. Similarly, rB is computed as well. Perceptions (sA , aA , rA ) and (sB , aB , rB ) are to be kept private. (In this task, actions cannot be kept private because the parties learn them from whether the job was passed or not.) Distributed reward DRL (RDRL) (Schneider et al., 1999) is tested in addition to the four RLs tested earlier. RDRL is a variant of DRL, which is the same with IDRL except that global rewards are shared among distributed agents (Schneider et al., 1999). SARSA/ -greedy action selection was used in all settings. Fig 5 shows the changes of sum of global rewards per episo de. For avoiding overflows, co operation be-

Theorem 1. SARSA learning with private update of Q-values and random action selection is secure in the sense of Statement 1. Privacy-preserving SARSA learning and Q-learning with ( -)greedy action selection can be constructed by combining private update and private greedy random action selection. However, these PPRLs do not follow Statement 1 because it do es not allow agents to know greedy actions obtained in the middle of the learning. Therefore, the problem definition is relaxed as follows: Statement 2. The ith agent takes H i as inputs. After the execution of PPRL, al l agents learn a series of greedy actions during learning steps and a policy  which is equivalent to  c . Furthermore, no agent learns anything else. Theorem 2. SARSA and Q-learning with private update of Q-values and private greedy/ -greedy action selection is secure in the sense of Statement 2.

6. Experimental Results
We performed experiments to examine the efficiency of PPRL. Programs were written in Java 1.5.0. As the cryptosystem, (D°mgard & Jurik, 2001) with 1024-bit a keys was used. For SFE, Fairplay (Malkhi et al., 2004) was used. Experiments were carried out under Linux with 1.2 GHz CPU and 2GB RAM. 6.1. Random Walk Task This random walk task is partitioned by time. The state space is S = {s1 , ..., sn }(n = 40) and the action space is A = {a1 , a2 }. The initial and goal states are s1 and sn , respectively. When a1 is taken at sp (p = n),


Privacy-Preserving Reinforcement Learning Table 2. Comparison of efficiency in random walk tasks
CRL/rnd. IDRL/rnd. PPRL/ rn d . SFE/rnd. CRL/ -grd. IDRL/ -grd. PPRL/ -grd. SFE/ -grd. comp. (sec) 0.901 0.457 4.71 × 103 > 7.0 × 106 0.946 0.481 3.36 × 104 > 7.0 × 106 accuracy avg. #goal 40.0 30/30 247 8/30 40.0 30/30 40.0 30/30 40.0 30/30 -- 0/30 40.0 30/30 40.0 30/30 privacy loss disclosed all Stmt. 1 Stmt. 1 Stmt. 1 disclosed all Stmt. 2 Stmt. 2 Stmt. 2

References
Ab e, N., Verma, N., Apte, C., & Schroko, R. (2004). Cross channel optimized marketing by reinforcement learning. ACM SIGKDD Int'l Conf. on Know ledge Discovery and Data Mining (pp. 767­772). Cogill, R., Rotkowitz, M., Van Roy, B., & Lall, S. (2006). An Approximate Dynamic Programming Approach to Decentralized Control of Stochastic Systems. LNCIS, 329, 243­256. D°mgard, I., & Jurik, M. (2001). A Generalisation, a a Simplification and Some Applications of Paillier's Probabilistic Public-Key System. Public Key Cryptography 2001. Springer. Gambs, S., K´gl, B., & A¨meur, E. (2007). Privacye i preserving b oosting. Data Mining and Know ledge Discovery, 14, 131­170. Goldreich, O. (2004). Foundations of Cryptography: Volume 2, Basic Applications. Cambridge University Press. Jagannathan, G., & Wright, R. N. (2005). Privacypreserving distributed k-means clustering over arbitrarily partitioned data. ACM SIGKDD Int'l Conf. on Know ledge Discovery and Data Mining (pp. 593­599). Kearns, M., Tan, J., & Wortman, J. (2007). PrivacyPreserving Belief Propagation and Sampling. NIPS 20.

Table 3. Comparison of efficiency in load balancing tasks.
CRL RDRL IDRL PPRL SFE comp. (sec) 5.11 5.24 5.81 8.85 ×105 > 2.0 × 107 accuracy 90.0 87.4 84.2 90.0 90.0 privacy loss disclosed all partially disclosed Stmt. 1 Stmt. 2 Stmt. 2

SARSA/epsilon-greedy, load balancing 92 90 rewards 88 86 84 82 80 0 10 20 30 40 50 60 episodes 70 80 90 100 CRL/PPRL/SFE RDRL IDRL

Lindell, Y., & Pinkas, B. (2002). Privacy Preserving Data Mining. Journal of Cryptology, 15, 177­206. Malkhi, D., Nisan, N., Pinkas, B., & Sella, Y. (2004). Fairplay: a secure two-party computation system. USENIX Security Symposium (pp. 287­302). Moallemi, C. C., & Roy, B. V. (2004). Distributed optimization in adaptive networks. NIPS 16. Sakuma, J., & Kobayashi, S. (2008). Large-scale kmeans Clustering with User-Centric Privacy Preservation. Pacific-Asia Conf. on Know ledge Discovery and Data Mining (PAKDD) 2008, to appear. Schneider, J., Wong, W., Moore, A., & Riedmiller, M. (1999). Distributed value functions. Int'l Conf. on Machine Learning (pp. 371­378). Sutton, R. S., & Barto, A. G. (1998). Reinforcement Learning: An Introduction. MIT Press. Watkins, C. (1989). Learning from Delayed Rewards. Cambridge University. Yao, A. C.-C. (1986). How to generate and exchange secrets. IEEE Symposium on Foundations of Computer Science (pp. 162­167). Yu, H., Jiang, X., & Vaidya, J. (2006). Privacy-preserving SVM using nonlinear kernels on horizontally partitioned data. ACM SAC (pp. 603­610). Zhang, S., & Makedon, F. (2005). Privacy preserving learning in negotiation. ACM SAC (pp. 821­825).

Figure 5. Performance evaluation (sum of global rewards in an episode, normalized by the numb er of steps in an episode) in load balancing tasks (average of 100 trials).

tween agents is essential in this task. The performance of IDRL agents is inferior to others because selfish behavior is learned. In contrast, CRL, PPRL and SFE agents successfully obtain co operative behavior. The performance of RDRL is intermediate because perceptions of RDRL agents are limited. Efficiency is shown in Table 3. Since -greedy action selection was used, the privacy of IDRL, PPRL and SFE follow Statement 2. The privacy preservation of RDRL is between CRL and PPRL. As discussed in Section 1, PPRL achieves both the guarantee of privacy preservation and the optimality which is equivalent to that of CRL; SFE does the same, but at a much higher computational time.

Acknowledgments
This work was started at Tokyo Institute of Technology and carried out partly while the first author was a visitor of the DIMACS Center. The third author is partially supported by the National Science Foundation under grant number CNS-0822269.