Strategy Evaluation in Extensive Games with Importance Sampling Michael Bowling B OW L I N G @ C S . UA L B E RTA . C A Michael Johanson J O H A N S O N @ C S . UA L B E RTA . C A Neil Burch B U R C H @ C S . UA L B E RTA . C A Duane Szafron D UA N E @ C S . UA L B E RTA . C A Department of Computing Science, University of Alberta, Edmonton, Alberta, T6G 2E8 Canada Abstract Typically agent evaluation is done through Monte Carlo estimation. However, stochastic agent decisions and stochastic outcomes can make this approach inefficient, requiring many samples for an accurate estimate. We present a new technique that can be used to simultaneously evaluate many strategies while playing a single strategy in the context of an extensive game. This technique is based on importance sampling, but utilizes two new mechanisms for significantly reducing variance in the estimates. We demonstrate its effectiveness in the domain of poker, where stochasticity makes traditional evaluation problematic. work for describing many strategic interactions between decision-makers, artificial and human1 . Poker, for example, is a domain modeled very naturally as an extensive game. It involves independent and self-interested agents making sequential decisions based on both public and private information in a stochastic environment. Poker also demonstrates the challenge of evaluating agent performance. In one typical variant of poker, approximately 30,000 hands (or samples of playing the game) are sometimes needed to distinguish between professional and amateur levels of play. Matches between computer and human opponents typically involve far fewer hands, yet still need to draw similar statistical conclusions. In this work, we present a new technique for deriving low variance estimators of agent performance in extensive games. We employ importance sampling while exploiting the fact that the strategy of the agent being evaluated is typically known. However, we reduce the variance that importance sampling normally incurs by selectively adding synthetic data that is derived from but consistent with the sample data. As a result we derive low-variance unbiased estimators for agent performance given samples of the outcome of the game. We further show that we can efficiently evaluate one strategy while only observing samples from another. Finally, we examine the important case where we only get partial information of the game outcome (e.g., if a player folds in poker, their private cards are not revealed during the match and so the sequence of game states is not fully known). All of our estimators are then evaluated empirically in the domain of poker in both full and partial information scenarios. This paper is organized as follows. In Section 2 we introduce the extensive game model, formalize our problem, and describe previous work on variance reduction in agent evaluation. In Section 3 we present a general procedure for deriving unbiased estimators and give four examples of 1 In this work we use the words "agent", "player", and "decision-maker" interchangeably and, unless explicitly stated, aren't concerned if they are humans or computers. 1. Introduction Evaluating an agent's performance is a component of nearly all research on sequential decision making. Typically, the agent's expected payoff is estimated through Monte Carlo samples of the (often stochastic) agent acting in an (often stochastic) environment. The degree of stochasticity in the environment or agent behavior determines how many samples are needed for an accurate estimate of performance. For results in synthetic domains with artificial agents, one can simply continue drawing samples until the estimate is accurate enough. For non-synthetic environments, domains that involve human participants, or when evaluation is part of an on-line algorithm, accurate estimates with a small number of samples are critical. This paper describes a new technique for tackling this problem in the context of extensive games. An extensive game is a formal model of a sequential interaction between multiple, independent agents with imperfect information. It is a powerful yet compact frameAppearing in Proceedings of the 25 th International Conference on Machine Learning, Helsinki, Finland, 2008. Copyright 2008 by the author(s)/owner(s). Strategy Evaluation in Extensive Games with Importance Sampling these estimators. We then briefly introduce the domain of poker in Section 4 and describe how these estimators can be applied to this domain. In Section 5 we show empirical results of our approach in poker. Finally, we conclude in Section 6 with some directions for future work. 2. Background We begin by describing extensive games and then we formalize the agent evaluation problem. 2.1. Extensive Games Definition 1 (Osborne & Rubenstein, 1994, p. 200) a finite extensive game with imperfect information has the following components: · A finite set N of players. · A finite set H of sequences, the possible histories of actions, such that the empty sequence is in H and every prefix of a sequence in H is also in H . Z H are the terminal histories (those which are not a prefix of any other sequences). A(h) = {a : (h, a) H } are the actions available after a non-terminal history h H, · A player function P that assigns to each non-terminal history (each member of H \Z ) a member of N {c}, where c represents chance. P (h) is the player who takes an action after the history h. If P (h) = c, then chance determines the action taken after history h. · A function fc that associates with every history h for which P (h) = c a probability measure fc (·|h) on A(h) (fc (a|h) is the probability that a occurs given h), where each such probability measure is independent of every other such measure. · For each player i N a partition Ii of {h H : P (h) = i} with the property that A(h) = A(h ) whenever h and h are in the same member of the partition. Ii is the information partition of player i; a set Ii Ii is an information set of player i. · For each player i N a utility function ui from the terminal states Z to the reals R. If N = {1, 2} and u1 = -u2 , it is a zero-sum extensive game. A strategy of player i i in an extensive game is a function that assigns a distribution over A(Ii ) to each Ii Ii . A strategy profile consists of a strategy for each player, 1 , 2 , . . ., with -i referring to all the strategies in except i . Let (h) be the probability of history h occurring if players choose actions according to . We can decompose = iN {c} i (h) into each player's contribution to this probability. Hence, i (h) is the probability that if player i plays according to then for all histories h that are a proper prefix of h with P (h ) = i, player i takes the subsequent action in h. Let -i (h) be the product of all players' contribution (including chance) except player i. The overall value to player i of a strategy profile is then the expz cted payoff of the resulting terminal node, i.e., e ui ( ) = Z ui (z ) (z ). For Y Z , a subset of possiz ble terminal histories, define (Y ) = Y (z ), to be the probability of reaching any outcome in the set Y given , with i (Y ) and -i (Y ) defined similarly. 2.2. The Problem Given some function on terminal histories V : Z we want to estimate Ez| [V (z )]. In most cases V is simply ui , and the goal is to evaluate a particular player's expected payoff. We explore three different settings for this problem. In all three settings, we assume that i (our player's strategy) is known, while j =i (the other players' strategies) are not known. · On-policy full-information. In the simplest case, we get samples z1...t Z from the distribution . · Off-policy full-information. In this case, we get sam^ ples z1...t Z from the distribution where dif^ ^ fers from only in player i's strategy: -i = -i . In this case we want to evaluate one strategy for player i from samples of playing a different one. · Off-policy partial-information. In the hardest case, we don't get full samples of outcomes zt , but rather just player i's view of the outcomes. For example, in poker, if a player folds, their cards are not revealed to the other players and so certain chance actions are not known. Formally, in this case we get samples of K (zt ) K, where K is a many-to-one mapping ^ and zt comes from the distribution as above. K intuitively must satisfy the following conditions: for z , z Z , if K (z ) = K (z ) then, ­ V (z ) = V (z ), and ­ i (z ) = i (z ). 2.3. Monte Carlo Estimation The typical approach to estimating Ez| [V (z )] is through simple Monte Carlo estimation. Given independent samples z1 , . . . , zt from the distribution , simply estimate the expectation as the sample mean of outcome values. t 1i V (zi ) t =1 (1) Strategy Evaluation in Extensive Games with Importance Sampling As the estimator has zero bias, the mean squared error of the estimator is determined by its variance. If the variance of V (z ) given is large, the error in the estimate can be large and many samples are needed for accurate estimation. Recently, we proposed a new technique for agent evaluation in extensive games (Zinkevich et al., 2006). We showed that value functions over non-terminal histories could be used to derive alternative unbiased estimators. If the chosen value function was close to the true expected value given the partial history and players' strategies, then the estimator would result in a reduction in variance. The ~ approach essentially derives a real-valued function V (z ) that is used in place of V in the Monte Carlo estimator ~ from Equation 1. The expectation of V (z ) matches the expectation of V (z ) for any choice of , and so the result is an unbiased estimator, but potentially with lower variance and thus lower mean-squared error. The specific application of this approach to poker, using an expert-defined value function, was named the DIVAT estimator and was shown to result in a dramatic reduction in variance. A simpler choice of value function, the expected value assuming the betting is "bet-call" for all remaining betting rounds, can even make a notable reduction. We refer to this conceptually and computationally simpler estimator as (Bet-Call) BC-DIVAT. Both traditional Monte Carlo estimation and DIVAT are focused on the on-policy case, requiring outcomes sampled from the joint strategy that is being evaluated. Furthermore, DIVAT is restricted to full-information, where the exact outcome is known. Although limited in these regards, they also don't require any knowledge about any of the players' strategies. z Z (or even without an observed outcome) we can exactly compute the desired expectation by examining every outcome. VZ (z ) z Z V (z ) (z ) = Ez| [V (z )] (2) Although impractical since we don't know , VZ (z ) is an unbiased and zero variance estimator. Instead of using every terminal history, we could restrict ourselves to a smaller set of terminal histories. Let U (z Z ) Z be a mapping of terminal histories to a set of terminal histories, where at least z U (z ). We can coni struct an unbiased estimator that considers the history z n the estimation whenever we observe a history from the set U (z ). Another way to consider things is to say that U -1 (z ) is the set of synthetic histories considered when we observe z . Specifically, we define the estimator VU (z ) for the observed outcome z as, VU (z ) w z U -1 (z ) V (z ) (z ) ^ (U (z )) (3) The estimator considers the value of every outcome z here the observed history z is in the set U (z ). Each outcome though is weighted in a fashion akin to importance sampling. The weight term for z is proportional to the probability of that history given , and inversely proportional to the probability that z is one of the considered synthetic histories when observing sampled outcomes from . Note that VU (z ) is not an estimate of V (z ), but rather ^ has the same expectation. At first glance, VU may seem just as impractical as VZ since is not known. However, with a careful choice of U we can insure that the weight term depends only on the known strategies i and i . Before presenting example choices of ^ U , we first prove that VU is unbiased. ^ Theorem 1 If i (z ) is non-zero for all outcomes z Z , then, 3. General Approach We now describe our new approach for deriving lowvariance, unbiased estimators for agent evaluation. In this section we almost exclusively focus on the off-policy fullinformation case. Within this setting we observe a sampled ^ outcome z from the distribution , and the goal is to estimate Ez| [V (z )]. The outcomes are observed based on the strategy while we want to evaluate the expectation over ^ , where they differ only in player i's strategy. This case subsumes the on-policy case, and we touch on the more difficult partial-information case at the end of this section. In order to handle this more challenging case, we require full knowledge of player i's strategies, both the strategy being observed i and the one being evaluated i . ^ At the core of our technique is the idea that synthetic histories derived from the sampled history can also be used in the estimator. For example, consider the unlikely case when is known entirely. Given an observed outcome Ez| [VU (z )] = Ez| [V (z )] , ^ i.e., VU is an unbiased estimator. Proof: First, let us consider the denominator in the weight ^ term of VU . Since z U (z ) and i is always positive, ^ the denominator can only be zero if -i (z ) is zero. If this were true, -i (z ) must also be zero, and as a consequence i so must the numerator. As a result the terminal history z s never reached and so it is correct to simply exclude such histories from the estimator's summation. Define 1(x) to be the indicator function that takes on the Strategy Evaluation in Extensive Games with Importance Sampling value 1 if x is true and 0 if false. Ez| [VU (z )] ^ z = Ez | ^ U -1 (z ) term becomes, (z ) (z ) V (z ^ (U (z )) ) ) (4) ( 5) ^ (U (z )) z = Ez | ^ 1 (z U (z ))V (z (z ) ^ (U (z )) (z ) ) -i (z ) -i (z )i (z ) = ^ ^ -i (S-i (z ))i (S-i (z )) = ^ (S (12) (13) (14) (15) = = -i (S-i (z ))i (z ) ^ ^ -i (S-i (z ))i (S-i (z )) i (z ) ^ i (S-i (z )) = = = z z (z ) (z Ez| [1(z U (z ))] ^ ^ (U (z )) V ) (6) (7) (8) (z V ) (z ) ^ (U (z )) (U (z )) ^ As this only depends on the strategies of player i, we can compute this quantity and therefore the estimator. Example 3: Private Information. We can also use all histories in the update that differ only in player i's private information. In other words, any history that the other players wouldn't be able to distinguish from the sampled history is considered. For example, in poker, any history where player i receiving different private cards is considered in the estimator since the opponents' strategy cannot depend directly on tz is strictly private information. . Forh mally, let U (z ) = Z : -i (z ) = -i (z ) The weight term then becomes, (z ) z V (z ) (z ) = Ez| [V (z )] The derivation follows from the linearity of expectation, the ^ definition of , and the definition of expectation. We now look at four specific choices of U for which the weight term can be computed while only knowing player i's portion of the joint strategy . Example 1: Basic Importance Sampling. The simplest choice of U for which VU can be computed is U (z ) = {z }. In other words, the estimator considers just the sampled history. In this case the weight term is: (z ) (z ) =) ^ (U (z )) ^ (z (z ) (z = i ) -i ^ ^ i (z -i (z = i (z (z ^ i ) ) ^ (U (z )) = = = = = (z z z z U (z ) ) ^ (z ) (16) (17) (18) (19) (20) (9) ) ) -i (z )i (z ) ^ ^ ) (z U (z ) -i (z i -i (z )i (z ) (z ) (z ^ ^ U (z ) -i i -i (z )i (z z ) ) ^ i (z ) (10) (11) ) ^ -i (z U (z ) ) The weight term only depends on i and i and so is a ^ known quantity. When i = i the weight term is 1 and ^ the result is simple Monte Carlo estimation. When i is ^ different, the estimator is a straightforward application of importance sampling. Example 2: Game Ending Actions. A more interesting example is to consider all histories that differ from the sample history by only a single action by player i and that action must be the last action in the history. For example, in poker, the history where the player being evaluated chooses to fold at an earlier point in the betting sequence is considered in this estimator. Formally, define S-i (z ) H to be the shortest prefix of z where the remaining actions in z are all made by player i or chance. Let U (z ) = {z Z : S-i (z ) is a prefix of z }. The weight i (z ) (U (z )) ^ i As this only depends on the strategies of player i, we can again compute this quantity and therefore the estimator as well. Example 4: Combined. The past two examples show that we can consider histories that differ in the player's private information or by the player making an alternative game ending action. We can also combine these two ideas and consider any history that differs by both an alternative game ending action and the player's private information. Define Q(z ) = h , H : |h| = |S-i (z )| and -i (h) = -i (S-i (z )) Strategy Evaluation in Extensive Games with Importance Sampling Let U (z ) = {z (z ) ^ (U (z )) Z : a prefix of z is in Q(z )}. (z ) ^ (Q(z )) -i (z )i (z ) ^ ^ Q(z ) -i (h)i (h) -i (z )i (z ) ^ ^ Q(z ) -i (S-i (z ))i (h) = = = = = (21) (22) (23) (24) (25) imperfect information, stochastic outcomes, and observations of the game outcome during a match exhibit partial information. Each of the situations described in Section 2, on-policy and off-policy as well as full-information and partial information, have relevance in the domain of poker. In particular, the on-policy full-information case is the situation where one is trying to evaluate a strategy from full-information descriptions of the hands, as might be available after a match is complete. For example, this could be used to more accurately determine the winner of a competition involving a small number of hands (which is always the case when humans are involved). In this situation it is critical, that the estimator is unbiased, i.e., it is an accurate reflection of the expected winnings and therefore does not incorrectly favor any playing style. The off-policy full-information case is useful for examining past games against an opponent to determine which of many alternative strategies one might want to use against them in the future. The introduction of bias (depending on the strategy used when playing the past hands) is not problematic, as the goal in this case is an estimate with as little error as possible. Hence the introduction of bias is acceptable in exchange for significant decreases in variance. Finally, the off-policy partial-information case corresponds to evaluating alternative strategies during an actual match. In this case, we want to evaluate a set of strategies, which aren't being played, to try and identify an effective choice for the current opponent. The player could then choose a strategy whose performance is estimated to be strong even for hands it wasn't playing. The estimators from the previous section all have natural applications to the game of poker: · Basic Importance Sampling. This is a straightforward application of importance sampling. The value of the observed outcome of the hand is weighted by the ratio of the probability that the strategy being evaluated (i ) takes the same sequence of actions to the probability that the playing strategy (i ) takes the se^ quence of actions. · Game ending actions. By selecting the fold betting action, a player surrenders the game in order to avoid matching an opponent's bet. Therefore, the game ending actions estimator can consider all histories in which the player could have folded during the observed history.2 We call this the Early Folds (EF) estimator. The estimator sums over all possible prefixes In the full-information setting we can also consider situations where the player could have called on the final round of betting to end the hand. 2 h h -i (S-i (z ))i (z ) h ^ (S (z )) ^ -i -i Q(z ) i (h) i (z ) ^ i (Q(z )) Once again this quantity only depends on the strategies of player i and so we can compute this estimator as well. We have presented four different estimators that try to extract additional information from a single observed game outcome. We can actually combine any of these estimators with other unbiased approaches for reducing variance. This can be done by replacing the V function in the above estimators with any unbiased estimate of V . In particular, these estimators can be combined with our previous DIVAT approach by choosing V to be the DIVAT (or BC-DIVAT) estimator instead of ui . 3.1. Partial Information The estimators above are provably unbiased for both thepolicy and off-policy full-information case. We now briefly discuss the off-policy partial-information case. In this case we don't directly observe the actual terminal history zt but only a many-to-one mapping K (zt ) of the history. One simple adaptation of our estimators to this case is to use the history z in the estimator whenever it is possible that the unknown terminal history could be in U (z ), while keeping the weight term unchanged. Although we lose the unbiased guarantee with these estimators, it is possible that the reduction in variance is more substantial than the error caused by the bias. We investigate empirically the magnitude of the bias and the resulting mean-squared error of such estimators in the domain of poker in Section 5. 4. Application to Poker To analyze the effectiveness of these estimators, we will use the popular game of Texas Hold'em poker, as played in the AAAI Computer Poker Competition (Zinkevich & Littman, 2006). The game is two-player and zero-sum. Private cards are dealt to the players, and over four rounds, public cards are revealed. During each round, the players place bets that the combination of their public and private cards will be the strongest at the end of the game. The game has just under 1018 game states, and has the properties of Strategy Evaluation in Extensive Games with Importance Sampling of the betting sequence where the player could have chosen to fold. In the summation it weights the value of surrendering the pot at that point by the ratio of the probability of the observed betting up to that point and then folding given the player's cards (and i ) to the probability of the observed betting up to that point given the player's cards (and i ). ^ · Private information. In Texas Hold'em, a player's private information is simply the two private cards they are dealt. Therefore, the private information estimator can consider all histories with the same betting sequence in which the player holds different private cards. We call this the All Cards (AC) estimator. The estimator sums over all possible two-card combinations (excepting those involving exposed board or opponent cards). In the summation it weights the value of the observed betting with the imagined cards by the ratio of the probability of the observed betting given those cards (and i ) to the probability of the observed betting (given i ) summed over all cards. ^ Table 1. Full Information Case. Empirical bias, standard deviation, and root mean-squared-error over a 1000 hand match for various estimators. 1 million hands of poker between S2298 and PsOpti4 were observed. A bias of 0* indicates a provably unbiased estimator. Bias S2298 Basic DIVAT BC-DIVAT Early Folds All Cards AC+BC-DIVAT AC+EF+BC-DIVAT CFR8 Basic DIVAT BC-DIVAT Early Folds All Cards AC+BC-DIVAT AC+EF+BC-DIVAT Orange Basic DIVAT BC-DIVAT Early Folds All Cards AC+BC-DIVAT AC+EF+BC-DIVAT 0* 0* 0* 0* 0* 0* 0* 200 ± 122 62 ± 104 84 ± 45 123 ± 120 12 ± 16 35 ± 13 2 ± 12 159 ± 40 3 ± 25 103 ± 28 82 ± 35 7 ± 16 8±13 6±12 StdDev 5103 1935 2891 5126 4213 2146 1778 62543 53033 22303 61481 8518 3254 2514 20559 11350 12862 17923 8591 3154 2421 RMSE 161 61 91 162 133 68 56 1988 1678 710 1948 270 109 80 669 359 420 572 272 100 77 5. Results Over the past few years we have created a number of strong Texas Hold'em poker agents that have competed in the past two AAAI Computer Poker Competitions. To evaluate our new estimators, we consider games played between three of these poker agents: S2298 (Zinkevich et al., 2007), PsOpti4 (Billings et al., 2003), and CFR8 (Zinkevich et al., 2008). In addition, we also consider Orange, a competitor in the First Man-Machine Poker Championship. To evaluate these estimators, we examined records of games played between each of three candidate strategies (S2298, CFR8, Orange) against the opponent PsOpti4. Each of these three records contains one million hands of poker, and can be viewed as full information (both players' private cards are always shown) or as partial information (when the opponent folds, their private cards are not revealed). We begin with the full-information experiments. 5.1. Full Information We used the estimators described previously to find the value of each of the three candidate strategies, using fullinformation records of games played from just one of the candidate strategies. The strategy that actually played the hands in the record of games is called the on-policy strategy and the others are the off-policy strategies. The results of one these experiments is presented in Table 1. In this experiment, we examined one million full-information hands of S2298 playing against PsOpti4. S2298 (the on-policy strategy) and CFR8 and Orange (the off-policy strategies) are evaluated by our importance sampling estimators, as well as DIVAT, BC-DIVAT, and a few combination estimators. We present the empirical bias and standard deviation of the estimators in the first two columns. The third column, "RMSE", is the root-mean-squared error of the estimator if it were used as the method of evaluation for a 1000 hand match (a typical match length). All of the numbers are reported in millibets per hand played. A millibet is one thousandth of a small-bet, the fixed magnitude of bets used in the first two rounds of betting. To provide some intuition for these numbers, a player that always folds will lose 750 millibets per hand, and strong players aim to achieve an expected win rate over 50 millibets per hand. In the on-policy case, where we are evaluating S2298, all of the estimators are provably unbiased, and so they only differ in variance. Note that the Basic estimator, in this case, is just the Monte-Carlo estimator over the actual money lost or won. The Early Folds estimator provides no variance reduction over the Monte-Carlo estimate, while the All Cards estimator provides only a slight reduction. However, this is not nearly as dramatic as the reduction provided by the DIVAT estimator. The importance sampling estimators, however, can be combined with the DIVAT es- Strategy Evaluation in Extensive Games with Importance Sampling timator as described in Section . The combination of BCDIVAT with All Cards ("AC+BC-DIVAT") results in lower variance than either of the estimators separately.3 The addition of Early Folds ("AC+EF+BC-DIVAT") produces an even further reduction in variance, showing the bestperformance of all the estimators, even though Early Folds on its own had little effect. In the off-policy case, where we are evaluating CFR8 or Orange, we report the empirical bias (along with a 95% confidence bound) in addition to the variance. As DIVAT and BC-DIVAT were not designed for off-policy evaluation, we report numbers by combining them with the Basic estimator (i.e., using traditional importance sampling). Note that bias is possible in this case because our on-policy strategy (S2298) does not satisfy the assumption in Theorem 1, as there are some outcomes the strategy never plays. Basic importance sampling in this setting not only shows statistically significant bias, but also exhibits impractically large variance. DIVAT and BC-DIVAT, which caused considerable variance reduction on-policy, also should considerable variance reduction off-policy, but not enough to offset the extra variance from basic importance sampling. The All Cards estimator, on the other hand, shows dramatically lower variance with very little bias (in fact, the empirical bias is statistically insignificant). Combining the All Cards estimator with BC-DIVAT and Early Folds further reduces the variance, giving off-policy estimators that are almost as accurate as our best on-policy estimators. The trends noted above continue in the other experiments, when CFR8 and Orange are being observed. For space considerations, we don't present the individual tables, but instead summarize these experiments in Table 2. The table shows the minimum and maximum empirically observed bias, standard deviation, and the root-mean-squared error of the estimator for a 1000 hand match. The strategies being evaluated are separated into the on-policy case, when the record involves data from that strategy, and the offpolicy case, when it doesn't. 5.2. Partial Information The same experiments were repeated for the case of partial information. The results of the experiment involving S2298 playing against PsOpti4 and evaluating our three candidate strategies under partial information is shown in Table 3. For DIVAT and BC-DIVAT, which require full information of the game outcome, we used a partial information variant where the full-information estimator was used when the The importance sampling estimators were combined with BC-DIVAT instead of DIVAT because the original DIVAT estimator is computationally burdensome, particularly when many evaluations are needed for every observation as is the case with the All Cards estimator. 3 Table 3. Partial-Information Case. Empirical bias, standard deviation, and root mean-squared-error over a 1000 hand match for various estimators. 1 million hands of poker between S2298 and PsOpti4 with partial information were observed. A bias of 0* indicates a provably unbiased estimator. Bias S2298 Basic DIVAT BC-DIVAT Early Folds All Cards AC+BC-DIVAT CFR8 Basic DIVAT BC-DIVAT Early Folds All Cards AC+BC-DIVAT Orange Basic DIVAT BC-DIVAT Early Folds All Cards AC+BC-DIVAT 0* 81±9 95±9 47±1 5±13 96±12 202±80 175±47 183±47 181±78 13±19 101±16 204±45 218±22 244±21 218±43 3±19 203±16 StdDev 5104 2762 2759 5065 4218 2650 40903 23376 23402 39877 7904 4014 23314 10029 10045 22379 8092 3880 RMSE 161 119 129 167 133 127 1309 760 762 1274 250 162 765 385 401 741 256 237 game outcome was known (i.e., no player folded) and winnings was used when it was not. This variant can result in a biased estimator, as can be seen in the table of results. The All Cards estimator, although also without any guarantee of being unbiased, actually fares much better in practice, not displaying a statistically significant bias in either the offpolicy or on-policy experiments. However, even though the DIVAT estimators are biased their low variance makes them preferred in terms of RMSE in the on-policy setting. In the off-policy setting, the variance caused by Basic importance sampling (as used with DIVAT and BC-DIVAT) makes the All Cards estimator the only practical choice. As in the full-information case we can combine the All Cards and BC-DIVAT for further variance reduction. The resulting estimator has lower RMSE than either All Cards or BC-DIVAT alone both in the on-policy and off-policy cases. The summary of the results of the other experiments, showing similar trends, are shown in Table 4. 6. Conclusion We introduced a new method for estimating agent performance in extensive games based on importance sampling. The technique exploits the fact that the agent's strategy is typically known to derive several low variance estima- Strategy Evaluation in Extensive Games with Importance Sampling Table 2. Summary of the Full-Information Case. Summary of empirical bias, standard deviation, and root-mean-squared error over a 1000 hand match for various estimators. The minimum and maximum encountered values for all combinations of observed and evaluated strategies is presented. A bias of 0* indicates a provably unbiased estimator. Min On Policy Basic DIVAT BC-DIVAT AC+GE+BC-DIVAT Off Policy Basic DIVAT BC-DIVAT AC+GE+BC-DIVAT 0* 0* 0* 0* 49 2 10 2 Bias ­ Max ­ ­ ­ ­ ­ ­ ­ ­ 0* 0* 0* 0* 200 62 103 9 Min 5102 1935 2891 1701 20559 11350 12862 1816 StdDev ­ Max ­ ­ ­ ­ ­ ­ ­ ­ 5385 2011 2930 1778 244469 138834 173715 2857 Min 161 61 91 54 669 358 419 58 RMSE ­ Max ­ ­ ­ ­ ­ ­ ­ ­ 170 64 92 56 7732 4390 5493 90 Table 4. Summary of the Partial-Information Case. Summary of empirical bias, standard deviation, and root-mean-squared error over a 1000 hand match for various estimators. The minimum and maximum encountered values for all combinations of observed and evaluated strategies is presented. A bias of 0* indicates a provably unbiased estimator. Min On Policy Basic DIVAT BC-DIVAT AC+BC-DIVAT Off Policy Basic DIVAT BC-DIVAT AC+BC-DIVAT 0* 56 78 78 17 103 35 63 Bias ­ Max ­ ­ ­ ­ ­ ­ ­ ­ 0* 144 199 206 433 282 243 230 Min 5104 2762 2759 2656 23314 10029 10045 3055 StdDev ­ Max ­ ­ ­ ­ ­ ­ ­ ­ 5391 2876 2859 2766 238874 88791 99287 6785 Min 161 105 118 115 753 384 400 143 RMSE ­ Max ­ ­ ­ ­ ­ ­ ­ ­ 170 170 219 224 7566 2822 3139 258 tors that can simultaneously evaluate many strategies while playing a single strategy. We prove that these estimators are unbiased in both the on-policy and off-policy case. We empirically evaluate the techniques in the domain of poker, showing significant improvements in terms of lower variance and lower bias. We show that the estimators can also be used even in the challenging problem of estimation with partial information observations. theoretic optimal strategies for full-scale poker. International Joint Conference on Artificial Intelligence (pp. 661­668). Osborne, M., & Rubenstein, A. (1994). A course in game theory. Cambridge, Massachusetts: The MIT Press. Zinkevich, M., Bowling, M., Bard, N., Kan, M., & Billings, D. (2006). Optimal unbiased estimators for evaluating agent performance. American Association of Artificial Intelligence National Conference, AAAI'06 (pp. 573­578). Zinkevich, M., Bowling, M., & Burch, N. (2007). A new algorithm for generating equilibria in massive zero-sum games. Proceedings of the Twenty-Second Conference on Artificial Intelligence (pp. 788­793). Zinkevich, M., Johanson, M., Bowling, M., & Piccione, C. (2008). Regret minimization in games with incomplete information. Advances in Neural Information Processing Systems 20. To appear (8 pages). Zinkevich, M., & Littman, M. (2006). The AAAI computer poker competition. Journal of the International Computer Games Association, 29. News item. Acknowledgments We would like to thank Martin Zinkevich and Morgan Kan along with the entire University of Alberta Computer Poker Research Group for their valuable insights. This research was supported by NSERC and iCore. References Billings, D., Burch, N., Davidson, A., Holte, R., Schaeffer, J., Schauenberg, T., & Szafron, D. (2003). Approximating game-