Knowledge Combination in Graphical Multiagent Models Quang Duong Michael P. Wellman Satinder Singh University of Michigan Computer Science & Engineering Ann Arbor, MI 48109-2121 USA {qduong,wellman,baveja}@umich.edu Abstract A graphical multiagent model (GMM) represents a joint distribution over the behavior of a set of agents. One source of knowledge about agents' behavior may come from gametheoretic analysis, as captured by several graphical game representations developed in recent years. GMMs generalize this approach to express arbitrary distributions, based on game descriptions or other sources of knowledge bearing on beliefs about agent behavior. To illustrate the exibility of GMMs, we exhibit game-derived models that allow probabilistic deviation from equilibrium, as well as models based on heuristic action choice. We investigate three dierent methods of integrating these models into a single model representing the combined knowledge sources. To evaluate the predictive performance of the combined model, we treat as actual outcome the behavior produced by a reinforcement learning process. We nd that combining the two knowledge sources, using any of the methods, provides better predictions than either source alone. Among the combination methods, mixing data outperforms the opinion pool and direct update methods investigated in this empirical trial. 1 INTRODUCTION Graphical models provide a compact representation for domains with decomposable structure, with concomitant computational advantages. Multiagent scenarios may be particularly amenable to decomposition, to the extent that interactions among the agents exhibit localized eects. The idea of exploiting conditional independence among the eects of agents' decisions was central to the multiagent inuence diagram (MAID) framework developed by Koller and Milch (2003). This observation was also a driving motivation for graphical game models, rst introduced by Kearns et al. (2001), and subsequently examined and extended in several research eorts (Kearns, 2007). In the basic graphical game approach, the model is a factored representation of a normal-form game, and special-purpose algorithms operate on this representation to identify approximate or exact Nash equilibria. Daskalakis and Papadimitriou (2006) demonstrated how to map a graphical game to a Markov random eld (MRF), assigning high potential to congurations where an agent plays a best response to its neighbors. They showed that the maximum a posteriori congurations of the MRF correspond to purestrategy Nash equilibria (PSNE) of the game. This approach enables the exploitation of statistical inference tools for game-theoretic computation, including the full repertoire of graphical model algorithms. We build on these works to introduce graphical multiagent models (GMMs), which are simply graphical models where the joint probability distribution is interpreted as an uncertain belief (e.g., a prediction) about the agents' play. For instance, the Daskalakis and Papadimitriou mapping can be viewed as a GMM where we believe that the agents will play a PSNE if one exists. This is of course just one candidate for belief based on game-theoretic analysis. When reasoning about strategies to play, or designing a mechanism (which induces a game for other agents), we may wish to adopt alternative bases for forming beliefs about the agents' play (Vorobeychik and Wellman, 2006). The GMM framework supports such decision making, and moreover, allows that beliefs may be based on variant solution concepts, models of bounded rationality or equilibrium selection, or for that matter knowledge that has nothing to do with game-theoretic analysis. At this level, our motivation shares the spirit of the network of inuence diagrams formalism of Gal and Pfeer (2008), which extends MAIDs to incorporate non- maximizing models of behavior. It also aligns with the goal of Wolpert's information-theoretic framework for modeling bounded rationality in game play (Wolpert, 2006). In this paper, we illustrate the exibility of GMMs by showing how to construct plausible multiagent models using quite dierent sources of belief about agent play. One example model is based on the game form, and another based on heuristic assumptions of behavior. In both, we assume that the graphical structure representing interactions among players in the game is known. We then introduce and test three approaches to integrate these knowledge sources into a combined model. To evaluate the results, we posit that actual play is generated by agents who start to play heuristically, then update their behavior over repeated interactions through a reinforcement learning (RL) process. Thus, the task we set up is to predict the outcome of a RL regime. More precisely, we seek to compute a reasonable estimation of the joint probability distribution of the agents' play. Intuitively, knowledge about the heuristic starting point is relevant, as is knowledge of strategically stable policies (game-theoretic equilibria), but neither directly captures nor necessarily corresponds to the RL outcome. We nd experimentally that in fact the two knowledge sources are complementary, as the combined model outperforms either alone. Our investigation further provides support for one particular combination approach, based on mixing data. We begin with formal denitions and specications of the GMM framework in Section 2. In Section 3, we present an example multiagent domain, the Internet industry partnership network, and construct two plausible models based on dierent knowledge sources. Section 4 details some alternative combination methods. We follow with an empirical study in Section 5, designed to evaluate the performance of the respective models and their combinations. We conclude the paper with some observations on these results. 2 GRAPHICAL MULTIAGENT MODELS . Each neighborhood i is associated with a potential function (s ) : S R. Intuitively a local conguration of strategies with a higher potential is more likely to be part of the global outcome than one with lower potential. As in graphical games, the size of the GMM description is exponential only in the size of local neighborhoods rather than in the total number of players. These local potentials dene the joint probability of a global conguration s, (s ) , (1) Pr(s) = Z where Z is a normalization term. One source of potential functions is a description of the game played by the n agents. Let G be a game with agents and strategy sets as dened for the GMM G. Let us further assume that G is a graphical game, such that agent i's payo depends only on s : its strategy and those of its neighbors. Formally, i's payo is dened by a utility function, u : S R. For example, Daskalakis and Papadimitriou (2006) dened a binary potential function, associating a high value for congurations where each agent's strategy choice is a best response to its neighbors (i.e., maximizes payos given s ), and a low value for all other congurations. A natural generalization of this approach would smooth out the binary distinction, assigning intermediate potentials based on the degree to which agents deviate from their best response. We may not wish to assume that agents play best responses with certainty, as they may not be perfectly rational, or our attributions of payo functions may be inexact. For a given payo model, let i(s ) denote i's regret function, representing the maximum gain i can obtain through unilaterally reconsidering its own strategy s given s , N-i = Ni \{i} i Ni j Ni j ii Ni Ni j i j Ni N-i Ni i N-i i(sNi ) = max ui (si , sN-i ) - ui (sNi ). i Consider a multiagent scenario with n players, where each player i {1, . . . , n} chooses an action (or strategy ) s , from its strategy domain, S . The outcome is a joint action, or strategy prole, s, designating the strategy choice of all players. A GMM G for this scenario is a graphical model, G = (V, E, S, ), with vertices V = {v , . . . , v } corresponding to the agents (we refer to v and i interchangeably), and edges (i, j ) E indicating a local interaction between i and j . The graph denes for each agent a neighborhood, N = {j | (i, j ) E }{i}, including i and its neighbors i i 1 n i i Intuitively, we expect that high-regret proles are less likely to be played (all else equal), since as regret increases agents are more apt to recognize and select the better alternatives. We can capture this intuition in a regret potential, (s ) = e , (2) where T , the temperature parameter, provides a way to calibrate our association between regret and relative likelihood. Greater values of T accord more likelihood to agents making less than perfectly rational decisions. For simplicity in notation below, we also dene = , and the vector of . Let reG denote a GMM employing the regret potential function. In the current study, we consider the i Ni - i(sNi )/Ti i i i 1 Ti i s Si regret GMM reG as one plausible form of predictive model. We also consider models that are not based directly on payos in an associated game. In particular, we construct for our particular example a rule-based GMM, hG , encoding heuristic assumptions of agents' behavior. 3 EXAMPLE: INTERNET INDUSTRY PARTNERSHIPS We illustrate the GMM framework and motivate the problem of combining knowledge sources through an example multiagent scenario. In the Internet industry partnership domain, companies must decide whether to retain (s = 1) or upgrade (s = 2) their current technology. The payo functions in G can be mapped into G's potential functions in several dierent ways. The payo for each strategy depends on the choices of other companies to which they are related through some kind of partnershiptheir neighbors in the interaction graph. For example, the benets of upgrading may be larger when one's partners also upgrade, since keeping their technologies synchronized enhances compatibility. 1 3.1 GAME DEFINITION Figure 1: Part of the Internet industry partnership network, from Krebs (2002). where y U [0, 1] is a random variable, and I and I are indicator functions representing agreement in strategy and technology sector. I = 1 if s = s , and 0 otherwise; I = 1 if t = t , and 0 otherwise. The intuition for (3) is that the mutual inuence between two rms increases with their size, agreement in action, and sector commonality. We also dene function (ch , s ) to compare i's action to its change parameter ch . Specically, 's value is positive if s is upgrade (retain) and ch is greater (smaller) than 0.5. The interpretation is that a value greater (smaller) than 0.5 for ch implies that i is exible (inexible) with respect to technology change. s - ch if - ch < 0.5 (ch , s ) = 0.5 - + ch otherwise Finally, the overall payo combines the pairwise partnership weights, further adjusted by the company's exibility in upgrading its technology. ij s t s i j t i j i i i i i i i We construct our example scenario using a fragment of the partnership network consisting of 10 representative companies, as depicted in Figure 1. Each node represents a company, which we characterize by three parameters: (1) size class, z; (2) sector, t: either commerce, infrastructure, or content; and (3) change coecient, ch [0, 1], representing the intrinsic adaptability of the company's technology. The study by Krebs (2002) provides sector and a rough order of size; the change coecient is assigned by us in an arbitrary manner. Although somewhat contrived, the scenario specication serves our purpose of demonstrating some capabilities of the GMM approach. The payo function, u, denes the value of retaining or upgrading technology, given the actions of a rm's neighboring companies and the parameters describing these companies. Qualitatively, payo is increased by agreement with neighbors, where larger neighbors from the same sector are relatively more important. Let us rst introduce an intermediate value w for each connected pair of companies, reecting the strength of i and j's partnership: ij ij s i i 2 i si 2 si 2 i i ui (aNi ) = (1 + yi (chi , si )) j wij (si , sj ), with y 3.2 i U [0, 1] . GMM CONSTRUCTIONS Given the payo function and the game denition above, we can generate a regret potential function (2) in a straightforward manner, parametrized by temperature. This potential function in turn denes a regret GMM, reG . I 1 Our heuristic rule-based model, hM , in contrast, is y , (3) specied without direct reference to the payo funcw (s , s ) = (z + z ) + 2 +4 tion. In this model, each company independently applies a local heuristic to stochastically choose its acThe example is inspired by the network model of Krebs (2002), cited by Kearns (2002). tion. Specically, agent i changes its technology with ij i j i j It 1-It 1 probability pChange (i), where The intuition behind this heuristic is that the more partners (|N | - 1) a company has and the greater its size z , the less likely it is to change. Given the pChange values, it is straightforward to dene a potential function for the GMM hG such that the outcome distribution is the same as generated by applying the rule independently for each company. As a result, hG 's potential is a function of only s instead of s. The two GMMs, reG and hG , are based on qualitatively dierent sources of knowledge. If we believe that agents are essentially rational and aware of their environment, we may expect that the regret GMM reG would predict behavior well. If instead we have evidence that agents choose heuristically based on partnership density and size, we might have greater condence in predictions based on hG , or on some other heuristic models that capture our intuition of agents' behavior. In other situations, we may consider that both models capture factors determining behavior, and view the knowledge sources as complementary. i i i i Ni 3.3 SIMULATION MODEL pChange (i) = 0.5(1 - 10-3 )|Ni | (1 - 10-3 zi ). The role of our simulation model is to generate play data from a plausible agent interaction process. In this study, we treat this data as the actual outcome, and use it to evaluate the GMMs above as well as combined models. The simulation is based on the idea that actual agent behavior is produced via repeated interaction through a reinforcement learning (RL) procedure. In the model, each agent is an independent learner, employing an RL procedure designed for partially observable environments (Jaakkola et al., 1995). The environment for company i comprises i and its partners (i.e., its neighbors N ), and each conguration of partner strategies s is a possible state. The agent seeks to learn a stochastic play policy: (s | s ), which denotes the probability of playing action s at state s . However, the agent does not actually observe s before taking its action. Thus, the action is actually selected based on the policy, for a = 1, 2, s Pr (s = a) = (s = a | s ) Pr(s ). (4) -i N-i i i N-i i N-i N-i i i i i N-i N-i N-i To learn the policy , we apply the RL procedure of Jaakkola et al. (1995). 1. Initialize to pChange (i) (i.e., the heuristic policy from hM ). 2. Generate a play s using (4). Observe the resulting local state s and receive as reward the payo u (s ). Update Q (s , s ), the average reward for taking action s in the associated local state. 3. Choose (s | s ) to maximize Q (s , s ). Adjust the current policy in the direction of : (1 - ) + , where is the learning rate. 4. Repeat steps 2 and 3 until convergence. For our experiments, we used = 0.2 and iterated steps 2 and 3 above 40 times, the point after which few changes occurred in the learned RL policy . We denote the simulated model at the end of the RL procedure as simM . Note that the RL process starts with the local heuristic rule-based policy, but is updated based on payo experience. Thus we expect that both the heuristic rulebased model and the rationalistic regret-based model may oer value for predicting the outcomes of simM . i i Ni i Ni i i N-i i i i N-i i i N-i i i i i i 4 METHODS FOR COMBINING KNOWLEDGE SOURCES Given two complementary sources of knowledge, how can we integrate them into a single GMM? We formulate the problem for the case that one knowledge source is expressed explicitly as a GMM, G , and the other in the form of some data D = {s , . . . , s } of joint plays related to the multiagent scenario. Note that D may not reect the actual distribution of play accurately, for example because it is small in size or because it was observed in a dierent multiagent setting. In this section we answer these questions abstractly and in the next section show how we can do this for our specic reG model and data D derived from hM . 1 1 m 2 4.1 DIRECT UPDATE The direct update method combines the two sources of knowledge G and D into a new GMM, directG , derived by adjusting the parameters of G to maxiPr (s ) is a stationary probability distribution over states, which, for simplicity, we take to be uniform. mize the predictive performance w.r.t. the data D. The scarcity of companies' knowledge about others' c ee u o strategies and the network eects motivates our choice induScienaceGweMafnrogmndraattae, tshechcoamdbaitnaatsieotnfrmmthaoGsMcMn, boer M of Pr (s ) instead of a more complicated model of applied to more general settings where knoweledgd soaurces e come in either form. network dynamics. 1 i N-i G1 1 2 i N-i We measure predictive performance using the logarithmic scoring ru|le (Gneiting and Raftery, 2007): Score(G | D) = log Pr (s ), which assesses the log-likelihood of the data. We take as our problem to tune the GMM's parameters in order to maximize this score (Kappen and Rodríguez, 1997), D| k=1 G k their absolute values contain little meaning when taken out of the context of their corresponding models. We adopt for our aggregation function the weighted geometric mean, called the logarithmic opinion pool (logOP). PrOPG (s) = PrG1 (s)w PrG2 (s)1-w . Z The logOP is the only pool known to preserve indee r l e We employ the gradient ascent method to maximize peenldmnacne, s2t0r0u5c)t,uwehiicnhgirsaapnhiicapomtoadnetlsp(rPpnrntocikn aonudr Wl mr oey data likelihood, which entails computing the gradient context. The weight parameter w can be set by at, of L w.r.t : or tuned with data. In our experiments described below, we empl¯oy the other half of input game-play data = (5) D, denoted D, and set w by maximizing the objective = - |D| function: and adjusting = + , where is the learning k ¯ L(D | w) = log Pr (s ). (6) rate, until the gradient is below some threshold. A major problem in graphical-model parameter learning is the intractability of calculating log Z in (5). In brief, given the two components G and G , the It entails iterating over all possible outcomes of the opinion pool method rst initializes w = 0.5. It game, which is exponential in the number of play- then repeats computing the gradient w of the logers N , rendering exact inference and learning in undi- likelihood w.r.t w by dierentiating (6), and updating rected graphical models intractable. Since our priority w = w + w, where is the learning rate, until the is to accurately evaluate knowledge combination meth- gradient is acceptably small. ods for GMMs, our current implementation of the inference and learning algorithms does not employ any approximation. Our pilot study of generalized belief propagation approximations in GMMs (Yedidia et al., The mixing data method samples joint plays from the 2001) has indeed yielded positive results, and will be given GMM G to generate a new data set D . We combine D and D into one data set mD by sampling incorporated in future reports. from the two sources equally, though one could easily weight one source more than another by adjusting the sampling ratio. We then induce a new GMM for a Unlike direct update, which depends on the availability given parametrized form by tuning the parameter , of both play-outcome data and the potential function's as detailed in the direct update approach (Section 4.1), parameterized form, the next two methods, opinion to maximize the likelihood of the new data mD. pool and mixing data, push the knowledge combinae ine o t e x g c tion problem towards potentially greater independence BreolowciessthheouctolmbinfedhGmiMinmidxata meottheodt,hwthiwh p du t M G. N a e from the input knowledge sources' forms. leave out step 4 in our implementation. The opinion pool method starts by rst using half the given data D to learn a GMM G by adjusting its 1. Generate a sample of play outcomes D from G . parameters to maximize the likelihood of the data, as 2. Build the mixed data set mD from D and D with in the direct update method above. The combined sampling ratio = 0.5. model OPG is then an aggregation of G and G into a single probability distribution: 3. Initialize mixG with some . Update as in direct update using the data mD. Pr (s) = f (Pr (s), Pr (s)). 4. (optional) Tune in the direction determined by Note that the above equation does not involve dening the gradient-descent method to maximize mixG 's a separate pooled potential function for each player in performance on a small held-out part of the testthe combined model. The rationale is that potentials ing data. Repeat steps 3 and 4 until the gradient are not normalized like the joint probability, and thus, is below some threshold. 1 =1 L(D |) P P |D | - i i log e k i(sk ) Ni Score(G | D) = | kD| log PrG1 (sk ) = L(D | = G1 ). log Z ¯ |D | OPG k 1 2 4.3 MIXING DATA 1 1 1 4.2 OPINION POOL 2 1 1 1 1 2 mixG mixG OPG G1 G2 5 EMPIRICAL STUDY We evaluate our combination approaches experimentally using our simplied version of the Internet industry partnership domain. We concentrate mainly on Example 1, the scenario depicted in Figure 1. In the rst experiment, we also examine Example 2, which employs a smaller graph including only the top four companies. 5.1 EXPERIMENT SETTINGS In our experiments we use the following components dened above: (i) the regret GMM reG with the temperature parameters generated uniformly randomly (except in one case explicitly specied below), (ii) a data set D of joint plays generated by using the heuristic rule-based model hM , (iii) the heuristic model hM and associated GMM hG , and (iv) a testing data set D derived from the RL-based model simM . We compare our dierent combination approaches based on their ability to predict the test data set as measure by the score function Score(G | D ). In particular, we compare the performance of a model that combines knowledge sources combinedG relative to the performance of aSborseline model baseG using a ratio ae . of scores, R = Scoce r The inverted order of baseG and combinedG is due to Score's negativity, and thus, any R > 1 indicates combinedG 's improvement over baseG . We experiment with several environment settings. For each setting, we conduct 20 trials, each of which involves a training data D set of 500 plays from hM and a testing data set D of 500 plays from simM . (baseG |D ) (combinedG |D ) 5.2 RESULTS AND ANALYSIS Figure 2: Combination methods' performance across dierent examples and baselines. pared to a model derived from the same gold-standard source simM as our test data D . We sample a separate data set D rom simM and employ it in learning a GMM simG of the parametrized regret form (2), using maximum likelihood to adjust the temperature parameter. The results reveal that our combined models, especially mixG , closely match simG in terms of predictive performance. f First, in Figure 2 and Figure 3 we present an overview of our combination methods' eectiveness. For both gures, reG and D are used to derive the model directG using the direct update combination method, the model OPG using the opinion pool method, and the model mixG using the mixed data method. Figure 2 displays predictive performance across the two examples (1 and 2) and the two baseline models (reG and hG ). Mixing data is consistently best of the three combination methods, and direct update performs relatively better than opinion pool. All three methods yield better results than individual input models, suggesting that combining two knowledge sources is benecial regardless of which of the proposed methods is adopted. Figure 3 shows the performance of various models com- Figure 3: Combination and input models versus the underlying model. Next, we study the eect of varying the quality of the two input sources. Figure 4 shows the eect of varying the amount of joint-play data available in D. Specifically, we make a fraction |D| of play observations, [0, 1], available to the three combination methods. From Figure 4, we observe that as long as > 0.1, performance remains fairly stable. In our experiments, this corresponds to a threshold data set size of approximately 500 × 0.1 = 50. When the amount of data goes below this threshold, the combined model may very well become inferior to the models reG and hG . 1.3 1.2 1.1 1 0.9 0.8 performance score ratio 1.3 1.25 1.2 1.15 1.1 1.05 1 0.95 0.9 0.85 directG vs reG OPG vs reG mixG vs reG directG vs hG OPG vs hG mixG vs hG 0 0.5 1 1.5 2 degradation of reG 2.5 3 3.5 performance score ratio 0.7 0.6 0.5 0 0.1 0.2 0.3 directG vs reG OPG vs reG mixG vs reG directG vs hG OPG vs hG mixG vs hG 0.4 0.5 0.6 0.7 proportion of data used 0.8 0.9 1 0.8 Figure 4: Combination methods' performance versus data availability portion . Figure 5 shows the eect of varying the quality of the GMM provided as input to the combination methods. To modulate the accuracy of reG , we introduce a parameter controlling the relation of its temperature parameters to those of simG . Specically, we set to (1 + ) . The results of the third experiment are depicted in Figure 5. When compared with the unchanged heuristic model hG , the combination models show a slight decrease in their relative performance with , which reects the eect of reG 's inaccuracy on the combination methods. When the baseline is reG , in contrast, the degradation of the combined model is dominated by the eect of compromising the baseline reG . In other words, combining knowledge sources effectively compensates for degrading one of them. In these experiments, OPG 's poor performance relative to that of directG and mixG may be due in part to its reliance on only a single parameter, w, compared to the vector available to the other methods. mixG 's overall superiority is likely a result of its directly sampling from reG , which employs information contained in reG more eectively than directG , where reG only matters at the initialization stage. Figure 6 presents the results of an experiment designed to strengthen our claims about the benets of integrating knowledge sources in a single model. We examine the combined models' performance in environments where the simulation mode simM is not the product of RL that starts with the input model hM . In particular, we dene a dierent heuristic model hM such that pChange = 0.05 for all i. Let E be the input data set generated from hM Based on hM reG simG , hM (i) Figure 5: Combination models' performance versus input regret model's inaccuracy as controlled by . we subsequently build the simulation model simM nd its corresponding test data E . Given these data sets and models, we can evaluate the combined models across drastically dierent starting points, by comparing their performances when dierent input sources, D and E, are provided, on the same testing data (either D or E ). First, we compare the performance of the heuristic models employed in generating input data, hG (induced from hM ) and hG (induced from hM : hG performs 60% better than hG when tested on D , whereas hG outperforms hG by 54% on E . This assessment arms that the two different input data sets, D and E, are indeed dierentiable in terms of the behavior models they represent. a (D ) ) (E ) (D ) (E ) (E ) (D ) 2 1.8 performance score ratio (over reG) 1.6 1.4 1.2 1 0.8 0.6 0.4 direct update opinion pool mixing data input D, test D* input E, test D* input E, test E* input D, test E* . , Figure 6: Combination methods in dierent input and test settings. The results presented in Figure 6 indicate that better performance is achieved in cases where the test and input environments coincide (the rst and third measures) compared to those in which they are unrelated (second and fourth). This observation conrms that combined models whose input data contains some information about the underlying behavior model outperform those extracting irrelevant information from its inputs. The gaps in mixG 's performance between scenarios where test cases are derived from the same input (rst and third) and from an unrelated model (second and forth) are relatively smaller than those in the other methods. This phenomenon is likely a result of mixG 's more eective usage of both knowledge sources, which limits the impact of irrelevant input data, and possibly contributes to the good performance of mixG , which is tuned to t D, when tested on E . (D ) sions present more complicated problems for combining models derived from dierent knowledge sources. Finally, we might look to dynamic Bayesian network concepts to extend the GMM framework to add a time dimension, and thus enable modeling sequential and interactive multiagent environments. References Daskalakis, C. and Papadimitriou, C. H. (2006). Computing pure Nash equilibria in graphical games via Markov random elds. In Seventh ACM conference on Electronic Commerce, pages 9199, Ann Arbor. Gal, Y. and Pfeer, A. (2008). Networks of inuence diagrams: A formalism for reasoning about agents' decision-making processes. Journal of Articial Intelligence Research. Gneiting, T. and Raftery, A. E. (2007). Strictly proper scoring rules, prediction and estimation. Journal of the American Statistical Association, 102(477):359378. Jaakkola, T., Singh, S., and Jordan, M. I. (1995). Reinforcement learning algorithm for partially observable Markov decision problems. In Advances in Neural Information Processing Systems 7, pages 345352. Kappen, H. J. and Rodríguez, F. B. (1997). Mean eld approach to learning in Boltzmann machines. Pattern Recognition Letters, 18:13171322. Kearns, M. (2002). Computational game theory: A tutorial. http://www.cis.upenn.edu/~mkearns/ nips02tutorial/nips.pdf. Presented at the Neural Information Processing Systems conference. Kearns, M. (2007). Graphical games. In Nisan, N., Roughgarden, T., Tardos, E., and Vazirani, V. V., editors, Algorithmic Game Theory, pages 159180. Cambridge University Press. Kearns, M., Littman, M. L., and Singh, S. (2001). Graphical models for game theory. In Seventeenth Conference on Uncertainty in Articial Intel ligence, pages 253260, Seattle. Koller, D. and Milch, B. (2003). Multi-agent inuence diagrams for representing and solving games. Games and Economic Behavior, 45:181221. Krebs, V. (2002). Internet industry partnerships. http: //www.orgnet.com/netindustry.html. Pennock, D. M. and Wellman, M. P. (2005). Graphical models for groups: Belief aggregation and risk sharing. Decision Analysis, 2:148164. Vorobeychik, Y. and Wellman, M. P. (2006). Mechanism design based on beliefs about responsive play (position paper). In ACM EC-06 Workshop on Alternative Solution Concepts for Mechanism Design, Ann Arbor, MI. Wolpert, D. H. (2006). Information theory: The bridge connecting bounded rational game theory and statistical physics. In Complex Engineered Systems. Springer. Yedidia, J. S., Freeman, W. T., and Weiss, Y. (2001). Generalized belief propagation. Advances in Neural Information Processing Systems, 13:689695. 6 CONCLUSIONS GMMs provide a exible representation framework for graphically structured multiagent scenarios, supporting the specication of probability distributions based on game-theoretic models as well as heuristic or other qualitatively dierent characterizations of agent behavior. We explored the possibility of exploiting this exibility by employing multiple knowledge sources for prediction, as demonstrated for the task of predicting the outcome of a reinforcement learning process. Our basic nding is that combining two knowledge sources in this scenario does improve predictive power over either input source alone. We have also identied the most eective combination method among those triedmixing dataand the existence of a threshold for data availability that can help boost eciency. Furthermore, we have found that our knowledge combination approaches, especially mixing data, can eectively match the performance of modeling the reinforcement learning process directly. This study is a rst step in what we expect to be an extended eort to develop the GMM framework for supporting reasoning about strategic situations. One important issue to address is computational feasibility; although the graphical representation facilitates scalability in the number of agents, accurate approximation techniques are nevertheless essential to support practical applications with large models. Knowledge about multiagent behavior may come from sources other than play history and regret functions, and so another logical research direction is to develop canonical models for capturing and combining such sources. Learning may be extended to not only the model's parameters, but also the structure of potential functions and the graph topology itself. Such exten-