Improving on Expectation Propagation Manfred Opper Computer Science, TU Berlin opperm@cs.tu-berlin.de Ulrich Paquet Computer Laboratory, University of Cambridge ulrich@cantab.net Ole Winther Informatics and Mathematical Modelling, Technical University of Denmark owi@imm.dtu.dk Abstract A series of corrections is developed for the fixed points of Expectation Propagation (EP), which is one of the most popular methods for approximate probabilistic inference. These corrections can lead to improvements of the inference approximation or serve as a sanity check, indicating when EP yields unrealiable results. 1 Introduction The expectation propagation (EP) message passing algorithm is often considered as the method of choice for approximate Bayesian inference when both good accuracy and computational efficiency are required [4]. One recent example is a comparison of EP with extensive MCMC simulations for Gaussian process (GP) classifiers [3], which has shown that not only the predictive distribution, but also the typically much harder marginal likelihood (the partition function) of the data, are approximated remarkably well for a variety of data sets. However, while such empirical studies hold great value, they can not guarantee the same performance on other data sets or when completely different types of Bayesian models are considered. In this paper methods are developed to assess the quality of the EP approximation. We compute explicit expressions for the remainder terms of the approximation. This leads to various corrections for partition functions and posterior distributions. Under the hypothesis that the EP approximation works well, we identify quantities which can be assumed to be small and can be used in a series expansion of the corrections with increasing complexity. The computation of low order corrections in this expansion is often feasible, typically require only moderate computational efforts, and can lead to an improvement to the EP approximation or to the indication that the approximation cannot be trusted. 2 Expectation Propagation in a Nutshell Since it is the goal of this paper to compute corrections to the EP approximation, we will not discuss details of EP algorithms but rather characterise the fixed points which are reached when such algorithms converge. EP is applied to probabilistic models with an unobserved latent variable x having an intractable distribution p(x). In applications p(x) is usually the Bayesian posterior distribution conditioned on a set of observations. Since the dependency on the latter variables is not important for the subsequent theory, we will skip them in our notation. 1 It is assumed that p(x) factorizes into a product of terms fn such that p(x) = where the normalising partition function Z = an approximation to p(x) in the form q (x) = 1 Z n d x n fn (x) , n (1 ) fn (x) is also intractable. We then assume (2 ) gn (x) where the terms gn (x) belong to a tractable, e.g. exponential family of distributions. To compute the optimal parameters of the gn term approximation a set of auxiliary tilted distributions is defined via . q 1 (x)fn (x) qn (x) = (3 ) Zn gn (x) Here a single approximating term gn is replaced by an original term fn . Assuming that this replacement leaves qn still tractable, the parameters in gn are determined by the condition that q (x) and all qn (x) should be made as similar as possible. This is usually achieved by requiring that these distributions share a set of generalised moments (which usually coincide with the sufficient statistics of the exponential family). Note, that we will not assume that this expectation consistency [7] for the moments is derived by minimising a Kullback­Leibler divergence, as was done in the original derivations of EP [4]. Such an assumption would limit the applicability of the approximate inference and exclude e.g. the approximation of models with binary, Ising variables by a Gaussian model as in one of the applications in the last section. The corresponding approximation to the normalising partition function in (1) was given in [7] and [6] and reads in our present notation1 n ZE P = Zn . (4 ) 3 Corrections to EP An expression for the remainder terms which are neglected by the EP approximation can be obtained by solving for fn in (3), and taking the product to get n n Zn qn (x)gn (x) = n qn (x) . fn (x) = ZE P q (x) (5 ) q (x) q (x) He n c e Z = d x n fn (x) = ZE P R, with d n qn (x) x q (x) q (x) a n 1 nd p(x) = q (x) R q n (x) R= q (x) . (6 ) This shows that corrections to EP are small when all distributions qn are indeed close to q , justifying the optimality criterion of EP. For a similar attempt on corrections to loopy belief propagation, see [8]. Exact probabilistic inference with the corrections described here again leads to intractable computations. However, we can derive exact perturbation expansions involving a series of corrections with increasing computational complexity. Assuming that EP already yields a good approximation, the computation of a small number of these terms maybe sufficient to obtain the most dominant corrections. On the other hand, when the leading corrections come out large or do not sufficiently decrease with order, this may indicate that the EP approximation is inaccurate. Two such perturbation expansions are be presented in this section. 1 The definition of partition functions Zn is slightly different from previous works. 2 3.1 Expansion I: Clusters n The most basic expansion is based on the variables n (x) = qq((x) - 1 which we can assume to be x) typically small, when the EP approximation is good. Expanding the products in (6) we obtain the correction to the partition function d n R= x q (x) (1 + n (x)) (7 ) =1+ which is a finite series in terms of growing clusters of "interacting" variables n (x). Here the brackets . . . q denote expectations with respect to the distribution q . Note, that the first order term n n (x) q = 0 vanishes by the normalization of qn and q . As we will see later, the computation of corrections is feasible when qn is just a finite mixture of K simpler densities from the exponential family to which q belongs. Then the number of mixture components in the j -th term of the expansion of R is just of the order O(K j ) and an evaluation of low order terms should be tractable. In a similar way, we get p(x) = q (x) 11 + + n n n (x) + n1 (x)n2 (x) + . . . 1 a. This is a regression model yn = xn +n where i.i.d. noise variables n have uniform distribution and the observed outputs are all zero, i.e. yn = 0. For this case, the exact posterior variance does not shrink to zero even if the number of data points goes to infinity. The EP approximation however has the variance decrease to zero and our corrections increase with sample size. 4.3 Ising models Somewhat surprising (and probably less known) is the fact that EP and our corrections apply well to a fairly limiting case of the GP model where the terms are of the form tn (xn ) = en xn ( (xn + 1) + (xn - 1)), where (x) is the Dirac distribution. These terms, together with i a "Gaussian" f0 (x) = exp[