Improving on Expectation Propagation

Manfred Opper Computer Science, TU Berlin opperm@cs.tu-berlin.de

Ulrich Paquet Computer Laboratory, University of Cambridge ulrich@cantab.net

Ole Winther Informatics and Mathematical Modelling, Technical University of Denmark owi@imm.dtu.dk

Abstract
A series of corrections is developed for the fixed points of Expectation Propagation (EP), which is one of the most popular methods for approximate probabilistic inference. These corrections can lead to improvements of the inference approximation or serve as a sanity check, indicating when EP yields unrealiable results.

1 Introduction
The expectation propagation (EP) message passing algorithm is often considered as the method of choice for approximate Bayesian inference when both good accuracy and computational efficiency are required [4]. One recent example is a comparison of EP with extensive MCMC simulations for Gaussian process (GP) classifiers [3], which has shown that not only the predictive distribution, but also the typically much harder marginal likelihood (the partition function) of the data, are approximated remarkably well for a variety of data sets. However, while such empirical studies hold great value, they can not guarantee the same performance on other data sets or when completely different types of Bayesian models are considered. In this paper methods are developed to assess the quality of the EP approximation. We compute explicit expressions for the remainder terms of the approximation. This leads to various corrections for partition functions and posterior distributions. Under the hypothesis that the EP approximation works well, we identify quantities which can be assumed to be small and can be used in a series expansion of the corrections with increasing complexity. The computation of low order corrections in this expansion is often feasible, typically require only moderate computational efforts, and can lead to an improvement to the EP approximation or to the indication that the approximation cannot be trusted.

2 Expectation Propagation in a Nutshell
Since it is the goal of this paper to compute corrections to the EP approximation, we will not discuss details of EP algorithms but rather characterise the fixed points which are reached when such algorithms converge. EP is applied to probabilistic models with an unobserved latent variable x having an intractable distribution p(x). In applications p(x) is usually the Bayesian posterior distribution conditioned on a set of observations. Since the dependency on the latter variables is not important for the subsequent theory, we will skip them in our notation. 1


It is assumed that p(x) factorizes into a product of terms fn such that p(x) = where the normalising partition function Z = an approximation to p(x) in the form q (x) = 1 Z n d x n fn (x) , n (1 )

fn (x) is also intractable. We then assume (2 )

gn (x)

where the terms gn (x) belong to a tractable, e.g. exponential family of distributions. To compute the optimal parameters of the gn term approximation a set of auxiliary tilted distributions is defined via . q 1 (x)fn (x) qn (x) = (3 ) Zn gn (x) Here a single approximating term gn is replaced by an original term fn . Assuming that this replacement leaves qn still tractable, the parameters in gn are determined by the condition that q (x) and all qn (x) should be made as similar as possible. This is usually achieved by requiring that these distributions share a set of generalised moments (which usually coincide with the sufficient statistics of the exponential family). Note, that we will not assume that this expectation consistency [7] for the moments is derived by minimising a Kullback­Leibler divergence, as was done in the original derivations of EP [4]. Such an assumption would limit the applicability of the approximate inference and exclude e.g. the approximation of models with binary, Ising variables by a Gaussian model as in one of the applications in the last section. The corresponding approximation to the normalising partition function in (1) was given in [7] and [6] and reads in our present notation1 n ZE P = Zn . (4 )

3 Corrections to EP
An expression for the remainder terms which are neglected by the EP approximation can be obtained by solving for fn in (3), and taking the product to get n n Zn qn (x)gn (x) = n qn (x) . fn (x) = ZE P q (x) (5 ) q (x) q (x) He n c e Z = d x n fn (x) = ZE P R, with d n qn (x) x q (x) q (x) a n 1 nd p(x) = q (x) R q
n (x)

R=

q (x)

.

(6 )

This shows that corrections to EP are small when all distributions qn are indeed close to q , justifying the optimality criterion of EP. For a similar attempt on corrections to loopy belief propagation, see [8]. Exact probabilistic inference with the corrections described here again leads to intractable computations. However, we can derive exact perturbation expansions involving a series of corrections with increasing computational complexity. Assuming that EP already yields a good approximation, the computation of a small number of these terms maybe sufficient to obtain the most dominant corrections. On the other hand, when the leading corrections come out large or do not sufficiently decrease with order, this may indicate that the EP approximation is inaccurate. Two such perturbation expansions are be presented in this section.
1

The definition of partition functions Zn is slightly different from previous works.

2


3.1 Expansion I: Clusters
n The most basic expansion is based on the variables n (x) = qq((x) - 1 which we can assume to be x) typically small, when the EP approximation is good. Expanding the products in (6) we obtain the correction to the partition function d n R= x q (x) (1 + n (x)) (7 )

=1+

which is a finite series in terms of growing clusters of "interacting" variables n (x). Here the brackets . . . q denote expectations with respect to the distribution q . Note, that the first order term n n (x) q = 0 vanishes by the normalization of qn and q . As we will see later, the computation of corrections is feasible when qn is just a finite mixture of K simpler densities from the exponential family to which q belongs. Then the number of mixture components in the j -th term of the expansion of R is just of the order O(K j ) and an evaluation of low order terms should be tractable. In a similar way, we get p(x) = q (x) 11 + + n n n (x) + n1 (x)n2 (x) + . . . 1 <n2 n , n1 (x)n2 (x) q + . . . 1 <n2 (9 )

n

1 <n2


n1 (x)n2 (x)

q

+

n

1 <n2 <n3


n1 (x)n2 (x)n3 (x)

q

+ ... ,

(8 )

In order to keep the resulting density normalized to one, we should keep as many terms in the numerator as in the denominator. As an example, the first order correction to q (x) is n p(x)  qn (x) - (N - 1)q (x) . (1 0 ) 3.2 Expansion II: Cumulants One of most important applications of EP is to the case of statistical models with Gaussian process priors. Here x is a latent variable with Gaussian prior distribution and covariance E[xx ] = K where K is the kernel matrix. In this case we have N + 1 terms f0 , f1 , . . . , fN in (1) where f0 (x) = 1 g0 (x) = exp[- 2 x K-1 x]. For n  1 each fn (x) = tn (xn ) is the likelihood term for the nth observation which depends only on a single component xn of the vector x. The corresponding approximating terms are chosen to be Gaussian of the form gn (x)  2 1 en x- 2 n x . The 2N parameters n and n are determined in such a way that q (x) and the distributions qn (x) have the same first and second marginal moments xn and x2 . In this case, the n computation of corrections (7) would require the computation of multivariate integrals of increasing dimensionality. Hence, a different type of expansion seems more appropriate. The main idea is to expand with respect to the higher order cumulants of the distributions qn . To derive this expansion, we simplify (6) using the fact that q (x) = q (x\n |xn )q (xn ) and qn (x) = q (x\n |xn )qn (xn ), where we have (with a slight abuse of notation) introduced q (xn ) and qn (xn ), d 1 the marginals of q (x) and qn (x). Thus p(x) = R q (x)F (x) and R = x q (x)F (x), where . q n n (xn ) (1 1 ) F (x) = q (xn )

Since q (xn ) and the qn (xn ) have the same first two cumulants, corrections can be expressed by the higher cumulants of the qn (xn ) (note, that the higher cumulants of q (xn ) vanish). The cumulants cln of qn (xn ) are defined by their characteristic functions n (k ) via d l k -ikxn c ln l qn (xn ) = n (k ) and ln n (k ) = e k. (1 2 ) (i)l 2 l! Expressing the Gaussian marginals q (xn ) by their first and second cumulants, the means mn and the variances Snn and introducing the function l c ln l rn (k ) = (i)l k (1 3 ) l!
3

3


where in the last equality we have introduced a shift of variables n = kn + i (xn -mn ) . Sn n An expansion can be performed with respect to the cumulants in the terms gn which had been neglected in the EP approximation. The basic computations are most easily explained for the correction R to the partition function. 3.2.1 Correction to the partition function Since q (x) is a multivariate Gaussian of the form q (x) = N (x; m, S), the correction R to the partition Z involves a double Gaussian average over the vector x and the set of n . This can be simplified by combining them into a single complex zero mean Gaussian random vector defined as zn = n - i xn -mn such that Sn n e n z R= xp rn (zn ) The most remarkable property of the Gaussian z is its covariance which is easily found to be Sij 2 when i = j, and zi z = 0 . zi zj z = - Sii Sj j The last equation has important consequences for the surviving terms in an expansion of R! Assuming that the gn are small we perform a power series expansion of ln R e n n 2 z n 1n 1 ln R = ln xp - rn (zn ) = rn z + rn 2 2 z S l m l c ln c lm 1m nm = Б ... rm rn z Б . . . = 2 l! Snn Smm
=n =n 3

which contains the contributions of all higher order cumulants, we get d d - 1 2 n kn exp ikn (xn - mn ) - 2 Snn kn + rn (kn ) - F (x) = 1 2 kn exp ikn (xn - mn ) - 2 Snn kn e n - n  d n Snn  2 (xn - mn ) Snn n xp ex p rn n - i =  2 2 Snn

(

(1 4 ) 15)

(1 6 )

(1 7 )

rn z

2

Б . . . (1 8 ) (1 9 )

ls Here we have repeatedly used the fact that each factor zn in expectations zn zm have to be paired (by Wick's theorem [2]) with a factor zm where m = n (diagonal terms vanish by (17)). This gives nonzero contributions only, when l = s and there are l! ways for pairing.2

This expansion gives a hint why EP may work typically well for multivariate models when covariances Sij are small compared to the variances Sii . While we may expect that ln ZE P = O(N ) where N is the number of variables xn , the vanishing of the "self interactions" indicates that corrections may not scale with N . 3.2.2 Correction to marginal moments The predictive density of a novel observation can be treated by extending the Gaussian prior to include a new latent variable x with E[x x] = k and E[x2 ] = k , and appears as an average of a  likelihood term over the posterior marginal of x . A correction for the predictive density can also be derived in terms of the cumulant expansion by 2 2 averaging the conditional distribution p(x |x) = N (x ; k K-1 x,  ) with  = k - k K-1 k .   Using the expression (15) we obtain (where we set R = 1 in (6) to lowest order) 1   + d n xn - mn + p(x ) = x p(x |x) p(x) = N (x ; Еx , s2  ) rn n - i ... x Snn
2 The terms in the expansion might be organised in Feynman graphs, where "self interaction" loops are absent [2].

,xN (x;Е,)

(2 0 )

4


-195 -200 -205 -210

-215 -220 -225 -230 -235

1

2

3

4

5

6

N u mb e r o f c o m p o n e n t s K

Figure 1: ln Z approximations obtained from q (x)'s factorization in (2), for sec. 4.1's mixture model, as obtained by: variational Bayes (see [1] 1 for details) as red squares;  = 2 in Minka's divergence message passing scheme, described in [5], as magenta triangles; EP as blue circles; EP with the 2nd order correction in (8) as green diamonds. For 20 runs each, the colour intensities correspond to the frequency of reaching different estimates. A Monte Carlo estimate of the true ln Z , as found by parallel tempering with thermodynamic integration, is shown as a line with twostandard deviation error bars.

(2 1 ) where hl is the lth Hermite polynomial [2]. The Hermite polynomials are averaged over a Gaussian density where the only occurrence of x is through (x - Еx ) in Е, so that the expansion ultimately appears as a polynomial in x . A correction to the predictive density follows from averaging t (x ) over (21).

where Еx = k K-1 m and variance s2  = k - k (K + -1 )-1 k and  = diag() denotes   x the parameters in the Gaussian terms gn . The average in (20) is over a Gaussian x with -1 = - - (K - k 1 k k )-1 + -1 and Е = (x - Еx ) 2 K-1 k + m. By simplifying the inner  expectation over the complex Gaussian variables  we obtain   lhx x 1 n l c ln n - mn   p(x ) = N (x ; Еx , s2  ) 1 + + ЗЗЗ l x l! Snn Snn 3
N (x;Е,)

4 Applications
4.1 Mixture of Gaussians This section illustrates an example where a large first nontrivial correction term in (8) reflects an inaccurate EP approximation. We explain this for a K -component Gaussian mixture model.   N (n ; Е , -1 ), with Consider N observed data points n with likelihood terms fn (x) =  n  1 and with the mixing weights  forming a probability vector. The latent variables are then x = { , Е ,  }K 1 . For our prior on x we use a Dirichlet distribution and product of Normal=  N W (Е ,  ). When we multiply the fn terms we Wisharts densities so that f0 (x) = D( ) see that intractability for the mixture model arises because the number of terms in the marginal likelihood is K N , rather than because integration is intractable. The computation of lower-order terms in (8) should therefore be immediately feasible. The approximation q (x) and each gn (x) are chosen to be of the same exponential family form as f0 (x), where we don't require gn (x) to be normalizable. For brevity we omit the details of the EP algorithm for this mixture model, and assume here that an EP fixed point has been found, possibly using some damping. Fig. 1 shows various approximations to the log marginal likelihood ln Z for n coming from the acidity data set. It is evident that the "true peak" doesn't match the peak obtained by approximate inference, and we will wrongly predict which K maximizes the log marginal likelihood. Without having to resort to Monte Carlo methods, the second order correction for K = 3 both corrects our prediction and already confirms that the original approximation might be inadequate. 4.2 Gaussian Process Classification The GP classification model arises when we observe N data points n with class labels yn  {-1, 1}, and model y through a latent function x with the GP prior mentioned in sec. 3.2. The 5

l og Z


5 4.5

5
0.3

l og magn i t u d e , l n(  )

0. 4

0.5

0.6

4 3.5 3

l og magn i t u d e , l n(  )

0. 4

0.5 0.4 0.3
0.5 0.4 0.3
0.2

0.2

0.6
0.2

4.5 4 3.5 3 2.5 2
0.3

0.2

3 0.

0.1

0.1

2.5 2 1.5 1

0.

2

0.3

0.6

0.1
0.0 1

0.

5 0.

3

0.1

0.0

1

0.4

1.5

0.

2
1

0.
0.1

2
1 0.0 0 0.0 1

0.4
1 0.5
0.

0.0

1

0.0

01

0.5 0

0.1
0 2.5 .01
3

0.2
0.1
0.0 01

0

1 .00
3.5 4

0 0.0

01
4.5 5

0 1.5 2

0.00

01

1 0.0

1.5

2

2.5

3

3.5

4

4.5

5

l og l e n gt h s c al e , l n(  )

l og l e n gt h s c al e , l n(  )

(a) ln R, second order, with l = 3, 4.

(b) Monte Carlo ln R

Figure 2: A comparison of a perturbation expansion of (19) against Monte Carlo estimates of the true correction ln R, using the USPS data set from [3]. likelihood terms for yn are assumed to be tn (xn ) = (yn xn ), where (З) denotes the cumulative Normal density. Eq. (19) shows how to compute the cumulant expansion by dovetailing the EP fixed point with the characteristic function of qn (xn ): From the EP fixed point we have q (x) = N (x; m, S) and gn  1 en xn - 2 n xn ; consequently the marginal density of xn in q (x)/gn (xn ) from (3) is N (xn ; Е, v 2 ), where v -2 = 1/Snn - n and Е = v -2 (mn /Snn - n ). Using (3) again we have qn (xn ) = 1 (yn xn )N (xn ; Е, v 2 ) . Zn (2 2 )

The characteristic function of qn (xn ) is obtained by the inversion of (12), n (k ) = eikxn = eikЕ- 2 k
1 22

v

(wk ) yn Е + ik v 2 yn Е and wk =  , , with w =  2 (w) 1+v 1 + v2

(2 3 )

with expectations З З З being with respect to qn (xn ). Raw moments are computed through deriva(j ) tives of the characteristic function, i.e. xj = i-j n (0). The cumulants cln are determined n from the derivatives of ln n (k ) evaluated at zero (or equally from raw moments, e.g. c3n = 2 xn 3 - 3 xn x2 + x3 ), such that n n 2 ( c3n = 3   2 + 3w + w2 - 1 24) 63 , 4 2 2 3 c4n = -   + 12w + 7w  + w - 4 - 3w (2 5 )  where  = v 2 / 1 + v 2 and  = N (w; 0, 1)/(w). An extensive MCMC evaluation of EP for GP classification on various data sets was recently given by [3], showing that the log marginal likelihood of the data can be approximated remarkably well. An even more accurate estimation of the approximation error is given by considering the second order correction in (19) (computed here up to l = 4). For GPC we generally found that the l = 3 term dominates l = 4, and we do not include any higher cumulants here. Fig. 2 illustrates the ln R correction on the binary subproblem of the USPS 3's vs. 5's digits data set, with N = 767, as was 1 used by [3]. We used the same kernel k ( ,   ) =  2 exp(- 2  -   2/2 ) as [3], and evaluated (19) on a similar grid of ln  and ln  values. For the same grid values we obtained Monte Carlo estimates of ln Z , and hence ln R. They are plotted in fig. 2(b) for the cases where they estimate ln Z to sufficient accuracy (up to four decimal places) to obtain a smoothly varying plot of ln R.3 The correction from (19), as computed here, is O(N 2 ), and compares favourably to O(N 3 ) complexity of EP for GPC.
3 The Monte Carlo estimates in [3] are accurate enough for showing EP's close approximation to ln Z , but not enough to make any quantified statement about ln R.

6


Co e ffi c i e n t s o f x a & c o r r e c t i o n r a t i o 

1.2 1 0.8 0.6 0.4 0.2 0 -0.2 -0.4 -0.6 -6 -4 -2 0 2 4

yn = +1 yn = -1 coeff of x0 coeff of coeff of coeff of
* x1 * x2 * x3 *

correction ratio p (y = 1) / gpc * pcorr(y* = 1)

6

Figure 3: The initial coefficients of the polynomial in x , as they ultimately appear in the first nontrivial correction term in (21). Cumulants l = 3 and l = 4 were used. The coefficients are shown for test points  after observing data points n . The ratio between the standard and (1st order) corrected GP classification predictive density is also illustrated.

Lo c at i on of  

In fig. 3 we show the coefficients of the polynomial corrections (21) in powers of x to the predictive density p(x ), using 3rd and 4th cumulants. The small corrections arise as whenever terms yn mn are positive and large compared to the posterior variance, non-Gaussian terms fn (x) = tn (xn )  1 for almost all values of xn which have significant probability under the Gaussian distribution that is proportional to q (x)/gn (xn ). For these terms qn (x) is therefore almost Gaussian and higher cumulants are small. A example where this will no longer be the case is a GP model with tn (xn ) = 1 for |xn | < a and tn (xn ) = 0 for |xn | > a. This is a regression model yn = xn +n where i.i.d. noise variables n have uniform distribution and the observed outputs are all zero, i.e. yn = 0. For this case, the exact posterior variance does not shrink to zero even if the number of data points goes to infinity. The EP approximation however has the variance decrease to zero and our corrections increase with sample size. 4.3 Ising models Somewhat surprising (and probably less known) is the fact that EP and our corrections apply well to a fairly limiting case of the GP model where the terms are of the form tn (xn ) = en xn ( (xn + 1) +  (xn - 1)), where  (x) is the Dirac distribution. These terms, together with i a "Gaussian" f0 (x) = exp[ <j Jij xi xj ] (where we do not assume that the matrix J is negative definite), makes this GP model an Ising model with binary variables xn = Б1. As shown in [7], this model can still be treated with the same type of Gaussian term approximations as ordinary GP models, allowing for surprisingly accurate estimation of the mean and covariance. Here we will show the effect of our corrections for toy models, where exact inference is possible by enumeration.
10 MAD 2 node marginals
0

10

5

10

-2

AD Free energy
0 1

10

0

10

-4

10

-5

10 -1 10

-6

10

-10

10 

10

10

-1

10 

0

10

1

Figure 4: The left plot shows the MAD of the estimated covariance matrix from the exact one for different values of  for EP (blue), EP 2nd order l = 4 corrections (blue with triangles), Bethe or loopy belief propagation (LBP; dashed green) and Kikuchi or generalized LBP (dash­dotted red). The Bethe and Kikuchi approximations both give covariance estimates for all variable pairs as the model is fully connected. The right plot shows the absolute deviation of ln Z from the true value using second order perturbations with l = 3, 4, 5 (l = 3 is the smallest change). The remaining line styles are the same as in the left plot. 7


The tilted distributions qn (xn ) are biased binary distributions with cumulants: c3n = -2mn (1 - m2 ), c4n = -2 + 8m2 - 6m4 , etc. We will consider two different scenarios for random  and J n n n described in detail in [7]. In the first scenario, with N = 10, the Jij 's are generated independently at random according to Jij =  wij and wij  N (0, 1). For varying  , the maximum absolute s x deviation (MAD) of the estimated covariance matrices from the exact one maxi,j |ej t - ej act | i i is shown in fig. 4 left. The absolute deviation on the log partition function is shown in fig. 4 right. In the Wainwright-Jordan set-up N = 16 nodes are either fully connected or connected to nearest neighbors in a 4­by­4 grid. The external field (observation) strengths i are drawn from a uniform distribution i  U [-dobs , dobs ] with dobs = 0.25. Three types of coupling strength statistics are considered: repulsive (anti-ferromagnetic) Jij  U [-2dcoup , 0], mixed Jij  U [-dcoup , +dcoup ] and attractive (ferromagnetic) Jij  U [0, +2dcoup]. Table 1 gives the MAD of marginals averaged of 100 repetitions. The results for both set-ups give rise to the conclusion that when the EP approximation works well then the correction give an order of magnitude of improvement. In the opposite situation, the correction might worsen the results. Table 1: Average MAD of marginals in a Wainwright-Jordan set-up, comparing loopy belief propagation (LBP), log-determinant relaxation (LD), EP, EP with l = 5 correction (EP+), and EP with only one spanning tree approximating term (EP tree).
Graph Problem type Coupling dcoup Repulsive 0.25 Repulsive 0.50 Full Mixed 0.25 Mixed 0.50 Attractive 0.06 Attractive 0.12 Repulsive 1.0 Repulsive 2.0 Gr i d Mixed 1.0 Mixed 2.0 Attractive 1.0 Attractive 2.0 L BP 0.037 0.071 0.004 0.055 0.024 0.435 0.294 0.342 0.014 0.095 0.440 0.520 LD 0.020 0.018 0.020 0.021 0.027 0.033 0.047 0.041 0.016 0.038 0.047 0.042 Method EP EP + 0.003 0.00058487 0.031 0.0157 0.002 0.00042727 0.022 0.0159 0.004 0.0023 0.117 0.1066 0.153 0.1693 0.198 0.4244 0.011 0.0122 0.082 0.0984 0.125 0.1759 0.177 0.4730 EP tree 0.0017 0.0143 0.0013 0.0151 0.0025 0.0211 0.0031 0.0021 0.0018 0.0068 0.0028 0.0002

5 Outlook
We expect that it will be possible to develop similar corrections to other approximate inference methods, such as the variational approach or the "power EP" approximations which interpolate between the variational method and EP. This may help the user to decide which approximation is more accurate for a given problem. We will also attempt an analysis of the scaling of higher order terms in these expansions to see if they are asymptotic or have a finite radius of convergence. References
[1] H. Attias. A variational Bayesian framework for graphical models. In Advances in Neural Information Processing Systems 12, 2000. [2] S. Janson. Gaussian Hilbert spaces. Cambridge University Press, Cambridge, 1997. [3] M. Kuss and C. E. Rasmussen. Assessing approximate inference for binary Gaussian process classification. Journal of Machine Learning Research, 6:1679­1704, 2005. [4] T. P. Minka. Expectation propagation for approximate Bayesian inference. In UAI 2001, pages 362­369, 2001. [5] T. P. Minka. Divergence measures and message passing. Technical Report MSR-TR-2005-173, Microsoft Research, Cambridge, UK, 2005. [6] T.P. Minka. The EP energy function and minimization schemes. Technical report, 2001. [7] M. Opper and O. Winther. Expectation consistent approximate inference. Journal of Machine Learning Research, 6:2177­2204, 2005. [8] E. Sudderth, M. Wainwright, and A. Willsky. Loop series and Bethe variational bounds in attractive graphical models. In Advances in Neural Information Processing Systems 20, pages 1425­1432. 2008.

8