Gaussian Covariance and Scalable Variational Inference
Matthias W. Seeger mseeger@mmci.uni-saarland.de Saarland University and Max Planck Institute for Informatics, Campus E1.7, 66123 Saarbruecken, Germany

Abstract
We analyze computational aspects of variational approximate inference techniques for sparse linear models, which have to be understood to allow for large scale applications. Gaussian covariances play a key role, whose approximation is computationally hard. While most previous methods gain scalability by not even representing most posterior dependencies, harmful factorization assumptions can be avoided by employing datadependent low-rank approximations instead. We provide theoretical and empirical insights into algorithmic and statistical consequences of low-rank covariance approximation errors on decision outcomes in nonlinear sequential Bayesian experimental design.

oretical understanding. In this paper, we focus on computational aspects of scalable variational inference for large SLMs. Bayesian inference is hard and useful for the same underlying reason: the emergence of very many nonlocal dependencies in the posterior distribution. In the large scale continuous-variable context, these are approximated by Gaussian covariances of restricted structure and dimensionality. The choice of these restrictions not only impacts the final best fit to the posterior, but also the optimization process leading there. By far most methods to date attain scalability through factorization assumptions, whereby all dependencies are forced into a predetermined form, and most of them are ruled out up front. In contrast, Seeger et al. (2009) show how to avoid factorizations entirely, using low-rank covariance approximations such as PCA or the Lanczos algorithm (Schneider & Willsky, 2001) instead. The latter concept of tracking a limited number of principal covariance directions alongside the variational optimization has advantages in practice, since most Bayesian decision making or experimental design applications are driven by dominating modes of posterior dependencies. We point out the fundamental role of Gaussian (co)variance computations for large scale variational inference and experimental design in Section 2, and review approximation methods in Section 3. Our main contribution is an analysis of how low-rank Gaussian covariance approximations affect inference outcomes in the framework of Seeger et al. (2009). First, we prove that if covariances are approximated by PCA rather than computed exactly, their algorithm remains convergent. Our argument is based on convexity of spectral functions (Davis, 1957). Second, we show that in the context of SLM inference, PCA approximation errors lead to a systematic strengthening of the sparsity regularization. A running example in this paper is optimizing real-world image acquisition (adaptive compressive sensing) by Bayesian experimental design (Seeger et al., 2009). We discuss the impact of PCA approximations on design score computations and sequential decisions, and provide experimental results on

1. Introduction
Sparse linear models (SLMs) enjoy enormous popularity in high-dimensional statistics, signal and image processing, and machine learning. A large part of this success story is due to regard for computational details: maximum a posteriori (MAP) estimation is formulated in terms of convex problems, which are reduced to standard primitives of numerical mathematics and digital signal processing. In such point estimation techniques, the Bayesian posterior is used as a criterion to be maximized rather than a distribution to be approximated and queried. Many applications require posterior information beyond its mode's location. Decision theory and Bayesian experimental design can be used to optimize sampling patterns (Seeger et al., 2009) or data acquisition, and sparse bilinear model reconstruction is greatly improved by Bayesian averaging (Levin et al., 2009). However, today's approximate inference technology lags far behind MAP estimation in terms of scalability, robustness, and theAppearing in Proceedings of the 27 th International Conference on Machine Learning, Haifa, Israel, 2010. Copyright 2010 by the author(s)/owner(s).

Gaussian Covariance and Scalable Inference

real-world images in Section 5.

2. Variational Inference for Sparse Linear Models
We are interested in sparse linear models (SLMs) applied to image reconstruction (see (Seeger, 2008) for a detailed exposition). A latent image u  Rn (n pixels) is sampled by way of a design matrix X  Rm×n . Observations y  Rm are modelled as y = X u + , where   N (0,  2 I) is noise of variance  2 . For example, X is a partial discrete Fourier transform in MRI reconstruction applications (Seeger et al., 2009). The sparsity of filter coefficients s = Bu  Rq (such as wavelet coefficients or spatial derivatives) is encouraged by way of a Laplacian prior distribution q P (u)  j=1 e-j |sj /| . We are interested in sparse inference, for example to approximate the posterior covariance in order to optimize the design X (see Section 2.2). Even for modest image resolutions, n (number pixels) is beyond 100000, with m (number samples) and q (number sparsity coefficients) of the same order. Note that in many image reconstruction applications, X is neither sparse, nor has simple graphical model structure. The posterior distribution has the form P (u|y) = Z -1 N (y|X u,  2 I)
q j=1

j = 1, . . . , q, kept up-to-date by message passing. Both notions lead to easily implementable algorithms, iterating between local factor or node updates and Gaussian message passing. Unfortunately, none of these approaches result in scalable algorithms in general. Both factorization assumptions and single-marginal updating lead to non-convex inference relaxations in all cases we know of. More important, while each single update is easy to do, far too many of them are required until convergence. For [2], we require q updates to even visit each marginal, and the absence of a sparsity or graph structure of X precludes fast message passing in between: an n × n linear system has to be solved for each update. In order to be applied at very large scales, a variational algorithm has to do few global iterations1 until convergence, which in turn have to be reducible to scalable computational problems. A scalable variational inference method has been proposed by Seeger et al. (2009), see also (Nickisch & Seeger, 2009). We sketch some details to outline computational demands and prepare the ground for further analysis. Variational methods target the log partition function log Z of (1), the cumulantgenerating function of the posterior. It is lower bounded by plugging in the representation e-j |sj /| = 2 2 maxj >0 e-(sj /) /(2j )-j j /2 for Laplacian sites, then interchanging max 0 with the integral over u. The variational problem constitutes finding the closest bound over variational parameters  : min 0 ( ) with ( ) = log |A| + ( 2 )T  + minu R(u,  ), R :=  -2 y - Xu
2

e-j |sj /| , (1)

s = Bu, which we would like to integrate against and obtain moments of. This is hard for two reasons coming together. First, P (u|y) is highly coupled, since X is neither diagonal nor sparse. Second, it is nonGaussian due to the Laplacian prior potentials. In large scale regimes, a third problem is the sheer size even of basic moments such as the covariance matrix. In cases of interest here, inference is practically intractable even for linear models with Gaussian prior. 2.1. Scalable Algorithms How can P (u|y) be approximated at large scales? Most methods make use of a Gaussian approximation Q(u|y;  ), either fitting Q to P globally, or using Q as carrier for self-consistency equations between marginals. The rationale is that global covariances can still be represented this way, while Gaussian integrals are tractable to compute. In this section, we show that large scale variational inference crucially relies on bulk computation of Gaussian variances. Most algorithms to date can be grouped into two classes: either [1] Q(u|y) is restricted to factorize, with factors ranging over small disjoint subsets of u, or [2] updates are done based on marginals Q(sj |y),

+ sT -1 s ,

(2)

where  := diag  . The posterior P (u|y) is fitted by a Gaussian approximation Q(u|y) = N (u|u ,  2 A-1 ) parameterized in terms of  , in that A = X T X + B T -1 B and u = u ( ). Denote posterior marginals by Q(sj |y) = N (hj ,  2 j ) in the sequel,  = (j ). It is easy to see that j =  -2 VarQ [sj |y]  j : the variational parameters directly control the variances. SLMs implement selective shrinkage, in that most |sj | are strongly forced to small values, while some (the "relevant" ones) are shrunken little at all. The key statistic for sorting coefficients in this way is posterior variance, and the role of the j is to implement selective shrinkage within the Gaussian approximation Q(u|y).
In principle, methods from [1] and [2] can be run doing parallel updates, which would look like "global" steps. However, single components of such parallel updates are derived by assuming the rest of Q(u|y) remains static: they are not coupled themselves. To our knowledge, parallel updating for [1] or [2] has not been reported to run faster for SLMs than simpler sequential variants.
1

Gaussian Covariance and Scalable Inference

The variances are not only statistically decisive, but also computationally. For gradient-based minimization of , we certainly require  -1 log |A| = diag-1 (BA-1 B T ) = . While all marginal means (hj ) can be computed solving a single linear system by conjugate gradients (Golub & Van Loan, 1996), bulk variances computation is much more difficult. A key property of the scalable algorithm is that Gaussian variances have to be computed few times only until convergence. By affinely upper-bounding the coupling term log |A| in (2), we iterate between the following two steps (called outer loop update and inner loop minimization): z   = diag-1 (BA-1 B T ), u  argmin z , j 
u

the image acquisition optimization problem. In this section, we argue that in many such decision making scenarios, the critical information are the maximum covariance directions of the posterior, a fact which directly motivates PCA and Lanczos covariance approximations discussed in Section 3. In Bayesian experimental design (ED), the design matrix X is optimized sequentially, appending parts X  which maximize an information gain score, in this case (X  ) := log |I + X  CovQ [u|y]X T | (Seeger  et al., 2009). Considering {(X  )} over many candidates X  , it is the posterior covariance CovQ [u|y] these score values depend upon: the mode or mean of Q(u|y) are irrelevant. (X  ) measures the overlap of X  with leading eigendirections of CovQ [u|y], the directions of maximum posterior covariance. In order to drive Bayesian ED successfully, it is not necessary to closely approximate all of P (u|y): decision making depends mainly on its leading covariance eigendirections. Consider the extreme case, where X  is a single row and all unit norm vectors are candidates: the score maximizer is the single maximal eigenvector of CovQ [u|y] =  2 A-1 . While all of A cannot even be stored, let alone inverted, its leading eigenvector can be obtained tractably (see Section 3). On the other hand, it is not enough to approximate posterior marginals only, or to fit any factorized distribution to P (u|y): neither give sufficient information about the leading covariance eigenmodes in general.

1 j

zj +
j

s,j 2 , s = Bu ,  zj + (|sj |/)2 . (3)

z =  -2 y - X u

2

+2

j

Updating (u ,  ) is a standard penalized least squares problem, which can be solved at large scales. Variance computations are required only for computing z, which happens few times until convergence. To sum up, the main computational primitives of large scale SLM variational inference are Gaussian means and variances, the latter are much more difficult to approximate. Gaussian mean computations (least squares) are bread-and-butter in any computational discipline, while variance computations, not required for point estimation, are less frequently addressed. Among inference algorithms aiming at large scales, those stand out which require Gaussian variances as seldomly as possible. Note that variances are not required in MAP and most other sparse estimation methods. In the context of large scale sparse linear models, the requirement of Gaussian variances is the most important computational difference between variational sparse inference and sparse estimation. Since they can in general not be approximated to close relative accuracy, it is important to understand effects of Gaussian variance errors on variational inference outcomes. 2.2. Sequential Bayesian Experimental Design Approximating a SLM posterior P (u|y) by a Gaussian Q(u|y) is hard at large scales. But how accurate do we have to be? Which additional structural restrictions on Q can be tolerated? Fortunately, while a uniformly close approximation of P (u|y) or its marginals P (sj |y) is presently unattainable at large scales, this is too much to ask for in typical applications, such as

3. Approximating Gaussian Variances
We exposed Gaussian variances computation as general bottleneck for scalable variational inference, no matter which specific method is used. As discussed in Section 2.1, most previous algorithms are not scalable up front, since variances are computed one by one, and consequently there has been little machine learning interest in Gaussian variances so far. Still, their dominant role in (Seeger et al., 2009) and scalable inference in general motivates closer attention. Methods have been proposed in spatial statistics (Willsky, 2002), often exploiting graph structure or sparsity of X , neither of which is present in our case. A general idea for approximating variances is to estimate them from a number L n of linear projections, thus to approximate CovQ [u|y] by a low-rank matrix. In special cases, good projections can be constructed based on prior knowledge about signal structure (Malioutov et al., 2008). Perhaps the most promising general approach is to choose projections that capture as much covariance as possible: the L principal components. Namely, if A  U U T (L smallest

Gaussian Covariance and Scalable Inference

eigenvalues/-vectors), then CovQ [u|y]   2 U -1 U T and   (L) := diag-1 (BU -1 U T B T ). Of course, L-PCA remains an academic exercise without a scalable way of computing U ,  for L sufficiently large. Fortunately, this can often be done by way of the Lanczos algorithm (Schneider & Willsky, 2001). Details about the Lanczos algorithm are found in (Golub & Van Loan, 1996). At one matrix-vector multiplication (MVM) with A per iteration, we obtain QT AQk = T k , Qk  Rn×k orthonormal, T k tridiagok nal. A's extremal eigenvalues/-vectors are closely approximated by those of Qk T k QT (the latter are called k Ritz values/vectors). Eigenconvergence can easily be monitored inside the method: the SVD of T k can be done in O(k 2 ). For example, the maximum eigenvector of A is typically obtained to high accuracy after few iterations (see Section 2.2). In practice, LPCA is approximated by running Lanczos for K > L steps, until the L smallest eigenvalues of A have converged. Whether this is feasible or not, depends on the spectral structure of A. Lanczos convergence theory (Golub & Van Loan, 1996) states that eigenvalues converge from the fringe of the spectrum inwards, roughly ordered by the gap size between neighbouring entries. For example, if spec(A) decreases geometrically, eigenvalues converge from largest to smallest. However, precision matrices in typical SLM scenarios show roughly linear spectral decay (Seeger, 2009), so that largest and smallest eigenvalues converge even for K n. At present, it may well be the best general method for variance approximations in the context of SLM variational inference. Of course, rather than settling for L-PCA after K Lanczos steps, we may as well make use of the complete Lanczos representation, approximating A-1 by Qk T -1 QT instead k k of U -1 U T , referred to as K-Lanczos approximation in the sequel. The corresponding variance estimator is k = diag-1 (BQk T -1 QT B T ). k k The Lanczos algorithm is not easy to implement or analyze, and comes with a higher cost than conjugate gradients. Ironically, the convergence of Ritz values causes the difficulties: they continue to contaminate subsequent steps by way of numerical roundoff, avoided only by orthogonalizing each new vector against all previously converged Ritz vectors (or all columns of Qk ). Therefore, Qk has to be stored at O(n K), and orthogonalization costs up to O(n K 2 ). Can accurate variances be obtained with few iterations? In general, this is possible only if A's spectrum decays geometrically, which does not happen for typical system matrices in our case. The part of the spectrum we miss out on with moderate K carries substantial mass. A-1 is not closely approximated by any

matrix of low rank in terms of (co)variance explained, and relative errors of k tend to be substantial. This uniformly bleak picture will be constrasted with a finegrained analysis in Section 4.1, exposing structure in Lanczos variance errors which can be beneficial for SLM inference applications. Can the Lanczos algorithm be improved in the context of variance approximation? After all, we do not require eigenvectors/-values as such, but a specific estimate based on them only. First, reorthogonalization cannot be skipped. Doing so renders the Lanczos algorithm practically useless if more than the leading eigendirection is required. Deflation time can be saved by selective orthogonalization (Golub & Van Loan, 1996), whereby Ritz vector convergence is monitored during the course of the algorithm. Unfortunately, this requires even more memory, and in some comparisons of ours did not lead to substantial speed-ups.

4. Consequences of Lanczos/PCA Approximations
In this section, we analyze effects of Lanczos variance approximation errors on SLM variational inference, within the double loop algorithm of Seeger et al. (2009) sketched in (3). First, we highlight the overall statistical role played by structures in these errors. Second, by using a result on convexity of spectral functions, we show that the convergence proof of the algorithm is retained with L-PCA variance approximation. 4.1. Lanczos Approximations and Sparsity Gaussian variances are fundamental for sparse Bayesian inference (Section 2.1) and experimental design (Section 2.2), yet cannot be obtained to high relative accuracy for large scale models of interest (Section 3). Why can we still obtain sensible results? In this section, we aim to understand the statistical role of Lanczos variance approximation errors. Recalling from Section 2.1 that sparsity priors enforce selective shrinkage, we show that this effect is strengthened by variance errors. First, both (L) (L-PCA) and K (K-Lanczos) are monotonically nondecreasing (w.r.t. L, K) and lowerbound  in each component. This is immediate for (L) (since eigenvalues are positive) and easy to show for K . Interestingly, the ratio of underestimation K,j /j has a clear structure. In (Seeger, 2009), j  K,j /j are plotted for different values of K. The error is smallest for those coefficients whose true variance j is large, while coefficients with moderate true j are most strongly damped. As K grows, these worst

Gaussian Covariance and Scalable Inference

ratios are lifted towards 1, while the errors for the largest j , smallish to begin with, are affected least. In summary, there is a stable structure in the Lanczos variance errors, coming from the algorithm's working. The largest eigenvalues of A-1 converge rapidly, even for small K. In general, the largest j depend most strongly on these largest eigenvalue contributions. Still, we run inner loop optimizations, plugging in values for z which are overall substantially too small (since z   is replaced by z  K ). In (3), the zj feature in the penalty term (zj + |sj /|2 )1/2 , whose strength (in terms of enforcing |sj |  0) grows with shrinking2 zj . The effect of Lanczos variance errors within this framework is to strengthen soft sparsity penalization. This does not happen uniformly across sj : if the true zj  j is among the largest coefficients, shrinkage by Lanczos is least pronounced, while moderately small true zj are strongly diminished. The selective shrinkage effect discussed in Section 2.1 is strengthened by Lanczos variance errors. Those coefficients most relevant under exact computation are least affected by Lanczos variance errors, but the damping of less relevant ones is amplified. While there is no proof that Lanczos errors do not hurt SLM variational inference in general, they do not work against this decisive effect. 4.2. Convergence of Double-Loop Algorithm Recall from Section 2.1 that the inference relaxation of interest here is a convex optimization problem for Laplacian potentials, and its double loop algorithm is provably convergent in general (Seeger et al., 2009). These statements hold if Gaussian variances are computed exactly, which is impossible at large scales. It is natural to ask which of these beneficial properties are provably retained if variances are approximated by L-PCA. Similar questions should be asked more frequently in machine learning. It is often the case that desirable properties like convexity or guaranteed convergence are proven assuming exact computations, yet real-world experiments are run based on low-rank or subsampling approximations. In all cases we are aware of, the effects of such approximations on former properties remain unanalyzed. In order to lift "exact" properties for methods with embedded approximations, the challenge is to characterize the latter in a strong way, so that even with the approximation in place, crucial steps in the "exact computation" proof remain valid. This is possible for L-PCA, using a result on convexity of spectral func2 If zj = 0 throughout, this becomes |sj |/, and the inner loop minimization reduces to MAP estimation.

tions (Davis, 1957), as summarized in the following theorem. Theorem 1 Consider the tractable variant of the method of (Seeger et al., 2009), replacing outer loop updates in (3) by z  (L) (L-PCA variance approximation). This modified algorithm is provably convergent, in the general setting given by Nickisch & Seeger (2009). However, the convexity of the modified relaxation may be compromised. Proof. Inspecting the convergence proof in (Seeger et al., 2009), crucial points are the concavity of  -1  log |A|, and that z is updated as  -1 log |A|. This ensures that  and the inner loop criterion z meet tangentially at current points  , see also (Wipf & Nagarajan, 2008). For L-PCA approximations A  U U T (U  Rn×L orthonormal,  the L smallest eigenvalues), log |A| is replaced by log ||, and  by (L) = diag-1 (BU -1 U T B T ), where A, U ,  are mappings of  0. We have to show that  -1  log || is concave, and that (L) . We assume that at each  0  -1 log || =  of interest, the smallest L eigenvalues of A are separated (by continuity, this holds in a small environment as well), so that eigenvalue derivatives are welldefined. If  is an eigenvalue at  with unit eigenvector u, then (du)T u = 0 (since uT u = 1 around  ), thus d = uT (dA)u + 2uT A(du) = uT (dA)u + 2uT (du) = uT (dA)u. Therefore,  -1 log || = -1 2 (L) . i>n-L i (Bui ) =  Proving the concavity of  -1  log || is harder, due to the implicit definition of . We draw on results for spectral functions. Such f (A) are induced from symmetric functions f : Rn  R (f (P x) = f (x) for any coefficient permutation P  Pn ) by way of f (A) := f (spec(A)), where spec(A) are the eigenvalues of A. Clearly, A  log || = i>n-L i (A) is a spectral function. It is shown in (Davis, 1957) that if f is symmetric, convex, and lower semicontinuous, then its induced spectral function is convex and lower semicontinuous over Hermitian matrices. Let L h(x) := - i=1 log(xi ) for x 0, h(x) :=  elsewhere, and f (x) := maxP Pn h(P x). Since - log is convex and decreasing, so is h. f is convex as maximum over convex functions, and since h is decreasing, it is the L smallest entries of x that f (x) depends on: f (A) = - log ||. Therefore, the modified algorithm remains provably convergent. The convexity of ( ) hinges on the convexity of   log |A| (Nickisch & Seeger, 2009). However,   log || is not convex in general, as the following counter-example shows: n = q = 2, L = 1,

Gaussian Covariance and Scalable Inference

X T X = I 2 , B = I 2 , so that log |A| = log |I 2 + -1 |. Then, log 2 = mini log(1 + 1/i ), convex at each  with 1 = 2 . But for  = (1 1)T + tp, f (t) = mini log(1 + 1/(1 + pi t)) is not convex at t = 0 if p1 = p2 . If 0 < p1 < p2 , the argmin is 2 for t > 0, 1 for t < 0. The derivative of a component at t = 0 is -pi /2, so that f (t)  -p1 /2 for t  0, t < 0, and f (t)  -p2 /2 for t  0, t > 0. Since -p1 /2 > -p2 /2, f (t) is not convex at t = 0. Note that   log || is locally convex in regions where eigenvalues do not cross over. However, it fails to be globally convex in general. This concludes the proof. Even if variances are approximated by L-PCA with any L  1, the double loop algorithm is guaranteed to converge. Of course, we optimize the wrong criterion in this case, but the method is self-consistent and enjoys the same convergence properties. Our proof does not extend to K-Lanczos variance approximations. If log |A| is replaced by log |T K |, then  -1 log |T K | = K . However, we are lacking a description of   T K strong enough in order to prove (or disprove) global concavity of  -1  log |T K |. Finally, the concavity of A  log || may be more generally useful for analyzing PCA approximations embedded in machine learning methods. For example, there is recent interest in estimating Gaussian model structure by way of penalized maximum likelihood with l1 potentials on the entries of the precision matrix P (Banerjee et al., 2008). The likelihood part of the criterion has the form tr E T P - log |P |: a convex function in P or some linear parameterization. If P is large, a natural approximation would be L-PCA. In this case, our result above implies that the problem remains convex if log |P | is replaced by log ||,  the L smallest eigenvalues of P . 4.3. Evaluations of Approximate Inference Arguments in this section are developed for SLM variational inference with the double loop algorithm discussed above, but may have wider significance. Besides computational tractability, it is important to understand differences in robustness to variance errors across algorithms: does the double loop algorithm stand out, or are other methods (see Section 2.1) equally tolerant (while much slower)? In our view, it does not make much sense to relate approximate Bayesian techniques to the inherently intractable ideas they tend to be motivated with, or to grant too much value to such a motivation in the first place. After all, inference is used to drive real-world problems, and approximation errors have to understand in these contexts (see Section 2.2). Moreover, variational inference methods should be compared against today's feasible alternatives for these decision making problems. For example,

switching to a model of simpler graphical structure, we risk to optimize acquisitions for the wrong reconstruction setup, thus to solve the wrong problem more accurately. As noted in Section 2.1, we can opt for variational mean field Bayes (Attias, 2000), tractable at large scales only with many factorization assumptions. However, most covariances are not represented in a factorized Q(u|y). The choice of which dependencies to suppress is typically done beforehand, without even looking at data. Factorization assumptions seem disadvantageous in decision scenarios like experimental design, where the dominant posterior covariances are all that matters. While more difficult to implement, Lanczos (PCA) approximations provide a superior alternative in this context, since leading posterior covariances can be tracked in a data-dependent manner. Our arguments have implications for how (variational) approximate inference methods should be evaluated. At present, the quality of a new technique is evaluated by comparing the relative accuracy of marginals on small regular graphs, where brute force exact computations can still be done. But virtually no applications of approximate Bayesian inference rely crucially on highly accurate marginal numbers. Typically, decision scenarios such as in Section 2.2 are faced: what matters is that the highest scoring candidates come out on top for the approximation as well. When Lanczos approximations are used for moderate K, almost all variances are underestimated substantially (Section 3), but special structure in these errors implies that they do not adversely affect selective shrinkage in SLM inference (Section 4.1). Similar to marginal variances, Lanczos approximations of design score curves (X  ) are globally much too small, but its maximum points tend to stick out even for small K (see Section 2.2). There are many other practically relevant properties of a method, such as computational efficiency, robustness, uniqueness, guaranteed convergence, ease of use, and some of these may well be more important than accuracy of marginals. A main message of this paper is that variational inference techniques should be analyzed and compared on real-world decision scenarios. Testing them on artificial problems may paint a misleading picture.

5. Experiments
Recall the image acquisition optimization setup from Section 2.2. The underlying assumption is that with good posterior approximations, highly informative designs will be found (taylored to the reconstruction model), and MAP reconstruction results on a test

Gaussian Covariance and Scalable Inference

set will be improved. The goal of our study3 is to test how reconstruction quality varies with different levels of Gaussian variance errors. In line with Section 4.3, we choose a realistic setup: image reconstruction from noisy Fourier measurements (required for magnetic resonance imaging). We use the model setup previously employed in (Seeger et al., 2009; Seeger & Nickisch, 2008): B consists of an orthonormal wavelet transform B a and horizontal/vertical differences B r , corresponding prior parameters are a , r . We adopt the "Cartesian" variant of (Seeger, 2009): candidates X  are "phase encodes" (complete columns in Fourier space). MRI data has additional complexities (complex-valued u, phase noise) which would interfere with our goals here: we employ a dataset of conventional natural images (the 75 images previously used in (Seeger & Nickisch, 2008)), at resolution 256 × 256. Measurement noise is fixed to  2 = 10-3 , the hyperparameters a , r adjusted based on MAP reconstructions with a fixed design picked adhoc (a = 0.08, r = 0.16). We split the data into five blocks, each containing one image for design optimization (Figure 1) and 5 test images drawn randomly from the pool. Sequential experimental design is run as previously done in (Seeger et al., 2009), both inference and MAP estimation use the same hyperparameters. Our ^ test error measure is u - utrue / u - utrue , u ^ the MAP estimate in question, u the MAP image for a low-frequency-only design4 . · denotes test set and block averaging.

have gaps at both ends. We give L-PCA results here for comparison, but note that K-Lanczos will typically be used in practice.
0.66 0.64 0.62 0.6 Test error 0.58 0.56 0.54 0.52 0.5 0.48 800/500 500/500 200/200 100/100 50/50

75

80

85

90 95 Design columns

100

105

110

1 0.95 0.9 0.85 Test error 0.8 0.75 0.7 0.65 0.6 0.55 135/85 135/34 85/85 85/34 59/59 34/34 70 80 90 100 Design columns 110 120

Figure 1. Training set for image acquisition optimization.

Recall K-Lanczos and L-PCA from Section 3. For inference with L-PCA, we can prove desirable properties (Section 4.2), yet it is not very practical for a number of reasons. First, there is no good reason for not using the full Lanczos representation after K steps: for L-PCA, we throw part of it away. Second, running Lanczos for K steps, we observe a significant fluctuation of the number of converged eigenvalues, certainly as long as designs are small. Finally, it is not obvious how to separate smallest and largest converged eigenvalues: the true spectrum of A's encountered tends to
3 Our goal is not compare the method of Seeger et al. (2009) against others, but to analyze effects of variance errors for the same method. 4 The central 32 columns in Fourier space. Note that design optimization starts with this basis: it is contained in all other designs used here.

Figure 2. Relative test reconstruction errors for designs found with variational inference based on different-quality Gaussian variance approximations. Top: K-Lanczos (Kvar /Kscore ). Bottom: L-PCA (Lvar /Lscore ).

Results are shown in Figure 2. First, K-Lanczos does much better than L-PCA5 : the latter cannot be recommended. Results for K-Lanczos are remarkably robust across a wide range of Lanczos steps done. The curves between K = 200 and K = 800 differ insignificantly only (we show the most variable part). However, for too few Lanczos steps, results deteriorate (see also error images in Figure 3). Can we evaluate scores with less Lanczos iterations? We repeated the experiment with Kvar = {500, 800} (variances  at OL updates),
5 Rather than trying to spot the "true central gap", we run Lanczos for ~ K  6L steps, obtaining N converged eigenvalues, then ~ use the min{L, N } smallest of these.

Gaussian Covariance and Scalable Inference

but Kscore = 200, obtaining virtually the same results. For L-PCA, a similar maneuvre leads to further degradation (Figure 2, lower panel).

Acknowledgements This work is partly funded by the Excellence Initiative of the German research foundation (DFG). Thanks to Hannes Nickisch and David Wipf for helpful discussions.

References
Attias, H. A variational Bayesian framework for graphical models. In NIPS 12, pp. 209­215, 2000. Banerjee, O., El Ghaoui, L., and d'Aspremont, A. Model selection through sparse maximum likelihood estimation for multivariate Gaussian or binary data. JMLR, 9:485­516, 2008. Davis, C. All convex invariant functions of Hermitian matrices. Archiv der Mathematik, 8:276­278, 1957.
Figure 3. MAP reconstructions of training image under designs of 70, 80, 90, 100, 110 columns. Shown are residuals w.r.t. true image. Top: 200-Lanczos. Middle: 100Lanczos. Bottom: 50-Lanczos.

Golub, G. and Van Loan, C. Matrix Computations. Johns Hopkins University Press, 3rd edition, 1996. Levin, A., Weiss, Y., Durand, F., and Freeman, W. Understanding and evaluating blind deconvolution algorithms. In CVPR, 2009. Malioutov, D., Johnson, J., Choi, M., and Willsky, A. Low-rank variance estimation in GMRF models: Single and multiscale approaches. IEEE Trans. Sig. Proc., 56(10):4621­4634, 2008. Nickisch, H. and Seeger, M. Convex variational Bayesian inference for large scale generalized linear models. In ICML 26, pp. 761­768, 2009. Schneider, M. and Willsky, A. Krylov subspace estimation. SIAM J. Comp., 22(5):1840­1864, 2001. Seeger, M. Bayesian inference and optimal design for the sparse linear model. JMLR, 9:759­813, 2008. Seeger, M. Speeding up magnetic resonance image acquisition by Bayesian multi-slice adaptive compressed sensing. In NIPS 22, pp. 1633­1641, 2009. Seeger, M. and Nickisch, H. Compressed sensing and Bayesian experimental design. In ICML 25, 2008. Seeger, M., Nickisch, H., Pohmann, R., and Sch¨lkopf, o B. Bayesian experimental design of magnetic resonance imaging sequences. In NIPS 21, pp. 1441­ 1448, 2009. Willsky, A. Multiresolution Markov models for signal and image processing. Proc. IEEE, 8(90):1396­1458, 2002. Wipf, D. and Nagarajan, S. A new view of automatic relevance determination. In NIPS 20, 2008.

These results underline our comments in Section 4.3. It should be noted that relative variance errors for KLanczos are of very different size between K = 200 and K = 800: the average relative error across all coefficients scales somewhat linearly in K (Seeger (2009), Sect. 5.1), a sizeable number of variance coefficients are orders of magnitude too small at K = 200. However, if posterior variances are used within a decision scenario, such as Bayesian experimental design for natural images, outcomes can be entirely robust in the presence of such errors.

6. Discussion
We have highlighted the significance of Gaussian variances approximation for variational (sparse) Bayesian inference and provided novel analyses about effects of PCA/Lanczos variance approximation errors on outcomes of nonlinear Bayesian experimental design. Our results show that outcomes can be robust in the presence of substantial overall marginal variance errors, at least for methods aiming to track dominating posterior covariances rather than imposing factorization constraints up front. While variational Bayesian methods are used in diverse applications, most evaluations of novel technology to date concentrate almost solely on closeness of marginals to the true posterior, a single point of merit which may often be of minor importance in practice. In order to understand real-world impact of Bayesian technology, theoretical analyses and empirical evaluations may have to broaden their focus.