Convex Variational Bayesian Inference for Large Scale Generalized Linear Models
Hannes Nickisch hn@tuebingen.mpg.de Empirical Inference Department, Max Planck Institute for Biological Cybernetics, T¨bingen, Germany u Matthias W. Seeger mseeger@mmci.uni-saarland.de Department of Computer Science, Saarland University, Saarbr¨cken, Germany u

Abstract
We show how variational Bayesian inference can be implemented for very large generalized linear models. Our relaxation is proven to be a convex problem for any log-concave model. We provide a generic double loop algorithm for solving this relaxation on models with arbitrary super-Gaussian potentials. By iteratively decoupling the criterion, most of the work can be done by solving large linear systems, rendering our algorithm orders of magnitude faster than previously proposed solvers for the same problem. We evaluate our method on problems of Bayesian active learning for large binary classification models, and show how to address settings with many candidates and sequential inclusion steps.

ing scalability by decoupling the criterion and reducing all efforts to standard techniques from numerical linear algebra. We generalize the algorithm from (Seeger et al., 2009) to arbitrary super-Gaussian sites and provide technology for a fully generic implementation. We show how our method applies to binary classification Bayesian active learning, and extend the framework of (Seeger et al., 2009) so to cope with thousands of sequential inclusions. The structure of the paper is as follows: Our novel inference algorithm is presented in Section 2, along with theoretical results. Its application to binary classification active learning is given in Section 3, and relations to other work are discussed in Section 4. We present large scale experiments in Section 5. Much of the technical details are given in the Appendix.

2. Scalable Convex Inference
Our method applies to generalized linear models over continuous latent variables u  Rn . More specifically, models may have Gaussian sites N (y|Xu,  2 I) and non-Gaussian factors ti (si ), s = Bu. For example, applied to magnetic resonance imaging (Seeger et al., 2009), u is the unknown MR image, y = Xu +   Cn are scanner measurements, and ti (si ) form a sparsity prior on multiscale image gradients s = Bu  Rq . In binary classification, u are classifier weights, B collects feature vectors bi , and ti (si ) are Bernoulli likelihoods. For a Gaussian weight prior, X = I, y = 0. A sparsity prior on u leads to X = 0, y = 0, appending I to B, and adding sparsity sites to the ti (si ). The posterior of the model has the form
q

1. Introduction
Modern machine learning applications require methods for inferring good decisions from incomplete data, in settings with many unknown variables. Bayesian inference is a powerful technique towards this end, but most current Bayesian algorithms cannot reliably operate in large scale domains, where convex point estimation techniques are routinely used. We present a Bayesian inference approximation which can be used on a scale previously achieved by point estimation techniques only. We show that our variational relaxation constitutes a convex optimization problem, whenever the search for the posterior mode is convex. We propose a novel double loop algorithm for inference in large generalized linear models, reachAppearing in Proceedings of the 26 International Conference on Machine Learning, Montreal, Canada, 2009. Copyright 2009 by the author(s)/owner(s).
th

P (u|D) = P (D)-1 N (y|Xu,  2 I)
i=1

ti (si ),

(1)

where P (D) is a normalization constant.

We re-

Convex Large Scale Variational Inference

fer to a site ti (si ) as (strongly) super-Gaussian if -2 log[ti (si )e- i si ] is even for some i  R, and strictly convex and decreasing as a function of s2 i (Palmer et al., 2006). For example, the sparsity promoting Laplace sites ti (si ) = e-(i /)|si | , i > 0, (2)

ti (si ) are log-concave. Second, we provide a scalable algorithm to solve it orders of magnitude faster than previous algorithms. Our methods allow for the application of Bayesian techniques to problems, where previously only point estimation had been feasible, as will be demonstrated in our experiments. 2.1. Convex Inference Relaxation The log-partition function lower bound implies a Gaussian approximate posterior Q(u|D) with covariance CovQ [u|D] =  2 A-1 , A := X X + B -1 B.

are super-Gaussian with i = 0. For binary classification, Bernoulli (or logistic) sites are frequently used: ti (si ) = 1 + e-ci (i /)si
-1

,

ci  {±1},

(3)

which are super-Gaussian as well (Jaakkola, 1997, Sect. 3.B) with i = ci i /2 (see Appendix). Laplace and Bernoulli sites are log-concave: si  log ti (si ) is concave. This property, which is shared by many other potentials commonly used, will have important consequences. In the context of applications of interest here, Bayesian inference amounts to approximating the mean and covariance matrix of P (u|D). All of n, q, m can be hundreds of thousands. Our algorithms operate on this scale by exploiting structure in the model matrices X, B, so that matrix-vector multiplications (MVMs) can be computed rapidly, as is the case for sparse matrices, wavelet, or Fourier transforms. Super-Gaussian sites can be lower-bounded by scaled Gaussians of any width. Let xi :=  -2 s2 and gi (xi ) := i log ti (si ) -  -2 i si where gi is convex and decreasing, so that gi (xi ) = maxi >0 [-xi /(2i ) - hi (i )/2] by Legendre-Fenchel duality (Palmer et al., 2006), and -2 2 ti (si ) = maxi >0 e (i si -si /(2i ))-hi (i )/2 . Here, hi scales the height of the Gaussian lower bound such that it just touches the site ti (si ) from below. To simplify notation, we will write gi (si ) = gi (xi ) in the following (gi as dependent variable, rather than function). Variational Bayesian inference centers on the log-partition function log P (D) (1), the cumulant generating function of the posterior. The inference relaxation we employ here (Girolami, 2001; Palmer et al., 2006; Jaakkola, 1997) consists of lower-bounding log P (D), by plugging in the bounds for each site: P (D)  e
-h()/2

We start with an equality for Gaussians: N (y|Xu,  2 I)Q0 (u) du = |2 2 A-1 |1/2 max N (y|Xu,  2 I)Q0 (u),
u

so that the lower bound can be written as P (D)  C1 e-()/2 , with () := log |A| + h() + min R(u, ),
u

R :=  -2

y - Xu

2

+ s -1 s - 2 s ,

to be minimized w.r.t.  0. R(u, ) is jointly convex, since (si , i )  s2 /i is jointly convex for i > 0. i We will establish that   log |A| is convex as well, and that   h() is convex iff all ti (si ) are logconcave. We will show how to solve min  efficiently. Theorem 1 Let X  Rm×n , B  Rq×n , and fi (i ) continuously differentiable functions into R+ , so that log fi (i ) are convex. Then,   log |X X + B (diag f ())B| is convex. Especially,   log |A| is convex. A proof is given in the Appendix. The conditions are - fulfilled for fi (i ) = i i , i > 0 (i = 1 for the case log |A|), but also for fi (i ) = ei , generalizing the convexity of the logsumexp function to matrix values. Theorem 2 Consider a model of the form (1), with strongly super-Gaussian sites, and let gi (xi ) = gi (si ) = log ti (si ) -  -2 i si , xi =  -2 s2 . If si  gi (si ) is i concave and twice continuously differentiable for si > 0, then hi (i ) = - minxi 0 (xi /i + 2gi (xi )) is convex. On the other hand, if gi (si ) > 0 for some si > 0, then hi (i ) is not convex at some i > 0. A proof is given in the Appendix. We conclude from these theorems that the minimization of () is a convex problem if all sites are strongly super-Gaussian and log-concave. On the other hand, for a model of

N (y|Xu,  I)Q0 (u) du,
-1

2

(4)

Q0 (u) := e ( s-s  s/2) ,  = diag  and h() = i hi (i ). This is a tractable Gaussian integral, and previous algorithms maximize it one i at a time (Girolami, 2001). If both n and q are very large, such coordinate ascent methods cannot be used anymore. Our novel contributions to this problem are twofold: First, we prove that this variational lower bound relaxation is a convex optimization problem iff all sites

-2

Convex Large Scale Variational Inference

the form (1), if some site ti (si ) fails to be log-concave, then hi (i ) is not convex, and we can easily construct some X, y, B such that () is not convex. Our result settles a longstanding problem in approximate inference: If the posterior mode of a superGaussian model can be found via a convex problem, then a frequently used approximation (Girolami, 2001; Palmer et al., 2006; Jaakkola, 1997) is convex as well. 2.2. Scalable Double Loop Algorithm Convexity of min  does not necessarily imply tractability on a large scale. For example, high resolution image reconstruction problems of the kind addressed in (Seeger et al., 2009) could not be sensibly approached by previous coordinate descent algorithms (Girolami, 2001): each i update requires the corresponding marginal Q(si |D), for which an n × n linear system has to be solved, and each of the q sites has to be visited at least once. If q and n are very large, such algorithms are intractable even for highly structured matrices. Standard joint optimization code is problematic as well, since even a single gradient   is expensive. The algorithm we propose here, circumvents these problems by decoupling the expensive parts of  in a nested double loop fashion, related to concaveconvex (Yuille & Rangarajan, 2003) or difference of convex ideas, as used in expectation maximization. While double loop algorithms have been proposed for non-convex approximate inference, we show that they can also be used to drastically speed up solving convex inference problems. We proceed as in (Seeger et al., 2009). While   log |A| is convex (Theorem 1), concavity of  -1  log |A| is well known. Using conjugate duality once more, log |A| = minz 0 z ( -1 ) - g  (z). We have   min z ,
u

each of which requires the solution of one linear system (X X+B (diag e)B)d = r, where r, e are simple functions of u. Given useful structure in X, B (such as sparsity), this optimization is scalable to very large sizes, the systems are solved by (preconditioned) linear conjugate gradients (LCG). Upon inner loop convergence, the minimizer u is the mean of Q(u|D). Once an inner loop converges, z has to be refitted. The minimizer is z =  log |A| = diag-1 (BA-1 B ) =  -2 (VarQ [si |D]), the marginal variances of Q(s|D), and g  (z ) = z ( -1 ) - log |A|. While the means of a large linear-Gaussian model Q(u|D) can be estimated by a single linear system, the variances are much harder to obtain. In fact, we do not know of a general bulk variance estimator which would be as accurate as LCG, but not vastly more expensive. To understand the rationale behind our algorithm, note that the computation of   is as difficult as the estimation of z. Our algorithm requires these expensive steps only few times (usually 4 or 5 outer loop iterations are sufficient), since they are kept out of the inner loop, where most of the progress is made. In contrast, most standard gradient-based optimizers require many evaluations of   to converge. As discussed below, our decomposition also means that the variances can be estimated rather poorly, while still obtaining a practically useful algorithm. Finally, we note that our double loop algorithm is guaranteed to converge, under the assumption that the z are fitted exactly (Seeger et al., 2009; Wipf & Nagarajan, 2008). For the inner loop optimization, we require h (si ) and i its first and second derivatives. For Laplace sites, hi (i ) = i2 i , and h (si ) =  2 i (zi + s2 / 2 )1/2 . Howi i ever, for Bernoulli sites, we are not aware of an analytic expression for hi (i ), let alone h (si ). Since hi i and h are defined by scalar convex minimizations, all i terms can be computed implicitly whenever required. We show in the Appendix how to implement our algorithm generically, given gi (xi ) and its derivatives only. Even with many implicitly defined h , this can be done i efficiently. Moreover, these computations can be parallelized straightforwardly. 2.3. Covariance Estimation by Lanczos Outer loop updates require the computation of z = diag-1 (BA-1 B ), or all variances VarQ [si |D], which we estimate by the Lanczos algorithm (Schneider & Willsky, 2001). In a nutshell, the precision matrix A is approximated by a low-rank representation QTQ , Q  Rn×k orthonormal, T tridiagonal, and k n, where the extremal eigenvectors of A are optimally approximated by Q. We then approximate expres-

z := z ( -1 ) + h() + R(u, ) - g  (z).

The upper bound z is jointly convex in (u, ), and compared to , the difficult coupling term log |A| is replaced by the decoupled convex term z ( -1 ). Our algorithm proceeds in inner loop minimizations of z for fixed z, and outer loop updates, where z and g  (z) are reestimated. For the inner loop, we note that min z = 2
 -2

1 y - Xu 2

q 2

+
i=1

h (si ) i

,

1 h (si ) = 2  2 mini ((zi +  -2 s2 )/i + hi (i )) - i si . i i The problem minu (min z ) is of standard penalized quadratic form, and can be solved very efficiently by the iteratively reweighted least squares (IRLS) algorithm, which typically converges in few Newton steps,

Convex Large Scale Variational Inference

sions linear in A-1 by plugging in QT-1 Q instead, ^ for example zk := diag-1 (BQT-1 Q B )  z. In 2 ^ ^ fact, zk = zk-1 + vk , with vk = (Bqk - dk-1 vk-1 )/ek (where dk , ek form the bidiagonal Cholesky factor of ^ T), so that zk increases monotonically towards z in each component. In this usage, the Lanczos algorithm can be thought of as solving many linear system in parallel, with the same A but different right hand sides. This approach is not straightforward. Lanczos codes are complicated and scale superlinearly in the number of iterations k. The algorithm can be run with moderate k only, since at least O(n k) memory is re^ quired, and many components in zk are significantly underestimated. This inaccuracy may be unavoidable: we are not aware of a general bulk variances estimator improving on Lanczos, and variances are required to drive any algorithm for min . Importantly, these ^ errors in zk  z seem to not much affect our algorithm in practice. Inaccurate variances mean that z is not exactly tangent to  at the current  after an outer loop update. However, its (inner loop) minimization is accurate, since mean computations are required only. In short, our algorithm seems to be rather robust in practice towards significant errors in posterior variance estimation. Given the apparent intractability of this computation, this is a critical feature of our decoupling approach. Some intuition about this robustness will be given in a longer paper. Compared to what is done in other tractable inference approximations, where many dependencies are ruled out up front without even inspecting the data (factorization assumptions in structured mean field), our approximation is fully data-dependent, with the extremal covariance eigenvectors being homed in by Lanczos, just as is done in PCA. 2.4. Properties of the Algorithm The scalability of our algorithm comes from a number of appropriate reductions illustrated in Figure 1. On the first level, the complicated inference problem is relaxed to a convex program. The optimization problem is decoupled in the double loop algorithm: iterations reduce to the estimation of means and variances in a linear-Gaussian model, which is done by standard algorithms of numerical mathematics (LCG and Lanczos), routinely employed for very large systems. These naturally reduce to matrix-vector multiplications (MVMs). On a higher level, we fit a sequence of Gaussian models to the true posterior. The computational complexity of the algorithm is measured in number of MVMs needed, and can be related to MAP estimation and a naive approach to minimizing ().

General, adjustable

Variational Inference
* High-dimensional optimization * Convexity Linear Systems

Stable, understood

Numerical Mathematics
* Conjugate Gradients: Means * Lanczos: Covariance Structured Matrix-Vector Multiplication

Signal Processing
Highly parallelizable * Sparse matrices * (Nonequispaced) FFT * Filterbanks

Figure 1. Reductions in Variational Inference

Recall that n is the number of latent variables, m the number of Gaussian, and q the number of nonGaussian sites. Denote by k the number of Lanczos iterations in outer loop updates, by NCG the number of LCG iterations to solve a system with A, and by NNewt the number of Newton steps for IRLS.
algorithm full Newton for MAP one coordinate ascent step  one exact   one approx   one (u, ) for inner loop one z for outer loop # MVMs NNewt · NCG q · NCG q · NCG k + NCG NNewt · NCG k storage O(m + n + q) O(m + n + q) O(m + n + q) O(k · n + q) O(n + q) O(k · n + q)

The cost of an MVM for sparse matrices is typically linear in the number of nonzeros. Empirically, NNewt  10 for our inner loops, and we never run more than 5 outer loop iterations, typically 1 or 2 only. Lanczos codes come with additional costs to keep Q orthonormal, up to O(n · k 2 ). The table shows that a naive minimization of () is not scalable, since we have to solve O(q) n × n linear systems for a single gradient step. While MAP estimation is faster in practice, its scaling differs from our algorithm's only by a moderate constant factor.

3. Bayesian Active Learning
Proper Bayesian active learning builds on the posterior distribution, therefore inference is needed. More specifically, the evaluation of typical scores, such as the information gain, requires (approximate) inference. Within our method, these computations are naturally dealt with as part of the (convex) relaxation, without requiring artificial additional Laplace approximations. Our efficient double loop algorithm lets us address large scale Bayesian optimal decision problems. A special case was used in (Seeger et al.,

Convex Large Scale Variational Inference

2009) in order to optimize measurements for image reconstruction. Their framework can only be run feasibly for a moderate number of sequential design extensions. We present extensions in order to deal with binary classification active learning, with weights u, features given by the rows of B, retaining scalability even with very many sequential inclusions. The likelihood sites ti (si ) are Bernoulli (3), and the prior on u is a spherical Gaussian with variance  2 (X = I, y = 0). If n  q, a sparsity prior on the weights may be more appropriate, and we employ such a variant in our experiments as well, using n Laplace sites (2) in addition to the likelihood sites. Technically, this is done by appending I to B and removing the Gaussian factor. For simplicity of notation, the Gaussian prior case is discussed in the remainder of this section. Our active learning algorithm starts with a posterior approximation based on randomly drawn instances. In the subsequent design phase, we sequentially include blocks of K data points. If the task requires a large number of sequential inclusions, tractability is retained by choosing K large enough. We are using a candidate list of potential rows for B. Each iteration consists of an initial Lanczos run to estimate marginal posterior moments, K  1 inclusions (appending K new rows to B), and a re-optimization of all site parameters . Within a block, the marginals Q(si ) = N (µi ,  2 i ), i  J containing all model and candidate sites, are kept valid at all times. Note that µJ = BJ u (since u = EQ [u|D]), and that B is a part of BJ . For larger K, our method runs faster, since the variational parameters  are updated less frequently, while for smaller K, the more frequent refits to the non-Gaussian posterior may result in better sequential decisions. Each inclusion within a block consists of scoring all remaining candidates, picking the winner, and updating the marginals µJ , J . Let bi be a potential new row of B, and si = bi u. In our experiments, we use several design scores, based on the current marginal Q(si ). The information gain score (Seeger, 2008, Sect. 4.1) is SIG (bi ) := ci =±1 Q(ci )D[Q (si ; ci ) Q(si )], where Q (·; ci ) is the new approximation to  Q(si )ti (si ; ci ). The classifier uncertainty score is simpler: SCU (bi ) := -|Q(ci = +1) - 1/2|, preferring candidates with predictive probability Q(ci = +1) closest to 1 . We com2 pute Q(ci = +1) = EQ [ti (si ; ci = +1)] by quadrature. Once the winner is determined (say, i), its label ci is queried and its site included, using the novel i (detailed in the Appendix). The marginals are updated as suggested in (Seeger & Nickisch, 2008): J = J - (i + i )-1 w2 , µJ = µJ + ((i - µi /i )/i )w,

where i = 1 + i /i and w = BJ d, d = A-1 i (one linear system). We use the solution to recompute i , µi , solve again for i , and plug these back into µJ , J . This corrects for Lanczos inaccuracies (especially i is underestimated by Lanczos). Moreover, u = u + ((i - µi /i )/i )d, and log |A | = log |A| + log i . At the end of a block, we re-run our variational algorithm in order to update all site parameters jointly (within a block, only i for novel model sites are updated). In practice, a single outer loop iteration suffices for these runs. Importantly, the first outer loop update comes for free, since the model marginals (part of µJ , J ), u , and log |A| have been kept valid. Therefore, only a single Lanczos run per block is required. Finally, since variances are underestimated by Lanczos, it may happen that components in J become negative within a block. Such components are simply removed, and if they correspond to model sites, their marginals are recomputed by solving linear systems at the end of the block. While there is some complexity to our scheme, note that the principal computational primitives are always the same: solving linear systems with A (or simple variants thereof), and variance estimation by Lanczos based on A. This is discussed in more detail in Section 2.4.

4. Related Work
Active learning can be done using a large variety of criteria. For an empirical review and collection of heuristics see (Schein & Ungar, 2007). We use Bayesian active learning, meaning that the scores for inclusion decisions are computed based on the posterior distribution, which seems the approach favoured by decision theory in general. Given that, we can employ a host of different scores, and the particular ones used in our experiments here could certainly be improved upon by heuristic experience with the task. However, the general consensus is that to compute Bayesian scores, such as the information gain, accurately over many different candidates, Bayesian inference (in particular pertaining to the posterior covariance) has to be approximated well. Our lower bound variational inference approximation has been used many times before (Girolami, 2001; Jaakkola & Jordan, 1997; Palmer et al., 2006). Our scalable double loop algorithm is novel, as is the characterization of the relaxation as a convex problem. Either have been given for the special case of Laplace sites in (Seeger et al., 2009). We extend these results to generalized linear models with arbi-

Convex Large Scale Variational Inference

trary super-Gaussian sites, and give a generic implementation which can be run as is with any kind of super-Gaussian sites, which are either log-concave, or for which hi (i ) can be worked out analytically. Moreover, their method can be applied to problems with many inclusions only with the additional technology presented here.

5. Experiments
We use three standard data sets for binary classification1 , outlined in the table below. The feature vectors are sparse, and a MVM with the matrix B costs O(#nz).
Dataset a9a real-sim rcv1 q 32, 561 72, 201 677, 399 q+ /q- 0.32 0.44 1.10 n 123 20, 958 42, 736 # nonzeros 451, 592 3, 709, 083 49, 556, 258

Our double loop inference algorithm decouples the criterion and runs orders of magnitude faster than previous coordinate descent methods. Computational efforts are reduced to fast algorithms of numerical mathematics and exploiting fast MVMs due to the structured matrices X and B. Our generic implementation, can be run with any configuration of super-Gaussian, log-concave sites. We have pointed out the crucial role of the Lanczos algorithm for bulk variance estimation in large Gaussian models, an essential step for variational approximate inference methods. Our framework can be used for a multitude of Bayesian decision and experimental design problems, and is most useful for tasks where the choice of good decisions depends on posterior covariance information.

Appendix
Proof of Theorem 1 Let A = X X + B f ()B for now, 1 = log |A|. d1 = tr SD, S = BA-1 B and D = f ()(d). Then, d2 1 = - tr SDSD + tr Sf ()(d)2 = tr DSD(p() - S), where pi (i ) = fi (i )/(fi (i ))2 . Since S 0 (positive semi-definite): S = VV with some V, and d2 1 = tr(DV) (p() - S)DV. If we show that p() - S 0, then for  t =  + t(): 1 (0) = tr M (p() - S)M  0 for all small , where M = f ()()V, so that 1 is convex. We show that f ()-1 - S 0, employing the identity r M-1 r = maxx 2r x - x Mx, M 0 (positive definite). Now, r BA-1 B r = maxx 2r Bx - x Ax  maxy=Bx 2r y - y f ()y, since x X Xx = Xx 2  0. Taking the maximum over all y  Rq , we see that r Sr  maxy 2r y - y f ()y = r f ()-1 r, so that f ()-1 - S 0. We are done if p() - f ()-1 0, which is equivalent to fi (i )fi (i )  (fi (i ))2 for all i, which in turn is equivalent to all log fi (i ) being convex Proof of Theorem 2 We focus on a single site i and drop its index. Since b = 0 is dealt with separately, assume that t(s) is even itself. Since positive scaling does not influence convexity, assume that  2 = 1 in this section only, so that g(s) = g(x) = log t(s), x = s2 . By conjugate duality: h() = maxx0 f (x, ), f := -x/ - 2g(x). We only consider s  0. Let x := argmaxx0 f (x, ) (unique, since g(x) is strictly convex). If 0 := sup{ | f (s, )  -2g(0) s} (0 = 0 for an empty set), then x = 0, h() = -2g(0) for 0 <   0 , and for  > 0 , h is strictly increasing and x > 0. Since g(x) is convex and decreasing, conjugate duality implies that for

We randomly select 16, 36 and 50 thousand instances for training; the rest is kept for testing. Hyperparameters bern ,  2 , and lapl were determined on the full datasets. Results are given in Figure 2. We ran sparse logistic regression (with Laplace prior) on a9a only, results for the larger sets will be included in the final version of this paper. As expected, our algorithm runs longer in this case, and is less tolerant w.r.t. larger block sizes K: the Laplace prior site parameters have to be updated in response to new cases, in order to do their job properly. Although sparse classification improves on the Gaussian prior case beyond about 2800 cases, active learning works better with a Gaussian prior for fewer inclusions. This may be due to the case that the Lanczos variance estimation is exact for q < k, and in general more accurate in the Gaussian prior case. Over all sets, we see clear improvements of active learning with the classifier uncertainty score SCU over random sampling of data cases. Somewhat surprisingly, the information gain score does much less well in the binary classification case. Since sequential experimental design with this score performs well in other cases (Seeger et al., 2009), a closer analysis of the respective merits of different design scores w.r.t. different statistical tasks is an important point for future research.

6. Discussion
We have shown that a frequently used variational relaxation to Bayesian inference in super-Gaussian models is convex if and only if the posterior is log-concave.
1 http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/ datasets/

Convex Large Scale Variational Inference

Figure 2. Classification errors for different design scores (information gain, classifier uncertainty), vs. random sampling (results on full training set also shown). Design started after 100, 100, 500, 800 randomly drawn initial cases respectively, all remaining training cases were candidates. Prior variance  2 = 1 in all cases, bern = 1, 1, 3, 3 respectively. k = 80, 80, 750, 750 Lanczos vectors for outer loop updates/candidate scoring. For a9a, we used design blocks of size K = 3, and K = 20 for the others.

every x > 0, there exists some : x () = x. It suffices to show that h is convex at all  > 0 , where x = s2 > 0.  We use the notation fs = f /(s), functions are evaluated at (x , ) if nothing else is said. Now, fs = -2s / - 2gs (s ) = 0, so that gs (s ) = -s /. Next, x  g(x) is twice continuously differentiable, and x = s2 at . Therefore, fx = f /(x) is  continuously differentiable, and gx,x (x) > 0 by the strict convexity of g(x). By the implicit function theorem, x () is continuously differentiable at , and since h() = f (x (), ), h () exists. Moreover, 0 = (d/d)fx (x (), ) = fx, + fx,x · (dx )/(d), so that (dx )/(d) =  -2 /(2gx,x (x )) > 0: x () is increasing. From fs = 0, we have that h () = f = s2 / 2 = (gs (s ))2 , since gs (s ) = -s /. Now, gs (s) is  nonincreasing by the concavity of g(s), and gs (s ) < 0,

so that s  h () is nondecreasing. Since s2 = x is  increasing in , so is s . Therefore,   h () is nondecreasing, and h() is convex for  > 0 . The concavity of g(s) is necessary as well. Suppose that gs,s (~) > 0 for some s > 0. Now, gs (s ) < 0 s ~ whenever s > 0, and there exists some  > 0 so that ~ s (~ ) = s. But if gs,s (s ) > 0 at  , then s  h () is  ~ ~ decreasing at s = s, and just as above   h () is ~ decreasing at  , so that h is not convex at  . ~ ~ Implicit Computation of hi and h i We focus on a single site i and drop its index. Our method requires the evaluation of h (s) = 1 2 2  (min k(x, )) - bs, k := (z + x)/ + h(), x = s2 / 2 , as well as (h ) (s) and (h ) (s) (for the inner loop Newton steps). Let  = argmin k(x, ). Assum-

Convex Large Scale Variational Inference

ing for now that h and its derivatives are available,  is found by univariate Newton minimization2 , where  2 k = -(z+x)+ 2 h (),  3 k, = 2(z+x)+ 3 h (). Now, k = 0 (always evaluated at (x,  )), so that (h ) (s) = s/ - b. Moreover, 0 = (d/ds)k = -1 ks, + k, · (d )/(ds), so that (h ) (s) =  (1 - -1 -1 3 s (d )/(ds)) =  (1 - 2x/( k, (x,  ))). Next, by Fenchel-Legendre duality, h() = - minx l(x, ), l := x/ + 2g(x), where g(x) is strictly convex and decreasing. We need methods to evaluate g(x) and its first and second derivatives (g (x) > 0). The minimizer x = x () is found by convex minimization once more, started from the last recently found x for this site, where l(x, ) should be minimized instead of l(x, ). Note that x = 0 iff   0 := -1/(2g (0)), which has to be checked up front. Given x , we have that h() = -x - 2g(x ). Since lx = 0 for  > 0 (always evaluated at (x , )), we have that  2 h () = - 2 l = x (this holds even if lx > 0 and x = 0). Moreover, if x > 0 (for  > 0 ), then (d/d)lx (x , ) = 0, so that (dx )/(d) =  -2 /(2g (x )), and 3 -1  h () = (2g (x )) - 2x . If x = 0 and lx > 0, then x (~ ) = 0 for  close to , so that  ~ h () = 0. A critical case is x = 0 and lx = 0, which happens for  = 0 : h () does not exist at this point in general. This is not a problem for our code, since we employ a robust Newton/bisection search for  . If  >  , but is very close, we approximate x () by 0 (log  - log 0 ) with 0 := -g (0)/g (0). For Bernoulli sites (3), we have that g(x) = - log cosh(v) - log 2, v := (y /2)x1/2 = -1 (y /2) |s|, so that g (x) = -C(tanh v)/v, g (x) = (C/2)x-1 ((tanh v)/v + tanh2 v - 1), C = (y /2)2 /2. Care has to be taken when evaluating these for x  0. Moreover, 0 = 1/(2C) and 0 = 3/(2C). Active Learning Scores Recall Section 3. For the classifier uncertainty score SCU , we require Q(ci = +1) = EQ [t(si ; ci )], which we approximate by Gaussian quadrature. The simple information gain score SIG is computed as follows. If the candidate i is scored under the label assumption ci , then the offset is i = ci i /2. The lower bound to P (D  {(i , ci )}) is
-2

up to a constant not depending on i , and simple calculations give i (i ) = hi (i ) + log i -  -2 (µi + i i )2 /(i i ), i := 1 + i /i , where Q(si ) = N (µi ,  2 i ). This convex function is minimized just as   k(x, ) in the previous subsection. If Q (si )  -2 1 2 Q(si )e (i si - 2 si /i ) at the minimizer i , then D[Q Q] = 1 2 log i + i i (i - µi /i )2 -1 - i  2 i .

The score SIG (i ) is obtained by averaging over Q(ci ).

References
Girolami, M. (2001). A variational method for learning sparse and overcomplete representations. Neural Computation, 13, 2517­2532. Jaakkola, T. (1997). Variational methods for inference and estimation in graphical models. Doctoral dissertation, Massachusetts Institute of Technology. Jaakkola, T., & Jordan, M. (1997). Improving the mean field approximation via the use of mixture distributions. In M. I. Jordan (Ed.), Learning in graphical models. Kluwer. Palmer, J., Wipf, D., Kreutz-Delgado, K., & Rao, B. (2006). Variational em algorithms for non-Gaussian latent variable models. Advances in Neural Information Processing Systems 18 (pp. 1059­1066). Schein, A. I., & Ungar, L. H. (2007). Active learning for logistic regression: An evaluation. Machine Learning, 68, 235­265. Schneider, M., & Willsky, A. (2001). Krylov subspace estimation. SIAM Journal on Scientific Computing, 22, 1840­1864. Seeger, M. (2008). Bayesian inference and optimal design for the sparse linear model. Journal of Machine Learning Research, 9, 759­813. Seeger, M., & Nickisch, H. (2008). Compressed sensing and Bayesian experimental design. International Conference on Machine Learning 25 (pp. 912­919). Seeger, M., Nickisch, H., Pohmann, R., & Sch¨lkopf, o B. (2009). Bayesian experimental design of magnetic resonance imaging sequences. Advances in Neural Information Processing Systems 21 (pp. 1441­1448). Wipf, D., & Nagarajan, S. (2008). A new view of automatic relevance determination. Advances in Neural Information Processing Systems 20 (pp. 1625­1632). Yuille, A., & Rangarajan, A. (2003). The concaveconvex procedure. Neural Computation, 15, 915­ 936.

e-hi (i )/2 EQ e
2

(i si -s2 /(2i )) i

 e-i (i )/2

We find a root of k , by first constructing a bracket, then using a robust combination of Newton and bisection steps.