Posterior Consistency of the Silverman g -prior in Bayesian Model Choice Zhihua Zhang Departments of EECS University of California, Berkeley, CA, USA Michael I. Jordan Departments of EECS and Statistics University of California, Berkeley, CA, USA Dit-Yan Yeung Departments of CSE HKUST, Hong Kong, China Abstract Kernel supervised learning methods can be unified by utilizing the tools from regularization theory. The duality between regularization and prior leads to interpreting regularization methods in terms of maximum a posteriori estimation and has motivated Bayesian interpretations of kernel methods. In this paper we pursue a Bayesian interpretation of sparsity in the kernel setting by making use of a mixture of a point-mass distribution and prior that we refer to as "Silverman's g -prior." We provide a theoretical analysis of the posterior consistency of a Bayesian model choice procedure based on this prior. We also establish the asymptotic relationship between this procedure and the Bayesian information criterion. 1 Introduction We address a supervised learning problem over a set of training data {xi , yi }n 1 where xi X Rp i= is a p-dimensional input vector and yi is a univariate response. Using the theory of reproducing kernels, we seek to find a predictive function f (x) from the training data. Suppose f = u + h ({1} + HK ) where HK is a reproducing kernel Hilbert space (RKHS). The estimation of f (x) is then formulated as a regularization problem of the form 1n , i g 2 min L(yi , f (xi )) + h HK (1) f H K n =1 2 where L(y , f (x)) is a loss function, h 2 K is the RKHS norm and g > 0 is the regularization H parameter. By the representer theorem [7], the solution for (1) is of the form f (x) = u + jn =1 j K (x, xj ), (2) n where K (·, ·) is the kernel function. Noticing that h 2 K = i,j =1 K (xi , xj )i j and substituting H (2) into (1), we obtain the minimization problem with respect to (w.r.t.) the i as , 1 in g (3) L(yi , f (xi )) + K min n =1 2 u, where K = [K (xi , xj )] is the n×n kernel matrix and = (1 , . . . , n ) is the vector of regression coefficients. 1 From the Bayesian standpoint, the role of the regularization term g K can be captured b0 assigny 2 f -1 -1 ing a design-dependent prior Nn (0, g K ) to the regression vector . The prior Nn , K-1 or was first proposed by [5]ain his Bayesian formulation of spline smoothing. Here we refer to the 0 prior Nn , g -1 K-1 s the Silverman g -prior by analogy to the Zellner g -pri0r [9]. When o a K is singular, by analogy to generalized singular g-prior (gsg-prior) [8], we call Nn , g -1 K-1 generalized Silverman g-prior. Given the high dimensionality generally associated with RKHS methods, sparseness has emerged as a significant theme, particularly when computational concerns are taken into account. For example, the number of support vectors in support vector machine (SVM) is equal to the number of nonzero components of . That is, if j = 0, the j th input vector is excluded from the basis expansion in (2); otherwise the j th input vector is a support vector. We are thus interested in a prior for which allows some components of to be zero. To specify such a prior we first introduce an indicator vector n= (1 , . . . , n ) such that j = 1 if xj is a support vector and j = 0 if it is not. Let n = j =1 j be the number of support vectors, let K be the n×n submatrix of K consisting of those columns of K for which j = 1, and let be the corresponding subvector of . Accordingly, w 0 1 here K is the n ×n submatrix of K consisting of those rows we let Nn , g -1 K- of K for which j = 1. We thus have a Bayesian model choice problem in which a family of models is indexed by an indicator vector . Within the Bayesian framework we can use Bayes factors to choose among these models [3]. In this paper we provide a frequentist theoretical analysis of this Bayesian procedure. In particular, motivated by the work of [1] on the consistency of the Zellner g -prior, we investigate the consistency for model choice of the Silverman g -prior for sparse kernel-based regression. 2 Main Results Our analysis is based on the following regression model M : y = u1n + K + 0 , Nn (0, 2 In ), | Nn , 2 (g K )-1 (4) where y = (y1 , . . . , yn ) . Here and later, 1m denotes the m×1 vector of ones and Im denotes the m×m identity matrix. We compare each model M with the null model M0 , formulating the model choice problem via the hypotheses H0 : = 0 and H : Rn . Throughout this paper, for any n , it is always assumed to take a finite value even though n . Let K = [1n , K ]. The following condition is also assumed: 1 For a fixed n < n, n K K is positive definite and (5) converges to a positive definite matrix as n . Suppose that the sample y is generated by model M with parameter values u, and . We formalize the problem of consistency for model choice as follows [1]: plim p(M |y) = 1 and plim p(M |y) = 0 for all M = M , (6) n n where "plim" denotes convergence in probability and the limit is taken w.r.t. the sampling distribution under the true model M . 2.1 A Noninformative Prior for (u, 2 ) We first consider the case when (u, 2 ) is assigned the following noninformative prior: (u, 2 ) 1/ 2 . (7) n n Moreover, we assume 1 K = 0. In this case, we have 1 K = 0 so that the intercept u may be regarded as a common parameter for both M and M0 . After some calculations the marginal likelihood is found to be p(y|M ) = ( n-1 ) n-1 1 2 2 y - y 1n -n+1 |Q |- 2 (1 - F )- 2 , ¯ n-1 2 n 2 (8) where y = ¯ 1 n n i=1 1 yi , Q = In + g -1 K K- K and 2 F = y K (g K + K K )-1 K y . y - y 1n 2 ¯ 2 Let RSS = (1 - R ) y - y 1n 2 be the residual sum of squares. Here, ¯ 2 R = y K (K K )-1 K y (y - y 1n ) K (K K )-1 K (y - y 1n ) ¯ ¯ = . y - y 1n 2 ¯ y - y 1n 2 ¯ 2 2 2 It is easily proven that for fixed n, plimg 0 F = R and plimg 0 (1 - F ) y - y 1n 2 = RSS , ¯ ( H )y where H = K (K K )-1 K . As a special case of (8), it is also and RSS = y In - immediate to obtain the marginal distribution of the null model as p(y|M0 ) = ( n-1 ) 2 y - y 1n -n+1 . ¯ n-1 2 n Then the Bayes factor for M versus M0 is 2 BF 0 = |Q |- 2 (1 - F )- 1 n-1 2 . In the limiting case when g 0 and both n and n are fixed, BF 0 tends to 0. This implies that a large spread of the prior forces the Bayes factor to favor the null model. Thus, as in the case of the Zellner g -prior [4], Bartlett's paradox arises for the Silverman g -prior. The Bayes factor for M versus M is given by BF 2 BF 0 |Q |- 2 (1 - F )- 2 = = 1 n-1 . 2 BF0 |Q |- 2 (1 - F )- 2 1 n-1 (9) Based on the Bayes factor, we now explore the consistency of the Silverman g -prior. Suppose that the sample y is generated by model M with parameter values u, and 2 . Then the consistency property (6) is equivalent to n plim BF = 0, for all M = M . M , Assume that under any model M that does not contain M , i.e, M K (In - H )K n n lim = c (0, ), (10) where = (u, ). Note that In - H is a symmetric idempotent matrix which projects onto the subspace of Rn orthogonal to the span of K . Given that (In - H )1n = 0 and 1n K = 0, condition (10) reduces to K (In - H )K = c (0, ), n n lim where H = K (K K )-1 K . We now have the following theorem whose proof is given in Sec. 3. Theorem 1 Consider the regression model (4) with the noninformative prior for (u, 2 ) in (7). Assume that conditions (5) and (10) are satisfied and assume that g can be written in the form g = w1 (n ) with w2 (n) n lim w2 (n) = and w2 (n) =0 n w2 (n) lim (11) for particular choices of functions w1 and w2 , where w2 is differentiable and w2 (n) is the first derivative w.r.t. n. When the true model M is not the null model, i.e., M = M0 , the posterior probabilities are consistent for model choice. 3 Theorem 1 can provide an empirical methodology for setting g . For example, it is clear that g = 1/n where w1 (n ) = 1 and w2 (n) = n satisfies condition (11). It is interesting to consider the (asymptotic) relationship between the Bayes factor and Bayesian information (or Schwartz) criterion (BIC) in our setting. Given two models M and M , the difference between the BICs of these two models is given by S = n RSS n - n ln + ln(n). 2 RSS 2 We thus obtain the following asymptotic relationship (the proof is given in Sec. 3): Theorem 2 Under the regression model and the conditions in Theorem 1, we have plim n ln BF S + n -n 2 ln w2 (n) = 1. ln BF S Furthermore, if M is not nested within M , then plimn limits are taken w.r.t. the model M . 2.2 A Natural Conjugate Prior for (u, 2 ) = 1. Here the probability In this section, we analyze consistency for model choice under a different prior for (u, 2 ), namely the standard conjugate prior: p(u, 2 ) = N (u|0, 2 -1 )Ga ( -2 |a /2, b /2) where Ga (u|a, b) is the Gamma distribution: p(u) = ba a-1 u exp(-bu), a > 0, b > 0. (a) 0 0 g K . (13) (12) We further assume that u and are independent. Then Nn +1 (0, 2 -1 ) with = The marginal likelihood of model M is thus p(y|M ) = a /2 + - a2 n b ( n+a ) 1b 2 |M |- 2 + y M-1 y , n/2 ( a ) 2 (14) where M = In + K -1 K . The Bayes factor for M versus M is given by + | 1b a2 n M | 2 + y M-1 y BF = . |M | b + y M-1 y -n Because M-1 = In - K -1 K and |M | = | || |-1 = -1 g |K |-1 | | where = K K + , we have n /2 BF = g g n /2 K || | |K || | | 1b 2 +y b + y I K -1 K n - I -K -1 K n y a +n 2 . y Theorem 3 Consider the regression model (4) with the conjugate prior for (u, 2 ) in (12). Assume that conditions (5) and (10) are satisfied and that g takes the form in (11) with w1 (n ) being a decreasing function. When the true model M is not the null model, i.e., M = M0 , the posterior probabilities are consistent for model choice. Note the difference between Theorem 1 and Theorem 3: in the latter theorem w1 (n ) is required to be a decreasing function of n . Thanks to the fact that g = w1 (n )/w2 (n), such a condition is equivalent to assuming that g is a decreasing function of n . Again, g = 1/n satisfies these conditions. Similarly with Theorem 2, we also have 4 Theorem 4 Under the regression model and the conditions in Theorem 3, we have ln BF plim = 1. n -n n S + ln w2 (n) 2 Furthermore, if M is not nested within M , then plimn limits are taken w.r.t. the model M . ln BF S = 1. Here the probability 3 Proofs In order to prove these theorems, we first give the following lemmas. A h b A12 11 Lemma 1 Let A = e symmetric and positive definite, and let B = A21 A22 ave the same size as A. Then A-1 - B is positive semidefinite. Proof The proof follows readily once we express A-1 and B as I I A-1 0 -A-1 A12 -1 11 11 A= -A21 A-1 0 I 0 A-11 11 22 · I A-1 I 0 -A-1 A12 0 11 11 B= -A21 A-1 I 0 I 0 0 11 where A22·1 = A22 - A21 A-1 A12 is also positive definite. 11 The following two lemmas were presented by [1]. Lemma 2 Under the sampling model M : (i) if M is nested within or equal to a model M , i.e., M M , then RSS plim = 2 n n and (ii) for any model M that does not contain M , if (10) satisfies, then RSS plim = 2 + c . n n Lemma 3 R nder thd sampling model M , if M is nested within a model M , i.e., M M , U e d SS then n ln RSS - 2 -n as n where - denotes convergence in distribution. n Lemma 4 Under the regression model (4), if limn g (n) = 0 and condition (5) is satisfied, then n 2 plim (1 - F ) y - y 1n 2 - RSS = 0. ¯ A- 1 0 11 0 0 0 I , , Proof It is easy to compute 2 (1 - F ) y - y 1n 2 - RSS ¯ y K [(K K )-1 - (K K + g (n)K )-1 ]K y = . 2 2 Since both K K /n and K are positive definite, there exists an n ×n nonsingular matrix An and an n ×n positive diagonal matrix n such that K K /n = An n An and K = An An . Letting z = -1 (nn )-1/2 (An )-1 K y, we have z Nn ( -1 (nn )1/2 An , In ) and f (z) ( 2 n -1 1 - F ) y - y 1n 2 - RSS ¯ = z z - z nn n + g (n)In z 2 jn g (n) z2. = nj (n) + g (n) j =1 5 2 Note that zj follows a noncentral chi-square distribution, 2 (1, vj ), with vj = 2 2 nj (n)(aj (n) ) / where j (n) > 0 is the j th diagonal element of n and aj (n) is 2 2 the j th column of An . We thus have E(zj ) = 1 + vj and Var(zj ) = 2(1 + 2vj ). It follows from condition (5) that lim K K /n = lim An n An = A A, n n where A is nonsingular and is a diagonal matrix with positive diagonal elements, and both are independent of n. Hence, g (n) = g (n) = 2 2 lim E zj 0 and lim Var zj 0. n n nj (n) + g (n) nj (n) + g (n) We thus have plimn f (z) = 0. The proof is completed. Lemma 5 Assume that M is nested within M and g is a decreasing function of n . Then y (In - K -1 K )y y (In - K -1 K )y. Proof Since M isested withinw , we express K = [K , K2 ] without loss of generality. We n M 11 12 now write = here 11 is of size n ×n . Hence, we have 21 22 -1 K 11 K K K2 + 12 + -1 . = K2 K + 21 K2 K2 + 22 0 i 0 K K + - (K K +11 ) = Because 0 < g g , s positive semidef 0 (g -g )K inite. Consequently, (K K +11 )-1 - (K K + )-1 is positive semidefinite. It follows from ( i K K + )-1 0 -1 Lemma 1 that - s also positive semidefinite. We thus have 0 0 y (In - K -1 K )y - y (In - K -1 K )y -1 K K +11 K K2 +12 = y K - K2 K +21 K2 K2 +22 ( K K + )-1 0 K 0 0 y 0. 3.1 Proof of Theorem 1 We now prove Theorem 1. Consider that ln BF = Because |Q |- 2 = we have w1 (n )n |K | |Q | = ln + ln + ln ln |Q | w1 (n )n | K | Because = lim ln n nw2 (n) K 1 (n ) nw2 (n) K 1 2 1 |Q | n-1 (1 - F ) ln + ln . 2 2 |Q | 2 (1 - F ) g 2 |K |1/2 , |g K + K K |1/2 nw2 (n) K 1 (n ) nw2 (n) K n w 1 (n ) 1 + n K K 1 + n K K + (n -n ) ln(nw2 (n)). w 1 (n ) 1 + n K K 1 + n K K = lim ln n 1 | n K K | (-, ), 1 | n K K | 6 it is easily proven that 1 |Q | lim ln = n 2 |Q | where const = plim 2 n < n - n > n const n = n , (15) + 1 2 ln |K | |K | . According to Lemma 4, we also have 2 2 n-1 (1-F ) n-1 (1-F ) y-¯1n 2 n-1 RSS y . ln = plim ln = plim ln 2) 2 (1-F (1-F ) y-¯1n 2 y RSS n 2 n 2 n 2 Now consider the following two cases: (a) M is not nested within M : From Lemma 2, we obtain n plim ln RSS /n 2 RSS = plim ln = ln 2 . RSS RSS /n +c n + n -n = ln(nw2 (n)) - n- 1 w (n)+nw2 (n) Moreover, we have the following limit 2 n- 1 l lim n2 n 2 +c n -n due to limn -1 ln(nw2 (n)) = limn (n -n ) 2 nw2 (n) = 0 and n < 2 1. This implies that limn ln BF = -. Thus we obtain ln 2 +c limn BF = 0. (b) M is nested within M : We always have n > n . By Lemma 3, we have (n-1) ln(RSS /RSS ) - 2 -n . n Hence, (RSS /RSS )(n-1)/2 - exp(2 -n /2). Combining this result with (15) leads n to a zero limit for BF . 3.2 Proof of Theorem 2 Using the same notations as those in Theorem 1, we have ln BF S + n -n 2 n-1 n d d C = ln w2 (n) = ln (1-F ) + 2 (1-F 2 ) RSS ln RSS n -n 2 ln(nw2 (n)) + n Const n . n -n + n ln(nw2 (n)) (a) M is not nested within M : From Lemma 4, we obtain plim C = lim ln 2 c + + ln 2 2 +c 2 n n + n -n n n -n n ln(nw2 (n)) ln(nw2 (n)) = 1. In this case, we also have ln 2 c + n ln(nw2 (n)) ln BF + plim = lim = 1. 2 n -n n S n ln 2 c + n ln n + (b) M is nested within M : We obtain plim C = plim (n-1) ln (1-F ) + (n -n ) ln(nw2 (n)) + 2 × Const 2 2 n -n (1-F 2 ) n n RSS n ln RSS + (n -n ) ln(nw2 (n)) d =1 due to n > n and n ln(RSS /RSS ) - 2 -n . n 7 3.3 Proof of Theorem 3 We now sketch the proof of Theorem 3. For the case that M is not nested within M , the proof is similar to that of Theorem 1. When M is nested within M , Lemma 5 shows the following relationship y b y I K - 1 K y . I K -1 K +y n - n - ln ln y y I -K -1 K I -K -1 K b + y n yn We thus have b a +n +y ln plim 2 n b + y I K -1 K n - I -K -1 K n y y = a +n ln plim y 2 n y y a +n plim ln 2 n y I -K -1 K n I H n- I -H n I n- K -1 K y y y (0, ). y From this result the proof follows readily. 4 Conclusions In this paper we have presented a frequentist analysis of a Bayesian model choice procedure for sparse regression. We have captured sparsity by a particular choice of prior distribution which we have referred to as a "Silverman g -prior." This prior emerges naturally from the RKHS perspective. It is similar in spirit to the Zellner g -prior, which has been widely used for Bayesian variable selection and Bayesian model selection due to its computational tractability in the evaluation of marginal likelihoods [6, 2]. Our analysis provides a theoretical foundation for the Silverman g -prior and suggests that it can play a similarly wide-ranging role in the development of fully Bayesian kernel methods. References ´ [1] C. Fernandez, E. Ley, and M. F. J. Steel. Benchmark priors for Bayesian model averaging. Journal of Econometrics, 100:381­427, 2001. [2] E. I. George and R. E. McCulloch. Approaches for Bayesian variable selection. Statistica Sinica, 7:339­374, 1997. [3] R. E. Kass and A. E. Raftery. Bayes factors. Journal of the American Statistical Association, 90:773­795, 1995. [4] F. Liang, R. Paulo, G. Molina, M. A. Clyde, and J. O. Berger. Mixtures of g-priors for Bayesian variable selection. Journal of the American Statistical Association, 103(481):410­423, 2008. [5] B. W. Silverman. Some aspects of the spline smoothing approach to non-parametric regression curve fitting (with discussion). Journal of the Royal Statistical Society, B, 47(1):1­52, 1985. [6] M. Smith and R. Kohn. Nonparametric regression using Bayesian variable selection. Journal of Econometrics, 75:317­344, 1996. [7] G. Wahba. Spline Models for Observational Data. SIAM, Philadelphia, 1990. [8] M. West. Bayesian factor regression models in the "large p, small n" paradigm. In J. M. Bernardo, M. J. Bayarri, J. .O Berger, A. P. Dawid, D. Heckerman, A. F. M. Smith, and M. West, editors, Bayesian Statistics 7, pages 723­732. Oxford University Press, 2003. [9] A. Zellner. On assessing prior distributions and Bayesian regression analysis with g -prior distributions. In P. K. Goel and A. Zellner, editors, Bayesian Inference and Decision Techniques: Essays in Honor of Bruno de Finetti, pages 233­243. North-Holland, Amsterdam, 1986. 8