Sparse probabilistic projections ´ Cedric Archambeau Department of Computer Science University College London, United Kingdom c.archambeau@cs.ucl.ac.uk Francis R. Bach INRIA - Willow Project ´ Ecole Normale Superieure, Paris, France francis.bach@mines.org Abstract We present a generative model for performing sparse probabilistic projections, which includes sparse principal component analysis and sparse canonical correlation analysis as special cases. Sparsity is enforced by means of automatic relevance determination or by imposing appropriate prior distributions, such as generalised hyperbolic distributions. We derive a variational Expectation-Maximisation algorithm for the estimation of the hyperparameters and show that our novel probabilistic approach compares favourably to existing techniques. We illustrate how the proposed method can be applied in the context of cryptoanalysis as a preprocessing tool for the construction of template attacks. 1 Introduction Principal component analysis (PCA) is widely used for data pre-processing, data compression and dimensionality reduction. However, PCA suffers from the fact that each principal component is a linear combination of all the original variables. It is thus often difficult to interpret the results. In recent years, several methods for sparse PCA have been designed to find projections which retain maximal variance, while enforcing many entries of the projection matrix to be zero [20, 6]. While most of these methods are based on convex or partially convex relaxations of the sparse PCA problem, [16] has looked at using the probabilistic PCA framework of [18] along with 1-regularisation. Canonical correlation analysis (CCA) is also commonly used in the context for dimensionality reduction.The goal is here to capture features that are common to several views of the same data. Recent attempts for constructing sparse CCA include [10, 19]. In this paper, we build on the probabilistic interpretation of CCA outlined by [2] and further extended by [13]. We introduce a general probabilistic model, which allows us to infer from an arbitrary number of views of the data, both a shared latent cause and individual low-dimensional representations of each one of them. Hence, the probabilistic reformulations of PCA and CCA fit this probabilistic framework. Moreover, we are interested in sparse solutions, as these are important for interpretation purposes, denoising or feature extraction. We consider a Bayesian approach to the problem. A proper probabilistic approach allows us to treat the trade-off between the modelling accuracy (of the high-dimensional observations by low-dimensional latent variables) and the degree of sparsity of the projection directions in principled way. For example, we do not need to estimate the sparse components successively, using, e.g., deflation, but we can estimate all sparse directions jointly as we are taking the uncertainty of the latent variable into account. In order to ensure sparse solutions we propose two strategies. The first one, discussed in Appendix A, is based on automatic relevance determination (ARD) [14]. No parameter needs to be set in advance. The solution can thus be viewed as a natural sparse solution. In fact, entries in the projection matrix which are not well determined by the data are automatically driven to zero. The second approach uses priors from the generalised hyperbolic family [3], and more specifically the inverse Gamma. In this case, the degree of sparsity can be adjusted, eventually leading to very sparse solutions if desired. For both approaches we derive a variational EM algorithm [15]. 1 2 1.8 1.6 1.4 1.2 p(!) 1 0.8 0.6 0.4 0.2 0 -3 -2 -1 0 ! 1 2 a=1 a = 10 a = 100 3 (a) (b) Figure 1: (a) Graphical model (see text for details). Arrows denote conditional dependencies. Shaded and unshaded nodes are respectively observed and unobserved random variables. Plates indicate repetitions. (b) Marginal prior on the individual matrix entries (b = 100). 2 Generative model We consider the graphical model shown in Figure 1(a). For each observation, we have P independent measurements x1 , . . . , xP in different measurement spaces or views. The measurement xp RDp is modelled as a mix of a common (or view independent) continuous latent vector y0 RQ0 and a view dependent continuous latent vector yp RQp , such that xp = Wp y0 + Vp yp + µp + p, Wp RDp ×Q0 , Vp RDp ×Qp , (1) - where {µp }P=1 are the view dependent offsets and p N (0, p 1 IDp ) is the squared residual error p in view p. We are interested in the case where y0 and yp are low-dimensional, i.e., Q0 , Qp impose Gaussian priors on the latent vectors: y0 N (0, -1 ), 0 Dp for all p. We (2) yp N (0, -1 ), p {1, . . . , P }. p The resulting generative model comprises a number of popular probabilistic projection techniques as special cases. If there is a single view (and a single latent cause) and the prior covariance is diagonal, we recover probabilistic factor analysis [9]. If the prior is also isotropic, then we get probabilistic PCA [18]. If there are two views, we recover probabilistic CCA [2]. We seek a solution for which the matrices {Wp }P=1 and {Vp }P=1 are sparse, i.e. most of their enp p tries are zero. One way to achieve sparsity is by means of ARD-type priors [14]. In this framework, a zero-mean Gaussian prior is imposed on the entries of the weight matrices: wip j N (0, 1/ip j ), vip kp N (0, 1/ip kp ), ip {1, . . . , Dp }, j {1, . . . , Q0 }, ip {1, . . . , Dp }, kp {1, . . . , Qp }. (3) (4) Type II maximum likelihood leads then to a sparse solution when considering independent hyperparameters. The updates arising in the context of probabilistic projections are given in Appendix A. Since marginalisation with respect to both the latent vectors and the weights is intractable, we apply variational EM [15]. Unfortunately, following this route does not allow us to adjust the degree of sparsity, which is important e.g. for interpretation purposes or for feature extraction. Hence, we seek a more flexible approach. In the remaining of this paper, we will assume that the marginal prior on each weight ij , which is either an entry of {Wp }P=1 or {Vp }P=1 and will be p p defined shortly, has the form of an (infinite) weighted sum of scaled Gaussians: N - p(ij ) = (0, ij 1 ) p(ij ) dij . (5) We will chose the prior over ij in such a way that the resulting marginal prior over the corresponding ij induces sparsity. A similar approach was followed in the context of sparse nonparametric Bayesian regression in [4, 5]. 2 2.1 Compact reformulation of the generative model Before discussing the approximate inference scheme, we rewrite the model in a more compact way. Let us denote the nth observation, the corresponding latent vector and the means respectively by x , y , µ . xn = n1 , . . . , xnP zn = n0 , yn1 , . . . , ynP µ = 1 , . . . , µP The generative model can be reformulated as follows: zn N (0, -1 ), ij |ij where 1 W1 .. = . = . . . P WP V1 . . . 0 ... .. . ... 0 . , . . VP 1 ID1 . . = . 0 ... .. . ... 0 . . . P IDP . - N (0, ij 1 ), -1 RQ×Q , Q = Q0 + ), R D ×Q p Qp , p Dp , (6) (7) (8) i {1, . . . , D}, j {1, . . . , Q}, D = , R D ×D xn |zn , N (zn + µ, , Note that we do not assume that the latent spaces are correlated as = diag{0 , 1 , . . . , P }. This is consistent with the fact the common latent space is modelled independently through y0 . Subsequently, we will also denote the matrix of the hyperparameters by RD×Q , where we set (and fix) ij = for all ij = 0. 2.2 Sparsity inducing prior over the individual scaling variables We impose an inverse Gamma prior on the scaling variable ij : ij I G (a/DQ, b), (9) for all i and j . The shape parameter a and the scale parameter b are non-negative. The marginal prior on the weight ij is then in the class of the generalised hyperbolic distributions [3] and is defined in terms of the modified Bessel function of the third kind K (·): a1 2 a 2D Q - 4 2 2 ( b DQ ij p(ij ) = K DaQ - 1 b2j 10) i a 2 ( DQ ) 2b for ij = 0, and b ij 0 a 1 ( DQ - 2 ) a ( DQ ) lim p(ij ) = 2 1 > 2, otherwise. a DQ (11) The function (·) is the (complete) Gamma function. The effective prior on the individual weights is shown in Figure 1(b). Intuitively, the joint distribution over the weights is sparsity inducing as it is sharply peaked around zero (and in fact infinite for sufficiently small a). It favours only a small number of weights to be non-zero if the scaling variable b is sufficiently large. For a more formal discussion in the context of regression we refer to [7]. It is interesting to note that for a/DQ = 1 we recover the popular Laplace prior, which is equivalent to the 1-regulariser or the LASSO [17], and for a/DQ 0 and b 0 the resulting prior is the Normal-Jeffreys prior. In fact, the automatic thresholding method described in Appendix A fits also into the framework defined by (5). However, it corresponds to imposing a flat prior on the scaling variables over the log-scale, which is a limiting case of the Gamma distribution. When imposing independent Gamma priors on the scaling variables, the effective joint marginal is a product of Student-t distributions, which again is sharply peaked around zero and sparsity inducing. 3 Variational approximation We view {zn }N=1 and matrix as latent variables, and optimise the parameters = {µ, , , } n by EM. In other words, we view the weight matrix as a matrix of parameter and estimate the 3 entries by maximum a posteriori (MAP) learning. The other parameters are estimated by maximum likelihood (ML). The variational free energy is given by Fq (x1 , . . . , xN , ) = - nN =1 ln p(xn , zn , | ) q - H[q (z1 , . . . , zN , )], (12) where · q denotes the expectation with respect to the variational distribution q and H[·] is the differential entropy. Since the Kullback-Leibler divergence (KL) is non-negative, the negative free energy is a lower bound to log-marginal likelihood: nN =1 ln p(xn | ) = -Fq ({xn }, ) + KL [q ({zn }, ) p({zn }, )|{xn }, )] -Fq ({xn }, ). (13) Interestingly it is not required to make a factorised approximation of the the joint posterior q to find a tractable solution. Indeed, the posterior q factorises naturally given the data and the weights, such that the posteriors we will obtain in the E-step are exact. The variational EM finds maximum likelihood estimates for the parameters by cycling through the following two steps until convergence: 1. In the E-step, the posterior over the latent variables are computed for fixed parameters by minimising the KL in (13). It can be shown that the variational posteriors are given by q ( z1 , . . . , zN ) q () e nN e ln p(xn ,zn ,| ) q() , (14) (15) =1 ln p(xn ,zn |, ) q(z1 ,...,zN ) p(). 2. In the M-step, the variational free energy (12) is minimised wrt the parameters for fixed q . This leads in effect to type II ML estimates for the paramteres and is equivalent to maximising the expected complete log-likelihood: argmax nN =1 ln p(xn , zn , | ) q . (16) Depending on the initialisation, the variational EM algorithm converges to a local maximum of the log-marginal likelihood. The convergence can be checked by monitoring the variational lower bound, which monotonically increases during the optimisation. The explicit expression of the variational bound is here omitted due to a lack of space 3.1 Posterior of the latent vectors The joint posterior of the latent vectors factorises into N posteriors due to the fact the observations are independent. Hence, the posterior of each low-dimenstional latent vector is given by ¯¯ q (zn ) = N (zn , Sn ), (17) -1 ¯ ¯ ¯ where zn = Sn (xn - µ) is the mean and Sn = + is the covariance. 3.2 Posterior of the scaling variables The inverse Gamma distribution is not conjugate to the exponential family. However, the posterior over matrix is tractable. It has the form of a product of generalised inverse Gaussian distributions (see Appendix B for a formal definition): q () = iD j Q =1 =1 p(ij |ij ) = iD j Q =1 =1 N - (ij , ij , ij ) ¯¯¯ (18) a where ij = - DQ + 1 is the index and ij = 2j and ij = 2b are the shape parameters. The ¯ ¯ ¯ i 2 factorised form arises from the scaling variable being independent conditioned on the weights. 4 3.3 Update for the parameters Based on the properties of the Gaussian and the generalised inverse Gaussian, we can compute the variational lower bound, which can then be maximised. This leads to the following updates: N 1n ¯ µ (xn - zn ), N =1 -1 N 1n ¯ ¯¯ diag{zn zn + Sn }, N =1 (19) ¯ nN i i + (i, i) zn zn =1 - p 1 N ( 1n xnp - µp ) N Dp =1 ( 1 - (i, i) nN ¯ (xn (i) - µ(i))zn , (20) ( p zn ) p zn =1 xnp - µp ) - 2(xnp - µp ) ¯ p zn + , (21) where the required expectations are given by ¯ ¯¯ zn zn = Sn + zn zn , ¯ i = diag i1 ( p zn ) , ij = , . . . , iQ tr{ zn zn p p }, ¯¯ ¯K ij ij ¯¯ ij K+1 . ij ¯ ij ij ¯¯ p zn = (22) (23) Note that diag{·} denotes a block-diagonal operation in (19). More importantly, since we are seeking a sparse projection matrix, we do not suffer from the rotational ambiguity problem as is for example the case standard probabilistic PCA. 4 Experiments 4.1 Synthetic denoising experiments Because of identifiability issues which are subject of ongoing work, we prefer to compare various methods for sparse PCA in a denoising experiment. That is, we assume that the data were generated from sparse components plus some noise and we compare the various sparse PCA on the denoising task, i.e., on the task of recovering the original data. We generated the data as follows: select uniformly at random M = 4 unit norm sparse vectors in P = 10 dimensions with known number S = 4 of non zero entries, then generate i.i.d. values of the random variables Z from three possible distributions (Gaussian, Laplacian, uniform), then add isotropic noise of relative standard deviation 1/2. When the latent variables are Gaussian, our model exactly matches the data and our method should provide a better fit; however, we consider also situations where the model is misspecified in order to study the robustness of our probabilistic model. We consider our two models: SCA-1 (which uses automatic relevance determination type of sparsity priors) and SCA-2 (which uses generalised hyperbolic distribution), where we used 6 latent dimensions (larger than 4) and fixed hyperparameters that lead to vague priors. Those two models thus have no free parameters and we compare them to the following methods, which all have two regularisation parameters (rank and regularisation): DSPCA [6], the method of Zou [20] and the recent method of [16] which essentially considers a probabilistic PCA with 1-penalty on the weights. In Table 1 we report mean-squared reconstruction error averaged over 10 replications. It can be seen that two proposed probabilistic approaches perform similarly and significantly outperform other sparse PCA methods, even when the model is misspecified. 4.2 Template attacks Power consumption and electromagnetic radiation are among the most extensively used sidechannels for analysing physically observable cryptographic devices. A common belief is that the useful information for attacking a device is hidden at times where the traces (or time series) have large variance. Once the relevant samples have been identified they can be used to construct templates, which can then be used to assess if a device is secure. A simple, yet very powerful approach, recently proposed by [1], is to select time samples based on PCA. Figure 2(a) shows the weight 5 N 100 200 400 N 100 200 400 N 100 200 300 SCA-1 39.9 36.5 35.5 SCA-1 39.9 36.8 36.4 SCA-1 39.3 36.5 35.8 SCA-2 40.8 36.8 35.5 SCA-2 40.9 37.0 36.4 SCA-2 40.3 36.7 35.8 Zou 42.2 40.8 39.8 Zou 42.6 40.9 40.5 Zou 42.7 40.2 40.6 DSPCA 42.9 41.4 40.3 DSPCA 43.6 41.1 40.7 DSPCA 43.4 40.8 40.9 L1-PCA 50.8 50.4 42.5 L1-PCA 49.8 48.1 46.8 L1-PCA 48.5 46.2 41.0 Table 1: Denoising experiment with sparse PCA (we report mean squared errors): (top) Gaussian distributed latent vectors, (middle) latent vectors generated from the uniform distribution, (bottom) latent vectors generated from the Laplace distribution. 0.2 0.2 Power 0.1 0 0 1 !1 0 -1 0 1 !2 0 -1 0 1 !3 0 -1 0 20 40 60 80 100 t 120 140 160 180 200 20 40 60 80 100 120 140 160 180 200 20 40 60 80 100 120 140 160 180 200 20 40 60 80 100 120 140 160 180 200 Power 0.1 0 0 1 20 40 60 80 100 120 140 160 180 200 !1 0 -1 0 1 20 40 60 80 100 120 140 160 180 200 !2 0 -1 0 1 20 40 60 80 100 120 140 160 180 200 !3 0 -1 0 20 40 60 80 100 t 120 140 160 180 200 (a) Probabilistic PCA. (b) Sparse probabilistic PCA (SCA-2). Figure 2: Power traces and first three principal directions. associated to each time sample by the first three principal directions found by PCA. The problem with this approach is that all time samples get a non-zero weights. As a result, the user has to define a threshold manually in order to decide whether the information leakage at time t is relevant or not. Figure 2(b) shows the weight associated to the time samples by SCA-2 when using a Laplace prior (i.e. for a/DQ = 1). It can be observed that one gets a much better picture of where the relevant information is. Clearly, sparse probabilitic PCA can be viewed as being more robust to spurious noise and provides a more reliable and amenable solution. 5 Conclusion In this paper we introduced a general probabilistic model for inferring sparse probabilistic projection matrices. Sparsity was enforced by either imposing an ARD-type prior or by means of the a NormalInverse Gamma prior. Although the inverse Gamma is not conjugate to the exponential family, the posterior is tractable as it is a special case of the generalised inverse Gaussian [12], which in turn is a conjugate prior to this family. Future work will include the validation of the method on a wide range of applications and in particular as a feature extracting tool. Acknowledgments We are grateful to the PASCAL European network of excellence for partially supporting this work. 6 A Automatic thresholding the weights by ARD In this section, we provide the updates for achieving automatic thresholding of projection matrix entries in a probabilistic setting. We apply Tipping's sparse Bayesian theory [8], which is closely related to ARD [14]. More specifically, we assume the prior over the scaling variables is uniform over a log-scale, which is a limiting case of the Gamma distribution. Let us view {zn }N=1 and as latent variables and optimise the parameters = {µ, , , } by n variational EM. The variational free energy is given by Fq (x1 , . . . , xN , ) = - nN =1 ln p(xn , zn , | ) q - H[q (z1 , . . . , zN , )]. (24) In order to find a tractable solution, we further have to assume that the approximate posterior q has a factorised form. We can then compute the posterior of the low-dimenstional latent vectors: ¯¯ q (zn ) = N (zn , Sn ), (25) ¯ -1 i ¯¯ ¯ ¯ ¯ ¯ where zn = Sn (xn - µ) and Sn = + (i, i)i + . And the posterior of the weights is given by q () = iD =1 q (i ) = iD =1 ¯¯ N (i , i ), (26) -1 n n¯ ¯ ¯ ¯ ¯ ¯¯ where i = i (i, i) (xn (i) - µ(i))zn and i = . The {Sn + zn zn } i + (i, i) i partially factorised form q (i ) arises naturally. Note also that the update for the mean weights has the same form as in (20). Finally, the updates for the parameters are found by maximising the negative free energy, which corresponds to performing type II ML for the scaling variables. This yields µ - p 1 N 1n ¯¯ (xn - zn ), N =1 -1 ( N , ¯ 1n ¯ ¯ diag zn zn + Sn N =1 ij 2j -1 , i ( p zn ) p zn (27) , N ( 1n xnp - µp ) N Dp =1 ¯ ¯ xnp - µp ) - 2(xnp - µp ) p zn + (28) ¯ ¯ where 2j = 2j + i (j, j ) and i i ( p zn ) p zn = ¯ ¯¯ tr{(zn zn + Sn ) i ¯ ¯i ¯ (ip p + ip )}. p B Generalised inverse Gaussian distribution The Generalised inverse Gaussian distribution is in the class of generalised hyperbolic distributions. It is defined as follows [12, 11]: - ( ) -1 - 1 (x-1 +x) - x e2 , (29) y N ( , , ) = 2K ( ) where x > 0 and K (·) is the modified Bessel function of the third kind1 with index . The following expectations are useful [12]: -1 y= R ( ), y = R- ( ), d ln K ( ) , d ln y = ln + (30) where R (·) K+1 (·)/K (·). The expectation ln y can also be computed in closed form by noting that: K ( ) - K +1 ( ) d ln K ( ) = = - R ( ). (31) d K ( ) 1 The modified Bessel function of the third kind is known under various names. In particular, it is also known as the modified Bessel function of the second kind (cf. E. W. Weisstein: "Modified Bessel Function of the Second Kind." From MathWorld: http://mathworld.wolfram.com/ModifiedBesselFunctionoftheSecondKind.html). 7 Inverse Gamma distribution When = 0 and < 0, the generalised inverse Gaussian distribution reduces to the inverse Gamma distribution: ba -a-1 - b I G (a, b) = x (32) e x , a, b > 0. (a) It is straightforward to verify this result by posing a = - and b = /2, and noting that lim K (y ) = (- )2--1 y (33) y 0 for < 0. References [1] C. Archambeau, E. Peeters, F.-X. Standaert, and J.-J. Quisquater. Template attacks in principal subspaces. In L. Goubin and M. Matsui, editors, 8th International Workshop on Cryptographic Hardware and Embedded Systems(CHES), volume 4249 of Lecture Notes in Computer Science, pages 1­14. Springer, 2006. [2] F. Bach and M. I. Jordan. A probabilistic interpretation of canonical correlation analysis. Technical Report 688, Department of Statistics, University of California, Berkeley, 2005. [3] O. Barndorff-Nielsen and R. Stelzer. Absolute moments of generalized hyperbolic distributions and ´ approximate scaling of normal inverse Gaussian Levy processes. Scandinavian Journal of Statistics, 32(4):617­637, 2005. [4] P. J. Brown and J. E. Griffin. Bayesian adaptive lassos with non-convex penalization. Technical Report CRiSM 07-02, Department of Statistics, University of Warwick, 2007. [5] F. Caron and A. Doucet. Sparse bayesian nonparametric regression. In 25th International Conference on Machine Learning (ICML). ACM, 2008. [6] A. d'Aspremont, E. L. Ghaoui, M. I. Jordan, and G. R. G. Lanckriet. A direct formulation for sparse PCA using semidefinite programming. SIAM Review, 49(3):434­48, 2007. [7] J. Fan and R. Li. Variable selection via nonconcave penalized likelihood and its oracle properties. Journal of the American Statistical Association, 96:1348­1360, 2001. [8] A. C. Faul and M. E. Tipping. Analysis of sparse Bayesian learning. In T. G. Dietterich, S. Becker, and Z. Ghahramani, editors, Advances in Neural Information Processing Systems 14 (NIPS), pages 383­389. The MIT Press, 2002. [9] Z. Ghahramani and G. E. Hinton. The EM algorithm for mixtures of factor analyzers. Technical Report CRG-TR-96-1, Department of Computer Science, University of Toronto, 1996. [10] D. Hardoon and J. Shawe-Taylor. Sparse canonical correlation analysis. Technical report, PASCAL EPrints, 2007. [11] Wenbo Hu. Calibration of multivariate generalized hyperbolic distributions using the EM algorithm, with applications in risk management, portfolio optimization and portfolio credit risk. PhD thesis, Florida State University, United States of America, 2005. [12] B. Jørgensen. Statistical Properties of the Generalized Inverse Gaussian Distribution. Springer-Verlag, 1982. [13] A. Klami and S. Kaski. Local dependent components. In Z. Ghahramani, editor, 24th International Conference on Machine Learning (ICML), pages 425­432. Omnipress, 2007. [14] D. J. C. MacKay. Bayesian methods for backprop networks. In E. Domany, J.L. van Hemmen, and K. Schulten, editors, Models of Neural Networks, III, pages 211­254. 1994. [15] R. M. Neal and G. E. Hinton. A view of the EM algorithm that justifies incremental, sparse, and other variants. In M. I. Jordan, editor, Learning in Graphical Models, pages 355­368. The MIT press, 1998. [16] C. D. Sigg and J. M. Buhmann. Expectation-maximization for sparse and non-negative PCA. In 25th International Conference on Machine Learning (ICML). ACM, 2008. [17] R. Tibshirani. Regression shrinkage and selection via the LASSO. Journal of the Royal Statistical Society B, 58:267­288, 1996. [18] M. E. Tipping and C. M. Bishop. Probabilistic principal component analysis. Journal of the Royal Statistical Society B, 61:611­622, 1999. [19] D. Torres, D. Turnbull, B. K. Sriperumbudur, L. Barrington, and G.Lanckriet. Finding musically meaningful words using sparse CCA. In NIPS workshop on Music, Brain and Cognition, 2007. [20] H. Zou, T. Hastie, and R. Tibshirani. Sparse principal component analysis. Journal of Computational and Graphical Statistics, 15(2):265­286, 2006. 8