Implicit Regularization in Variational Bayesian Matrix Factorization

Shinichi Nakajima Nikon Corporation, Tokyo 140-8601, Japan Masashi Sugiyama Tokyo Institute of Technology and JST PRESTO, Tokyo 152-8552, Japan

nakajima.s@nikon.co.jp sugi@cs.titech.ac.jp

Abstract
Matrix factorization into the product of lowrank matrices induces non-identifiability, i.e., the mapping between the target matrix and factorized matrices is not one-to-one. In this paper, we theoretically investigate the influence of non-identifiability on Bayesian matrix factorization. More specifically, we show that a variational Bayesian method involves regularization effect even when the prior is non-informative, which is intrinsically different from the maximum a posteriori approach. We also extend our analysis to empirical Bayes scenarios where hyperparameters are also learned from data.

beyond its experimental success. The purpose of this paper is to provide new insight into Bayesian MF. A key characteristic of MF models is non-identifiability (Watanabe, 2009), where the mapping between parameters and functions is not one-to-one--in the context of MF, the mapping between the target matrix and the factorized matrices is not one-to-one. Previous theoretical studies on non-identifiable models showed that, when combined with full-Baysian (FB) estimation, regularization effect is significantly stronger than the MAP method (Watanabe, 2001; Yamazaki & Watanabe, 2003). Since a single point in the function space corresponds to a set of points in the (redundant) parameter space in non-identifiable models, simple distributions such as the Gaussian distribution in the function space produce highly complicated multimodal distributions in the parameter space. This causes the MAP and FB solutions to be significantly different. Thus the behavior of non-identifiable models is substantially different from that of identifiable models. Theoretical properties of VB has been investigated for Gaussian mixture models (Watanabe & Watanabe, 2006) and linear neural networks (Nakajima & Watanabe, 2007). In this paper, we extend these results and investigate the behavior of the VBMF estimator. More specifically, we show that VBMF consists of two shrinkage factors, the positive-part James-Stein (PJS) shrinkage (James & Stein, 1961; Efron & Morris, 1973) and the trace-norm shrinkage (Srebro et al., 2005), operating on each singular component separately for producing low-rank solutions. The trace-norm shrinkage is simply induced by non-flat prior information, as in the MAP approach (Salakhutdinov & Mnih, 2008). Thus, no trace-norm shrinkage remains when priors are non-informative. On the other hand, we show a counter-intuitive fact that the PJS shrinkage factor still remains even with uniform priors. This allows the VBMF method to avoid overfitting (or in some

1. Introduction
The goal of matrix factorization (MF) is to find a lowrank expression of a target matrix. MF has been used for learning linear relation between vectors such as reduced rank regression (Baldi & Hornik, 1995; Reinsel & Velu, 1998), canonical correlation analysis (Rao, 1965; Anderson, 1984), and partial least-squares (Rosipal & Kr¨mer, 2006). More recently, MF is applied to collaba orative filtering (CF) in the context of recommender systems (Konstan et al., 1997; Funk, 2006) and microarray data analysis (Baldi & Brunak, 1998). For this reason, MF has attracted considerable attention these days. Recently, the variational Bayesian (VB) approach (Attias, 1999) has been applied to MF (Lim & Teh, 2007; Raiko et al., 2007). The VBMF method was shown to perform very well in experiments. However, its good performance was not completely understood
Appearing in Proceedings of the 27 th International Conference on Machine Learning, Haifa, Israel, 2010. Copyright 2010 by the author(s)/owner(s).

Implicit Regularization in Variational Bayesian Matrix Factorization

Figure 1. Matrix factorization model (H  L  M ).

cases this might cause underfitting) even when noninformative priors are provided. We further extend the above analysis to empirical VBMF (EVBMF) scenarios, where hyperparameters in prior distributions are also learned based on the VB free energy. We derive bounds of the EVBMF estimator, and show that the effect of PJS shrinkage is at least doubled compared with the uniform prior cases.

2. Bayesian Approaches to Matrix Factorization
In this section, we give a probabilistic formulation of the matrix factorization (MF) problem and review its Bayesian methods. 2.1. Formulation The goal of the MF problem is to estimate a target matrix U ( RL×M ) from its n observations V n = {V (i)  RL×M | i = 1, . . . , n}. Throughout the paper, we assume L  M . If L > M , we may simply re-define the transpose U  as U so that L  M holds. Thus this does not impose any restriction. A key assumption of MF is that U is a low-rank matrix. Let H ( L) be the rank of U . Then the matrix U can be decomposed into the product of A = (a1 , a2 , . . . , aH )  RM ×H and B = (b1 , b2 , . . . , bH )  RL×H as follows (see Figure 1): U = BA . Assume that the observed matrix V is subject to the additive-noise model V = U + E, where E ( RL×M ) is a noise matrix. Each entry of E is assumed to independently follow the Gaussian distribution with mean zero and variance  2 . Then, the likelihood p(V n |A, B) is given by ( ) n 1  (i) n  2 p(V |A, B)  exp - 2 V - BA Fro , (1) 2 i=1 where  · 2 denotes the Frobenius norm of a matrix. Fro

H

A

M

H
B
L=

M
U
L

2.2. Bayesian Matrix Factorization We use the Gaussian priors on the parameters A, B: (U ) = A (A)B (B), (  ) H A (A)  exp - h=1 ah 2 /(2c2 h ) , a (  ) H B (B)  exp - h=1 bh 2 /(2c2h ) . b c2 h and c2h are hyperparameters corresponding to the a b prior variance of those vectors. Without loss of generality, we assume that the product cah cbh is nonincreasing with respect to h. The Bayes posterior p(A, B|V n ) can be written as p(A, B|V n ) = p(V n |A, B)A (A)B (B) , p(V n |A, B)A (A)B (B) (2)

where ·p denotes the expectation over p. The fullBayesian (FB) solution is given by the Bayes posterior mean: U FB = BA p(A,B|V n ) . (3)

The hyperparameters cah and cbh may be determined so that the Bayes free energy F (V n ) is minimized. F (V n ) = - logp(V n |A, B)A (A)B (B) . (4)

This method is called the empirical Bayes method (Bishop, 2006). 2.3. Maximum A Posteriori Matrix Factorization (MAPMF) When computing the Bayes posterior (2), the expectation in the denominator of Eq.(2) is often intractable due to high dimensionality of the parameters A and B. A simple approach to mitigating this problem is to use the maximum a posteriori (MAP) approximation. The MAP solution U MAP is given by U MAP = B MAP AMAP , where (AMAP , B MAP ) = argmaxA,B p(A, B|V n ). 2.4. Variational Bayesian Matrix Factorization (VBMF) Another approach to avoiding computational intractability of the FB method is to use the VB approximation (Attias, 1999; Bishop, 2006). Here, we review the VBMF method proposed by Lim and Teh (2007) and Raiko et al. (2007).


Implicit Regularization in Variational Bayesian Matrix Factorization

For a trial distribution r(A, B|V n ) =
H  h=1

10 9

rah (ah |V n )rbh (bh |V n ),
h

8 7 6 5 4 3 2 1 1

ML MAP VB-upp er VB-lower

the VB free energy is defined as   r(A, B|V n ) F (r|V n ) = log . p(V n , A, B) r(A,B|V n ) The VB approach minimizes the VB free energy with respect to the trial distribution r(A, B|V n ). The resulting distribution is called the VB posterior. The VB solution U VB is given by the VB posterior mean: U VB = BA r(A,B|V n ) . Applying the variational method to the VB free energy shows that the VB posterior satisfies ( ) rah (ah |V n )  ah (ah ) exp log p(V n |A, B)r(\ah |V n ) , ( ) rbh (bh |V n )  bh (bh ) exp log p(V n |A, B)r(\bh |V n ) , where r(\ah |V ) denotes the VB posterior except ah .
n

2

3

4

5 h

6

7

8

9

10

Figure 2. Shrinkage of the ML estimator (6), the MAP estimator (5), and the VB estimator (8) when n = 1,  2 = 0.1, cah cbh = 0.1, L = 100, and M = 200.

Then we have the following theorem. Theorem 1 The MAP estimator U MAP is given by U
MAP

=

H  h=1

MAP h  bh  h , a

2.5. Empirical Variational Bayesian Matrix Factorization (EVBMF) The VB free energy also allows us to determine the hyperparameters c2 h and c2h in a computationally a b tractable way. That is, instead of the Bayes free energy F (V n ), the VB free energy F (r|V n ) is minimized with respect to c2 h and c2h . We call this method empirical a b VBMF (EVBMF).

where, for [c]+ = max(0, c), [ MAP = h - h 2 ncah cbh ]+ . (5)

3. Analysis of MAPMF, VBMF, and EVBMF
In this section, we theoretically analyze the behavior of MAPMF, VBMF and EVBMF solutions. More specifically, we derive analytic-form expression of the MAPMF solution (Section 3.1), semi-analytic expressions of the VBMF solution (Section 3.2) and the EVBMF solution (Section 3.3), and we elucidate their regularization mechanism. 3.1. MAPMF Let h ( 0) be the h-th largest singular value of V : V = 1  (i) V . n i=1
n

The theorem implies that the MAP solution cuts off the singular values less than  2 /(ncah cbh ); otherwise it reduces the singular values by  2 /(ncah cbh ) (see Figure 2). This shrinkage effect allows the MAPMF method to avoid overfitting. Similarly to Theorem 1, we can show that the maximum likelihood (ML) estimator is given by U ML = where
ML h = h for all h. H  h=1 ML h  bh  h , a

(6)

Thus the ML solution is reduced to V when H = L (see Figure 2): U ML =
L  h=1 ML h  bh  h = V . a

Let  ah and  bh be the associated right and left singular vectors: V =
L  h=1

h  bh  h . a

A parametric model is said to be identifiable if the mapping between parameters and functions is one-toone; otherwise the model is said to be non-identifiable (Watanabe, 2001). Since the decomposition U = BA

Implicit Regularization in Variational Bayesian Matrix Factorization

is redundant, the MF model is non-identifiable. For identifiable models, the MAP estimator with the uniform prior is reduced to the ML estimator (Bishop, 2006). On the other hand, the MF model is nonidentifiable because of the redundancy of the decomposition U = BA . This implies that a single point in the space of U corresponds to a set of points in the joint space of A and B. For this reason, the uniform priors on A and B do not produce the uniform prior on U . Nevertheless, Eqs.(5) and (6) imply that MAP is reduced to ML when the priors on A and B are uniform (i.e., cah , cbh  ). More precisely, Eqs.(5) and (6) show that cah cbh   is sufficient for MAP to be reduced to ML, which is weaker than cah , cbh  . This implies that both priors on A and B do not have to be uniform; only the condition that one of the priors is uniform is sufficient for MAP to be reduced to ML in the MF model. This phenomenon is distinctively different from the case of identifiable models. When the prior is uniform and the likelihood is Gaussian, the posterior is Gaussian. Thus the mean and mode of the posterior agrees with each other due to the symmetry of the Gaussian density. For identifiable models, this fact implies that the FB and MAP solutions agree with each other. However, the FB and MAP solutions are generally different in nonidentifiable models since the symmetry of the Gaussian density in the space of U is no longer kept in the joint space of A and B. In Section 4.1, we further investigate these distinctive features of the MF model using illustrative examples. 3.2. VBMF Next, let us analyze the behavior of the VBMF estimator. We have the following theorem. Theorem 2 U VB is expressed as U VB =
H  h=1 VB h  bh  h . a

bound agrees with the upper-bound, and we have lim
VB h =

cah cbh 

) ]+ [( M 2 1- h 2 nh

(9)

VB if h > 0; otherwise h = 0. This is the same form as the positive-part James-Stein (PJS) shrinkage estimator (James & Stein, 1961; Efron & Morris, 1973). The factor M  2 /n is the expected contribution of the 2 noise to h --when the target matrix is U = 0, the ex2 pectation of h over all h is given by M  2 /n. When 2 2 VB h  M  /n, Eq.(9) implies that h = 0. Thus, the PJS estimator cuts off the singular components dom2 inated by noise. As h increases, the PJS shrinkage 2 2 factor M  /(nh ) tends to 0, and thus the estimated VB singular value h becomes close to the original singular value h .

Let us compare the behavior of the VB solution (9) with that of the MAP solution (5) when cah cbh  . In this case, the MAP solution merely results in the ML solution where no regularization is incorporated. In contrast, VB offers PJS-type regularization even when cah cbh  ; thus VB can still mitigate overfitting. This fact is in good agreement with the experimental results reported in Raiko et al. (2007), where no overfitting is observed when c2 h = 1 and c2h is a b set to large values. This counter-intuitive fact stems again from the non-identifiability of the MF model-- the Gaussian noise E imposed in the space of U possesses a very complex surface in the joint space of A and B, in particular, multimodal structure. This causes the MAP solution to be distinctively different from the VB solution. In Section 4.2, we investigate the above phenomena in more detail using illustrative examples.
VB We can derive another upper-bound of h , which depends on hyperparameters cah and cbh .  VB Theorem 3 When h > M  2 /n, h is upperbounded as [( ]+ )( ) M 2 2 L 2 VB h  1- · h - . 1- 2 2 nh nh ncah cbh

(7)

 VB When h  M  2 /n, h = 0.  VB M  2 /n, h is bounded as

When h >

[( ]+  ) ( )  2 M/L M 2 M 2 VB 1- h -  h < 1- h . (8) 2 2 nh ncah cbh nh The upper- and lower-bounds in Eq.(8) are illustrated in Figure 2. In the limit when cah cbh  , the lower-

 When L = M and h > M  2 /n, the above upperbound agrees with the lower-bound in Eq.(8), and thus we have ]+ [( ) M 2 2 VB (10) h = 1- h - 2 nh ncah cbh
VB if h > 0; otherwise h = 0. Then the complete VB posterior can be obtained analytically.

Implicit Regularization in Variational Bayesian Matrix Factorization

Corollary 1 When L = M , the VB posteriors are given by H rA (A|V n ) = h=1 NM (ah ; µah , ah ), H rB (B|V n ) = h=1 NL (bh ; µbh , bh ), c cah VB µah = ± cah h ·  ah , ah = 2M cb IM , bh h  cbh VB cbh µbh = ±  ·  bh , bh = IM , cah h 2M cah ( ( )2 ) 2 2 2 VB VB = h + nca cb + 4nM - h + nca cb ,
h h h h

3

2

1

0

B

-1

-2

U=2 U=1 U=0 U=-1 U=-2 -2 -1 0 A 1 2 3

-3 -3

where Nd (·; µ, ) denotes the d-dimensional Gaussian density with mean µ and covariance matrix , and VB h is given by Eq.(10). 3.3. EVBMF Finally, we analyze the behavior of EVBMF, where hyperparameters cah and cbh are also estimated from data. We have the following theorem. Theorem 4 The EVB estimator is given by the following form. U EVB =
H  h=1 EVB h  bh  h . a

Figure 3. Equivalence class. Any A and B such that their product is unchanged give the same matrix U .

   EVB Theorem 5 When h  ( L + M )/ n, h is upper-bounded as (
EVB h

<

L 2 1- 2 nh

)(

 ) M 2 LM  2 1- h - . 2 nh nh

(11)

When L = M , the above upper-bound is sharper than that in Eq.(12), resulting in
EVB h <

   EVB = 0. When When h <  L + M )/ n, h  (  EVB h  ( L + M )/ n, h is upper-bounded as ( ) M 2 EVB < 1- h h . (12) 2 nh      EVB When h  7M / n (> ( L + M )/ n), h is lower-bounded as   + 2 2M  EVB h.  h > 1 -  2 2 nh - nh (L + M + LM ) 2 (13) Note that the inequality in Eq.(13) is strict. As pointed out by Raiko et al. (2007), if cah , cbh , A, and B are all estimated so that the Bayes posterior is maximized (i.e., `empirical MAP '; EMAP), the solution is trivial (i.e.,  EMAP = 0) irrespective of the observation. In contrast, Theorem 4 implies EVB that EVB gives a non-trivial solution (i.e., h > 0)   when h  7M / n. It is also note worthy that the upper-bound in Eq.(12) is the same as that in Eq.(8). Thus, even when the hyperparameters cah and cbh are learned from data, the same upper-bound as the fixedhyperparameter case holds.
EVB is given as follows. Another upper-bound of h

( ) 2M  2 1- h . 2 nh

(14)

Thus, the PJS shrinkage factor of the upper-bound 2 (14) of EVBMF is 2M  2 /(nh ). On the other hand, as shown in Eq.(9), the PJS shrinkage factor of plain VBMF with uniform priors on A and B (i.e., ca , cb  2 ) is M  2 /(nh ), which is less than a half of EVBMF. Thus, EVBMF provides substantially stronger regularization effect than plain VBMF with uniform priors. Furthermore, from Eq.(10), we can confirm that the upper-bound (14) is equivalent to the VB solution when cah cbh = h /M .

4. Illustration of Influence of Non-identifiability
In order to understand the regularization mechanism of MAPMF, VBMF, and EVBMF more intuitively, let us illustrate the influence of non-identifiability when L = M = H = 1 (i.e., U , V , A, and B are merely scalars). In this case, any A and B such that their product is unchanged form an equivalence class and give the same value U (see Figure 3). When U = 0, the equivalence class is a cross shape on the A- and B-axes; otherwise, it forms a hyperbolic curve.

Implicit Regularization in Variational Bayesian Matrix Factorization
Bayes p osterior (V = 0) 3 MAP estimator: 2 (A, B ) = (0, 0)
0.1

Bayes p osterior (V = 1) 3
0.1 .2 0 0.3

Bayes p osterior (V = 2) 3
0.3

0.3

MAP estimators: 2 (A, B )  (±1, ±1)
0.1
0.3 0.2

MAP estimators:   2 (A, B )  (± 2, ± 2)
0.1

0.1

0.3

0.2

0.2
0.3

0.2

0.1 0.2

1
0.1 0.2 0.3
0.1 0.2 0.3

1 0 -1
0.3
0.1 0.2 0.3

0.2 0.3
0.3 0.2 0.1

1 B 0 -1

B

0
0.3 0.2 0.1
0.3 0.2 0.1

B

0.3 0.2 0.1
0.1 0.2 0.3
0.1

-1
0 0.1 .2
0.3

0.2

0.3 0.2
0.3

0.1

0. 0.12

0.2

0.2

0.3

-2 -3 -3

-2 -3 -3

-2
0.3

0.1

0.3 .2 0 .1 0

-2

-1

0 A

1

2

3

-2

-1

0 A

1

2

3

-3 -3

-2

-1

0 A

1

2

3

Figure 4. Bayes posteriors with ca = cb = 100 (i.e., almost flat priors). The asterisks are the MAP solutions, and the dashed lines indicate the modes when ca , cb  .

4.1. MAPMF When L = M = H = 1, the Bayes posterior p(A, B|V n ) is proportional to ) ( A2 B2 n 2 (15) exp - 2 (V - BA) - 2 - 2 . 2 2ca 2cb Figure 4 shows the contour of the above Bayes posterior when V = 0, 1, 2 are observed, where the number of samples is n = 1, the noise variance is  2 = 1, and the hyperparameters are ca = cb = 100 (i.e., almost flat priors). When V = 0, the surface has a cross shape and its maximum is at the origin. When V > 0, the surface is divided into the positive orthant (i.e., A, B > 0) and the negative orthant (i.e., A, B < 0), and the two `modes' get farther as V > 0 increases. For finite ca and cb , the MAP solution can be expressed as  [ ]+ 2 ca MAP A =± |V | - , cb nca cb  [ ]+ cb 2 MAP B = ±sign(V ) |V | - , ca nca cb where sign(·) denotes the sign of a scalar. In Figure 4, the MAP estimators are indicated by the asterisks; the dashed lines indicate the modes of the contour of Eq.(15) when ca , cb  . When V = 0, the Bayes posterior takes the maximum value on the A- and Baxes, which results in U MAP = 0. When V = 1, the profile of the peaks of the Bayes posterior is hyperbolic and the maximum value is achieved on the hyperbolic curves in the positive orthant (i.e., A, B > 0) and the negative orthant (i.e., A, B < 0); in either case, U MAP  1. When V = 2, a similar multimodal structure is observed. From these plots, we can visually confirm that the MAP solution with almost flat priors (ca = cb = 100) approximately agrees with the ML solution: U MAP  U ML = V .

Furthermore, these graphs explain the reason why ca cb   is sufficient for MAP to agree with ML in the MF setup (see Section 3). Suppose ca is kept small, say ca = 1, in Figure 4. Then the Gaussian `decay' remains along the horizontal axis in the profile of the Bayes posterior. However, the MAP solution U MAP does not change since the mode of the Bayes posterior is kept lying on the dashed line (equivalence class). Thus, MAP agrees with ML if either of ca and cb tends to infinity. If V = 0, 1, 2 are observed, the FB solutions (see Eq.(3)) are given by 0, 0.92, 1.93, respectively (which were numerically computed). Since the corresponding MAP solutions are 0, 1, 2, FB and MAP were shown to produce different solutions. This happened because the Gaussian density in the space of U is no longer symmetric in the joint space of A and B (see Figure 4 again), and thus the posterior mean and mode are different. We can further show that the prior proportional to  A2 + B 2 (which is improper ) corresponds to the Jeffreys non-informative prior (Jeffreys, 1946) for the MF model. 4.2. VBMF Next, we illustrate the behavior of the VB estimator, where the Bayes posterior is approximated by a spherical Gaussian. In the current one-dimensional setup, Corollary 1 implies that the VB posteriors rA (A|V n ) and rB (B|V n ) are expressed as ) (  ca ca , rA (A|V n ) = N A; ±  VB ,  cb cb ( )  cb cb rB (B|V n ) = N B; ±sign(V )  VB ,  , ca ca ( )2 ( VB )  VB 2 2  2 = + + - + , 2 2nca cb n 2 2nca cb

Implicit Regularization in Variational Bayesian Matrix Factorization
VB p osterior (V = 0) 3 2 1 B 0 -1 -2 -3 -3 VB estimator : (A, B ) = (0, 0)
0.05
0. 0
0. 0

VB p osterior (V = 1) 3 2 1 B 0 -1 VB estimator : (A, B ) = (0, 0)
0.05

0.1
0. 1
0.15

5

0.

05

0.1
0. 1
0.15

5

0.

05

0.1

0.1

0.0 5

0.05

0.0

5

0.05

-2 -3 -3

-2

-1

0 A

1

2

3

-2

-1

0 A

1

2

3

VB p osterior (V = 2) 3 2 1 B 0 -1 -2 -3 -3
0.05 0.1 0. 15
2 0.

VB p osterior (V = 2) 3

0.1

0.2

0.

15

0.3
0.2 5

2 VB estimator :   (A, B )  (- 1.5, - 1.5) 1 B 0
0.0
0.1

0. 15 0.1

0.2

0.05

15 0. 0.1 5 0.0

0.05
0.15 0.1 0.2

VB estimator :   (A, B )  ( 1.5, 1.5)

0.15

0.2

0. 3

0.

2

-2 -3 -3

0.

05

0.15 0.1 0.05

-2

-1

0 A

1

2

3

-2

-1

0.1 0.15 0.0 5

-1
0.1

0.

25

Gaussian function located at the origin. Thus, the VB estimator is still U VB = 0, which is different from the MAP solution.  V = M  2 /n = 1 is actually a transition point of the behavior of the VB estimator. When V is not larger than the threshold M  2 /n, the VB method tries to approximate the two `modes' of the Bayes posterior by a single Gaussian located at the origin. When V goes beyond the threshold, the `distance' between two hyperbolic `modes' of the Bayes posterior becomes so large that the VB method chooses to approximate one of the two modes in the positive and negative orthants. As such, the symmetry is broken spontaneously and the VB solution is detached from the origin. Note that, as discussed in Section 3, M  2 /n amounts to the expected contribution of noise E to the squared 2 singular value  2 (= V in the current setup). The bottom row of Figure 5 shows the contour of two possible VB posteriors when V = 2. Note that, in either case, the VB solution is the same: U VB  3/2. The VB solution is closer to the origin than the MAP solution U MAP = 2, and the difference between the VB and MAP solutions tends to shrink as V increases. 4.3. EVBMF Finally, we illustrate the behavior of the EVB estimator. When L = M , the EVB estimators of cah and cbh can be analytically expressed (the details are omitted due to lack of space). Combing the analytic expression and Corollary 1, we can explicitly plot the EVB posterior (Figure 6).    When V = 2  ( L + M )/ n = 2, the infimum of the free energy is attained at a , b , ca , cb  0 under a /ca = 1 and b /cb = 1. Thus, the EVB posterior is the delta function located at the origin, and the EVB estimator is (AEVB , B EVB ) = (0, 0) (see the left graph). On the other hand, when V = 3     7M / n = 7  2.65, the solution (AEVB , B EVB ) is detached from the origin (see the right graph). Note that the EVB solution is not unique in terms of (A, B) in this case, but is unique in terms of U = BA. The graphs exhibit the stronger shrinkage effect of EVB than VB with the almost flat priors.

5

0.1

0.0

0.1

0.

05

5

5

0 A

1

0.0

5

Figure 5. VB posteriors and VB solutions when L = M = 1 (i.e., the matrices V , U , A, and B are scalars). When V = 2, VB gives either one of the two solutions shown in the bottom row.
EVB p osterior (V = 2) 3 2 1 B 0 -1 -2 -3 -3 B EVB estimator : (A, B ) = (0, 0) 3
0.1

0.

2 1 0 -1 -2 -3 -3

1

0.

3

0.4 0.3 0.2 0.1

EVB estimator :   (A, B )  ( 2.28, 2.28)

-2

-1

0 A

1

2

3

-2

-1

0 A

1

2

0.

1

0.2

Figure 6. EVB posteriors and EVB solutions when L = M = 1. Left: When V = 2, the EVB posterior is the delta function located at the origin. Right: The solution is detached from the origin when V = 3.

0.05

0.05

VB h =

[(  0

Figure 5 shows the contour of the VB posteriors rA (A|V n ) and rB (B|V n ) when V = 0, 1, 2 are observed, where the number of samples is n = 1, the noise variance is  2 = 1, and the hyperparameters are ca = cb = 100 (i.e., almost flat priors). When V = 0, the cross-shaped contour of the Bayes posterior (see Figure 4) is approximated by a spherical Gaussian function located at the origin. Thus, the VB estimator is U VB = 0, which is equivalent to the MAP solution. When V = 1, two hyperbolic `modes' of the Bayes posterior are approximated again by a spherical

0.15

0.15

0.1

0.05

0.25

1-

0.05

0. 05

0.2

2

3

EVB p osterior (V = 3)

0.2
0.4

0.1 2 0. 0.3

0.2 .1 0

3

2 2 nV

)

|V | -

2 nca cb

]+

if V = 0, otherwise.

5. Conclusion
In this paper, we theoretically analyzed the behavior of Bayesian matrix factorization methods. More specifically, in Section 3, we derived non-asymptotic bounds of the maximum a posteriori matrix factorization (MAPMF) estimator, the variational Bayesian

Implicit Regularization in Variational Bayesian Matrix Factorization

matrix factorization (VBMF) estimator, and the empirical VBMF (EVBMF) estimator. Then we showed that MAPMF consists of the trace-norm shrinkage alone, while VBMF consists of the positive-part JamesStein (PJS) shrinkage and the trace-norm shrinkage. An interesting finding was that, while the trace-norm shrinkage does not take effect when the priors are flat, the PJS shrinkage remains activated even with flat priors. We also showed that in EVBMF, the `strength' of the PJS shrinkage is more than doubled compared with VBMF with the flat prior. We illustrated these facts using one-dimensional examples in Section 4. The fact that the PJS shrinkage remains activated even with flat priors is induced by the nonidentifiability of the MF models, where parameters form equivalent classes. Thus, flat priors in the space of factorized matrices are no longer flat in the space of the target (composite) matrix. Furthermore, simple distributions such as the Gaussian distribution in the space of target matrix produce highly complicated multimodal distributions in the space of factorized matrices. As Gelman (2004) pointed out, reparameterization involving modification of conjugate priors affects the behavior of statistical models. Although such re-parameterization is often introduced solely for computational purposes, its role is essential in matrix factorization. Acknowledgments: The authors thank anonymous reviewers for their helpful comments. MS thanks the support from the FIRST program.

Gelman, A. (2004). Parameterization and Bayesian Modeling. Journal of the American Statistical Association, 99, 537­545. James, W., & Stein, C. (1961). Estimation with quadratic loss. Proceedings of the 4th Berkeley Symposium on Mathematical Statistics and Probability (pp. 361­379). Berkeley, CA, USA: University of California Press. Jeffreys, H. (1946). An Invariant Form for the Prior Probability in Estimation Problems. Proceedings of the Royal Society of London. Series A, Mathematical and Physical Sciences (pp. 453­461). Konstan, J. A., Miller, B. N., Maltz, D., Herlocker, J. L., Gordon, L. R., & Riedl, J. (1997). Grouplens: Applying collaborative filtering to usenet news. Communications of the ACM, 40, 77­87. Lim, Y. J., & Teh, T. W. (2007). Variational Bayesian Approach to Movie Rating Prediction. Proceedings of KDD Cup and Workshop. Nakajima, S., & Watanabe, S. (2007). Variational Bayes Solution of Linear Neural Networks and its Generalization Performance. Neural Computation, 19, 1112­1153. Raiko, T., Ilin, A., & Karhunen, J. (2007). Principal component analysis for large sale problems with lots of missing values. Proceedings of the 18th European conference on Machine Learning (pp. 691­698). Rao, C. R. (1965). Linear statistical inference and its applications. New York, NY, USA: Wiley. Reinsel, G. R., & Velu, R. P. (1998). Multivariate reducedrank regression: Theory and applications. New York, NY, USA: Springer. Rosipal, R., & Kr¨mer, N. (2006). Overview and rea cent advances in partial least squares. Subspace, Latent Structure and Feature Selection Techniques (pp. 34­51). Berlin, Germany: Springer. Salakhutdinov, R., & Mnih, A. (2008). Probabilistic matrix factorization. Advances in Neural Information Processing Systems 20 (pp. 1257­1264). Cambridge, MA, USA: MIT Press. Srebro, N., Rennie, J., & Jaakkola, T. (2005). Maximum Margin Matrix Factorization. Advances in Neural Information Processing Systems 17. Watanabe, K., & Watanabe, S. (2006). Stochastic Complexities of Gaussian Mixtures in Variational Bayesian Approximation. Journal of Machine Learning Research, 7, 625­644. Watanabe, S. (2001). Algebraic Analysis for Nonidentifiable Learning Machines. Neural Computation, 13, 899­ 933. Watanabe, S. (2009). Algebraic geometry and statistical learning. Cambridge, UK: Cambridge University Press. Yamazaki, K., & Watanabe, S. (2003). Singularities in Mixture Models and Upper Bounds of Stochastic Complexity. Neural Networks, 16, 1029­1038.

References
Anderson, T. W. (1984). An introduction to multivariate statistical analysis. New York: Wiley. Second edition. Attias, H. (1999). Inferring parameters and structure of latent variable models by variational Bayes. Proceedings of the Proceedings of the Fifteenth Conference Annual Conference on Uncertainty in Artificial Intelligence (UAI-99) (pp. 21­30). Baldi, P., & Brunak, S. (1998). Bioinformatics: The machine learning approach. Cambridge, MA, USA: MIT Press. Baldi, P. F., & Hornik, K. (1995). Learning in Linear Neural Networks: A Survey. IEEE Transactions on Neural Networks, 6, 837­858. Bishop, C. M. (2006). Pattern recognition and machine learning. New York, NY, USA: Springer. Efron, B., & Morris, C. (1973). Stein's Estimation Rule and Its Competitors--An Empirical Bayes Approach. Journal of the American Statistical Association, 68, 117­130. Funk, S. (2006). Try this at home. http://sifter.org/~simon/journal/20061211.html.