Large-scale Collaborative Prediction Using a Nonparametric Random Effects Model Kai Yu John Lafferty Shenghuo Zhu Yihong Gong NEC Laboratories America, Cupertino, CA 95014 USA Carnegie Mellon University, Pittsburgh, PA 15213 USA kyu@sv.nec-labs.com lafferty@cs.cmu.edu zsh@sv.nec-labs.com ygong@sv.nec-labs.com Abstract A nonparametric model is introduced that allows multiple related regression tasks to take inputs from a common data space. Traditional transfer learning models can be inappropriate if the dependence among the outputs cannot be fully resolved by known inputspecific and task-specific predictors. The proposed model treats such output responses as conditionally independent, given known predictors and appropriate unobserved random effects. The model is nonparametric in the sense that the dimensionality of random effects is not specified a priori but is instead determined from data. An approach to estimating the model is presented uses an EM algorithm that is efficient on a very large scale collaborative prediction problem. The obtained prediction accuracy is competitive with state-of-the-art results. approach is to model all of the data jointly, as Yij ij = iid µ + m(xi , zj ) + N(0, ) 2 ij , (1) where now mij = m(xi , zj ) is a regression function based on task-specific predictors xi X . In order to directly model the dependency between tasks, a multi-task Gaussian process approach (Yu et al., 2007; Bonilla et al., 2008) may assume m GP(0, ) where (zj , zj ; ) 0 is a covariance function among inputs and (xi , xi ; ) 0 a covariance function among tasks, which means Cov(mij , mi j ) = ii jj . The parameters and can be estimated, for example, by maximizing the marginalized likelihood. The model is nonparametric in the sense that m lives in an infinite-dimensional function space; it can be more appropriate for capturing complex dependencies than a fixed-dimension model. But this flexibility comes at a computational price. If Y RM ×N is the random response matrix, including those elements Yij that are observed and those that need to be predicted, the computational complexity of inference is O(M 3 N 3 ). It is thus infeasible to do inference if M and (or) N are large. For example, in our experiments on the Netflix dataset, M = 480, 189 and N = 17, 770. Just as significantly, from a modeling perspective this model may not be suitable in the situation where the responses Yij cannot be fully explained by the predictors xi and zj alone. In such a case the conditional independence assumption p(Y | m, x, z) = i,j 1. Introduction In transfer learning and multi-task learning, the observed responses of interest can be seen as relational measurements under a pair of heterogeneous conditions. If there are M tasks and N shared observations, the standard regression model forms M separate estimates Yij ij = iid µ + mi (zj ) + N(0, ) 2 ij , j = 1, . . . , N where µ is a bias parameter and zj Z is the vector of input predictors for observation j. A more flexible Appearing in Proceedings of the 26 International Conference on Machine Learning, Montreal, Canada, 2009. Copyright 2009 by the author(s)/owner(s). th p(Yij |m, xi , zj ) implied by Eq. (1) is invalid. Such is often the case for relational or networked observations. For instance, Large-scale Collaborative Prediction Using a Nonparametric Random Effects Model in the application of predicting users' unobserved ratings on movies, the observed ratings themselves are often more informative than the users' demographic attributes and the movies' genre attributes. Motivated by these considerations, in this paper we propose a transfer learning approach that satisfies three desiderata. The model (1) is nonparametric with sufficient flexibility, (2) is appropriate for dependent responses that are not solely conditioned on the given predictors, and (3) supports efficient learning on very large scale data. where > 0, 0 (xi , xi ) is a given covariance function based on task-specific predictors, and fi denotes the function values of the random effects on task i. Following a hierarchical Bayesian framework, we let the covariance function be a random variable sampled from a hyper-prior distribution IWP(, 0 + ), which defines an inverse-Wishart process, where > 0, is a degree-of-freedom parameter, is a Dirac delta kernel function, and 0 (zj , zj ) is a given covariance function based on input data. The inverse-Wishart process (IWP) is a nonparametric prior for random covariance functions, based on a nonstandard notation of the inverse-Wishart distribution (Dawid, 1981); a brief introduction is given in the Appendix. The deviation of Y from the population mean µ is decomposed into two parts, random effects m and f . Let 0 (xi , xi ) = (xi ), (xi ) , where is an implicit feature mapping; then m can be parameterized as mij = mj (xi ) = m(xi , j ) = (xi ), j where {j } are latent Gaussian random variables whose covariance follows E( j , j ) = j,j . Because of the coupling between mi and xi , mi is nonexchangeable given and the predictors xi , while fi is exchangeable under the same condition. The separation of these two kinds of random effects can enable computational advantages. For instance, one can let the nonexchangeable part be parametric while the exchangeable part is nonparametric; we will develop this strategy below in Sec. 3. We note that the independent noise ij is no longer separately required in this model, but can be absorbed into the random effects f . Since the covariance is a random variable, it is fully capable of modeling the total variance of the random effects and independent noise. Indeed, this treatment can considerably improve the computational efficiency, as we shall see in Sec. 3.4. Since the population mean µ can be reliably estimated by pooling all the observations together and computing their average, hereafter, for simplicity of presentation, we assume Yij to be centered and thus µ will not occur in the model. 2.2. An Equivalent Model Note that the above row-wise generative model seems to be arbitrary, because there is no particular reason to prefer it over the alternative column-wise sampling 2. A Random Effects Model In statistics, a random effects model is a kind of hierarchical linear model that that allows dependent observations occurring in repeated or multilevel structures. The scenario of transfer learning matches such a form; in particular, the population is grouped together according to tasks, and each group repeatedly generates random observations. A natural extension of the previous model is to assume the Yij are conditionally independent, given mij m(xi , zj ) and additional additive random effects fij . The model becomes Yij = µ + mij + fij + ij , iid ij N(0, 2 ) where the distribution of m and f determines the dependence among the variables Yij . One may wish to choose a form of f that is appropriate to explain the data dependence, for example, introducing latent variables ui and vj such that fij has a parametric form f (ui , vj ; f ), as suggested in (Hoff, 2005). A nonparametric random effects model may assume f GP(0, ), where the covariances and are parameters to be determined from data. While this model is natural and flexible, it is subject to the large computational limitations discussed above. We address these limitations by imposing additional structure and developing an efficient estimation procedure. 2.1. Our Model In the following we introduce a refinement of the above model that has specific structure, but exhibits some attractive conceptual and computational properties. We again take a hierarchical linear model: Yij = µ + mij + fij , but now assume that m fi iid GP(0, 0 ), GP(0, ), i = 1, . . . , M. Large-scale Collaborative Prediction Using a Nonparametric Random Effects Model process, m fj iid 3. Large-scale Inference and Learning IWP(, 0 + ), GP(0, 0 ), GP(0, ), j = 1, . . . , N, Given the model and some observed elements, denoted by YO , we can make predictions by computing the conditional expectation on unseen elements. To this end, we need to compute the posterior distribution p(Y|YO , ), where includes 0 , 0 , , and . It is highly challenging to apply nonparametric models to large-scale problems. We will present several algorithms, each with an analysis of its computational complexity in the context of the Netflix collaborative prediction problem, which contains M = 480, 189 users and more than 100 millions ratings on N = 17, 770 movies. Our estimates of computing time are based on a 2.66GHz Intel PC with 3.0GB memory. 3.1. Modeling Large-scale Data Assuming M N , e.g., the Netflix case, it is computationally more convenient to work with the row-wise model. Since the computation is always on a finite matrix Y RM ×N , including those elements Yij that are observed and those that need to be predicted, and since the model is consistent under marginalization, in the following we state the model on finite matrices IW(, 0 + IN ), m N(0, 0 ), Yi N(mi , ), i = 1, . . . , M where IW defines a finite inverse-Wishart distribution, and 0 are both N × N covariance matrices, 0 is M × M , mi is the i-th row of m, and Yi is the i-th row of Y. To further facilitate a large-scale implementation, we assume 0 has finite dimensions. i.e., 0 (xi , xi ) = (xi ) (xi ), with : X Rp . For large-scale data, it is reasonable to assume M N p. The finitedimension assumption is mainly due to computational concerns for large-scale problems, though the model itself does not require this. However, the sampled latent covariance functions and are always infinitedimensional, and so is the function Y . Therefore the overall model is still nonparametric. Without loss of generality, we let 0 (xi , xi ) = 1 1 p 2 xi , p 2 xi . In this case, the random effects m follow mi = xi , k N(0, ), k = 1, . . . , p iid where similarly fj denotes the function values of the random effects on case j, across tasks. It turns out that the two models are equivalent -- in both cases, if we compute the marginal distribution of Y , namely 1. 2. p(Y |m, , )p(m|, 0 )p(|0 , , ) dm d; p(Y |m, , )p(m|, 0 )p(|0 , , ) dm d; we obtain exactly the same closed-form distribution Y MTP , 0, (0 + ), (0 + ) , where MTP defines a matrix-variate Student-t process. That is, any subset of matrix values Y RM ×N follows a matrix-valued Student-t distribution (Gupta & Naga, 1999). The proof of the equivalence is sketched in the Appendix. 2.3. Discussion The model is nonparametric at two levels. First, the covariance functions and are nonparametric, because they both live in infinite dimensional function spaces. Moreover, they do not have explicit parametric forms. Second, the dimensionality of any finite Y is not specified a priori, but instead increases with the observation size. The equivalence between the two generative processes reveals an appealing symmetry in the model. One can either view the model as adapting the free-form covariance function to the data, or equivalently, adapting the covariance to the data. Therefore the model is somewhat similar to those that deal with both and simultaneously, such as the multi-task Gaussian process discussed in Sec. 1. However, as we shall see in Sec. 3, the model potentially allows us to avoid the prohibitive computational cost O(M 3 N 3 ), without sacrificing significant flexibility. The model can easily incorporate group-specific effects, as do most classical random-effects models. For example, one can expand feature representation the by adding a constant term c, such that 0 (xi , xi ) = (xi ), (xi ) + c. Then m can be parameterized as mij = (xi ), j + j where j is the j-th element of GP(0, c), i.e., the column-specific random effects. Similarly, with 0 (zj , zj ) = (zj ), (zj ) + b, one can deal with the row-specific random effects as well. where is a p × N random matrix, and k is its k-th row. Then the model becomes IW(, 0 + IN ), N(0, Ip ), Yi N( xi , ), i = 1, . . . , M Large-scale Collaborative Prediction Using a Nonparametric Random Effects Model 3.2. Approximate Inference ­ Gibbs Sampling Based on the joint distribution p(Y, , |) described by the model, the prediction can be approximated by a Markov chain Monte Carlo method p(Y|YO , )YdY 1 T T · E-step: compute the posterior distribution of Y given the current , M M Q(Y) = i=1 p(Yi |YOi , , ) = i=1 N(Yi | i , Ci ), Y (Y; Y(t) ) t=1 where the sufficient statistics are mi = xi , i = mi + [:,Ji ] -1,Ji ] (Y[i,Ji ] - m[i,Ji ] ) , [Ji Ci = - [:,Ji ] -1,Ji ] [Ji ,:] . [Ji · M-step: optimize and , = arg min EQ(Y) [- log p(Y, , |)] , where the samples Y(t) , together with (t) , (t) are i.i.d. drawn from p(Y, , |YO , ). For Gibbs sampling, whose details are omitted here due to space limitation, the computational cost of one Gibbs iteration is roughly M M O i=1 Ni3 + 2 i=1 Ni2 + 2M N 2 + N 3 , (2) where we count mainly multiplications, and ignore the impact of factor p, since M N p. Compared with O(M 3 N 3 ) for the direct Gaussian process, this is a dramatically reduced complexity. However, since M is large, the cost O(2M N 2 ) is still prohibitive-- each Gibbs iteration will take hundreds of hours on the Netflix data. 3.3. Approximate Inference ­ EM We first introduce some necessary notation. Let Ji {1, . . . , N } be the index set of the Ni observed elements in the row Yi , and [:,Ji ] RN ×Ni be the matrix obtained by keeping the columns of indexed by Ji . Then let [Ji ,Ji ] RNi ×Ni be obtained from [:,Ji ] by further keeping only the rows indexed by Ji . We can similarly define [Ji ,:] , Y[i,Ji ] and m[i,Ji ] . Alteratively, the conditional expectation can be approximated by p(Y|YO , )YdY = p(Y|YO , , )p(, |YO , )Y dddY p(Y|YO , , )Y dY and then let , . It turns out that can be nicely decoupled from , and has a closed form: = (x x + Ip )-1 x . Plugging into the optimization cost, we have = -1 [C - x(x x + Ip )-1 x ] + 0 + IN M + 2N + p + M where C = i=1 (Ci + i i ). Further detail on the M-step can be found in the Appendix. The overall cost of a single EM iteration is M M M O i=1 Ni3 + i=1 Ni2 N+ M+ i=1 Ni N2 . The computational cost seems to be even higher than that of Gibbs sampling, since N 2 is multiplied by the M large factor (M + i=1 Ni ), and N is a large multiM plicative factor of i=1 Ni2 . We estimate that, in a M single EM step, excluding the term O( i=1 Ni3 ), the computation would take several thousands of hours for the Netflix problem. 3.4. Efficient EM Implementation In order to introduce a faster algorithm, we now fix some notation. Let Ui RN ×Ni be a column selection operator, whose (s, t) element is one, if the index s equals to the t-th element of Ji , otherwise zero, so that [:,Ji ] = Ui and [Ji ,Ji ] = Ui Ui . Furthermore, Ui [Ji ,Ji ] Ui restores a N × N sparse matrix where the elements of [Ji ,Ji ] is placed back to their original positions in . where and are the maximum a posteriori estimates , = arg max p(, |YO , ). , When the number of observations is sufficiently large that the posterior distribution is concentrated around its mode, the above approximation can be tight. The expectation-maximization (EM) algorithm is a popular approach to finding the mode, alternating the following two steps: Large-scale Collaborative Prediction Using a Nonparametric Random Effects Model The major cost is from the E-step that computes the sufficient statistics i and Ci for every i. Our key observation is that, the M-step only needs C = M i=1 Ci + and x from the previous E-step. To obtain them, in fact it is unnecessary to compute i and Ci at all. For example, by using [:,Ji ] = Ui , we have M M Ci i=1 = i=1 M - [:,Ji ] -1,Ji ] [Ji ,:] [Ji - Ui -1,Ji ] Ui [Ji i=1 M This is a dramatic efficiency improvement; for each EM iteration on Netflix, the original thousands of hours are now reduced to several hours. The remaining maM jor cost O( i=1 Ni3 ) comes from the matrix inversion -1,Ji ] . Occasionally a row may have an unusually [Ji large number of observations, so we truncate Ni by randomly selecting 2000 observations if Ni > 2000. In the end, a single EM step takes about 5 hours, and the whole optimization typically converges in about 30 steps. As mentioned in Sec. 2.1, considerable computation is also saved by avoiding to separately model the noise M 2 3 ij . Otherwise, one has to spend O( i=1 (2Ni +2Ni )) operations to compute the noise statistics in a single EM step, which amounts to about 10 hours of computing time. = = M - i=1 Ui -1,Ji ] Ui [Ji Since multiplication with Ui is done by memory access M only, above reduces O( i=1 Ni3 + N Ni2 + N 2 Ni ) to M O( i=1 Ni3 + 2N 3 ). In a similar way we have M 4. Related Work There is a large body of research on multi-task learning using Gaussian processes, including those that learn the covariance shared across tasks (Lawrence & Platt, 2004; Schwaighofer et al., 2005; Yu et al., 2005), and those that additionally consider the covariance between tasks (Yu et al., 2007; Bonilla et al., 2008). The methods that only use have been applied to collaborative prediction (Schwaighofer et al., 2005; Yu et al., 2006). However, due to the computational cost, they were both evaluated on very small data sets, where M and N are several hundreds only. Our work differs from the above models in two aspects: First, a new nonparametric model is proposed to explicitly combine known predictors and random effects for multi-task predictions; Second, considerations in both model designing and algorithm engineering are put together to scale the nonparametric model to very large data sets. Another class of related work is the so-called semiparametric models, where the observations Yij are linearly generated from a finite number of basis functions randomly sampled from a Gaussian process prior (Teh et al., 2005). Our approach is more related to a recent work (Zhu et al., 2009), where Yij is modeled by two sets of multiplicative factors, one sampled from GP(0, ) and the other from GP(0, ), namely, Y becomes a low-rank random function. Low-rank matrix factorization is perhaps the most popular and the state-of-the-art method for collaborative filtering, e.g., (Salakhutdinov & Mnih, 2008a). Our model generalizes them in the sense that it models an infinite-dimensional relational function. A simplification of our work leads to a nonparametric prin- = x x + i=1 M Ui ai ai Ui M + i=1 Ui ai xi M + i=1 Ui ai xi x = x x+ i=1 Ui ai xi where ai = -1,Ji ] (Y - m)[i,Ji ] . [Ji Before running the EM iterations, we pre-compute x x and (x x + Ip )-1 , since they are both repeatedly used in the algorithm. Then we implement the E-step as the following Reset For A1 RN ×N , A2 RN ×p ; i = 1, . . . , M ; P = -1,Ji ] ; [Ji a = P(Y[i,Ji ] - xi [:,Ji ] ) ; A1 A1 + Ui (aa - P)Ui ; A2 A2 + Ui axi ; End. The statistics required by the M-step are computed as C = M + x x + A1 + A2 + A2 x = x x + A2 With this new implementation, the major cost of a single EM step is M M O i=1 Ni3 + 2 i=1 Ni2 + 2N 3 . Large-scale Collaborative Prediction Using a Nonparametric Random Effects Model Table 1. Prediction Error on EachMovie Data Table 2. Computation Time on EachMovie Data Method User Mean Movie Mean FMMMF PPCA BSRM-1 BSRM-2 NREM-1 NREM-2 RMSE 1.4251 1.3866 1.1552 1.1045 1.0902 1.0852 1.0816 1.0758 Standard Error 0.0004 0.0004 0.0008 0.0004 0.0003 0.0003 0.0003 0.0003 Method FMMMF PPCA BSRM-1 BSRM-2 NREM-1 NREM-2 Run Time (hours) 4.94 1.26 1.67 1.70 0.59 0.59 cipal component analysis (NPCA) model (Yu et al., 2009). A non-probabilistic nonparametric method, i.e., maximum-margin matrix factorization, was introduced in (Srebro et al., 2005). Very few matrix factorization methods use known predictors. One such a work is by (Hoff, 2005) that introduces low-rank multiplicative random effects, in addition to known predictors, to model networked observations. · User Mean and Movie Mean: baseline methods that predict ratings by computing the empirical mean of the same users' observed ratings, or, other users' ratings on the same movie. · FMMMF: fast max-margin matrix factorization (Rennie & Srebro, 2005), a low-rank method with 2 norm regularization on the factors, optimized by conjugate gradient descent. · PPCA: probabilistic principal component analysis (Tipping & Bishop, 1999), which is a probabilistic low-rank matrix factorization method optimized by EM algorithm. · BSRM: Bayesian stochastic relational model (Zhu et al., 2009), a semi-parametric model implemented by Gibbs sampling, where Y is finitedimensional. BSRM-1 does not use additional user/movie attributes, while BSRM-2 does. · NREM: Nonparametric random effects model, the work of this paper, which models Y as an infinitedimensional random function. NREM-1 does not use additional attributes, i.e., we let 0 and 0 be a constant c, as suggested in Sec. 2.3. For both BSRM-2 and NREM-2, we obtain the user and movie attributes from the top 20 eigenvectors of the binary matrix that indicates whether or not ratings are observed in the training set. Note that there can be different ways to obtain useful attributes, which is not the focus of this paper. All the results are summarized in Tab. 1 and Tab. 2. Our models achieved lower prediction errors than other methods. This result is perhaps not entirely surprising, because nonparametric models are generally more flexible to explain complex data than parametric models. On the other hand, the training of our models is even faster, which is a significant result. 5.2. Netflix Problem The data contain 100, 480, 507 ratings from 480, 189 users on 17, 770 movies. In this case Yij 5. Experiments 5.1. EachMovie Data We did experiments on the entire EachMovie data, which include 74, 424 users' 2, 811, 718 numeric ratings Yij {1, 2, 3, 4, 5, 6} on 1, 648 movies. The data are very sparse because 97.17% of the elements are missing. We randomly selected 80% of the observed ratings of each user for training and used the user's rest 20% ratings for testing. This random partition was repeated 10 times independently. We calculated RMSE (root mean square error) for each partition and averaged the 10 results to compute the mean and the standard error. In our experiments, for each evaluated method, we used one training/testing partition to do model selection, including determining hyper parameters and dimensionality. The selected model was then used for other partitions. For low-rank matrix factorization methods, we chose the dimensionality from D = 50, 100, 200. We found that larger dimensionality generally gave rise to better performances, but the accuracy was tending to saturate after the dimensionality got more than 100. We used 1% of the training ratings for setting the stopping criterion: for each method, we terminated the iterations if RMSE on the holdout set is observed to increase, and then included the holdout set to run one more iteration of training. We recorded the total run time of each method. In the experiments, we implemented the following methods: Large-scale Collaborative Prediction Using a Nonparametric Random Effects Model {1, 2, 3, 4, 5}. In addition, Netflix.com provides a set of validation data with 1, 408, 395 ratings. Therefore there are 98.81% of elements missing in the rating matrix. For evaluation, a set of 2, 817, 131 ratings are withheld. People need to submit their results in order to obtain the RMSE result from Netflix.com. We decided to direct compare our results with those reported in the literature, since they were all evaluated on the same test data. The results are shown in Tab. 3, where Cinematch is the baseline provided by Netflix. Besides those introduced in Sec. 5.1, there are several other methods: · SVD: a method almost the same as FMMMF, using a gradient-based method for optimization (Kurucz et al., 2007). · RBM: Restricted Boltzmann Machine trained by contrast divergence (Salakhutdinov et al., 2007). · PMF and BPMF: probabilistic matrix factorization (Salakhutdinov & Mnih, 2008b), and its Bayesian version (Salakhutdinov & Mnih, 2008a). · PMF-VB: probabilistic matrix factorization using a variational Bayes method for inference (Lim & Teh, 2007). Note that sometimes the running time was not found in the papers. For BPMF, we gave a rough estimation by assuming it went through 300 iterations (each took 220 minutes as reported in the paper). Similar to the EachMovie experiments, BSRM-2 and NREM2 used the top 40 eigenvectors of the binary indicator matrix as additional attributes. As shown in Tab. 3, if compared with those matrix factorization methods, NREM-1 used no additional attributes either, but generated more accurate predictions. Furthermore, NREMs are highly competitive if compared with the semi-parametric method BSRM. In terms of efficiency for training, our nonparametric model turns out to be even faster than those compared parametric and semi-parametric models. We note that the top performers in the Netflix competition reported better results by combining heterogenous models. For example, the progress award winner in 2007 combined predictions from about one hundred of different models (Bell et al., 2007). However, our focus here is not on developing ensemble methods. Table 3. Performance on Netflix Data Method Cinematch SVD PMF RBM PMF-VB BPMF BSRM-2 NREM-1 NREM-2 RMSE 0.9514 0.920 0.9265 0.9060 0.9141 0.8954 0.8881 0.8876 0.8853 Run Time (hours) 300 1100 350 148 150 random effects and known attributes to explain the complex dependence of data; Second, an algorithm is proposed to speed up the model on very large-scale problems. Our experiments demonstrate that the algorithm works very well on two challenging collaborative prediction problems. In the near future, it will be promising to perform a full Bayesian inference by a parallel Gibbs sampling method. References Bell, R. M., Koren, Y., & Volinsky, C. (2007). The BellKor solution to the Netflix prize (Technical Report). AT&T Labs. Bonilla, E. V., Chai, K. M. A., & Williams, C. K. I. (2008). Multi-task Gaussian process prediction. Advances in Neural Information Processing Systems 20 (NIPS), 153­160. Dawid, A. P. (1981). Some matrix-variate distribution theory: notational considerations and a Bayesian application. Biometrika, 68, 265­274. Gupta, A. K., & Naga, D. K. (1999). Matrix variate distributions. Chapman & Hall/CRC. Hoff, P. (2005). Bilinear mixed-effects models for dyadic data. Journal of the American Statistical Assosciation, 100, 286­295. Kurucz, M., Benczur, A. A., & Csalogany, K. (2007). Methods for large scale SVD with missing values. Proceedings of KDD Cup and Workshop. Lawrence, N. D., & Platt, J. C. (2004). Learning to learn with the informative vector machine. Proc. of the 21st International Conference on Machine Learning (ICML), 65­72. Lim, Y. J., & Teh, Y. W. (2007). Variational Bayesian approach to movie rating prediction. Proceedings of KDD Cup and Workshop. 6. Conclusion In this paper a nonparametric model for multi-task learning is introduced. The contributions are twofold: First, the model provides a novel way to use Large-scale Collaborative Prediction Using a Nonparametric Random Effects Model Rennie, J. D. M., & Srebro, N. (2005). Fast maximum margin matrix factorization for collaborative prediction. Proc. of the 22nd International Conference on Machine Learning (ICML), 713­719. Salakhutdinov, R., & Mnih, A. (2008a). Bayesian probabilistic matrix factorization using Markov chain Monte Carlo. Proc. of the 25th International Conference on Machine Learning (ICML), 880­887. Salakhutdinov, R., & Mnih, A. (2008b). Probabilistic matrix factorization. Advances in Neural Information Processing Systems 20 (NIPS), 1257­1264. Salakhutdinov, R., Mnih, A., & Hinton, G. (2007). Restricted Boltzmann machines for collaborative filtering. Proc. of the 24th International Conference on Machine Learning (ICML), 791­798. Schwaighofer, A., Tresp, V., & Yu, K. (2005). Hierarchical Bayesian modelling with Gaussian processes. Advances in Neural Information Processing Systems 17 (NIPS), 1209­1216. Srebro, N., Rennie, J. D. M., & Jaakola, T. S. (2005). Maximum-margin matrix factorization. Advances in Neural Information Processing Systems 18 (NIPS), 1329­1336. Teh, Y., Seeger, M., & Jordan, M. (2005). Semiparametric latent factor models. The 8th Conference on Artificial Intelligence and Statistics (AISTATS). Tipping, M. E., & Bishop, C. M. (1999). Probabilistic principal component analysis. Journal of the Royal Statisitical Scoiety, B, 611­622. Yu, K., Chu, W., Yu, S., Tresp, V., & Xu, Z. (2007). Stochastic relational models for discriminative link prediction. Advances in Neural Information Processing Systems 19 (NIPS), 1553­1560. Yu, K., Tresp, V., & Schwaighofer, A. (2005). Learning Gaussian processes from multiple tasks. Proc. of the 22nd International Conference on Machine Learning (ICML), 1012­1019. Yu, K., Zhu, S., Lafferty, J., & Gong, Y. (2009). Fast nonparametric matrix factorization for large-scale collaborative filtering. The 32nd SIGIR conference. Yu, S., Yu, K., Tresp, V., & Kriegel, H.-P. (2006). Collaborative ordinal regression. Proc. of the 23rd International Conference on Machine Learning (ICML), 1089­1096. Zhu, S., Yu, K., & Gong, Y. (2009). Stochastic relational models for large-scale dyadic data using MCMC. Advances in Neural Information Processing Systems 21 (NIPS), 1993­2000. 7. Appendix 7.1. Inverse-Wishart Processes A nonstandard definition of inverse-Wishart distribution IW(, ) was introduced in (Dawid, 1981), +2N with its p.d.f. proportional to ||- 2 etr( -1 -1 ). 2 The distribution is consistent under marginalization, namely, its any principal submatrix 11 follows IW(, 11 ). Therefore, just like Gaussian processes, one can characterize a random covariance function by an inverse-Wishart process. 7.2. Proof Sketch of the Model Equivalence The proof is based on Theorem 4.2.1 in (Gupta & Naga, 1999). Using our notation, it states: Let S W( + N - 1, -1 ), X N(0, IN ), if Y = 1 X (S- 2 ), then Y follows a matrix-variate Studentt distribution MT(, 0, , ). The theorem implies: If S-1 IW(, ) and Y N(0, S-1 ), then Y MT(, 0, , ). This can be used to derive our result in the case where Y is a random matrix. Since both inverse-Wishart and Gaussian are consistent under marginalization (Dawid, 1981), the result can be generalized to the case of random functions Y . 7.3. Derivation of the M-step The optimization cost of the M-step can be written as EQ(Y) [- log p(Y|, )] - log p(|) - log p(|) J1 (,)+const1 J2 (,)+const2 J3 ()+const3 where 1 M tr -1 C - 2 x + x x + log || 2 2 p 1 J2 = tr -1 ( ) + log || 2 2 + 2N 1 -1 J3 = tr (0 + IN ) + log || 2 2 J1 = The part containing can be organized into 1 tr -1 ( - b) (x x + Ip )( - b) - xb 2 where b = (x x + Ip )-1 x . This clearly suggests = b. Removing irrelevant constants, we have J(, ) = M + 2N + p + 1 log || + tr -1 2 2 where = -1 (C - x(x x + Ip )-1 x ) + 0 + J IN . Then a closed form is obtained from = 0.