Non-linear Matrix Factorization with Gaussian Processes Neil D. Lawrence neill@cs.man.ac.uk School of Computer Science, University of Manchester, Kilburn Building, Oxford Road, M13 9PL, U.K. Raquel Urtasun ICSI and EECS, UC Berkeley, 1947 Center Street, Berkeley, CA 94704 rurtasun@icsi.berkeley.edu Abstract A popular approach to collaborative filtering is matrix factorization. In this paper we develop a non-linear probabilistic matrix factorization using Gaussian process latent variable models. We use stochastic gradient descent (SGD) to optimize the model. SGD allows us to apply Gaussian processes to data sets with millions of observations without approximate methods. We apply our approach to benchmark movie recommender data sets. The results show better than previous state-of-theart performance. We show how a probabilistic matrix factorization is equivalent to probabilistic principal component analysis. We then extend this model in a non-linear way to give a probabilistic non-linear matrix factorization. Moreover, our formula for predicting a new test rating from previously rated items turns out to be strikingly similar to the weighted similarity equations used in neighborhood approaches. Our model therefore embodies a combined approach to collaborative filtering. Heuristic algorithms exhibiting this combined approach to collaborative filtering have already been suggested (see e.g. Koren, 2008). Our algorithm emerges naturally through Gaussian process prediction. We show how the parameters of the algorithm can all be learned using maximum likelihood via stochastic gradient descent. The latent factor approach to collaborative filtering sees the data as a sparsely populated matrix of ratings. For a data set with N items and D users we take this matrix to be Y N ×D . The objective is to factorize Y into a lower rank form, Y U V (see e.g. Rennie & Srebro, 2005; DeCoste, 2006), where U q×N and V q×D . Prediction can then be done by estimating (U, V) from the training data and computing the resulting approximation to Y. In this paper we consider a non-linear generalization of this approach. To guide algorithmic development, we favor a probabilistic perspective, we therefore consider the very natural probabilistic interpretation of this factorization given by Salakhutdinov and Mnih (2008b) they referred to as probabilistic matrix factorization (PMF). Their probabilistic model takes the form N 1. Introduction Collaborative filtering is the process of filtering information from different viewpoints. A particular use of the approach is the prediction of user tastes. Given a body of works, such as books or movies, what does a given user's quality rating of one item say about their likely rating for another? There are two dominant approaches to collaborative filtering. The neighborhood approach involves expressing a similarity measure between the items to be rated. When a prediction for a new item is required, the similarity between the new item and previously rated items is computed. The new prediction is given by a normalized similarity-weighted sum of the rated items. The alternative approach is the latent factor approach, which typically involves a low rank approximation. There is evidence (Bell & Koren, 2007) that both these approaches to collaborative filtering are important for good performance on real data. In this paper we develop a non-linear extension of the latent factor approach to collaborative filtering. Appearing in Proceedings of the 26 International Conference on Machine Learning, Montreal, Canada, 2009. Copyright 2009 by the author(s)/owner(s). th p Y|U, V, 2 = i=1 N yi,: |V u:,i , 2 I . where u:,i is the ith column of U and yi,: is a column vector taken from the ith row of Y containing ratings of the ith item from the users. In practice yi,: will have Non-linear Matrix Factorization with Gaussian Processes many missing values, but we will ignore this aspect for the moment. In PMF we are modeling the matrix Y as a noise corrupted low rank matrix. The mean of the distribution is given by the matrix factorization, U V, and the noise is taken to be Gaussian with variance 2 . In PMF a Gaussian prior is placed over U, N q -1 p (U) = i=1 j=1 N uj,i |0, u and V, p (V) = -1 N vj,i |0, v . Ideally, the marginal likelihood of the model would be calculated, but in practice this is not tractable. Instead, in their original paper, Salakhutdinov and Mnih suggest maximum a posteriori (MAP) approximations. D i=1 q j=1 marginalizing W instead of X. Taking the prior over W, D q -1 N wi,j |0, w i=1 j=1 p (W) = and combining with the likelihood we recover the following marginal likelihood: D p Y|X, 2 , w = j=1 -1 N y:,j |0, w XX + 2 I . It is well known that SVD of a data matrix is directly equivalent to principal component analysis (PCA). In this case we consider Y to be a design matrix, with each row representing a separate data point. Salakhutdinov and Mnih consider their model to be a probabilistic generalization of SVD, it therefore makes sense to ask the question, what is the relationship between PMF and probabilistic PCA (PPCA, Tipping & Bishop, 1999)? This is the marginal likelihood of a Bayesian linear regression model with multiple outputs. However, we are not given the input matrix, X, but we optimize with respect to it. Optimization with respect to these inputs is also equivalent to probabilistic PCA. It is known as dual probabilistic PCA (DPPCA, Lawrence, 2005). If we were to marginalize both X and W we would recover Bayesian PCA. This is not analytically tractable, but the Laplace approximation (Bishop, 1999a; Minka, 2001) and variational approaches (Bishop, 1999b) have been proposed. Salakhutdinov and Mnih (2008a) used Gibbs sampling to deal with the intractabilities. The equivalences of the models we've laid out above imply we can deal with marginalization of either X or W analytically, leaving us to optimize the resulting marginal likelihood with respect to the remaining matrix and the hyper parameters of the model. 2.1. Missing Values Both models we've described are Gaussian models with a particular covariance structure. This means that marginalizing over missing values is straightforward. Consider a Gaussian distribution over a vector y with mean µ and covariance , y N (µ, ) . We represent an observed subset of y by yi , where i represents the indexes of the observed values. Marginalizing the missing values leads to a Gaussian of the form yi N (µi , i,i ) , where µi and i,i represent the mean vector with the rows (and for columns) associated with the unobserved elements of y removed. This implies that when the data matrix is sparse (as is typically the case in collaborative filtering) the likelihoods have the form N 2. PMF is Bayesian PCA It turns out that the unconstrained PMF is probabilistically equivalent to Bayesian PCA (Bishop, 1999a). This is a little easier to see with a small change of notation. Consider a matrix of latent variables, X U N ×q and a mapping matrix which goes from the latent space to the observed data space, W V D×q . Using this new notation, N p Y|W, X, 2 = i=1 N yi,: |Wxi,: , 2 I , (1) the likelihood has a familiar look. It is a multi-output linear regression from a q dimensional feature matrix X to matrix targets Y. Placing a prior over X, N q -1 N xi,j |0, x , i=1 j=1 p (X) = and marginalizing leads to N p Y|W, 2 , x = i=1 -1 N yi,: |0, x WW + 2 I . p Y|W, 2 , x = i=1 -1 N yi,ji |0, x Wji ,: Wji ,: + 2 I When optimizing with respect to parameters we can absorb x into W leaving the likelihood function associated with PPCA (Tipping & Bishop, 1999). Unfortunately, attempting to marginalize W is now intractable. Instead, we take a step back and consider and D p Y|X, 2 , w = j=1 -1 N yij ,j |0, w Xij ,: Xij ,: + 2 I . (2) Non-linear Matrix Factorization with Gaussian Processes Which of these likelihoods should we select? Either the likelihood in the PPCA form or the DPPCA form can be used. Whilst there are sensible statistical arguments for selecting the likelihood with less parameters (which this is will depend on whether there are more users or items), the maximum likelihood solutions for the fully observed case are equivalent. We will proceed with the DPPCA form, though, as typically there are more users than items in collaborative filtering, and in this case DPPCA has fewer parameters. However, it should be borne in mind that the ideas that follow apply equally to the PPCA form, but with the roles of users and items reversed (there is a duality between the likelihoods that should have become apparent by now). When Y is fully observed, a global maximum for (2) with respect to X and 2 can be found through an eigenvalue problem. When there are missing values, Tipping and Bishop (1999) suggested an expectationmaximization (EM) algorithm. However, when D is very large EM is computationally too expensive and it makes sense to consider stochastic gradient descent. 2.2. Stochastic Gradient Descent Rather than maximizing the likelihood through an EM algorithm we propose presenting ratings for each user one at a time and computing the gradient of the log likelihood for that user. The parameters X, w and 2 can then be updated by the gradients for that user, and a new user is presented. The objective is to maximize the log likelihood, conversely we minimize the negative log likelihood. For the jth user this is given by Ej (X) = 1 Nj log |Cj | + y C-1 yij ,j + const., 2 2 ij ,j j 3. Non-Linear PMF via GP-LVMs We have already highlighted the fact that probabilistic matrix factorization, with the parameters W marginalized is a Bayesian multi-output regression model in which we optimize with respect to the inputs to the regression. This type of model is equivalent to probabilistic PCA. However, it also belongs to a larger class of models called Gaussian process latent variable models (GP-LVM). Lawrence (2005) showed how the matrix C has an interpretation as a Gaussian process (GP) covariance matrix. The GP associated with the -1 covariance function C = w XX + 2 I is a linear model. However, by replacing the inner product matrix, XX , by a Mercer kernel the model becomes a non-linear GP model. Maximization of the log likelihood can no longer be done through an eigenvalue problem, but it is straightforward to apply stochastic gradient descent in the manner described above. The regression model from (1) can be written as a product of univariate Gaussian distributions, D N p Y|W, X, 2 = j=1 i=1 N yi,j |fj (xi,: ) , 2 I , where the mean of each Gaussian is given by the inner product fj (xi,: ) = wj,: xi,: . Probabilistic PCA can be recovered by marginalizing either W or X. The GPLVM is recovered by recognizing that we can place the prior distribution directly over the function f (·) through a Gaussian process (Rasmussen & Williams, 2006). A Gaussian process (GP) can be thought of as a probability distribution for generating functions. The GP is specified by a mean and a covariance function. For any given set of observations of the function, f , the joint distribution over those observations is Gaussian. Restricting ourselves to GPs with a zero mean function, they are distributed as p (f | X) = N (f |0, K) , where K represents the covariance function. The covariance function is made up of elements, k(xi,: , xj,: ) that encode the degree of correlation between two samples, fi , fj from f as a function of the inputs associated with those samples, xi,: and xj,: . For a covariance function to be valid, it has to lead to a positive semi-definite matrix K for all valid inputs to the function. In practice that means that valid covariance functions have to be positive definite functions, i.e. the class of valid covariance functions is the same as the class of Mercer kernels (Sch¨lkopf & Smola, 2001). A linear regreso sion model is a GP in which the covariance function is taken to be k (xi,: , xj,: ) = xi,: xj,: . A widely used covariance function that gives a prior -1 where Cj = w Xij ,: Xij ,: + 2 I and Nj is the number of items rated by user j. The gradient with respect to X can be computed as dEj (X) = -GXij ,: dXij ,: with G = C-1 yij ,j yij ,j C-1 - C-1 . Gradients with j j j respect to 2 and w can also be found. The computational complexity of the gradient computation is 3 O(Nj ) or, if q < Nj the matrix inversion lemma, C-1 = -2 I - -4 X w I -2 + X X -1 X , can be used to give a complexity of O(q 3 ) for the inversion. Non-linear Matrix Factorization with Gaussian Processes over non-linear functions is known as the RBF covariance, k (x ,: , xi,: ) = m exp - m ||x 2 ,: - xi,: ||2 . test rating and the user's rated items. For this case our prediction is functionally identical to the neighborhood approach to collaborative filtering. Note that the model can rapidly deal with new users. A prediction for a new user can be made using (3) without any change to the parameters of the system. An additional advantage of the Gaussian process is that it provides a full posterior distribution over the predictions. This allows us to compute a variance for the prediction, ,j This covariance can be substituted directly for the linear covariance function in (2) giving the following probabilistic model, D p Y|X, 2 , = j=1 N yij ,j |0, K + 2 I . where are the parameters of the covariance function. Alternative covariance functions can also be considered, but in this paper we focus only on the RBF and linear covariance functions. 3.1. Prediction of User Rating: Relations to Neighborhood Approaches The parameters that need to be stored after model training are the latent variables, X, and the parameters of the covariance function. When a prediction is required for a new item, these parameters are combined using the standard formula for prediction by Gaussian processes (Rasmussen & Williams, 2006). The mean of a user j's prediction for item is given by µ ,j = s yij ,: , (3) where we have defined s = Kij ,ij + 2 I -1 =k , + 2 - kij , Kij ,ij + 2 I -1 kij , . The variance could be helpful in determining whether or not to propose the item to the user: it expresses the confidence with which the model is estimating the user's preference. As we will see in the experiments there is a strong correlation between the variance and the number of items the user has rated. 4. Experimental Evaluation We consider three data sets to assess the quality of the GP-LVM for collaborative filtering. The data sets are widely used benchmarks for collaborative filtering, each containing a set of item ratings from different users. These data sets are the EachMovie, the 1M MovieLens, and the recently released 10M MovieLens data2 . We don't show results on Netflix. State-ofthe-art on Netflix typically involves model combinations and bespoke algorithm modification to account for the particular peculiarities of the data (Salakhutdinov & Mnih, 2008b; Salakhutdinov & Mnih, 2008a; Bell & Koren, 2007; Bell et al., 2008). This makes it an interesting challenge, but perhaps a less good evaluator of generic algorithms for collaborative filtering. Two of the data sets we consider (EachMovie and 1M MovieLens) have been widely used benchmarks, allowing us to compare our approach to a range of other approaches. There are currently no other published results on the third data set (the recently released 10M MovieLens), but we expect it will also become a well used benchmark and our result serves to show that our algorithm is easily extended to larger systems. For the 1M MovieLens and EachMovie data sets, we mimic the setup used by Marlin (2004a), allowing us to compare directly to the best published results which include: the user rating profile (URP, Marlin, 2004b), Attitude (Marlin, 2004a), maximum margin matrix factorization (MMMF, Rennie & Srebro, 2005), the approach of (Park & Pennock, 2007) that combines collaborative filtering with item proximities, and en2 kij , . In other words, the mean prediction for an unseen item for the user is given by a weighted sum of the user's rated items. This has a very similar feel to neighborhood models for collaborative filtering, where the neighborhood is based on item similarity. In this approach to collaborative filtering, similarity scores are made between the test item and the user's rated items. These are often based on correlation measures. Prediction of the test item is made by summing the user's rated items weighted by their normalized similarities to the test item. For the GP-LVM approach, the neighborhood is being constructed in the latent space1 , X. Consider the RBF covariance function: k ,i = m m exp - 2 ||x ,: - xi,: ||2 . As a user's rated items become better separated in the latent space, then Kij ,ij becomes a constant diagonal matrix and s becomes a scaled version of kij , . Now note that the elements of kij , are just scaled similarities between the 1 By reversing the roles of W and X we could also develop a user-orientated neighborhood model. See http://www.grouplens.org/. Non-linear Matrix Factorization with Gaussian Processes sembles of MMMF (DeCoste, 2006). To our knowledge the results of DeCoste (2006) were the best reported to date on the EachMovie and 1M MovieLens databases. We explore the importance of the dimensionality of the latent space, q; how performance varies against the available training data and the effect of covariance function choice on the results. For training the models we always use stochastic gradient descent (SGD). Parameters of the stochastic gradient descent algorithm were "tuned" on the 100k MovieLens data3 (results not presented) and were fixed for the three presented data sets. For all our experiments we used 10 epochs of SGD, with a learning rate of 1×10-4 and a momentum of 0.9. Latent points were initialized by drawing from a zero mean spherical Gaussian with standard deviation 1×10- 3. All covariance parameters were initialized as 1 except noise variance which was fixed to 5 and bias variance which was fixed to 0.11. Our code is available from http://www.cs.man.ac.uk/~neill/collab/. 4.1. Marlin's setup In his thesis Marlin (2004a) defines two types of generalization, "weak" and "strong". Weak generalization is a single step process which involves filling-in missing data in the rating matrix. Strong generalization is a two-stage process, where the models are trained on one set of users and the test predictions are on a disjoint set of users. The learner is given sample ratings of those users, but may not use those ratings until after the initial model is constructed. Marlin used the normalized mean absolute error (NMAE) as an error measure so that random guessing produces a score of 1. NMAE is computed by normalizing the NMAE by a factor that depends on the range of the ratings4 . The factor is 1.6 for MovieLens data set (i.e., ratings varying {1, · · · , 5}), and 1.944 for the EachMovie data set (i.e. ratings varying {1, · · · , 6}). EachMovie Data The EachMovie data set contains 2.6 million ratings for 1, 648 movies and 74, 424 3 The tuning process was very coarse: we always used momentum 0.9 and 10 epochs of SGD. The learning rate was then logarithmically decreased from 0.1 until results seemed to be converging. 4 The motivation for this factor is that it is the mean absolute error that would be obtained if the ratings were uniformly distributed and the prediction was uniform random guessing. A much better prediction in this case would be to guess the median. For MovieLens the prediction would be 3 and it would lead to a score of 1.2. A still better "dumb" score could be obtained by predicting the median of the observed data. However, the use of the factors 1.6 and 1.944 has become somewhat standardized, so we follow this practice. Table 1. EachMovie. Our approach outperforms the URP and Attitude algorithms (Marlin, 2004a), the MMMF (Rennie & Srebro, 2005), E-MMMF of (DeCoste, 2006), and the Item-based approach of (Park & Pennock, 2007). URP Attitude MMMF Item E-MMMF Ours Linear Ours RBF Weak NMAE 0.4422 ± 0.0008 0.4520 ± 0.016 0.4397 ± 0.0006 0.4382 ± 0.0009 0.4287 ± 0.0023 0.4209 ± 0.0017 0.4179 ± 0.0018 Strong NMAE 0.4557 ± 0.0008 0.4550 ± 0.0023 0.4341 ± 0.0025 0.4365 ± 0.0024 0.4301 ± 0.0035 0.4171 ± 0.0054 0.4134 ± 0.0049 users. The ratings range {1, 2, · · · , 6}. Following Marlin (2004a) users with fewer than 20 ratings were discarded; This left us with 36, 656 users. We selected 30, 000 randomly that were used for "weak" generalization, while the remaining 6, 656 users were used for the strong generalization. We report results averaged over the same 3 partitions originally used in Marlin (2004a) and replicated by Rennie and Srebro (2005); DeCoste (2006); Park and Pennock (2007), where one movie is withheld to construct the test set. Table 1 shows the NMAE for the different baselines as well as our approach. Note that our approach, with either a linear or an RBF kernel, outperforms significantly all the baselines. Importantly, our approach uses latent dimensions much smaller than the 100D used in MMMF and E-MMMF. In particular, for weak and strong generalization we use 20D latent spaces for the RBF and linear models. Larger dimensional spaces resulted in better performance especially for the strong generalization, since the latent space is trained using more data, and thus is less prone to overfitting. Experiments showing the influence of the latent dimensionality are shown below. The Weak RMSE5 of our approach when using a 20D latent space is (1.1110 ± 0.0028) for the RBF kernel and (1.1118 ± 0.0022) for linear. The Strong RMSE of our approach when using a 20D latent space is (1.0981 ± 0.0077) for the RBF kernel and (1.1008 ± 0.0080) for the linear kernel. 1M MovieLens The 1M MovieLens data consists of 1 million ratings for 6, 040 users, and 3, 952 movies, with ratings ranging {1, 2, · · · , 5}. 5, 000 users were used for weak generalization and the remaining 1, 040 for strong generalization. As before, we reported results averaged over the same 3 partitions used in (Marlin, 2004a; Rennie & Srebro, 2005; DeCoste, 2006; 5 Note that for EachMovie RMSE error is higher than for MovieLens as the range of the ratings is larger. Non-linear Matrix Factorization with Gaussian Processes 2.6 GP variance Table 2. 1M MovieLens. Our approach outperforms the URP and Attitude algorithms (Marlin, 2004a), the MMMF (Rennie & Srebro, 2005), E-MMMF (DeCoste, 2006), and the Item-based approach (Park & Pennock, 2007) when using a non-linear latent space. URP Attitude MMMF Item E-MMMF Ours linear Ours RBF Weak NMAE 0.4341 ± 0.0023 0.4320 ± 0.0055 0.4156 ± 0.0037 0.4096 ± 0.0029 0.4029 ± 0.0027 0.4052 ± 0.0011 0.4026 ± 0.0020 Strong NMAE 0.4444 ± 0.0032 0.4375 ± 0.0028 0.4203 ± 0.0138 0.4113 ± 0.104 0.4071 ± 0.0093 0.4071 ± 0.0081 0.3994 ± 0.0145 2.4 2.2 2 1.8 1.6 1.4 1.2 1 0.8 0.6 19-37 37-69 69-125 125-252 252-2313 Park & Pennock, 2007). Table 2 shows the NMAE for the baselines as well as for our approach. Using an RBF covariance function our approach results in the best performance for both weak and strong generalization. As before, the latent space required by our approach is small. In particular, for weak generalization we use a 10D latent space for the RBF and 11D for the linear model, and for strong generalization a 14D latent space for the nonlinear model and 15D for the linear one. The weak RMSE of our approach is (0.8801 ± 0.0082) for a 12D RBF model and (0.8791 ± 0.0080) for a 14D linear model. The Strong RMSE of our approach is (0.8748 ± 0.0268) for a 15D RBF model and (0.8775 ± 0.0239) for a 11D linear model. As for the EachMovie data set, the method that previously performed best among the baselines is the ensembles of MMMF (DeCoste, 2006) that combines predictions from multiple MMMF models. In particular predictions from 100 models created using bagging and multiple random weight seeds are combined by plurality voting, voting by averaging and voting by confidence. Our approach could also benefit from averaging the predictions from multiple models. We set up a small experiment where we combine by averaging the predictions of 11 models whose latent space dimensionality ranges 5 to 15. The NMAE error for Weak NMAE was reduced to (0.3987 ± 0.0013). A further point of interest is the fact that with our approach, unlike most of the baselines, the strong NMAE is often smaller than the weak NMAE. This might look surprising, however the data set partitioning means that the models learned for strong generalization have seen more data (i.e., 30, 000 for EachMovie and 5, 000 for 1M MovieLens) than the weak generalization. This combined with the fact that all user specific parameters are marginalized in the model leads to improved Figure 1. GP variance: as a function of the number of movies rated, for a 10D latent space learned on 1M MovieLens Weak. The variance of the GP is a good indicator of the uncertainty in the model, its value decreases with the amount of movies rated. performance. Figure 1 shows the variance of the GP as a function of the movies rated for a 10D RBF model learned for 1M MovieLens Weak. Note that the variance of the GP is a good indicator of the uncertainty in the model, its value decreases as the amount of movies rated increases. 4.2. Latent dimensionality and training set size To test the response of our models to increasing data set size we also conduct experiments in a different setting than Marlin. We generate different training set sizes by splitting different percentages of the data as training and testing. Figure 2 depicts NMAE and RMSE errors average over 5 random partitions as a function of the dimensionality of the latent space. As expected our model's predictions are more accurate when more data is used for training. Small latent dimensionalities are preferred for small training set sizes (e.g. 30%) to avoid overfitting. When using more data, higher dimensional latent spaces result in better performance. 4.3. Using different kernels Figure 3 shows NMAEs when using different kernels for the MovieLens and EachMovie databases. Note than in general non-linear models (e.g. RBF) outperform linear models, showing the benefit of our approach with respect to other approaches that assume that the mapping is linear (e.g. Rennie & Srebro, 2005; DeCoste, 2006; Salakhutdinov & Mnih, 2008b; Salakhutdinov & Mnih, 2008a. Non-linear Matrix Factorization with Gaussian Processes 0.408 0.407 RBF linear metadata 0.418 0.416 0.414 0.412 NMAE error NMAE error 0.41 0.408 0.406 0.404 0.402 0.4 0.402 10 11 12 13 latent dimensionality 14 15 0.398 10 11 12 13 latent dimensionality 14 15 RBF linear metadata 0.426 0.425 0.424 0.423 NMAE error 0.422 0.421 0.42 0.419 0.418 0.417 10 11 12 15 latent dimensionality 20 RBF linear 0.423 0.422 0.421 0.42 0.419 0.418 0.417 0.416 0.415 0.414 0.413 10 11 12 15 latent dimensionality 20 RBF linear 0.406 NMAE error 0.405 0.404 0.403 (1MML Weak NMAE) 0.886 0.884 RBF linear metadata (1MML Strong NMAE) 0.886 0.884 RBF linear metadata (EaM Weak NMAE) 1.122 RBF linear 1.12 (EaM Strong NMAE) 1.112 1.11 1.108 RBF linear 0.882 RMSE error RMSE error 0.882 RMSE error 1.118 RMSE error 11 12 15 latent dimensionality 20 1.106 1.104 1.102 0.88 0.88 1.116 0.878 0.878 1.114 0.876 0.876 1.112 1.1 1.098 10 0.874 10 11 12 13 latent dimensionality 14 15 0.874 10 11 12 13 latent dimensionality 14 15 1.11 10 11 12 15 latent dimensionality 20 (1MML Weak RMSE) (1MML Strong RMSE) (EaM Weak RMSE) (EaM Strong RMSE) Figure 3. 1M MovieLens (1MML) and EachMovie (EaM): NMAE and RMSE errors for different kernels. Note that in general non-linear latent spaces result in better performance. 4.4. Larger database 10M MovieLens We also perform experiments on a larger data set. The 10M MovieLens data set consists of 10 million ratings for 71, 567 users and 10, 681 movies, with ratings ranging {1, 2, · · · , 5}. We use the ra and rb partitions provided with the database, that split the data into a training and testing, so that they are 10 ratings per user in the test set. This database was made available on 9th January 2009, so there are currently no other published results to compare with. results. Our approach gave an RMSE of (0.8740 ± 0.0278) using a 10 dimensional latent space. 4.5. Adding Movie Meta-Data The backbone of the GP-LVM is a Gaussian process regression model, in which we are optimizing with respect to input values to maximize the likelihood of the data. For the case of movies, there is additional data about the movie that we might want to include, for example with the 1M MovieLens data, there is information about the genre of the movie (e.g., comedy and western). This can be encoded in a binary vector that defines the genre for each movie in the database. We can include this information in the kernel matrix. If the meta-data for a particular movie is given in a vector mi,: then a covariance function can be created from the meta-data, km (mi,: , mj,: ) = m exp - m ||mi - mj ||2 . 2 We now optimize the parameters of the meta-data covariance function along with the X and the parameters of the latent data covariance function. The dashed green line in Figure 3 shows the NMAE and RMSE when using meta-data. No significant improvement in performance is seen with the meta-data, but conversely it hasn't significantly harmed the performance. In practice the use of meta-data may be important when, for example, a new movie is released and ratings data is not yet available for it. 5. Conclusions We have shown how probabilistic matrix factorization is equivalent to Bayesian probabilistic PCA. This inspired a new algorithm which allowed for non-linear extensions in the manner of the Gaussian process latent variable model. The predictions from the model have a very similar form to neighborhood based methods for collaborative filtering. We tested the resulting algorithm on two popular movie databases. Our approach outperformed all existing published results for these data. We briefly considered extensions to the model involving incorporating side information (metadata) about movie genre. We didn't see a significant difference in the performance given meta-data, but speculate that such data may be useful for new movies with few ratings. As well as pushing forward the boundary of the statethe-art performance for the data sets we've studied, our model provides probabilistic predictions. An item/user dependent variance can be computed giving the confidence with which the model is prepared to make the predictions. This can be combined with the covariance function defined in x-space through a tensor product, ki,j = km (mi,: , mj,: ) kx (xi,: , xj,: ) . Non-linear Matrix Factorization with Gaussian Processes 0.43 0.425 0.42 0.415 NMAE 0.41 0.405 0.4 0.395 0.39 0.385 2 3 4 5 6 7 Latent dimension 8 9 10 30 % 50 % 60 % 70 % 80 % 90 % Koren, Y. (2008). Factorization meets the neighborhood: a multifaceted collaborative filtering model. KDD '08: Proceeding of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining (pp. 426­434). Lawrence, N. D. (2005). Probabilistic non-linear principal component analysis with Gaussian process latent variable models. Journal of Machine Learning Research, 6, 1783­1816. Marlin, B. (2004a). Collaborative filtering: A machine learning perspective. Master's thesis, University of Toronto. Marlin, B. (2004b). Modeling user rating profiles for collaborative filtering. Advances in Neural Information Processing Systems (pp. 627­634). Cambridge, MA: MIT Press. Minka, T. P. (2001). Automatic choice of dimensionality for PCA. Advances in Neural Information Processing Systems (pp. 598­604). Cambridge, MA: MIT Press. Park, S.-T., & Pennock, D. M. (2007). Applying collaborative filtering techniques to movie search for better ranking and browsing. Proceedings of the 13th ACM SIGKDD international conference on Knowledge discovery and data mining. Rasmussen, C. E., & Williams, C. K. I. (2006). Gaussian processes for machine learning. Cambridge, MA: MIT Press. Rennie, J. D. M., & Srebro, N. (2005). Fast maximum margin factorization for collaborative prediction. Proceedings of the International Conference in Machine Learning (pp. 713­719). Salakhutdinov, R., & Mnih, A. (2008a). Bayesian probabilistic matrix factorization using MCMC. Proceedings of the International Conference in Machine Learning (pp. 872­879). Omnipress. Salakhutdinov, R., & Mnih, A. (2008b). Probabilistic matrix factorization. Advances in Neural Information Processing Systems (pp. 1257­1264). Cambridge, MA: MIT Press. Sch¨lkopf, B., & Smola, A. J. (2001). Learning with o kernels. Cambridge, MA: MIT Press. Tipping, M. E., & Bishop, C. M. (1999). Probabilistic principal component analysis. Journal of the Royal Statistical Society, B, 6, 611­622. 0.94 0.93 0.92 0.91 0.9 RMSE 0.89 0.88 0.87 0.86 0.85 0.84 2 3 4 5 6 7 Latent dimension 8 9 30 % 50 % 60 % 70 % 80 % 90 % 10 Figure 2. 1M MovieLens: NMAE and RMSE errors as a function of the latent space dimensionality for different percentages of the database used as training, i.e., 30-90 %. References Bell, R., & Koren, Y. (2007). Lessons from the Netflix prize challenge. SIGKDD Explorations 9, 75­79. Bell, R. M., Koren, Y., & Volinsky, C. (2008). The BellKor 2008 solution to the netflix prize. Available from http://www.research.att.com/~volinsky/ netflix/. Bishop, C. M. (1999a). Bayesian PCA. Advances in Neural Information Processing Systems (pp. 482­ 388). Cambridge, MA: MIT Press. Bishop, C. M. (1999b). Variational principal components. Proceedings Ninth International Conference on Artificial Neural Networks, ICANN'99 (pp. 509­ 514). DeCoste, D. (2006). Collaborative prediction using ensembles of maximum margin matrix factorization. Proceedings of the International Conference in Machine Learning (pp. 249­256). Omnipress.