Transfer Learning for Collaborative Filtering via a Rating-Matrix Generative Model

Bin Li School of Computer Science, Fudan University, Shanghai 200433, China

libin@fudan.edu.cn

Qiang Yang qyang@cse.ust.hk Dept. of Computer Science & Engineering, Hong Kong University of Science & Technology, Hong Kong, China Xiangyang Xue School of Computer Science, Fudan University, Shanghai 200433, China xyxue@fudan.edu.cn

Abstract
Cross-domain collaborative filtering solves the sparsity problem by transferring rating knowledge across multiple domains. In this paper, we propose a rating-matrix generative model (RMGM) for effective cross-domain collaborative filtering. We first show that the relatedness across multiple rating matrices can be established by finding a shared implicit cluster-level rating matrix, which is next extended to a cluster-level rating model. Consequently, a rating matrix of any related task can be viewed as drawing a set of users and items from a user-item joint mixture model as well as drawing the corresponding ratings from the cluster-level rating model. The combination of these two models gives the RMGM, which can be used to fill the missing ratings for both existing and new users. A major advantage of RMGM is that it can share the knowledge by pooling the rating data from multiple tasks even when the users and items of these tasks do not overlap. We evaluate the RMGM empirically on three real-world collaborative filtering data sets to show that RMGM can outperform the individual models trained separately.

items based on a collection of like-minded users' rating records on the same set of items. Various CF methods have been proposed in the last decade. For example, memory-based methods (Resnick et al., 1994; Sarwar et al., 2001) find K-nearest neighbors based on some similarity measure. Model-based methods (Hofmann & Puzicha, 1999; Pennock et al., 2000; Si & Jin, 2003) learn prference/rating models for similar users (and items). Matrix factorization methods (Srebro & Jaakkola, 2003) find a low-rank approximation for the rating matrix. Most of these methods are based on the available ratings in the given rating matrix. Thus, the performance of these methods largely depends on the density of the given rating matrix. However, in real-world recommender systems, users can rate a very limited number of items. Thus, the rating matrix is often extremely sparse. As a result, the available rating data that can be used for K-NN search, probabilistic modeling, or matrix factorization are radically insufficient. The sparsity problem has become a major bottleneck for most CF methods. To alleviate the sparsity problem in collaborative filtering, one promising approach is to pool together the rating data from multiple rating matrices in related domains for knowledge transfer and sharing. In the real world, many web sites for recommending similar items, e.g., movies, books, and music, are closely related. On one hand, since many of these items are literary and entertainment works, they should share some common properties (e.g., genre and style). On the other hand, since these web services are geared towards the general population, users of these services, and the items interested by them, should share some properties as well. However, much of the shared knowledge across multiple related domains may be well hidden, and few

1. Introduction
Collaborative filtering (CF) in recommender systems aims at predicting an active user's ratings on a set of
Appearing in Proceedings of the 26 th International Conference on Machine Learning, Montreal, Canada, 2009. Copyright 2009 by the author(s)/owner(s).

Transfer Learning for Collaborative Filtering via a Rating-Matrix Generative Model

studies have been done to uncover this knowledge. In this paper, we solve the problem of learning a rating-matrix generative model from a set of rating matrices in multiple related recommender systems (domains) for collaborative filtering. Our aim is to alleviate the sparsity problem in individual rating matrices by discovering what is common among them. We first show that the relatedness across multiple rating matrices can be established by sharing an implicit cluster-level rating matrix. Then, we extend the shared cluster-level rating matrix to a more general cluster-level rating model, which defines a rating function in terms of the latent user- and item-cluster variables. Consequently, a rating matrix of any related task can be viewed as drawing a set of users and items from a user-item joint mixture model as well as drawing the corresponding ratings from the clusterlevel rating model. The combination of these two models gives the rating-matrix generative model (RMGM). We also propose an algorithm for training the RMGM on the pooled rating data from multiple related rating matrices as well as an algorithm for predicting the missing ratings for new users in different tasks. Experimental comparison is carried out on the three real-world CF data sets. The results show that our proposed RMGM learned from multiple CF tasks can outperform the individual models trained separately. The remainder of the paper is organized as follows. In Section 2, we first introduce the problem setting for cross-domain collaborative filtering and the notations used in this paper. In Section 3, we describe how to establish the relatedness across multiple rating matrices via a shared cluster-level rating matrix. The RMGM is presented in Section 4 as well as the training and prediction algorithms. Related work is introduced in Section 5. We experimentally validate the effectiveness of the RMGM for cross-domain collaborative filtering in Section 6 and conclude the paper in Section 7.

ing data in the z-th rating matrix is a set of triplets (z) (z) (z) (z) (z) (z) Dz = {(u1 , v1 , r1 ), . . . , (usz , vsz , rsz )}, where sz is the number of available ratings in the z-th rating matrix. The ratings in {D1 , . . . , DZ } should be in the same rating scales R (e.g., 1 - 5). For model-based CF methods, a preference/rating model, e.g., the aspect model (Hofmann & Puzicha, 1999), can be trained on Dz for the z-th task. In our cross-domain collaborative filtering setting, we wish to train a rating-matrix generative model (RMGM) for all the given related tasks on the pooled rating data, namely, z Dz . Then, the z-th rating matrix can be viewed as drawing a set of users, Uz , and a set of items, Vz , from the learned RMGM. The missing values in the z-th rating matrix can be generated by the RMGM.

3. Cluster-Level Rating Matrix as Knowledge Sharing
To allow knowledge-sharing across multiple rating matrices, we first investigate how to establish the relatedness among the given tasks. A difficulty is that no explicit correspondence among the user sets or the item sets in the given rating matrices can be exploited. However, some collaborative filtering tasks are somewhat related in certain aspects. Take movie-rating and book-rating web sites for example. On one hand, movies and books have correspondence in genre. On the other hand, although the user sets are different from one another, they are the subsets sampled from the same population (this assumption only holds for popular web sites) and should reflect similar social aspects (Coyle & Smyth, 2008). The above observation suggests that, although we can not find an explicit correspondence among individual users or items, we can establish a cluster-level ratingpattern representation as a "bridge" to connect all the related rating matrices. Figure 1 illustrates how the implicit relatedness among three artificially generated rating matrices is established via a cluster-level rating matrix. By permuting the rows and columns (which is equivalent to co-clustering) in each rating matrix, we can obtain three block rating matrices. Each block comprises a set of ratings provided by a user group on an item group. We can further reduce the block matrices to be the cluster-level rating matrices, in which each row corresponds to a user cluster and each column an item cluster. The entries in the cluster-level rating matrices are the average ratings of the corresponding user-item co-clusters. The resulting clusterlevel rating matrices reveal that the three rating matrices implicitly share a common 4 × 4 cluster-level rating-pattern representation.

2. Problem Setting
Suppose that we are given Z rating matrices in related domains for collaborative filtering. In the z-th rating (z) (z) matrix, a set of users, Uz = {u1 , . . . , unz }  U, make (z) (z) ratings on a set of items, Vz = {v1 , . . . , vmz }  V, where nz and mz denote the numbers of rows (users) and columns (items), respectively. The random variables u and v are assumed to be independent from each other. To consider the more difficult case, we assume that neither the user sets nor the item sets in the given rating matrices have intersections, i.e., z Uz =  and z Vz =  (in fact, there may exist intersections, but they are unobservable). The rat-

Transfer Learning for Collaborative Filtering via a Rating-Matrix Generative Model

1 2 3 4 5 6

a ? 3 3 3 2 3 a 3 2 ? ? 2 ? 1 b 1 ? 1 3 2

b 3 1 ? ? 3 2 b 3 ? 1 3 ? 1 2 c ? 3 2 ? 3

c ? 2 2 1 3 ? c ? 2 2 3 2 2 ? d 3 2 ? 1 ?

d 3 2 ? 1 ? 1 d ? 1 1 3 1 1 2 e 3 1 3 2 ?

e 2 ? 3 ? 2 3 e 1 ? 1 1 ? 3 2 f ? 2 3 1 2

f 3 1 1 2 ? 2

Permute rows & cols

2 3 1 5 4 6

CF Task I

a e b f c d 3 ? 1 1 2 2 3 3 ? 1 2 ? ? 2 3 3 ? 3 2 2 3 ? 3 ? 3 ? ? 2 1 1 3 3 2 2 ? 1 b d a c e ? 1 2 2 ? 1 1 ? 2 1 3 ? 3 ? 1 3 3 ? 3 1 2 2 1 ? 2 ? 1 2 2 ? 1 1 ? 2 3

A B C I 3 1 2 II 2 3 3 III 3 2 1 Cluster-level Rating Matrix B 1 3 2 1 C 2 3 1 2 D 1 1 2 3

I II III IV

A 3 2 3 1

B 1 3 2 1

C 2 3 1 2

D 1 1 2 3

1 2 3 4 5 6 7 a 2 ? 2 1 3

Permute rows & cols

5 3 1 4 7 2 6

CF Task II

I II III IV

I II III IV

A 3 2 3 1

B 1 3 2 1

C 2 3 1 2

D 1 1 2 3

Cluster-level Rating Matrix A II 2 III 3 IV 1 B 3 2 1 C 3 1 2 D 1 2 3 I II III IV A 3 2 3 1 B 1 3 2 1 C 2 3 1 2 D 1 1 2 3

1 2 3 4 5

g 1 2 ? 3 ?

Permute rows & cols

1 3 2 5 4

CF Task III

a c d f e b g 2 ? 3 ? 3 1 1 2 2 ? 3 3 1 ? ? 3 2 2 1 ? 2 3 3 ? 2 ? 2 ? 1 ? 1 1 2 3 3

Figure 1. Sharing cluster-level user-item rating patterns among three toy rating matrices in different domains. The missing values are denoted by `?'. After permuting the rows (users) and columns (items) in each rating matrix, it is revealed that the three rating matrices implicitly share a common 4 × 4 cluster-level rating matrix.

This toy example shows an ideal case in which the users and items in the same cluster behave exactly the same. In many real-world cases, since users may have multiple personalities and items may have multiple attributes, a user or an item can simultaneously belong to multiple clusters with different memberships. Thus, we need to introduce softness to clustering models. Suppose there are K user clusters, (1) (K) (1) (L) {cU , . . . , cU }, and L item clusters, {cV , . . . , cV }, in the shared cluster-level rating patterns. The membership of a user-item pair (u, v) to a user-item co(k) (l) cluster (cU , cV ) is the joint posterior membership (k) (l) probability P (cU , cV |u, v). Furthermore, a user-item co-cluster can also have multiple ratings with different (k) (l) probabilities P (r|cU , cV ). Then, we can define the rating function fR (u, v) for a user u on an item v in (k) (l) terms of the two latent cluster variables cU and cV fR (u, v) =
r

where (1) is obtained based on the assumption that random variables u and v are independent. We can further rewrite (1) in the form of matrices fR (u, v) = pu Bpv , pu 1 = 1, pv 1 = 1, (2)

where pu  RK and pv  RL are the user- and item(k) cluster membership vectors ([pu ]k = P (cU |u) and (l) [pv ]l = P (cV |v)), and B is a K × L relaxed clusterlevel rating matrix in which an entry can have multiple ratings with different probabilities Bkl =
r

rP (r|cU , cV ).

(k)

(l)

(3)

rP (r|u, v)
(k) (l) (k) (l)

Eq. (2) implies that the relaxed cluster-level rating matrix B is a cluster-level rating model. In the next section, we focus on learning the user-item joint mixture model as well as the shared cluster-level rating model on the pooled rating data from multiple related tasks.

=
r

r
k,l

P (r|cU , cV )P (cU , cV |u, v) P (r|cU , cV )P (cU |u)P (cV |v), (1)
k,l (k) (l) (k) (l)

4. Rating-Matrix Generative Model
In order to extend the shared cluster-level rating matrix to a more general cluster-level rating model, we

=
r

r

Transfer Learning for Collaborative Filtering via a Rating-Matrix Generative Model

4×4 Cluster-level Rating Matrix A B C D I 3 1 2 1 II 2 3 3 1 III 3 2 1 2 IV 1 1 2 3 Extended to a cluster-level rating model

CF Task I a 3 3 ? 2 3 3 e ? 3 2 2 ? 3 b 1 ? 3 3 ? 2 f 1 1 3 ? 2 2 c 2 2 ? 3 1 ? d 2 ? 3 ? 1 1 5 3 1 4 7 2 6

2 3 1 5 4 6

CF Task II b d a c ? 1 2 2 1 1 ? 2 3 ? 3 ? 3 3 ? 3 2 2 1 ? ? 1 2 2 1 1 ? 2

CF Task III e ? 1 1 1 2 ? 3 a 2 2 ? 3 1 c ? 2 3 3 ? d 3 ? 2 ? 1 f ? 3 2 2 1 e 3 3 1 ? 2 b 1 1 ? 2 3 g 1 ? 2 ? 3

1 3 2 5 4

P(v)
3

1

2 3

Draw users and items from the same user-item joint mixture model for different tasks
1

2 P(u)

3 2
1

1

3
1

1 2

2

3

Figure 2. Each rating matrix can be viewed as drawing a set of users (horizontal straight lines) and items (vertical straight lines) from the same user-item joint mixture model (the joint probability of a user-item pair is indicted by gray-scales) as well as drawing the corresponding ratings (the crossing points of the horizontal and vertical lines) from a shared cluster-level rating model (the figures denote the ratings which are most likely to be obtained in those co-clusters).

should first define a user-item bivariate probability histogram over U × V. Let PU (u) and PV (v) denote the marginal distributions for users and items, respectively. The user-item bivariate probability histogram is a |U| × |V| matrix, H, which is defined as the useritem joint distribution Huv = P (u, v) = PU (u)PV (v). (4)

u given the user cluster cU . The user-item bivariate probability histogram (4) can be rewritten as Huv =
k,l

(k)

P (cU )P (cV )P (u|cU )P (v|cV ).

(k)

(l)

(k)

(l)

(8)

Then, the users and items can be drawn respectively from the user and the item mixture models which are in terms of the two latent cluster variables ui , vi
(z) (z)

Thus, the user-item pairs for all the given tasks can be drawn from H ui , vi
(z) (z)


k,l

P (cU )P (cV )P (u|cU )P (v|cV ).

(k)

(l)

(k)

(l)

 Pr(H),

(5)

for z = 1, . . . , Z; i = 1, . . . , sz . Based on the assumption that there are K clusters in U and L clusters in V, we can model the user and item marginal distributions in the form of mixture models, in which each component corresponds to a latent user/item cluster PU (u) PV (v)
(k)

(9) Eq. (9) defines the user-item joint mixture model. Furthermore, the ratings also can be drawn from the conditional distributions given the latent cluster variables ri
(z)

 P (r|cU , cV ).

(k)

(l)

(10)

Eq. (10) defines the cluster-level rating model. Combining (9) and (10), we can obtain the ratingmatrix generative model (RMGM), which can generate rating matrices. Figure 2 illustrates the rating-matrix generating process on the three toy rating matrices. The 4 × 4 cluster-level rating matrix from Figure 1 is extended to a cluster-level rating model. Each rating matrix can thus be viewed as drawing a set of users Uz and items Vz from the user-item joint mixture model as

=
k

P (cU )P (u|cU ), P (cV )P (v|cV ),
(l) (l)

(k)

(k)

(6) (7)

=
l

where P (cU ) denotes the prior for the user cluster (k) (k) cU and P (u|cU ) the conditional probability of a user

Transfer Learning for Collaborative Filtering via a Rating-Matrix Generative Model

well as drawing the corresponding ratings for (Uz , Vz ) from the cluster-level rating model. Generally speaking, each rating matrix can be viewed as drawing Dz from the RMGM. The formulation of RMGM is similar to the flexible mixture model (FMM) (Si & Jin, 2003). The major difference is that RMGM can generate rating matrices for different CF tasks (recall that z Uz =  and z Vz =  and the sizes of rating matrices are also different from one another). RMGM can be viewed as extending FMM to a multi-task version such that the user- and item-cluster variables are shared by and learned from multiple tasks. Furthermore, since the RMGM is trained on the pooled rating data from multiple tasks, the training and prediction algorithms for RMGM are also different from those for FMM. 4.1. Training the RMGM In this section, we introduce how to train an RMGM on the pooled rating data z Dz . We need to learn five (k) sets of model parameters in (9) and (10), i.e., P (cU ), (l) (k) (l) (k) (l) P (cV ), P (u|cU ), P (v|cV ), and P (r|cU , cV ), for k = 1, . . . , K; l = 1, . . . , L; u  z Uz ; v  z Vz ; and r  R. We adopt the Expectation Maximization (EM) algorithm (Dempster et al., 1977) for RMGM training. In (k) (l) the E-step, the joint posterior probability of (cU , cV ) (z) (z) (z) given (ui , vi , ri ) can be computed using the five sets of model parameters P (cU , cV |ui , vi , ri ) =
(k) (l) (z) (z) (z)

P (r|cU , cV ) =

(k)

(l)

z z

j:rj =r j

(z)

P (k, l|j (z) )

P (k, l|j (z) )

.

(16)

In Eqs. (12­16), all the parameters in terms of the two latent cluster variables are computed using the pooled rating data z Dz . By alternating E-step and M-step, an RMGM which is fit onto a set of related CF tasks can be obtained. In particular, the user-item joint mixture model defined in (9) and the shared clusterlevel rating model defined in (10) can be learned. A (z) (z) (z) rating triplet (ui , vi , ri ) from any task can thus be viewed as drawing from the RMGM. 4.2. RMGM-Based Prediction After training the RMGM, according to (1), the missing values in the K given rating matrices can be generated by fR (ui , vi ) =
(z) (z)

r
r k,l

P (r|cU , cV ) (17)

(k)

(l)

(k) (z) (l) (z) P (cU |ui )P (cV |vi ), (k) (z) (l) (z)

where P (cU |ui ) and P (cV |vi ) can be computed using the learned parameters based on the Bayes rule. To predict the ratings on Vz for a new user u(z) in the z-th task, we can solve a quadratic optimization problem to compute the user-cluster membership, pu(z)  RK , for u(z) based on the given ratings ru(z)  {R, 0}mz (the unobserved ratings are set to 0) minpu(z) s.t. [BPVz ] pu(z) - ru(z) pu(z) 1 = 1.
2 Wu(z)

(18)

(11)

(z) (k) (z) (l) (z) (k) (l) P (ui , cU )P (vi , cV )P (ri |cU , cV ) (z) (p) (z) (q) (z) (p) (q) p,q P (ui , cU )P (vi , cV )P (ri |cU , cV )

and P (ui , cU ) = P (cU )P (ui |cU ), P (vi , cV ) = (l) (z) (l) P (cV )P (vi |cV ). In the M-step, the five sets of model parameters for Z given tasks are updated as follows (let P (k, l|j (z) ) as a (k) (l) (z) (z) (z) shorthand for P (cU , cV |uj , vj , rj ) for simplicity) P (cU ) =
(l) P (cV ) (z) (k) (k) z l j

(z)

(k)

(k)

(z)

(k)

(z)

(l)

In Eq. (18), PVz is an L × mz item-cluster member(l) (z) ship matrix, where [PVz ]li = P (cV |vi ); Wu(z) is an mz ×mz diagonal matrix, where [Wu(z) ]ii = 1 if [ru(z) ]i is given and [Wu(z) ]ii = 0 otherwise. Here x W de notes a weighted l2 -norm, x Wx. The quadratic optimization problem (18) is very simple and can be solved by any quadratic solver. After obtaining the op^ timal user-cluster membership pu(z) for u(z) , the rat(z) (z) ings of u on vi can be predicted by ^ fR (u(z) , vi ) = pu(z) Bpv(z) ,
i

P (k, l|j (z) ) P (k, l|j (z) ) sz P (k, l|j (z) )
z

(z)

z sz z k j z

(12) (13) (14)

(19)

=
l

P (ui |cU ) = P (vi |cV ) =
(z) (l)

j:uj =ui

(z)

(z)

where pv(z) is the i-th column in PVz . Similarly, based i on the learned parameters, we can also predict the ratings of all the existing users in the z-th task on a new item. Due to space limitation, we skip the details. 4.3. Implementation Details

P (cU )
k
(z) (z) j:vj =vi

(k)

sz sz

P (k, l|j (z) ) (15)
z

(l) P (cV )

Initialization: Since the optimization problem for RMGM training is non-convex, the initialization for

Transfer Learning for Collaborative Filtering via a Rating-Matrix Generative Model

the five sets of model parameters is crucial for searching a better local maximum. We first select the densest rating matrix from the given tasks, and simultaneously cluster the rows (users) and columns (items) in that matrix using orthogonal nonnegative matrix tri-factorization (Ding et al., 2006) (other coclustering methods are also applicable). Based on the (k) co-clustering results, we can coarsely estimate P (cU ), (l) (k) (l) P (cV ), and P (r|cU , cV ). We use random values for (z) (k) (z) (l) initializing P (ui |cU ) and P (vi |cV ). Note that the five sets of initialized parameters should be respec(k) (l) tively normalized: k P (cU ) = 1, l P (cV ) = 1, (k) (l) (z) (k) r P (r|cU , cV ) = 1, z i P (ui |cU ) = 1, and (z) (l) z i P (vi |cV ) = 1. Regularization: In order to avoid unfavorable local maxima, we also impose regularization on the EM algorithm (Hofmann & Puzicha, 1998). We adopt the same strategy used in (Si & Jin, 2003) and we skip this part for space limitation. Model Selection: We need to set the numbers of user and item clusters, K and L, to start with. The clusterlevel rating model B should be not only expressive enough to encode and compress various cluster-level user-item rating patterns but also compact enough to avoid over-fitting. In the empirical tests, we observed that the performance is rather stable when K and L are in the range of [20, 50]. Thus, we simply set K = 20 and L = 20 in our experiments.

sided feature representation (in both row and column spaces). Owing to such two-sided feature representation, RMGM can share the knowledge across multiple tabular data sets from different domains. Since RMGM is a mixture model, our method is also related to various model-based CF methods. The most similar one is the flexible mixture model (FMM) (Si & Jin, 2003) which simultaneously models users and items into mixture models in terms of two latent cluster variables. However, as pointed out in Section 4, our RMGM is different from FMM in both training and prediction algorithms; moreover, the major difference is that RMGM is able to generate rating matrices in different domains. Several methods also aim at simultaneously clustering users and items for modeling rating patterns, such as the two-sided clustering model (Hofmann & Puzicha, 1999) and the coclustering-based model (George & Merugu, 2005).

6. Experiments
In this section, we investigate whether the CF performance can be improved by applying RMGM to extracting the shared knowledge from multiple rating matrices in related domains. We compare our RMGMbased cross-domain collaborative filtering method to two baseline single-task methods. One is the well known memory-based method, Pearson correlation coefficients (PCC) (Resnick et al., 1994), and we search 20-nearest neighbors in our experiments. The other is the flexible mixture model (FMM) (Si & Jin, 2003), which can be viewed as a single-task version of RMGM. Since (Si & Jin, 2003) claims that FMM performs better than some well-known state-of-the-art model-based methods, we only compare our method to FMM. We aim to validate that sharing useful information by learning a common rating model for multiple related CF tasks can obtain better performance than learning individual models for these tasks separately. 6.1. Data Sets The following three real-world CF data sets are used for performance evaluation. Our method will learn a shared model (RMGM) on the union of the rating data from these data sets, and the learned model is applicable for either task. MovieLens1 : A movie rating data set comprising 100,000 ratings (scales 1 - 5) provided by 943 users on 1682 movies. We randomly select 500 users with more than 20 ratings and 1000 movies for experiments (rating ratio 4.33%).
1

5. Related Works
The proposed cross-domain collaborative filtering belongs to multi-task learning. The earliest studies on multi-task learning should be (Caruana, 1997; Baxter, 2000), which learn multiple tasks by sharing a hidden layer in neural network. In our proposed RMGM method, each given rating matrix in the related domains can be generated by drawing a set of users and items as well as the corresponding ratings from the RMGM. In other words, each user/item in the given rating matrix is a linear combination of the prototypes for user/item clusters (see Eq. (19)). The shared cluster-level rating model B is a two-sided feature representation for both users and items. This knowledge sharing fashion is similar to the feature-representation based multi-task/transfer learning, such as (Jebara, 2004; Argyriou et al., 2007; Raina et al., 2007). They intend to find a common feature representation (usually a low-dimensional subspace) that is beneficial for the related tasks. A major difference from our work is that these methods learn a one-sided feature representation (in row space) while our method learns a two-

http://www.grouplens.org/node/73

Transfer Learning for Collaborative Filtering via a Rating-Matrix Generative Model

EachMovie2 : A movie rating data set comprising 2.8 million ratings (scales 1 - 6) provided by 72,916 users on 1628 movies. We randomly select 500 users with more than 20 ratings and 1000 movies for experiments (rating ratio 3.28%). For a rating scale consistency with other tasks, we replace 6 with 5 in the rating matrix to make the rating scales from 1 to 5. Book-Crossing3: A book rating data set comprising more than 1.1 million ratings (scales 1 - 10) provided by 278,858 users on 271,379 books. We randomly select 500 users and 1000 books with more than 16 ratings for experiments (rating ratio 2.78%). We also normalize the rating scales from 1 to 5. 6.2. Evaluation Protocol We evaluate the performance of the compared methods under different configurations. The first 100, 200, and 300 users in the three rating matrices (each data set forms a 500 × 1000 rating matrix) are used for training, respectively, and the last 200 users for testing. For each test user, three different sizes of the observed ratings (Given5, Given10, Given15) are provided for training and the remaining ratings are used for evaluation. Note that in our experiments, the given observed rating indices are randomly selected 10 times, so that the reported results in Table 1 are the average results over 10 splits. The evaluation metric we adopt is mean absolute error ~ (MAE): ( iT |ri - ri |)/|T |, where T denotes the set of test ratings, ri is the ground truth and ri is the ~ predicted rating. A smaller value of MAE means a better performance. 6.3. Results The comparison results on the three data sets are reported in Table 1. One can see that our method clearly outperforms the two baseline methods under all the testing configurations on all the three data sets. FMM performs slightly better than PCC, which implies that the model-based methods can benefit from sharing knowledge within user and item clusters. RMGM performs even better than FMM, which implies that clustering users and items across multiple related tasks can aggregate even more useful knowledge than clustering users and items in individual tasks. The overall experimental results have validated that the proposed RMGM indeed can gain additional useful knowledge by pooling the rating data from multiple related CF tasks to make these tasks benefit from one another.
2 3

Table 1. MAE Comparison on MovieLens (ML), EachMovie (EM), and Book-Crossing (BX).

Train

Method PCC FMM RMGM PCC FMM RMGM PCC FMM RMGM PCC FMM RMGM PCC FMM RMGM PCC FMM RMGM PCC FMM RMGM PCC FMM RMGM PCC FMM RMGM

Given5 0.930 0.908 0.868 0.934 0.890 0.859 0.935 0.885 0.857 0.996 0.969 0.942 0.983 0.955 0.934 0.976 0.952 0.934 0.617 0.619 0.612 0.621 0.617 0.615 0.621 0.615 0.612

Given10 0.908 0.868 0.822 0.899 0.863 0.821 0.896 0.868 0.820 0.952 0.937 0.908 0.943 0.933 0.905 0.937 0.930 0.906 0.599 0.592 0.583 0.612 0.602 0.591 0.619 0.604 0.590

Given15 0.895 0.846 0.808 0.888 0.847 0.806 0.888 0.846 0.804 0.936 0.924 0.895 0.930 0.923 0.890 0.933 0.924 0.890 0.600 0.583 0.573 0.620 0.596 0.583 0.630 0.596 0.581

ML100

ML200

ML300

EM100

EM200

EM300

BX100

BX200

BX300

6.4. Discussion Although the proposed method can clearly outperform the other compared methods on all the three data sets, we can see that there still exists some room for further performance improvements. A crucial problem lies in the inherent problem of the data sets, i.e., the users and items in the rating matrices may not always be able to be grouped into high quality clusters. We observe that the average ratings of the three data sets are far larger than the medians (given the median being 3, the average ratings are 3.64, 3.95, and 4.22 for the three data sets, respectively). This may be caused by the fact that the items with the most ratings are usually the most popular ones. In other words, users

http://www.cs.cmu.edu/~lebanon/IR-lab.htm http://www.informatik.uni-freiburg.de/~cziegler/BX/

Transfer Learning for Collaborative Filtering via a Rating-Matrix Generative Model

are willing to rate their favorite items and to recommend them to others, but have little interest to rate the items they dislike. Given that no clear user and item groups can be discovered for these cases, it is hard to learn a good cluster-level rating model.

Adaptive Hypermedia and Adaptive Web-Based Systems (pp. 103­112). Dempster, A. P., Laird, N. M., & Rubin, D. B. (1977). Maximum likelihood from incomplete data via the EM algorithm. J. of the Royal Statistical Society, B39, 1­38. Ding, C., Li, T., Peng, W., & Park, H. (2006). Orthogonal nonnegative matrix tri-factorizations for clustering. Proc. of the 12th ACM SIGKDD Int'l Conf. (pp. 126­135). George, T., & Merugu, S. (2005). A scalable collaborative filtering framework based on co-clustering. Proc. of the Fifth IEEE Int'l Conf. on Data Mining (pp. 625­628). Hofmann, T., & Puzicha, J. (1998). Statistical models for co-occurrence data (Technical Report AIM1625). Artifical Intelligence Laboratory, MIT. Hofmann, T., & Puzicha, J. (1999). Latent class models for collaborative filtering. Proc. of the 16th Int'l Joint Conf. on Artificial Intelligence (pp. 688­693). Jebara, T. (2004). Multi-task feature and kernel selection for SVMs. Proc. of the 21st Int'l Conf. on Machine Learning (pp. 329­336). Pennock, D. M., Horvitz, E., Lawrence, S., & Giles, C. L. (2000). Collaborative filtering by personality diagnosis: A hybrid memory- and model-based approach. Proc. of the 16th Conf. on Uncertainty in Artificial Intelligence (pp. 473­480). Raina, R., Battle, A., Lee, H., Packer, B., & Ng, A. Y. (2007). Self-taught learning: Transfer learning from unlabeled data. Proc. of the Int'l Conf. on Machine Learning (pp. 759­766). Resnick, P., Iacovou, N., Suchak, M., Bergstrom, P., & Riedl, J. (1994). GroupLens: An open architecture for collaborative filtering of netnews. Proc. of the ACM Conf. on Computer Supported Cooperative Work (pp. 175­186). Sarwar, B., Karypis, G., Konstan, J., & Riedl, J. (2001). Item-based collaborative filtering recommendation algorithms. Proc. of the 10th Int'l World Wide Web Conf. (pp. 285­295). Si, L., & Jin, R. (2003). Flexible mixture model for collaborative filtering. Proc. of the 20th Int'l Conf. on Machine Learning (pp. 704­711). Srebro, N., & Jaakkola, T. (2003). Weighted low-rank approximations. Proc. of the 20th Int'l Conf. on Machine Learning (pp. 720­727).

7. Conclusion
In this paper, we proposed a novel cross-domain collaborative filtering method based on the rating-matrix generative model (RMGM) for recommender systems. RMGM can share useful knowledge across multiple rating matrices in related domains to alleviate the sparsity problems in individual tasks. The knowledge is shared in the form of a latent cluster-level rating model, which is trained on the pooled rating data from multiple related rating matrices. Each rating matrix can thus be viewed as drawing a set of users and items from the user-item joint mixture model as well as drawing the corresponding ratings from the clusterlevel rating model. The experimental results have validated that the proposed RMGM indeed can gain additional useful knowledge by pooling the rating data from multiple related tasks to make these tasks benefit from one another. In our future work, we will 1) investigate how to statistically quantify the "relatedness" between rating matrices in different domains, and 2) consider an asymmetric problem setting where knowledge can be transferred from a dense auxiliary rating matrix in one domain to a sparse target one in another domain.

Acknowledgments
Bin Li and Qiang Yang are supported by Hong Kong CERG Grant 621307; Bin Li and Xiangyang Xue are supported in part by Shanghai Leading Academic Discipline Project (No. B114) and NSF of China (No. 60873178).

References
Argyriou, A., Evgeniou, T., & Pontil, M. (2007). Multi-task feature learning. Advances in Neural Information Processing Systems 19 (pp. 41­48). Baxter, J. (2000). A model of inductive bias learning. J. of Artificial Intelligence Research, 12, 149­198. Caruana, R. A. (1997). Multitask learning. Machine Learning, 28, 41­75. Coyle, M., & Smyth, B. (2008). Web search shared: Social aspects of a collaborative, community-based search network. Proc. of the Fifth Int'l Conf. on