SIGIR 2007 Proceedings Session 2: Routing and Filtering Efficient Bayesian Hierarchical User Modeling for Recommendation Systems School of Engineering University of California Santa Cruz Santa Cruz, CA, USA Yi Zhang, Jonathan Koren {yiz, jonathan}@soe.ucsc.edu files from user feedback so that they can recommend information tailored to each individual user's interest without requiring the user to make an explicit query. Learning the user profiles is the core problem for these systems. A user profile is usually a classifier that can identify whether a document is relevant to the user or not, or a regression model that tells how relevant a document is to the user. One ma jor challenge of building a recommendation or personalization system is that the profile learned for a particular user is usually of low quality when the amount of data from that particular user is small. This is known as the "cold start" problem. This means that any new user must endure poor initial performance until sufficient feedback from that user is provided to learn a reliable user profile. There has been much research on improving classification accuracy when the amount of labeled training data is small. The semi-supervised learning approach combines unlabeled and labeled data together to achieve this goal [26]. Another approach is using domain knowledge. Researchers have modified different learning algorithms, such as Na¨vei Bayes [17], logistic regression [7], and SVMs [22], to integrate domain knowledge into a text classifier. The third approach is borrowing training data from other resources [5][7]. The effectiveness of these different approaches is mixed, due to how well the underlying model assumption fits the data. One well-received approach to improve recommendation system performance for a particular user is borrowing information from other users through a Bayesian hierarchical modeling approach. Several researchers have demonstrated that this approach effectively trades off between shared and user-specific information, thus alleviating poor initial performance for each user[27][25]. In order to learn a Bayesian hierarchical model, the system usually tries to find the most likely model parameters for the given data. A mature recommendation system usually works for millions of users. It is well known that learning the optimal parameters of a Bayesian hierarchical model is computationally expensive when there are thousands or millions of users. The EM algorithm is a commonly used technique for parameter learning due to its simplicity and convergence guarantee. However, a content based recommendation system often handles documents in a very high dimensional space, in which each document is represented by a very sparse vector. With careful analysis of the EM algorithm in this scenario (Section 4), we find that the EM tering, or item-based collaborative filtering. In this paper, the words "filtering" and "recommendation" are used interchangeably. ABSTRACT A content-based personalized recommendation system learns user specific profiles from user feedback so that it can deliver information tailored to each individual user's interest. A system serving millions of users can learn a better user profile for a new user, or a user with little feedback, by borrowing information from other users through the use of a Bayesian hierarchical model. Learning the model parameters to optimize the joint data likelihood from millions of users is very computationally expensive. The commonly used EM algorithm converges very slowly due to the sparseness of the data in IR applications. This paper proposes a new fast learning technique to learn a large number of individual user profiles. The efficacy and efficiency of the proposed algorithm are justified by theory and demonstrated on actual user data from Netflix and MovieLens. Categories and Sub ject Descriptors: B.3.3 [Information Search and Retrieval]: Information filtering General Terms: Algorithms Keywords: recommender systems, information filtering, personalization, EM algorithm, Bayesian Hierarchical Models 1. INTRODUCTION Personalization is the future of the Web, and it has achieved great success in industrial applications. For example, online stores, such as Amazon and Netflix, provide customized recommendations for additional products or services based on a user's history. Recent offerings such as My MSN, My Yahoo!, My Google, and Google News have attracted much attention due to their potential ability to infer a user's interests from his/her history. One ma jor personalization topic studied in the information retrieval community is content-based personal recommendation systems1 . These systems learn user-specific pro1 Content-based recommendation is also called adaptive fil- Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. SIGIR'07, July 23­27, 2007, Amsterdam, The Netherlands. Copyright 2007 ACM 978-1-59593-597-7/07/0007 ...$5.00. 47 SIGIR 2007 Proceedings algorithm converges very slowly due to the sparseness of the input variables. We also find that updating the model parameter at each EM iteration is also expensive with computational complexity of O(M K ), where M is the number of users and K is the number of dimensions. This paper modifies the standard EM algorithm to create an improved learning algorithm, which we call the "Modified EM algorithm." The basic idea is that instead of calculating the numerical solution for all the user profile parameters, we derive the analytical solution of the parameters for some feature dimensions, and at the M step use the analytical solution instead of the numerical solution estimated at E step for those parameters. This greatly reduces the computation at a single EM iteration, and also has the benefit of increasing the convergence speed of the learning algorithm. The proposed technique is not only well supported by theory, but also by experimental results. The organization of the remaining parts of this paper is as follows: Section 3 describes the Bayesian hierarchical linear regression modeling framework used for content-based recommendations. Section 4 describes how to learn the model parameters using the standard EM algorithm, along with using the new technique proposed in this paper. The experimental setting and results used to validate the proposed learning technique are reported in Sections 5 and 6. Section 7 summarizes and offers concluding remarks. Session 2: Routing and Filtering 3. BAYESIAN HIERARCHICAL LINEAR REGRESSION Assume there are M users in the system. The task of the system is to recommend documents that are relevant to each user. For each user, the system learns a user model from the user's history. In the rest of this paper, we will use the following notations to represent the variables in the system. m = 1, 2, ..., M : The index for each individual user. M is the total number of users. wm : The user model parameter associated with user m. wm is a K dimensional vector. j = 1, 2, ..., Jm : The index for a set of data for user m. Jm is the number of training data for user m. Dm = {(xm,j , ym,j )}: A set of data associated with user m. xm,j is a K dimensional vector that represents the mth user's j th training document.2 ym,j is a scalar that represents the label of document xm,j . k = 1, 2, ..., K : The dimensional index of input variable x. The Bayesian hierarchical modeling approach has been widely used in real-world information retrieval applications. Generalized Bayesian hierarchical linear models, one of the simplest Bayesian hierarchical models, are commonly used and have achieved good performance on collaborative filtering [25] and content-based adaptive filtering [27] tasks. Figure 1 shows the graphical representation of a Bayesian hierarchical model. In this graph, each user model is represented by a random vector wm . We assume a user model is sampled randomly from a prior distribution P (w|). The system can predict the user label y of a document x given an estimation of wm (or wm 's distribution) using a function y = f (x, w). The model is called generalized Bayesian hierarchical linear model when y = f (wT x) is any generalized linear model such as logistic regression, SVM, and linear regression. To reliably estimate the user model wm , the system can borrow information from other users through the prior = (µ, ). Now we look at one commonly used model where y = wT x + , where N (0, 2 ) is a random noise [25][27]. Assume that each user model wm is an independent draw from a population distribution P (w|), which is governed by some unknown hyperparameter . Let the prior distribution of user model w be a Gaussian distribution with parameter = (µ, ), which is the commonly used prior for linear models. µ = (µ1 , µ2 , ..., µK ) is a K dimensional vector that represents the mean of the Gaussian distribution, and is the covariance matrix of the Gaussian. Usually, a Normal distribution N (0, aI ) and an Inverse Wishart distribution 1 1 P () ||- 2 b exp(- 2 ctr(-1 )) are used as hyperprior to model the prior distribution of µ and respectively. I is the K dimensional identity matrix, and a, b, and c are real numbers. With these settings, we have the following model for the system: 1. µ and are sampled from N (0, aI ) and I W (aI ), respectively. 2 The first dimension of x is a dummy variable that always equals to 1. 2. RELATED WORK Providing personalized recommendations to users has been identified as a very important problem in the IR community since the 1970's. The approaches that have been used to solve this problem can be roughly classified into two ma jor categories: content based filtering versus collaborative filtering. Content-based filtering studies the scenario where a recommendation system monitors a document stream and pushes documents that match a user profile to the corresponding user. The user may read the delivered documents and provide explicit relevance feedback, which the filtering system then uses to update the user's profile using relevance feedback retrieval models (e.g. Boolean models, vector space models, traditional probabilistic models [20] , inference networks [3] and language models [6]) or machine learning algorithms (e.g. Support Vector Machines (SVM), K nearest neighbors (K-NN) clustering, neural networks, logistic regression, or Winnow [16] [4] [23]). Collab orative filtering goes beyond merely using document content to recommend items to a user by leveraging information from other users with similar tastes and preferences in the past. Memorybased heuristics and model based approaches have been used in collaborative filtering task [15] [8] [2] [14] [12] [11]. This paper contributes to the content-based recommendation research by improving the efficiency and effectiveness of Bayesian hierarchical linear models, which have a strong theoretical basis and good empirical performance on recommendation tasks[27][25]. This paper does not intend to compare content-based filtering with collaborative filtering or claim which one is a better. We think each complements the other, and that content-based filtering is extremely useful for handling new documents/items with little or no user feedback. Similar to some other researchers[18][1][21], we found that a recommendation system will be more effective when both techniques are combined. However, this is beyond the scope of this paper and thus not discussed here. 48 SIGIR 2007 Proceedings Session 2: Routing and Filtering This kind of optimization problem is usually solved by the EM algorithm. Applying EM to the above problem, the set of user models w are the unobservable hidden variables and we have: w Q= P (w|µ, 2 , Dm ) log P (µ, 2 , w, D )dw Based on the derivation of the EM formulas presented in [24], we have the following Expectation-Maximization steps for finding the optimal hyperparameters. For space considerations, we omit the derivation in this paper since it is not the focus of our work. E step: For each user m, estimate the user model distribution P (wm |Dm , ) = N (wm ; wm , 2 ) based on the ¯ m current estimation of the prior = (µ, 2 ). Figure 1: Illustration of dep endencies of variables in the hierarchical mo del. The rating, y , for a do cument, x, is conditioned on the do cument and the user mo del, wm , asso ciated with the user m. Users share information ab out their mo dels through the prior, = (µ, ). 2. For each user m, wm is sampled randomly from a Normal distribution: wm N (µ, 2 ) 3. For each item xm,j , ym,j is sampled randomly from a T Normal distribution: ym,j N (wm xm,j , 2 ). Let = (, w1 , w2 , ..., wM ) represent the parameters of this system that needs to be estimated. The joint likelihood for all the variables in the probabilistic model, which includes the data and the parameters, is: m j P (D, ) = P () P (wm |) P (ym,j |xm,j , wm ) (1) For simplicity, we assume a, b, c, and the system. are wm = ((2 )-1 + ¯ 2 m Sxx,m -1 Sxy,m ( 2 ) 2 Sxx,m -1 = ((2 )-1 + 2 ) j (2 )-1 µ) (6) + (7) j xm,j ym,j where Sxx,m = xm,j xT ,j Sxy,m = m M step: Optimize the prior = (µ, 2 ) based on the estimation from the last E step. 1m µ= wm ¯ (8) M 1m 2 2 = m + (wm - µ)(wm - µ)T ¯ ¯ (9) M Many machine learning driven IR systems use a point estimate of the parameters at different stages in the system. However, we are estimating the posterior distribution of the variables at the E step. This avoids overfitting wm to a particular user's data, which may be small and noisy. A detailed discussion about this sub ject appears in [10]. provided to 4. MODEL PARAMETER LEARNING 4.2 New Algorithm: Modified EM If the prior is known, finding the optimal wm is straightforward: it is a simple linear regression. Therefore, we will focus on estimating . The maximum a priori solution of is given by M AP = arg max P (|D) (2) (3) (4) (5) P (, D) P (D ) = arg max P (D|)P () w = arg max P (D|w, )P (w|)P ()dw = arg max Finding the optimal solution for the above problem is challenging, since we need to integrate over all w = (w1 , w2 , ..., wM ), which are unobserved hidden variables. 4.1 EM Algorithm for Bayesian Hierarchical Linear Models In Equation 5, is the parameter needs to be estimated, and the result depends on unobserved latent variables w. Although the EM algorithm is widely studied and used in machine learning applications, using the above EM process to solve Bayesian hierarchical linear models in large-scale information retrieval systems is still too computationally expensive. In this section, we describe why the learning rate of the EM algorithm is slow in our application and introduce a new technique to make the learning of the Bayesian hierarchical linear model scalable. The derivation of the new learning algorithm will be based on the EM algorithm described in the previous section. First, the covariance matrices 2 , 2 are usually too large m to be computationally feasible. For simplicity, and as a common practice in IR, we do not model the correlation between features. Thus we approximate these matrices with K dimensional diagonal matrices. In the rest of the paper, we use these symbols to represent their diagonal approximations: 2 = 2 1 0 .. 0 2 0 2 .. 0 .. .. .. .. 2 0 0 .. K 49 SIGIR 2007 Proceedings 2 = m where sxx,m,k = Session 2: Routing and Filtering j x2 ,j,k and sxy,m,k = m j xm,j,k ym,j 0 .. 0 2 m,1 0 2 m,2 .. 0 .. 0 .. 0 .. .. 2 .. m,K Secondly, and most importantly, the input space is very sparse and there are many dimensions that are not "related" to a particular user in a real IR application. For example, let us consider a movie recommendation system, with the input variable x representing a particular movie. For the j th movie that the user m has seen, let xm,j,k = 1 if the director of the movie is "Jean-Pierre Jeunet" (indexed by k). Here we assume that whether or not that this director directed a specific movie is represented by the kth dimension. If the user m has never seen a movie directed by "Jean-Pierre Jeunet", then the corresponding dimension is always zero (xm,j,k = 0 for all j ) . One ma jor drawback of the EM algorithm is that the importance of a feature, µk , may be greatly dominated by users j who have never encountered this feature (i.e. xm,j,k = 0) at the M step (Equation 8). Assume that 100 out of 1 million users have viewed the movie directed by "Jean-Pierre Jeunet", and that the viewers have rated all of his movies as "excellent". Intuitively, he is a good director and the weight for him (µk ) should be high. Before the EM iteration, the initial value of µ is usually set to 0. Since the other 999,900 users have not seen this movie, their corresponding weights (w1,k , w2,k , ..., wm,k ..., w999900,k ) for that director would be very small initially. Thus the corresponding weight of the director in the prior µk at the first M step would be very low , and the variance m,k will be large (Equations 8 and 7). It is undesirable that users who have never seen any movie produced by the director influence the importance of the director so much. This makes the convergence of the standard EM algorithm very slow. Now let's look at whether we can improve the learning speed of the algorithm. Without a loss of generality, let us assume that the kth dimension of the input variable x is not related to a particular user m. By which we mean, xm,j,k = 0 for all j = 1, ..., Jm . It is straightforward to prove that the kth row and kth column of Sxx,m are completely filled with zeros, and that the kth dimension of Sxy,m is zeroed as well. Thus the corresponding kth dimension of the user model's mean, wm , should be equal to that of the prior: ¯ wm,k = µk , with the corresponding covariance of m,k = k . ¯ At the M step, the standard EM algorithm uses the numerical solution of the distribution P (wm |Dm , ) estimated at E step (Equation 8 and Equation 7). However, the numerical solutions are very unreliable for wm,k and m,k when ¯ the kth dimension is not related to the mth user. A better approach is using the analytical solutions wm,k = µk , and ¯ m,k = k for the unrelated (m, k ) pairs, along with the numerical solution estimated at E step for the other (m, k ) pairs. Thus we get the following new EM-like algorithm: Mo dified E step: For each user m, estimate the user model distribution P (wm |Dm , ) = N (wm ; wm , 2 ) based ¯ m on the current estimation of , µ, 2 . wm = ((2 )-1 + ¯ 2 m,k = Sxx,m -1 Sxy,m ( 2 (2 )-1 µ) (10) 2 ) + s 2 m ((k )-1 + xx,2 ,k -1 (11) ) Mo dified M Step Optimize the prior = (µ, 2 ) based on the estimation from the last E step for related userfeature pairs. The M step implicitly uses the analytical solution for unrelated user-feature pairs. 1m wm,k ¯ Mk :r elated 1m 2 2 k = m,k Mk :r elated µk = (12) +(wm,k - µk )(wm,k - µk )T ¯ ¯ (13) where Mk is the number of users that are related to feature k We only estimate the diagonal of 2 and since we are m using the diagonal approximation of the covariance matrices. To estimate wm , we only need to calculate the numeri¯ cal solutions for dimensions that are related to user m. To 2 estimate k and µk , we only sum over users that are related to the kth feature. There are two ma jor benefits of the new algorithm. First, because only the related (m, k ) pairs are needed at the modified M step, the computational complexity in a single EM iteration is much smaller when the data is sparse, and many of (m, k ) pairs are unrelated. Second, the parameters estimated at the modified M step (Equations 12 ­ 13) are more accurate than the standard M step described in Section 4.1 because the exact analytical solutions wm,k = µk ¯ and m,k = k for the unrelated (m, k ) pairs were used in the new algorithm instead of an approximate solution as in the standard algorithm. 5. 5.1 EXPERIMENTAL METHODOLOGY Evaluation Data Set To evaluate the proposed technique, we used the following three ma jor data sets (Table 1): MovieLens Data: This data set was created by combining the relevance judgments from the MovieLens[9] data set with documents from the Internet Movie Database (IMDB). MovieLens allows users to rank how much he/she enjoyed a specific movie on a scale from 1 to 5. This "likeability" rating was used as a measurement of how relevant the document representing the corresponding movie is to the user. We considered documents with likeability scores of 4 or 5 as relevant, and documents with a score of 1 to 3 as irrelevant to the user. MovieLens provided relevance judgments on 3,057 documents from 6,040 separate users. On average, each user rated 151 movies, of these 87 were judged to be relevant. The average score for a document was 3.58. Documents representing each movie were constructed from the portion of the IMDB database that is available for public download[13]. Based on this database, we created one document per movie that contained the relevant information about it (e.g. directors, actors, etc.). 50 SIGIR 2007 Proceedings Session 2: Routing and Filtering To answer the first question, we compared the Bayesian hierarchical models with commonly used Norm-2 regularized linear regression models. In fact, the commonly used approach is equivalent to the model learned at the end of the first EM iteration. To answer the second question, we compared the proposed new algorithm with the standard EM algorithm to see whether the new learning algorithm is better. To answer the third question, we tested the efficiency of the new algorithm on the entire Netflix data set where about half a million user models need to be learned together. For the MovieLens and Netflix data sets, algorithm effectiveness was measured by mean square error, while on the Reuters data set classification error was used because it was more informative. We first evaluated the performance on each individual user, and then estimated the macro average over all users. Statistical tests (t-tests) were carried out to see whether the results are significant. For the experiments on the MovieLens and Netflix data sets, we used a random sample of 90% of each user for training, and the rest for testing. On Reuters' data set, because there are too many relevant documents for each topic in the corpus, we used a random sample of 10% of each topic for training, and 10% of the remaining documents for testing. For all runs, we set (a, b, c, ) = (0.1, 10, 0.1, 1) manually. Table 1: Data Set Statistics. On Reuters, the numb er of rating for a simulated user is the numb er of do cuments relevant to the corresp onding topic. Data MovieLens Netflix-all Netflix-1000 Reuters-C Reuters-E Reuters-G Reuters-M Users 6,040 480,189 1000 34 26 33 10 Docs 3,057 17,770 17,770 100,000 100,000 100,000 100,000 Ratings per User 151 208 127 3949 1632 2222 6529 Netflix Data: This data set was constructed by combining documents about movies crawled from the web with a set of actual movie rental customer relevance judgments from Netflix[19]. Netflix publicly provides the relevance judgments of 480,189 anonymous customers. There are around 100 million rating on a scale of 1 to 5 for 17,770 documents. Similar to MovieLens, we considered documents with likeability scores of 4 or 5 as relevant. This number was reduced to 1000 customers through random sampling. The average customer on the reduced data set provided 127 judgments, with 70 being deemed relevant. The average score for documents is 3.55. Reuters Data: This is the Reuters Corpus, Volume 1. It covers 810,000 Reuters English language news stories from August 20, 1996 to August 19, 1997. Only the first 100,000 news were used in our experiments. The Reuters corpus comes with a topic hierarchy. Each document is assigned to one of several locations on the hierarchical tree. The first level of the tree contains four topics, denoted as C , E , M , and G. For the experiments in this paper, the tree was cut at level 1 to create four smaller trees, each of which corresponds to one smaller data set: Reuters-E Reuters-C, ReutersM and Reuters-G. For each small data set, we created several profiles, one profile for each node in a sub-tree, to simulate multiple users, each with a related, yet separate definition of relevance. All the user profiles on a sub-tree are supposed to share the same prior model distribution. Since this corpus explicitly indicates only the relevant documents for a topic(user), all other documents are considered irrelevant. 6. EXPERIMENTAL RESULTS 5.2 Evaluation We designed the experiments to answer the following three questions: 1. Do we need to take the effort to use a Bayesian approach and learn a prior from other users? 2. Does the new algorithm work better than the standard EM algorithm for learning the Bayesian hierarchical linear model? 3. Can the new algorithm quickly learn many user models? Figure 2, Figure 3, and Figure 4 show that on all data sets, the Bayesian hierarchical modeling approach has a statistical significant improvement over the regularized linear regression model, which is equivalent to the Bayesian hierarchical models learned at the first iteration. Further analysis shows a negative correlation between the number of training data for a user and the improvement the system gets. This suggests that the borrowing information from other users has more significant improvements for users with less training data, which is as expected. However, the strength of the correlation differs over data sets, and the amount of training data is not the only characteristics that will influence the final performance. Figure 2 and Figure 3 show that the proposed new algorithm works better than the standard EM algorithm on the Netflix and MovieLens data sets. This is not surprising since the number of related feature-users pairs is much smaller than the number of unrelated feature-user pairs on these two data sets, and thus the proposed new algorithm is expected to work better. Figure 4 shows that the two algorithms work similarly on the Reuters-E data set. The accuracy of the new algorithm is similar to that of the standard EM algorithm at each iteration. The general patterns are very similar on other Reuters' subsets. Further analysis shows that only 58% of the user-feature pairs are unrelated on this data set. Since the number of unrelated user-feature pairs is not extremely large, the sparseness is not a serious problem on the Reuters data set. Thus the two learning algorithms perform similarly. The results suggest that only on a corpus where the number of unrelated user-feature pairs is much larger than the number of related pairs, such as on the Netflix data set, the proposed technique will get a significant improvement over standard EM. However, the experiments also show that when the assumption does not hold, the new algorithm does not hurt performance. Although the proposed technique is faster than standard 51 SIGIR 2007 Proceedings Session 2: Routing and Filtering Figure 2: Performance on a Netflix subset with 1,000 users. The new algorithm is statistical significantly b etter than EM algorithm at iterations 2 - 10. Norm-2 regularized linear mo dels are equivalent to the Bayesian hierarchical mo dels learned at the first iteration, and are statistical significantly worse than the Bayesian hierarchical mo dels. 1.4 1.35 Mean Square Error 1.3 1.25 1.2 1.15 1.1 1.05 1 0 2 4 6 Iterations 8 10 0.39 New Algorithm Traditional EM New Algorithm Traditional EM 0.38 Classification Error 0.37 0.36 0.35 0.34 0.33 1 2 3 4 5 6 7 8 9 10 Iterations Figure 3: Performance on a MovieLens subset with 1,000 users. The new algorithm is statistical significantly b etter than EM algorithm at iteration 2 to 17 (evaluated with mean square error). 3.5 0.65 New Algorithm New Algorithm Traditional EM Traditional EM 0.6 3 Mean Square Error Classification Error 2.5 2 1.5 1 0.5 0.55 0.5 0.45 0.4 0.35 1 6 11 16 Iterations 21 1 6 11 16 Iterations 21 Figure 4: Performance on a Reuters-E subset with 26 profiles. Performances on Reuters-C, Reuters-M, Reuters-G are similar. 0.014 0.0135 Mean Square Error 0.013 0.0125 0.012 0.0115 0.011 1 New Algorithm Traditional EM Classification Error 0.0114 0.0112 0.011 0.0108 0.0106 0.0104 0.0102 1 New Algorithm Traditional EM 2 3 Iterations 4 5 2 3 Iterations 4 5 52 SIGIR 2007 Proceedings EM, can it really learn millions of user models quickly? Our results show that the modified EM algorithm converges quickly, and 2 - 3 modified EM iterations would result in a reliable estimation. We evaluated the algorithm on the whole Netflix data set (480,189 users, 159,836 features, and 100 million ratings) running on a single CPU PC (2GB memory, P4 3GHz). The system finished one modified EM iteration in about 4 hours. This demonstrates that the proposed technique can efficiently handle large-scale system like Netflix. Session 2: Routing and Filtering 8. ACKNOWLEDGMENTS We thank Wei Xu, David Lewis and anonymous reviewers for valuable feedback on the work described in this paper. Part of the work was supported by Yahoo, Google, the Petascale Data Storage Institute and the Institute for Scalable Scientific Data Management. Any opinions, findings, conclusions, or recommendations expressed in this material are those of the authors, and do not necessarily reflect those of the sponsors. 7. CONCLUSION 9. REFERENCES Content-based user profile learning is an important problem and is the key to providing personal recommendations to a user, especially for recommending new items with a small number of ratings. The Bayesian hierarchical modeling approach is becoming an important user profile learning approach due to its theoretically justified ability to help one user through information transfer from the other users by way of hyperpriors. This paper examined the weakness of the popular EM based learning approach for Bayesian hierarchical linear models and proposed a better learning technique called Modified EM. We showed that the new technique is theoretically more computationally efficient than the standard EM algorithm. Evaluation on the MovieLens and Netflix data sets demonstrated the effectiveness of the new technique when the data is sparse, by which we mean the ratio of related user-feature pairs to unrelated pairs is small. Evaluation on the Reuters data set showed that the new technique performed similar to the standard EM algorithm when the sparseness condition does not hold. In general, it is better to use the new algorithm since it is as simple as standard EM, the performance is either better or similar to EM, and the computation complexity is lower at each iteration. It is worth mentioning that even if the original problem space is not sparse, sparseness can be created artificially when a recommendation system uses user-specific feature selection techniques to reduce the noise and user model complexity. The proposed technique can also be adapted to improve the learning in such a scenario. We also demonstrated that the proposed technique can learn half a million user profiles from 100 million ratings in a few hours with a single CPU. The research is important because scalability is a ma jor concern for researchers when using the Bayesian hierarchical linear modeling approach to build a practical large scale system, even though the literature have demonstrated the effectiveness of the models in many applications. Our work is one ma jor step on the road to make Bayesian hierarchical linear models more practical. The proposed new technique can be easily adapted to run on a cluster of machines, and thus further speed up the learning process to handle a larger scale system with hundreds of millions of users. The research has much potential to benefit people using EM algorithm on many other IR problems as well as machine learning problems. EM algorithm is a commonly used machine learning technique. It is used to find model parameters in many IR problems where the training data is very sparse. Although we are focusing on the Bayesian hierarchical linear models for recommendation and filtering, the new idea of using analytical solution instead of numerical solution for unrelated user-feature pairs at the M step could be adapted to many other problems. [1] C. Basu, H. Hirsh, and W. Cohen. Recommendation as classification: Using social and content-based information in recommendation. In Proceedings of the Fifteenth National Conference on Artificial Intel ligence, 1998. [2] J. S. Breese, D. Heckerman, and C. Kadie. Empirical analysis of predictive algorithms for collaborative filtering. Technical report, Microsoft Research, One Microsoft Way, Redmond, WA 98052, 1998. [3] J. Callan. Document filtering with inference networks. In Proceedings of the Nineteenth Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 262­269, 1996. [4] N. Cancedda, N. Cesa-Bianchi, A. Conconi, C. Gentile, C. Goutte, T. Graepel, Y. Li, J. M. Renders, J. S. Taylor, and A. Vinokourov. Kernel method for document filtering. In The Eleventh Text REtrieval Conference (TREC11). National Institute of Standards and Technology, special publication 500-249, 2003. [5] C. Chelba and A. Acero. Adaptation of maximum entropy capitalizer: Little data can help a lot. In D. Lin and D. Wu, editors, Proceedings of EMNLP 2004, pages 285­292, Barcelona, Spain, July 2004. Association for Computational Linguistics. [6] B. Croft and J. Lafferty, editors. Language Modeling for Information Retrieval. Kluwer, 2002. [7] A. Dayanik, D. D. Lewis, D. Madigan, V. Menkov, and A. Genkin. Constructing informative prior distributions from domain knowledge in text classification. In SIGIR '06: Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval, pages 493­500, New York, NY, USA, 2006. ACM Press. [8] J. Delgado and N. Ishii. Memory-based weightedma jority prediction for recommender systems. In ACM SIGIR'99 Workshop on Recommender Systems, 1999. [9] GroupLens. Movielens. http://www.grouplens.org/taxonomy/term/14, 2006. [10] D. Heckerman. A tutorial on learning with bayesian networks. In M. Jordan, editor, Learning in Graphical Models. Kluwer Academic, 1998. [11] J. L. Herlocker, J. A. Konstan, A. Borchers, and J. Riedl. An algorithmic framework for performing collaborative filtering. In SIGIR '99: Proceedings of the 22nd annual international ACM SIGIR conference on Research and development in information retrieval, 53 SIGIR 2007 Proceedings pages 230­237, New York, NY, USA, 1999. ACM Press. T. Hofmann and J. Puzicha. Latent class models for collaborative filtering. In IJCAI '99: Proceedings of the Sixteenth International Joint Conference on Artificial Intel ligence, pages 688­693, San Francisco, CA, USA, 1999. Morgan Kaufmann Publishers Inc. I. M. D. (IMDB). Internet movie database. http://www.imdb.com/interfaces/, 2006. R. Jin, J. Y. Chai, and L. Si. An automatic weighting scheme for collaborative filtering. In SIGIR '04: Proceedings of the 27th annual international ACM SIGIR conference on Research and development in information retrieval, pages 337­344, New York, NY, USA, 2004. ACM Press. J. A. Konstan, B. N. Miller, D. Maltz, J. L. Herlocker, L. R. Gordon, and J. Riedl. GroupLens: Applying collaborative filtering to Usenet news. Communications of the ACM, 40(3):77­87, 1997. D. Lewis. Applying support vector machines to the TREC-2001 batch filtering and routing tasks. In Proceedings of the Eleventh Text REtrieval Conference (TREC-11), 2002. B. Liu, X. Li, W. S. Lee, , and P. Yu. Text classification by labeling words. In Proceedings of The Nineteenth National Conference on Artificial Intel ligence (AAAI-2004), July 25-29, 2004. P. Melville, R. J. Mooney, and R. Nagara jan. Content-boosted collaborative filtering for improved recommendations. In Proceedings of the Eighteenth National Conference on Artificial Intel ligence (AAAI-2002), Edmonton, Canada, 2002. Netflix. Netflix prize. http://www.netflixprize.com (visited on Nov. 30, 2006), 2006. S. Robertson and K. Sparck-Jones. Relevance weighting of search terms. In Journal of the American Society for Information Science, volume 27, pages 129­146, 1976. Session 2: Routing and Filtering [21] J. Wang, A. P. de Vries, and M. J. T. Reinders. Unifying user-based and item-based collaborative filtering approaches by similarity fusion. In SIGIR '06: Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval, pages 501­508, New York, NY, USA, 2006. ACM Press. [22] X. Wu and R. K. Srihari. Incorporating prior knowledge with weighted margin support vector machines. In Proc. ACM Know ledge Discovery Data Mining Conf.(ACM SIGKDD 2004), Aug. 2004. [23] Y. Yang, S. Yoo, J. Zhang, and B. Kisiel. Robustness of adaptive filtering methods in a cross-benchmark evaluation. In Proceedings of the 28th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, 2005. [24] K. Yu, V. Tresp, and A. Schwaighofer. Learning gaussian processes from multiple tasks. In ICML '05: Proceedings of the 22nd international conference on Machine learning, pages 1012­1019, New York, NY, USA, 2005. ACM Press. [25] K. Yu, V. Tresp, and S. Yu. A nonparametric hierarchical bayesian framework for information filtering. In SIGIR '04: Proceedings of the 27th annual international ACM SIGIR conference on Research and development in information retrieval, pages 353­360. ACM Press, 2004. [26] X. Zhu. Semi-supervised learning literature survey. Technical report, University of Wisconsin ­ Madison, December 9, 2006. [27] P. Zigoris and Y. Zhang. Bayesian adaptive user profiling with explicit & implicit feedback. In Conference on Information and Know ledge Mangement 2006, 2006. [12] [13] [14] [15] [16] [17] [18] [19] [20] 54