WWW 2008 / Refereed Track: Data Mining - Modeling April 21-25, 2008 · Beijing, China Opinion Integration Through Semi-supervised Topic Modeling Depar tment of Computer Science University of Illinois at Urbana-Champaign Urbana, IL 61801 Yue Lu yuelu2@uiuc.edu Depar tment of Computer Science University of Illinois at Urbana-Champaign Urbana, IL 61801 Chengxiang Zhai czhai@uiuc.edu ABSTRACT Web 2.0 technology has enabled more and more people to freely express their opinions on the Web, making the Web an extremely valuable source for mining user opinions about all kinds of topics. In this paper we study how to automatically integrate opinions expressed in a well-written expert review with lots of opinions scattering in various sources such as blogspaces and forums. We formally define this new integration problem and propose to use semi-supervised topic models to solve the problem in a principled way. Experiments on integrating opinions about two quite different topics (a product and a political figure) show that the proposed method is effective for both topics and can generate useful aligned integrated opinion summaries. The proposed method is quite general. It can be used to integrate a well written review with opinions in an arbitrary text collection about any topic to potentially support many interesting applications in multiple domains. Categories and Sub ject Descriptors: B.3.3 [Information Search and Retrieval]: Text Mining General Terms: Algorithms Keywords: opinion integration, semi-supervised, probabilistic topic modeling, expert review 1. INTRODUCTION As Web 2.0 applications become increasingly popular, more and more people express their opinions on the Web in various ways such as customer reviews, forums, discussion groups, and Weblogs. The wide coverage of topics and abundance of opinions make the Web an extremely valuable source for mining user opinions about all kinds of topics (e.g., products, political figures, etc.). However, with such a large scale of information source, it is quite challenging for a user to integrate and digest all the opinions from different sources. In general, for any given topic (e.g., a product), there are often two kinds of opinions: the first is opinions expressed in some well-structured relatively complete review typically written by some expert about the topic and the second is fragmental opinions scattering around in all kinds of sources such as blog articles and forums. For convenience of discussion, we will refer to the first kind as expert opinions and the second ordinary opinions. The expert opinions are relatively easy for a user to access through some opinion search Copyright is held by the International World Wide Web Conference Committee (IW3C2). Distribution of these papers is limited to classroom use, and personal use by others. WWW 2008, April 21­25, 2008, Beijing, China. ACM 978-1-60558-085-2/08/04. website such as CNET. Because a comprehensive product review is often written carefully, it is also easy for a user to digest expert opinions. However, finding, integrating, and digesting ordinary opinions pose significant challenges as they are scattering in many different sources, and are generally fragmental and not well structured. While expert opinions are clearly very useful, they may be biased and often out of date after a while. In contrast, ordinary opinions tend to represent the general opinions of a large number of people and get refreshed quickly as people dynamically generate new content. For example, a query "iPhone" returns 330,431 matches in Google's blogsearch (as of Nov. 1, 2007), suggesting that there are many opinions expressed about iPhone in blog articles within a short period of time since it hit the market. To enable a user to benefit from both kinds of opinions, it is thus necessary to automatically integrate these two kinds of opinions and present an integrated opinion summary to a user. To the best of our knowledge, such an integration problem has not been studied in the existing work. In this paper, we study how to integrate a well-written expert review about an arbitrary topic with many ordinary opinions expressed in a text collection such as blog articles. We propose a general method to solve this integration problem in three steps: (1) extract ordinary opinions from text using information retrieval; (2) summarize and align the extracted opinions to the expert review to integrate the opinions; (3) further separate ordinary opinions that are similar to expert opinions from those that are not. Our main idea is to take advantage of the high readability of the expert review to structure the unorganized ordinary opinions while at the same time summarizing the ordinary opinions to extract representative opinions using the expert review as guidance. From the viewpoint of text data mining, we are essentially to use the expert review as a "template" to mine text data for ordinary opinions. The first step in our approach can be implemented with a direct application of information retrieval techniques. Implementing the second and third steps involves special challenges. In particular, without any training data, it is unclear how we should align ordinary opinions to an expert review and separate similar and supplementary opinions. We propose a semi-supervised topic modeling approach to solve these challenges. Specifically, we cast the expert review as a prior in a probabilistic topic model (i.e., PLSA[6]) and fit the model to the text collection with the ordinary opinions with Maximum A Posterior (MAP) estimation. With the estimated probabilistic model, we can then naturally obtain alignments of opinions as well as ad- 121 WWW 2008 / Refereed Track: Data Mining - Modeling ditional ordinary opinions that cannot be well-aligned with the expert review. The separation of similar and supplementary opinions can also be achieved with a similar model. We evaluate our method on integrating opinions about two quite different topics. One is a popular product "iPhone", and the other is a popular political figure Barack Obama. Experiment results show that our method can effectively integrate the expert review (a produce review from CNET for iPhone and a short biography from Wikipedia for Barack Obama) with ordinary opinions from blog articles. This paper makes the following contributions: 1. We define a new problem of opinion integration. To the best of our knowledge, there is no existing work that solves this problem. 2. We propose a new semi-supervised topic modeling approach for integrating opinions scattered around in text articles with those in a well-written expert review for an arbitrary topic. 3. We evaluate the proposed method both qualitatively and quantitatively. The results show that our method is effective for integrating opinions about quite different topics. Collecting and digesting opinions about a topic is critical for many tasks such as shopping, medical decision making, and social interactions. Our proposed method is quite general and can be applied to integrate opinions about any topic in any domain, thus potentially has many interesting applications. The rest of the paper is organized as follows. In Section 2, we formally define the novel problem of opinion integration. After that, we present our Semi-supervised Topic Model in Section 4. We discuss our experiments and results in Section 5. Finally, we conclude in Section 7. April 21-25, 2008 · Beijing, China Blog Collection Aspect r1 Aspect r2 Scattered Opinions ... Aspect rk Review Article ... Partition of Similar Opinions Blog Collection Supplementary Opinions Aspect r1 Aspect r2 ... Aspect rk Aspect rk+1 ... Aspect rk+m Figure 1: Problem Setup real applications, we would link each extracted opinion sentence back to the original document to facilitate navigating into the original document and obtaining context of an opinion. We would like our integrated opinion summary to include both opinions in the expert review and those most representative opinions in the text collection. Since the expert review is well written, we keep their original form and leverage its structure to organize the ordinary opinions extracted from text. To quantify the representativeness of an ordinary opinion sentence, we will compute a "support value" for each extracted ordinary opinion sentence. Specifically, we would like to partition the extracted ordinary opinion sentences into groups that can be potentially aligned with all the review segments r1 , ..., rk . Naturally, there may also be some groups with extra ordinary opinions that are not alignable with any expert opinion segment, and these opinions can be very useful to augment the expert review with additional opinions. Furthermore, for opinions aligned to a review segment ri , we would like to further separate those that are similar to ri from those that are supplementary for ri ; such separation can allow a user to digest the integrated opinions more easily. Finally, if ri has multiple sentences, we can further align each ordinary opinion sentence (both "similar" and "supplementary") with a sentence in ri to increase the readability. This problem setup is illustrated in Figure 1. We now define the problem more formally. Definition (Representative Opinion (RO)) A representative opinion(RO) is an ordinary opinion sentence extracted from the text collection with a support value. Formally, we denote it by oij = ( , sij ) where [1, +) is a support value indicating how many sentences this opinion sentence can represent, and sij is a sentence in document di . Since ordinary opinions tend to be redundant and we are primarily interested in extracting representative opinions, 2. PROBLEM DEFINITION In this section, we define the novel problem of opinion integration. Given an expert review about a topic T (e.g., "iPhone" or "Barack Obama") and a collection of text articles (e.g., blog articles), our goal is to extract opinions from text articles and integrate them with those in the expert review to form an integrated opinion summary. The expert review is generally well-written and coherent, thus we can view it as a sequence of semantically coherent segments, where a segment could be a sentence, a paragraph, or other meaningful segments (e.g., paragraphs corresponding to product features) available in some semistructured review. Formally, we denote the expert review by R = {r1 , ..., rk } where ri is a segment. Since we can always treat a sentence as a segment, this definition is quite general. The text collection is a set of text documents where ordinary opinions are expressed and can be represented as C = {d1 , ..., d|C | } where di = (si1 , ..., si|di | ) is a document and sij is a sentence. To support opinion integration in a general and robust manner, we do not rely on extra knowledge to segment documents to obtain opinion regions; instead, we treat each sentence as an opinion unit. Since a sentence has a well-defined meaning, this assumption is reasonable. To help a user interpret any opinion sentence, in 122 WWW 2008 / Refereed Track: Data Mining - Modeling the support can be very useful to assess the representativeness of an extracted opinion. Let RO(C ) be all the possible representative opinion sentences in C . We can now define the integrated opinion summary that we would like to generate as follows. April 21-25, 2008 · Beijing, China In the second stage, our main idea is to exploit a probabilistic topic model, i.e., Probabilistic Latent Semantic Analysis (PLSA) with conjugate prior [6, 11] to cluster opinion sentences in a special way so that there will be precisely one cluster corresponding to each segment ri in the expert review. These clusters are to collect opinion sentences that can Definition (Integrated Opinion Summary) An intebe aligned with a review segment. There will also be some grated opinion summary of R and C is a tuple clusters that are not aligned with any review segments, and (R, S sim , S supp , S extra ) where (1) R is the given expert rethey are designed to collect extra opinions. Thus the model s s s s view; (2) S sim = {S1 im , ..., Skim } and S supp = {S1 upp , ..., Skupp } provides an elegant way to simultaneously partition opinare similar and supplementary representative opinion senions and align them to the expert review. Interestingly, the s s tences, respectively, that can be aligned to R, and Si im , Sj upp same model can also be adapted to further partition opinions RO(C ) are sets of representative opinion sentences; (3) S extra aligned to a review segment into similar and supplementary opinions. Finally, a simplified version of the model (i.e., RO(C ) is a set of extra representative opinion sentences that no prior, basic PLSA) can be used to cluster any group of cannot be aligned with R. sentences to extract representative opinion sentences. The Note that we define "opinion" broadly as covering all the support of a representative opinion is defined as the size of discussion about a topic in opinionate sources such as blog the cluster represented by the opinion sentences. spaces and forums. The notion of "opinion" is quite vague; Note that what we need in this second stage is semiwe adopt this broad definition to ensure generality of the supervised clustering in the sense that we would like to conproblem set up and its solutions. In addition, any existstrain many of the clusters so that they would correspond to ing sentiment analysis technique could be applied as a postthe segments ri s in the expert review. Thus a direct applicaprocessing step. But since we only focus on the integration tion of any regular clustering algorithm would not be able to problem in this paper, we will not cover sentiment analysis. solve our problem. Instead of doing clustering, we can also imagine using each expert review segment ri as a query to retrieve similar sentences. However, it would be unclear how 3. OVERVIEW OF PROPOSED APPROACH to choose a good cutoff point on the ranked list of retrieved The opinion integration problem as defined in the previous results. Compared with these alternative approaches, PLSA section is quite different from any existing problem setup for with conjugate prior provides a more principled and unified opinion extraction and summarization, and it presents some way to tackle all the challenges. special challenges: (1) How can we extract representative In the optional third stage, we have a review segment opinion sentences with support information? (2) How can ri with multiple sentences and we would like to align all we distinguish alignable opinions from non-alignable opinextracted representative opinions to the sentences in ri . This ions? (3) For any given expert review segment, how can we can be achieved by using each representative opinion as a distinguish similar opinions from those that are supplemenquery and retrieve sentences in ri . Once again, in general, tary? (4) In the case when a review segment ri has mulany retrieval method can be used. In this paper, we again tiple sentences, how can we align a representative opinion used the KL-divergence retrieval method. to a sentence in ri ? In this section, we present our overall From the discussion above, it is clear that we leverage both approach to solving all these challenges, leaving a detailed information retrieval techniques and text mining techniques presentation to the next section. (i.e., PLSA), and our main technical contributions lie in the At a high level, our approach primarily consists of two second stage where we repeatedly exploit semi-supervised stages and an optional third stage: In the first stage, we topic modeling to extract and integrate opinions. We deretrieve only the relevant opinion sentences from C using scribe this step in more detail in the next section. the topic description T as a query. Let CO be the set of all the retrieved relevant opinion sentences. In the second stage, we use probabilistic topic models to cluster sentences 4. SEMI-SUPERVISED PLSA FOR OPINION in CO and obtain S sim , S supp and S extra . When ri has INTEGRATION multiple sentences, we have a third stage, in which we again use information retrieval techniques to align any extracted Probabilistic latent semantic analysis (PLSA) [6] and its representative opinion to a sentence of ri . We now describe extensions [21, 13, 11] have recently been applied to many each of the three stages in detail. text mining problems with promising results. Our work adds The purpose of the first stage is to filter out irrelevant to this line yet another novel use of such models for opinion sentences and opinions in our collection. This can be done integration. by using the topic description as a keyword query to retrieve As in most topic models, our general idea is to use a unirelevant opinion sentences. In general, we may use any regram language model (i.e., a multinomial word distribution) trieval method. In this paper, we used a standard language to model a topic. For example, a distribution that assigns modeling approach (i.e., the KL-divergence retrieval model high probabilities to words such as "iPhone", "battery", "life", [20]). To ensure coverage of opinions, we perform pseudo "hour", would suggest a topic such as "battery life of iPhone." feedback using some top-ranked sentences; the idea is to In order to identify multiple topics in text, we would fit a expand the original topic description query with additional mixture model involving multiple multinomial distributions words related to the topic so that we can further retrieve to our text data and try to figure out how to set the paopinion sentences that do not necessarily match the original rameters of the multiple word distributions so that we can topic description T . After this retrieval stage, we obtain a maximize the likelihood of the text data. Intuitively, if two set of relevant opinion sentences CO . words tend to co-occur with each other and one word is 123 WWW 2008 / Refereed Track: Data Mining - Modeling assigned a high probability, then the other word generally should also be assigned a high probability to maximize the data likelihood. Thus this kind of model generally captures the co-occurrences of words and can help cluster the words based on co-occurrences. In order to apply this kind of model to our integration problem, we assume that each review segment corresponds to a unigram language model which would capture all opinions that can be aligned with a review segment. Furthermore, we introduce a certain number of unigram language models to capture the extra opinions. We then fit the mixture model to CO , i.e., the set of all the relevant opinion sentences generated using information retrieval as described in the previous section. Once the parameters are estimated, they can be used to group sentences into different aspects corresponding to the different review segments and extra aspects corresponding to extra opinions. We now present our mixture model in detail. April 21-25, 2008 · Beijing, China Figure 2: Generation Pro cess of a Word 4.1 Basic PLSA We first present the basic PLSA model as described in [21]. Intuitively, the words in our text collection CO can be classified into two categories (1) background words that are of relatively high frequency in the whole collection. For example, in the collection of topic "iPhone", words like "iPhone", "Apple" are considered as background words. (2) words related to different aspects which we are interested in. So we define k + 1 unigram language models: B as the background model to capture the background words, = {1 , 2 , ..., k } as k theme models, each capturing one aspect of the topic and corresponding to the k review segments r1 , ..., rk . A document d in CO (in our problem it is actually a sentence) can then be regarded as a sample of the following mixture model. jk =1 p(zd,w,j ) p(zd,w,B ) d,j (n+1) p(n+1) (w|j ) 4.2 Semi-supervised PLSA pd (w) = B p(w|B ) + (1 - B ) [d,j p(w|j )] (1) where w is a word, d,j is a document-specific mixing k weight for the j -th aspect ( j =1 d,j = 1), and B is the mixing weight of the background model B . The log-likelihood of the collection CO is logp(CO |) = d CO w V {c(w, d)× (2) log(B p(w|B ) + (1 - B ) k j =1 [d,j p(w |j )]} where V is the set of all the words (i.e., vocabulary), c(w, d) is the count of word w in document d, and is the set of all model parameters. The purpose of using a background model is to "force" clustering to be done based on more discriminative words, leading to more informative and more discriminative theme models. The model can be estimated using any estimator. For example, the Expectation-Maximization (EM) algorithm [3] can be used to compute a maximum likelihood estimate with the following updating formulas: We could have directly applied the basic PLSA to extract topics from CO . However, the extracted topics in this way would generally not be well-aligned to the expert review. In order to ensure alignment, we would like to "force" some of the multinomial distribution component models (i.e., language models) to be "aligned" with all the segments in the expert review. In probabilistic models, this can be achieved by extending the basic PLSA to incorporate a conjugate prior defined based on the expert review segments and using the Maximum A Posterior (MAP) esatimator instead of the Maximum Likelihood estimator as we did in the basic PLSA. Intuitively, a prior defined based on an expert review segment would tend to make the corresponding language model similar to the empirical word distribution in the review segment, thus the language model would tend to attract opinion sentences in CO that are similar to the expert review segment. This ensures the alignment of the extracted opinions with the original review segment. Specifically, we build a unigram language model {p(w|rj )}wV for each review segment rj (j {1, ..., k }) and define a conjugate prior (i.e., a Dirichlet prior) on each multinomial distribution topic model, parameterized as Dir({j p(w|rj )}w V ), where j is a confidence parameter for the prior. Since we use a conjugate prior, j can be interpreted as the "equivalent sample size" which means that the effect of adding the prior would be equivalent to adding j p(w|rj )pseudo counts for word w when we estimate the topic model p(w|j ). Figure 2 illustrates the generation process of a word W in such a semi-supervised PLSA where the prior serves as some "training data" to bias the clustering results. 124 ¥¤£ ©©© $#"! ¡ @985 43 £ (0)(' £§ 85 43 ©©© ©©© 21 ¨ ¨§ (0)(' 75 43 &#% ¦ ¢§ ¢ 6 21BA 5 43 (0)(' C H&QP H IH G F ED ! . . . = = = = (1 - B )d,j p(n) (w|j ) k (n) B p(w|B ) + (1 - B ) j =1 d,j p(n) (w|j ) B p(w|B ) k (n) B p(w|B ) + (1 - B ) j =1 d,j p(n) (w|j ) w c(w, d)p(zd,w,j ) j V w V c(w , d)p(zd,w ,j ) d CO c(w , d)p(zd,w,j ) w d , V CO c(w , d)p(zd,w j ) (n) WWW 2008 / Refereed Track: Data Mining - Modeling The prior for all the parameters is given by p() k+m w j =1 V April 21-25, 2008 · Beijing, China 4.3.1 (3) Theme Extraction from Text Collection p(w|j )j p(w|rj ) Generally we have m > 0, because we may want to find extra opinion topics other than the corresponding segments in the expert review. So we set j = 0 for k < j k + m. With the prior defined above, we can then use the Maximum A Posterior (MAP) estimator to estimate all the parameters as follows ^ = arg max p(CO |)p() (4) The MAP estimate can be computed using essentially the same EM algorithm as presented above with only slightly different updating formula for the component language models. The new updating formula is: d c(w, d)p(zd,w,j ) + j p(w|rj ) (n+1) w CO d p(w|j ) = c(w , d )p(zd ,w ,j ) + j V C O (5) We can see that the main difference between this equation and the previous one for basic PLSA is that we now pool the counts of terms in the expert review segment with those from the opinion sentences in CO , which is essentially to allow the expert review to serve as some training data for the corresponding opinion topic. This is why we call this model semi-supervised PLSA. If we are highly confident of the aspects captured in the prior, we could empirically set a large j . Otherwise, if we need to ensure the impact of the prior without being over restricted by the prior, some regularized estimation techniques are necessary. Following the similar idea of regularized estimation [19], we define a decay parameter and a prior weight µj as µj = w V We start from a topic T , a review R = {r1 , ..., rk } of k segments, a collection CO = {d1 , d2 , ..., dN } of opinion sentences closely relevant to T . We assume that CO covers a number of themes each about one aspect of the topic T . We further assume that there are k + m ma jor themes in the collection, {1 , 2 , ..., k+m }, each being characterized by a multinomial distribution over all the words in our vocabulary V (also known as a unigram language model or a topic model). We propose to use review aspects as priors in the partition of CO into aspects. We could have used the whole expert review segment to construct the priors. But if so, we could only get the opinions that are most similar to the review opinions. However, we would like extract not only opinions supporting the review opinions but also supplementary opinions on the review aspect. So we use only the "aspect words" to estimate the prior. We use a simple heuristic: opinions are usually expressed in the form of adjectives, adverbs and verbs while aspect words are usually nouns. And we apply a Part-of-Speech tagger1 on each review segment ri and further filter out the opinion words to get a ri . The prior {p(w|ri )}wV is estimated by Maximum Likelihood: p(w|ri ) = w c( w , r i ) ,i V c(w r ) (9) Given these priors constructed from the expert review {p(w|ri )}wV , i {1, ..., k }, we could estimate the parameters for the semi-supervised topic model according to Section 4.2. After that, we have a set of theme models extracted from the text collection {i |i = 1, ..., k + m}, and we could group each sentence di in CO into one of the k + m themes by choosing the theme model with the largest probability of generating di : arg max p(di |j ) = arg max j j w V c(w, di )p(w|j ) (10) d C O j c(w , d )p(zd ,w ,j ) + j (6) So we could start from a large j (say 5000) (i.e., starting with perfectly alignable opinion models) and gradually decay it in each EM iteration by equation 7, and we stop the decaying of j until the weight of the prior µj is below some threshold (say 0.5). Decaying allows the model to gradually pick up words from CO . The new updating formulas are (n) j j (n) If we define g (di ) = j if di is grouped into {p(w|j )}wV , then we have a partition of CO : CO = {Si |i = 1, ..., k + m} (11) where each Si is a set of sentences Si = {dj |g (dj ) = i, dj CO } with the following two properties: CO = Si Sj = k+m i =1 Si (12) (13) (n+1) j if µj > (7) if µj (n+1) ,j = i, j {1, ..., k + m} , i = j d p(w|j ) (n+1) = w C O V c(w, d)p(zd,w,j ) + j d c(w , d )p(zd ,w C O p(w|rj ) (n+1) ) + j Thus each Si , i = 1, ..., k , corresponds to the review aspect ri and each Sj , j = k + 1, ..., k + m, is the set of sentences that supplements the expert review with additional aspects. Parameter m, the number of additional aspects, is set empirically. (8) 4.3.2 Further Separation of Opinions 4.3 Overall Process In this section, we describe how we use the semi-supervised topic model to achieve three tasks in the second stage as defined in Section 3. We also summarize the computational complexity of the whole process. In this subsection, we show that how we further partition each Si , i = 1, ..., k into two parts: s s Si = {Si im , Si upp } 1 (14) http://l2r.cs.uiuc.edu/~ogcomp/asoftware.php?skey=LBPPOS c 125 WWW 2008 / Refereed Track: Data Mining - Modeling s such that Si im contains sentences that is similar to the s opinions in the review while Si upp is a set of sentences that supplement the review opinions on the review aspect ri . We assume that each subset of sentences Si , i = 1, ..., k , covers two themes captured by two subtopic models s s {p(w|i im )}wV and {p(w|i upp )}wV . We first construct a unigram language model {p(w|ri )}wV from review segment ri using both the feature words and opinion words. s This model is used as a prior for extracting {p(w|i im )}wV . After that, we estimate the model parameters as described in Section 4.2. And then, we could classify each sentence s s dj Si into either Si im or Si upp in the way similar to equation 10. April 21-25, 2008 · Beijing, China case is O(I · 2 · (k|V | + |W | + |CO |)) = O(I · (k|V | + |W |)). Finally, "Generation of Summaries" makes 2k + m invocations of PLSA, each on a subset of the collection P s s s s {S1 im , ..., Skim }{S1 upp , ..., Skupp }{Sk+1 , ..., Sk+m } = CO . In each invocation, the number of clusters is |P | , and WP c is the total number of words in P . So the total complexity P in this stage is O( I · |P | (|V | + |WP | + |P |)), which in c I the worst case is O( c · (|CO | · |V | + |CO | · |W | + |CO |2 )) = O( I · |CO | · |W |). c Thus, our whole process is bounded by the computational complexity O(I · ((k + m + 1)|W | + k|V | + |CO |c·|W | )). Since k, m, and c are usually much smaller than |CO |, the running time is basically bounded by O(I · |CO | · |W |). 4.3.3 Generation of Summaries So far, we have a meaningful partition over CO : s s s s CO = {S1 im , ..., Skim } {S1 upp , ..., Skupp } {Sk+1 , ..., Sk+m } (15) Now we need to further summarize each block P in the s s s s partition P {S1 im , ..., Skim } {S1 upp , ..., Skupp } {Sk+1 , ..., Sk+m } by extracting representative opinions RO (P ). We take a two-step approach. In the first step, we try to remove the redundancy of sentences in P and group the similar opinions together by unsupervised topic modeling. In detail, we use PLSA (without any prior) to do the clustering and set the number of clusters proportional to the size of P . After the clustering, we get a further partition of P = {P1 , ..., Pl } where l = |P |/c and c is a constant parameter that defines the average number of sentences in each cluster. One representative sentence in Pi is selected by the similarity between the sentence and the cluster centroid (i.e. a word distribution) of Pi . If we define rsi as the representative sentenced of Pi , and i = |Pi | as the support, we have a representative opinion of Pi which is oi = (i , rsi ). Thus RO(P ) = {o1 , o2 , ..., ol }. In the second step, we aim at providing some context information for each representative opinion oi of P to help the user to better understand the opinion expressed. What we propose is to compare the similarity between opinion sentence rsi and each review sentence in segment corresponding to P and assign rsi to the review sentence with the highest similarity. For both steps, we use KL-Divergence as the similarity measure. 5. EXPERIMENTAL RESULTS In this section, we first introduce the data sets used in the experiment. Then we demonstrate the effectiveness of our semi-supervised topic modeling approach by showing two examples in two different scenarios. Finally, we also provide some quantitative evaluation. 5.1 Data Sets Source CNET wikipedia # of words 4434 312 # of asp ects 19 14 Topic Desc. iPhone Barack Obama Table 1: Basic Statistics of the REVIEW data set Topic Desc. iPhone Barack Obama Query Terms iPhone Barack+Obama # of articles 552 639 N 3000 1000 Table 2: Basic Statistics of the BLOG data set We need two types of data sets for evaluation. One type is expert reviews. We construct this data set by leveraging the existing services provided by CNET and wikipedia, i.e., we submit queries to their web sites and download the expert reviews on "iPhone" written by CNET editors2 and the introduction part of articles about "Barack Obama" in wikipedia3 . The composition and basic statistics of this data set (denoted as "REVIEW") is shown in Table 1. The other type of data is a set of opinion sentences related to certain topic. In this paper, we only use Weblog data, but our method can be applied on any kind of data that contain opinions in free text. Specifically, we firstly submit topic description queries to Google Blog Search4 and collect the blog entries returned. The search domain are restricted to spaces.live.com, since schema matching is not our focus. We further build a collection of N opinion sentences CO = {d1 , d2 , ..., dN } which are highly relevant to the given topic T using information retrieval techniques as described as the first stage in Section 3. The basic information of these collections (denoted as "BLOG" is shown in Table 2. For all the data collections, Porter stemmer [18] is used to stem the text and stop words in general English are removed. http://reviews.cnet.com/smart-phones/apple-iPhone-8gbat/4505-6452 7-32309245.html?tag=pdtl-list 3 http://en.wikipedia.org/wiki/Barack Obama 4 http://blogsearch.google.com 2 4.3.4 Computational Complexity PLSA and semi-supervised PLSA have the same complexity: O(I · K (|V | + |W | + |C |)), where I is the number of EM iterations, K is the number of themes, |V | is the vocabulary size, |W | is the total number of words in the collection, |C | is the number of documents. Our whole process makes multiple invocations of PLSA/semi-supervised PLSA, and we suppose we use the same I across different invocations. "Theme Extraction from Text Collection" makes one invocation of semi-supervised PLSA on the whole collection CO , where the number of cluster is k + m. So the complexity is O(I · (k + m) · (|V | + |W | + |CO |) = O(I · (k + m) · |W |). There are k invocations of semi-supervised PLSA in "Further Separation of Opinions", each on a subset of the collection Si (i = 1, ..., k ) with only two clusters. And we know k k+m from equation 11 that i=1 Si i=1 Si = CO . Suppose WSi is the total number of words in Si . So the total comS plexity is O( I · 2 · (|V | + |WSi | + |Si |)) which in the worst i 126 WWW 2008 / Refereed Track: Data Mining - Modeling April 21-25, 2008 · Beijing, China by the mention of the price of iPhone in the background aspect. The next two aspects with highest support are "Bluetooth and Wireless" and "Activation" both with support 101. As stated in the iPhone review "The Wi-Fi compatibility is especially welcome, and a feature that's absent on far too many smart phones.", and our support statistics suggest that people do comment a lot about this unique feature of iPhone. "Activation" is another hot aspect as discovered by our method. As many people know, the activation of iPhone requires a two-year contract with AT&T, which brings much controversy among customers. In addition, we show three of the most supported representative opinions in the extra aspects in Table 4. The first sentence points out another way of activating iPhone, while the second sentence brings up the information that Cisco was the original owner of the trademark "iPhone". The third sentence expresses a opinion in favor of another smartphone, Nokia N95, which could be useful information for a potential smartphone buyer who did not know about Nokia N95 before. 5.2 Scenario I: Product Gathering opinions on products is the main focus of the research on opinion mining, so our first example of opinion integration is a hot product, iPhone. There are 19 defined segments in the "iPhone" review of the REVIEW data set. We use these 19 segments as aspects from the review and define 11 extra aspects in the semi-supervised topic model. Due to the limitation of the spaces, only part of the integration with review aspects are show in Table 3. We can see that there is indeed some interesting information discovered. · In the "background" aspect (which corresponds to the background introduction part of the expert review), we see that lots of people care about the price of iPhone, and the sentences extracted from blog articles show different pricing information which confirms the fact that the price of iPhone has been adjusted. In fact, the first two sentences only mention the original price while the third sentence talks about the cut down of the price but the actual numbers are incorrect. · The displayed sentence in the "activation" aspect describes the results if you do not activate the iPhone. A piece of very interesting information related to this aspect, "unlocking the iPhone" is never mentioned in the expert review but is extracted from blog articles by using our semi-supervised topic modeling approach. Indeed, we know that "unlock" or "hack" is a hot topic since the iPhone hit the market. This is a good demonstration that our approach is able to discover information which is highly related and supplementary to the review. · The last aspect shown is about battery life. There is a high support (support = 19 in the column of similar opinions) of the life of battery described in the review, and there is another supplementary set of sentences (support = 7) which gives a concrete number of battery in hours under real usage of iPhone. 160 140 134 120 101 100 80 60 40 20 0 kg ro un Dd es ig Dn is pl To M ay Ex uch enu s te rio scr Bl r f ee ue ea n to tu M oth Fe res es sa and atu re gi ng wir s an ele iP d e ss ho -m Sa ne' ail fa s iP ri br od ow Yo se uT r Vi u su W b e al idg vo e ic ts e m C ail a C me al Br r ow l qu a se ali r s ty p Ac ee tiv d at io Ba n tte ry 5.3 Scenario II: Political Figure If we want to know more about a political figure, we could treat a short biography of the person as an expert review and apply our semi-supervised topic model. In this subsection, we demonstrate what we can achieve by an example of "Barack Obama". There is no definition of segments in the short introduction part in wikipedia, so we just treat each sentence as a segment. In Table 5, we display part of the opinion integration with the 14 aspects in the review. Since there is no short description of each aspect in this example, we use ID in the first column of the table to distinguish one aspect from another. · Aspect 0 is a brief introduction of the person and his position, which attracts many sentences in the blog articles some directly confirming the information provided in the review, some also suggest his position while stating other facts. · Aspect 1 and 3 talk about his heritage and early life, and we further discover from the blog articles supplementary information such as his birthplace is Honolulu, his parents' names are Barack Hussein Obama Sr. and Ann Dunham, and even why his father came to the US. · For aspect 10 about his presidential candidacy, our summaries not only confirm the fact but also point out another democratic presidential candidate Hillary Clinton. · A brief description of his family is in review aspect 12, and the mention of his daughters has attracted a piece of news related to young daughters of White House aspirants. After further summing up the support for each aspect, we display two of the most supported aspects and one least supported aspect in Table 6. The most supported aspect is aspect 0 with S upport = 68, which as mentioned above is a brief introduction of the person and his position. Aspect 2 talking about his heritage ranks as the second with S upport = 36, which agrees with the fact that he is special among the presidential candidates because of his Kenyan 87 74 50 79 70 73 70 63 95 74 53 68 57 55 73 101 60 ba c Figure 3: Supp ort Statistics for iPhone Asp ects Furthermore, we may also want to know which aspects of iPhone people are most interested in. If we define the support of an aspect as the sum of the support of representative opinions in this aspect, we could easily get the support statistics for each review aspects in our topic modeling approach. As can be seen in Figure 3, the "background" aspect attracts the most discussion. This is mainly caused 127 WWW 2008 / Refereed Track: Data Mining - Modeling Aspect Review Similar Opinions April 21-25, 2008 · Beijing, China Supplementary Opinions [supp ort=19]The iPhone will come in two versions, a 4GB 499 model, and an 8GB 599 model with a two year contract. [supp ort=16]The Price: 499 (4GB) or 599(8GB) with a two year contract , by the time the contract is over your iPhone will probably be scratched all over like the Nano or be made obsolete by better phone on the market. [supp ort=12]Recently, Apple decided to cut down price of iPhone from 399 to 200 , giving rise to much rage from consumers b ought the phone before. [supp ort=10]Several other methods for unlocking the iPhone have emerged on the Internet in the past few weeks, although they involve tinkering with the iPhone hardware or more complicated ways of bypassing the protections for AT T's exclusivity. [supp ort=7]Playing relatively high bitrate VGA H.264 videos, our iPhone lasted almost exactly 9 freaking hours of continuous playback with cell and WiFi on (but Bluetooth off ). Background Even with the new $399 price for the 8GB model (down from an original price of $599), it's still a lot to ask for a phone that lacks so many features and locks you into an iPhone-specific two-year contract with AT&T. Activation You can make emergency calls, but you can't use any other functions, including the iPod music player. Battery life The Apple iPhone has a rated battery life of 8 hours talk time, 24 hours of music playback, 7 hours of video playback, and 6 hours on Internet use. [supp ort=19] iPhone will Feature Up to 8 Hours of Talk Time, 6 Hours of Internet Use, 7 Hours of Video Playback or 24 Hours of Audio Playback Battery Table 3: iPhone Example: Opinion Integration with Review Asp ects Supplementry Opinions on Extra Asp ects [supp ort=15]You may have heard of iASign (http: iphone.fiveforty.net wiki index.php IASign), an iPhone Dev Wiki tool that allows you to activate your phone without going through the iTunes rigamarole. [supp ort=13]Cisco has owned the trademark on the name "iPhone" since 2000, when it acquired InfoGear Technology Corp., which originally registered the name. [supp ort=13]With the imminent availability of Apple's uber cool iPhone, a look at 10 things current smartphones like the Nokia N95 have been able to do for a while and that the iPhone can't currently match... Table 4: iPhone Example: Opinion Integration on Extra Asp ects ID 0 1 12 Review Barack Hussein Obama (born August 4, 1961) is the junior United States Senator from Illinois and a member of the Democratic Party. Born to a Kenyan father and an American mother, Obama grew up in culturally diverse surroundings. He married in 1992 and has two daughters. Supp ort 68 36 3 human consensus on this task and how our approach could recover the choice of human. User Our Approach User 1 User 2 User 3 Sentence ID of the 7 sentences 2, 6, 9, 21, 22, 25, 30 1, 6, 9, 13, 16, 25, 30 9, 11, 16, 20, 21, 30, 31 2, 6, 8, 9, 24, 25, 31 Table 6: Obama Example: Supp ort of Aspects Table 7: Selection of 7 Sentences on Extra Asp ects origin and indicates that people are interested in it. The least covered aspect is aspect 12 about his family, since the total support is only 3. Table 7 displays the selection of the seven sentences on extra aspects by our method and the three users. The only sentence out of seven that all three users agree on is sentence number 9, which suggests that grouping sentences into extra aspects is quite a sub jective task so it is difficult to produce results satisfactory to each individual user. However our method is able to recover 52.4% of the user's choices on average. In the second task, we try to evaluate the performance of our approach in grouping sentences into k review aspects. we s s randomly permutate all the sentences in {S1 im , ..., Skim } supp supp {S1 , ..., Sk } to construct a Sreview and remove the aspect assigned to each sentence. For each of the 27 sentences, the users are asked to assign one of the 14 review aspects to it. In essence, this is a multi-class classification problem where the number of classes is 14. The results turn out to be · Three users agree on 13 sentences about the class label, which means that more than half of the sentences are controversial even among human users. · On average, our method could recover the user's choices by 10.67 sentences out of 27. Note that if we randomly 5.4 Quantitative Evaluation In order to quantitatively evaluate the effectiveness of our semi-supervised topic modeling approach, we designed a test which consists of three tasks, each asks a user to perform a part of our processing. The main goal is to see to what extent can our approach reproduce the human choice. The test is designed based on the above-mentioned "Barack Obama" example. In order to reduce the bias, we collect the evaluation results from three users, who are all PhD students in our department, two males and one female. In the first designed task, we aims at evaluating the effectiveness of our approach in identifying the extra aspects in addition to review aspects. Towards this goal, we generate a big set of sentences Sall by mixing all the sentences s s s s in {S1 im , ..., Skim } {S1 upp , ..., Skupp } with seven most supported sentences in {Sk+1 , ..., Sk+m }. There are |Sall | = 34 sentences in Sall in total. The users are asked to select seven sentences from randomly permutated Sall that do not fit into the k review aspects. In this way, we could see how is the 128 WWW 2008 / Refereed Track: Data Mining - Modeling ID Review Similar Opinions April 21-25, 2008 · Beijing, China Supplementary Opinions [supp ort=21]Barack Obama, another leading Democratic presidential hopeful, campaigns for more dollars with "Dinner With Barack." [supp ort=11]A Chicago, Illinois, radio station recently conducted a live survey on a man called Barack Obama. [supp ort=10]In fact, there is not a single metropolitan area in the country where a family earning minimum wage can afford decent housing, said Senator Barack Obama. [supp ort=16]Barack Obama is an African American whose father was born in Kenya and got a sholarship to study in American. [supp ort=12]Obama was born in Honolulu, Hawaii, to Barack Hussein Obama Sr., a Kenyan, and Kansas born Ann Dunham. 0 Barack Hussein Obama (born August 4, 1961) is the junior United States Senator from Illinois and a member of the Democratic Party. [supp ort=9]Senator Barack Hussein Obama is the junior United States Senator from Illinois and a member of the Democratic Party . 1 3 The U.S. Senate Historical Office lists him as the fifth African American Senator in U.S. history and the only African American currently serving in the U.S. Senate. He lived for most of his childhoo d in the ma jority-minority U.S. state of Hawaii and spent four of his pre-teen years in the multi-ethnic Indonesian capital city of Jakarta. He is among the Democratic Party's leading candidates for nomination in the 2008 U.S. presidential election. [supp ort=2]Mr Obama will contest the Democrat presidential nomination 10 12 He married in 1992 and has two daughters. [supp ort=14](AP) Democratic presidential candidate Barack Obama said Sunday that the front runner for his party's nomination, Hillary Rodham Clinton, does not offer the break from politics as usual that voters need. [supp ort=3]MARCH 4 Senator Barack Obama is threatening legal action against a self describ-ed pedophile who has posted photos of the Democratic politician's young daughters on a web site that purports to handicap the 2008 presidential campaign by evaluating the "cuteness" of underage daughters and granddaughters of White House aspirants Table 5: Obama Example: Opinion Integration with Review Aspects assign one aspect out of 14, (1) the probability of recovering k sentences out of 27 is 2 × 7 prk × (1 - pr)27-k k 1 where pr = 14 . When k = 10, the probability is only around 0.00037; (2) the expected number of sentences recovered would be 2 × k27 7 prk × (1 - pr)27-k = 1 k =0 the choices of human users. Among the different choices between our method and the users, only one aspect has achieved consensus of three users. That is to say, this is a "true" mistake of our method, while other mistakes do not have agreement in the users. 6. RELATED WORK · Our method and all three users assigned the same label to 8 sentences. · Among the many mistakes our method made, three users only agree on 5 sentences. In other words, they assigned the same label to the 5 sentences which is different the label assigned by our method. Again, this task is sub jective, and there is still much controversy among human users. But our approach performs reasonably : in the 13 sentences with human consensus, our method achieves the accuracy of 61.5%. In the third task, our goal is to see how well we can separate similar opinions from supplementary opinions in the semi-supervised topic modeling approach. We first select 5 review aspects out of 14 which our method has identified both similar and supplementary opinions; then for each of the 5 aspects, we mix one similar opinion with several supplementary opinions; the users are supposed to select one sentence which share the most similar opinion with the review aspect. On average, our method could recover 60% of To the best of our knowledge, no previous study has addressed the problem of integrating a well-written expert review with opinions scattering in text documents. But there are some related studies which we will briefly review in this section. Recently there has been a lot of work in opinion mining and summarization especially on customer reviews. In [2], sentiment classifiers are built from some training corpus. Some papers [8, 7, 10, 17] further mine product features from reviews on which the reviewers have expressed their opinions. Zhuang and others focused on movie review mining and summarization [22]. [4] presented a prototype system, named Pulse, for mining topics and sentiment orientation jointly from customer feedback. However, these techniques are limited to the domain of products/movies, and many are highly dependent on the training data set, so are not generally applicable to summarize opinions about an arbitrary topic. Our problem setup aims at shallower but more robust integration. Weblogs mining has attracted many new research work. Some focus on sentiment analysis. Mishne and others used the temporal pattern of sentiments to predict the book sales [14, 15]. Opinmind[16] summarizes the weblog search results with positive and negative categories. On the other hand, researchers also extract the subtopics in weblog collections, 129 WWW 2008 / Refereed Track: Data Mining - Modeling and track their distribution over time and locations [12]. Last year, Mei and others proposed a mixture model to model both facets and opinions at the same time [11]. These previous work aims at generating sentiment summary for a topic purely based on the blog articles. We aim at aligning blog opinions to an expert review. We also take a broader definition of opinions to accommodate the integration of opinions for an arbitrary topic. Topic model has been widely and successfully applied to blog articles and other text collections to mine topic patterns [5, 1, 21, 9]. Our work adds to this line yet another novel use of such models for opinion integration. Furthermore, we explore a novel way of defining prior. April 21-25, 2008 · Beijing, China [5] T. Hofmann. Probabilistic latent semantic analysis. In Proc. of Uncertainty in Artificial Intel ligence, UAI'99, Stockholm. [6] T. Hofmann. Probabilistic latent semantic indexing. In Proceedings of SIGIR '99, pages 50­57. [7] M. Hu and B. Liu. Mining and summarizing customer reviews. In KDD, pages 168­177. [8] M. Hu and B. Liu. Mining opinion features in customer reviews. In AAAI, pages 755­760. [9] W. Li and A. McCallum. Pachinko allocation: Dag-structured mixture models of topic correlations. In ICML '06: Proceedings of the 23rd international conference on Machine learning, pages 577­584. [10] B. Liu, M. Hu, and J. Cheng. Opinion observer: analyzing and comparing opinions on the web. In WWW '05: Proceedings of the 14th international conference on World Wide Web, pages 342­351. [11] Q. Mei, X. Ling, M. Wondra, H. Su, and C. Zhai. Topic sentiment mixture: Modeling facets and opinions in weblogs. In Proceedings of the World Wide Conference 2007, pages 171­180. [12] Q. Mei, C. Liu, H. Su, and C. Zhai. A probabilistic approach to spatiotemporal theme pattern mining on weblogs. In WWW '06: Proceedings of the 15th international conference on World Wide Web, pages 533­542. [13] Q. Mei and C. Zhai. A mixture model for contextual text mining. In KDD '06: Proceedings of the 12th ACM SIGKDD international conference on Know ledge discovery and data mining, pages 649­655. [14] G. Mishne and M. de Rijke. MoodViews: Tools for blog mood analysis. In AAAI 2006 Spring Symposium on Computational Approaches to Analysing Weblogs (AAAI-CAAW 2006), pages 153­154. [15] G. Mishne and N. Glance. Predicting movie sales from blogger sentiment. In AAAI 2006 Spring Symposium on Computational Approaches to Analysing Weblogs (AAAI-CAAW 2006). [16] Opinmind. http://www.opinmind.com. [17] A.-M. Popescu and O. Etzioni. Extracting product features and opinions from reviews. In HLT '05: Proceedings of the conference on Human Language Technology and Empirical Methods in Natural Language Processing, pages 339­346. [18] M. F. Porter. An algorithm for suffix stripping. pages 313­316, 1997. [19] T. Tao and C. Zhai. Regularized estimation of mixture models for robust pseudo-relevance feedback. In SIGIR '06: Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval, pages 162­169. [20] C. Zhai and J. Lafferty. Model-based feedback in the language modeling approach to information retrieval. In Proceedings of CIKM 2001, pages 403­410. [21] C. Zhai, A. Velivelli, and B. Yu. A cross-collection mixture model for comparative text mining. In Proceedings of KDD '04, pages 743­748. [22] L. Zhuang, F. Jing, and X.-Y. Zhu. Movie review mining and summarization. In CIKM '06: Proceedings of the 15th ACM international conference on Information and know ledge management, pages 43­50. 7. CONCLUSIONS In this paper, we formally defined a novel problem of opinion integration which aims at integrating opinions expressed in a well-written expert review with those in various Web 2.0 sources such as Weblogs to generated an aligned integrated opinion summary. We proposed a new opinion integration method based on semi-supervised probabilistic topic modeling. With this model, we could automatically generate an integrated opinion summary that consists of (1) supporting opinions with respect to different aspects in the expert review; (2) opinions supplementary to those in the expert review but on the same aspect; and (3) opinions on extra aspects which are not even mentioned in the expert review. We evaluate our model on integrating opinions about two quite different topics (a product and a political figure) and the results show that our method works well for both topics. We are also planning to evaluate our method more rigorously. Since integrating and digesting opinions from multiple sources are critical in many tasks, our method can be applied to develop many interesting applications in multiple domains. A natural future research direction would be to address the more general setup of the problem ­ integrating opinions in arbitrary text collections with a set of expert reviews instead of a single expert review. 8. ACKNOWLEDGMENTS This work was in part supported by the National Science Foundation under award numbers 0425852, 0428472, and 0713571. We thank the anonymous reviewers for their useful comments. 9. REFERENCES [1] D. M. Blei, A. Y. Ng, and M. I. Jordan. Latent dirichlet allocation. J. Mach. Learn. Res., 3:993­1022, 2003. [2] D. Dave and S. Lawrence. Mining the peanut gallery: opinion extraction and semantic classification of product reviews. [3] A. P. Dempster, N. M. Laird, and D. B. Rubin. Maximum likelihood from incomplete data via the EM algorithm. Journal of Royal Statist. Soc. B, 39:1­38, 1977. [4] M. Gamon, A. Aue, S. Corston-Oliver, and E. K. Ringger. Pulse: Mining customer opinions from free text. In IDA, volume 3646 of Lecture Notes in Computer Science, pages 121­132, 2005. 130