An Iterative Reinforcement Approach for Fine-Grained Opinion Mining Weifu Du Songbo Tan Haerbin Institute of Technology Institute of Computing Technology Haerbin, China Beijing, China duweifu@software.ict.ac.cn tansongbo@software.ict.ac.cn Abstract With the in-depth study of sentiment analysis research, finer-grained opinion mining, which aims to detect opinions on different review features as opposed to the whole review level, has been receiving more and more attention in the sentiment analysis research community recently. Most of existing approaches rely mainly on the template extraction to identify the explicit relatedness between product feature and opinion terms, which is insufficient to detect the implicit review features and mine the hidden sentiment association in reviews, which satisfies (1) the review features are not appear explicit in the review sentences; (2) it can be deduced by the opinion words in its context. From an information theoretic point of view, this paper proposed an iterative reinforcement framework based on the improved information bottleneck algorithm to address such problem. More specifically, the approach clusters product features and opinion words simultaneously and iteratively by fusing both their semantic information and co-occurrence information. The experimental results demonstrate that our approach outperforms the template extraction based approaches. 1 Introduction In the Web2.0 era, the Internet turns from a static information media into a platform for dynamic information exchanging, on which people can express their views and show their selfhood. More and more people are willing to record their feelings (blog), give voice to public affairs (news review), express their likes and dislikes on products (product review), and so on. In the face of the volume of sentimental information available on the Internet continues to increase, there is growing interest in helping people better find, filter, and manage these resources. 486 Automatic opinion mining (Turney et al., 2003; Ku et al., 2006; Devitt et al., 2007) can play an important role in a wide variety of more flexible and dynamic information management tasks. For example, with the help of sentiment analysis system, in the field of public administration, the administrators can receive the feedbacks on one policy in a timelier manner; in the field of business, manufacturers can perform more targeted updates on products to improve the consumer experience. The research of opinion mining began in 1997, the early research results mainly focused on the polarity of opinion words (Hatzivassiloglou et al., 1997) and treated the text-level opinion mining as a classification of either positive or negative on the number of positive or negative opinion words in one text (Turney et al., 2003; Pang et al., 2002; Zagibalov et al., 2008;). With the in-depth study of opinion mining, researchers committed their efforts for more accurate results: the research of sentiment summarization (Philip et al., 2004; Hu et al., KDD 2004), domain transfer problem of the sentiment analysis (Kanayama et al., 2006; Tan et al., 2007; Blitzer et al., 2007; Tan et al., 2008; Andreevskaia et al., 2008; Tan et al., 2009) and finegrained opinion mining (Hatzivassiloglou et al., 2000; Takamura et al., 2007; Bloom et al., 2007; Wang et al., 2008; Titov et al., 2008) are the main branches of the research of opinion mining. In this paper, we focus on the fine-grained (feature-level) opinion mining. For many applications (e.g. the task of public affairs review analysis and the products review analysis), simply judging the sentiment orientation of a review unit is not sufficient. Researchers (Kushal, 2003; Hu et al., KDD 2004; Hu et al., AAAI 2004; Popescu et al., 2005) began to work on finer-grained opinion mining which predicts the sentiment orientation related to different review features. The task is known as feature-level opinion mining. Human Language Technologies: The 2009 Annual Conference of the North American Chapter of the ACL, pages 486­493, Boulder, Colorado, June 2009. c 2009 Association for Computational Linguistics In feature-level opinion mining, most of the existing researches associate product features and opinion words by their explicit co-occurrence. Template extraction based method (Popescu et al., 2005) and association rule mining based method (Hu et al., AAAI 2004) are the representative ones. These approaches did good jobs for identifying the review features that appear explicitly in reviews, however, real reviews from customers are usually complicated. In some cases, the review features are implicit in the review sentences, but can be deduced by the opinion words in its context. The detection of such hidden sentiment association is a big challenge in feature-level opinion mining on Chinese reviews due to the nature of Chinese language (Qi et al., 2008). Obviously, neither the template extraction based method nor the association rule mining based method is effective for such cases. Moreover, in some cases, even if the review features appear explicitly in the review sentences, the co-occurrence information between review features and opinion words is too quantitatively sparse to be utilized. So we consider whether it is a more sensible way to construct or cluster review feature groups and opinion words groups to mine the implicit or hidden sentiment association in the reviews. The general approach will cluster the two types of objects separately, which neglects the highly interrelationship. To address this problem, in this paper, we propose an iterative reinforcement framework, under which we cluster product features and opinion words simultaneously and iteratively by fusing both their semantic information and sentiment link information. We take improved information bottleneck algorithm (Tishby, 1999) as the kernel of the proposed framework. The information bottleneck approach was presented by Tishby (1999). The basic idea of the approach is that it treats the clustering problems from the information compressing point of view, and takes this problem as a case of much more fundamental problem: what are the features of the variable X that are relevant for the prediction of another, relevance, variable Y? Based on the information theory, the problem can be formulated as: find a compressed representation of the variable X, denoted C, such that the mutual information between C and Y is as high as possible, under a constraint on the mutual information between X and C. For our case, take the hotel reviews as example, X 487 is one type of objects of review features (e.g. facilities, service, surrounding environment, etc) or opinion words (e.g. perfect, circumspect, quiet, etc), and Y is another one. Given some review features (or opinion words) gained from review corpus, we want to assemble them into categories, conserving the information about opinion words (or review features) as high as possible. The information bottleneck algorithm has some benefits, mainly including (1) it treats the trade-off of precision versus complexity of clustering model through the rate distortion theory, which is a subfield of information theory; (2) it defines the "distance" or "similarity" in a well-defined way based on the mutual information. The efficiency of information bottleneck algorithm (Slonim and Tishby, 2000) motivates us to take it as the kernel of our framework. As far as we know, this approach has not been employed in opinion mining yet. In traditional information bottleneck approach, the distance between two data objects is measured by the Jensen-Shannon divergence (Lin, 1991), which aims to measure the divergence between two probability distributions. We alter this measure to integrate more semantic information, which will be illustrated in detail in the following sections, and the experimental result shows the effectiveness of the alteration. It would be worthwhile to highlight several aspects of our work here: We propose an iterative reinforcement framework, and under this framework, review feature words and opinion words are organized into categories in a simultaneous and iterative manner. In the process of clustering, the semantic information and the co-occurrence information are integrated. The experimental results on real Chinese web reviews demonstrate that proposed method outperforms the template extraction based algorithm. 2 2.1 Proposed Algorithm The Problem In product reviews, opinion words are used to express opinion, sentiment or attitude of reviewers. Although some review units may express general opinions toward a product, most review units are regarding to specific features of the product. A product is always reviewed under a certain feature set F. Suppose we have got a lexical list O which includes all the opinion expressions and their sentiment polarities. For the feature-level opinion mining, identifying the sentiment association between F and O is essential. The key points in the whole process are as follows: get opinion word set O (with polarity labels) get product feature set F identify relationships between F and O The focus of the paper is on the latter two steps, especially for the case of hidden sentiment association that the review features are implicit in the review sentences, but can be deduced by the opinion words in its context. In contrast to existing explicit adjacency approaches, the proposed approach detects the sentiment association between F and O based on review feature categories and opinion word groups gained from the review corpus. To this end, we first consider two sets of association objects: the set of product feature words F = {f1,f2,...,fm} and the set of opinion words O = {o1,o2,...on}. A weighted bipartite graph from F and O can be built, denoted by G = {F, O, R}. Here R = [rij] is the m*n link weight matrix containing all the pair-wise weights between set F and O. The weight can be calculated with different weighting schemes, in this paper, we set rij by the co-appearance frequency of fi and oj in clause level. We take F and O as two random variables, and the question of constructing or clustering the object groups can be defined as finding compressed representation of each variable that reserves the information about another variable as high as possible. Take F as an example, we want to find its compression, denoted as C, such that the mutual information between C and O is as high as possible, under a constraint on the mutual information between F and C. We propose an iterative reinforcement framework to deal with the tasks. An improved information bottleneck algorithm is employed in this framework, which will be illustrated in detail in the following sections. 2.2 Information Bottleneck Algorithm The information bottleneck method (IB) was presented by Tishby et al. (1999). According to Shannon`s information theory (Cover and Thomas, 488 1991), for the two random variables X, Y, the mutual information I(X;Y) between the random variables X, Y is given by the symmetric functional: p ( y | x) (1) I ( X ; Y ) = p ( x) p( y | x) log p( y ) x X , yY and the mutual information between them measures the relative entropy between their joint distribution p(x, y) and the product of respective marginal distributions p(x)p(y), which is the only consistent statistical measure of the information that variable X contains about variable Y (and vice versa). Roughly speaking, some of the mutual information will be lost in the process of compression, e.g. I (C , Y ) I ( X , Y ) (C is a compressed representation of X). This representation is defined through a (possibly stochastic) mapping between each value x X to each representative value c C . Formally, this mapping can be characterized by a conditional distribution p(c|x), inducing a soft partitioning of X values, Specifically, each value of X is associated with all the codebook elements (C values), with some normalized probability. The IB method is based on the following simple idea. Given the empirical joint distribution of two variables, one variable is compressed so that the mutual information about the other variable is preserved as much as possible. The method can be considered as finding a minimal sufficient partition or efficient relevant coding of one variable with respect to the other one. This problem can be solved by introducing a Lagrange multiplier , and then minimizing the functional: _ _ L[ p (c | x)] = I (C , X ) - I (C , Y ) (2) This solution is given in terms of the three distributions that characterize every cluster c C , the prior probability for this cluster, p(c), its membership probabilities p(c|x), and its distribution over the relevance variable p(y|c). In general, the membership probabilities, p(c|x) is "soft", i.e. every x X can be assigned to every c C in some (normalized) probability. The information bottleneck principle determines the distortion measure between the points x and c to be p( y | x) the DKL [ p( y | x) || p( y | c)] = y p( y | x)log , the p( y | c) Kullback-Leibler divergence (Cover and Thomas, 1991) between the conditional distributions p(y|x) and p(y|c). Specifically, the formal optimal solution is given by the following equations which must be solved together. p (c ) p(c | x) = Z ( , x) exp(- DKL [ p( y | x) || p ( y | c)]) 1 p( y | c) = p (c | x ) p ( x ) p ( y | x ) p (c ) x p (c ) = x p (c | x ) p ( x ) (3) Where Z ( , x) is a normalization factor, and the single positive (Lagrange) parameter determines the "softness" of the classification. Intuitively, in this procedure the information contained in X about Y `squeezed' through a compact `bottleneck' of clusters C, that is forced to represent the `relevant' part in X w.r.t to Y. An important special case is the "hard" clustering case where C is a deterministic function of X. That is, p(c|x) can only take values of zero or one, This restriction, which corresponds to the limit in Eqs 3 meaning every x X is assigned to exactly one cluster c C with a probability of one and to all the others with a probability of zero. This yields a natural distance measure between distributions which can be easily implemented in an agglomerative hierarchical clustering procedure (Slonim and Tishby, 1999). 1, if x c p (c | x ) = 0, otherwise 1 (4) p( y | c) = p ( x, y ) p (c) xc p (c ) = p( x) xc The algorithm starts with a trivial partitioning into |X| singleton clusters, where each cluster contains exactly one element of X. At each step we merge two components of the current partition into a single new component in a way that locally minimizes the loss of mutual information about the categories. Every merger, (ci , c j ) c* , is formally defined by the following equation: 1, x ci or x c j p(c* | x) = 0, otherwise p(c j ) p(ci ) p( y | ci ) + p( y | c j ) p( y | c* ) = (5) p(c* ) p(c* ) p(c ) = p(c ) + p(c ) i j * The decrease in the mutual information I(C, Y) due to this merger is defined by I (ci , c j ) I (Cbefore , Y ) - I (Cafter , Y ) (6) When I (Cbefore , Y ) and I (Cafter , Y ) are the information values before and after the merger, respectively. After a little algebra, one can see I (ci , c j ) ( p(ci ) + p(c j ) ) DJS p( y | ci ), p( y | c j ) (7) Where the functional DJS is the Jensen-Shannon divergence (Lin, 1991), defined as ^ ^ DJS pi , p j = i DKL pi || p + j DKL p j || p (8) where in our case p , p p ( y | c ), p ( y | c ) i j } { i j } { p(ci ) p (c j ) (9) , { i , j } p (c* ) p(c* ) ^ p = i p ( y | ci ) + j p ( y | c j ) By introducing the information optimization criterion the resulting similarity measure directly emerges from the analysis. The algorithm is now very simple. At each step we perform "the best possible merger", i.e. merge the clusters {ci , c j } which minimize I (ci , c j ) . 2.3 Improved Information Bottleneck Algorithm for Semantic Information In traditional information bottleneck approach, the distance between two data objects is measured by the difference of information values before and after the merger, which is measured by the JensenShannon divergence. This divergence aims to measure the divergence between two probability distributions. For our case, the divergence is based on the co-occurrence information between the two variables F and O. While the co-occurrence in corpus is usually quantitatively sparse; additionally, Statistics based 489 on word-occurrence loses semantic related information. To avoid such reversed effects, in the proposed framework we combine the co-occurrence information and semantic information as the final distance between the two types of objects. D( X i , X j ) = Dsemantic ( X i , X j ) + (1 - ) I ( X i , X j ) where { X i F X j F } { X i O X j O} (10) In equation 10, the distance between two data objects Xi and Xj is denoted as a linear combination of semantic distance and information value difference. The parameter reflects the contribution of different distances to the final distance. Input: Joint probability distribution p(f,o) Output: A partition of F into m clusters, m {1,...,|F|}, and a partition of O into n clusters n{1,...,|O|} 1. t0 2. Repeat a. Construct CFtFt b. i, j=1,...,|CFt|, i