WWW 2007 / Poster Paper Topic: Search Generative Models for Name Disambiguation Yang Song1 , Jian Huang2 , Isaac G. Councill2 , Jia Li3,1 , C. Lee Giles2,1 1 Depar tment of Computer Science and Engineering,2 Information Sciences and Technology,3 Depar tment of Statistics, The Pennsylvania State University, University Park, PA 16802, USA ABSTRACT Name ambiguity is a special case of identity uncertainty where one person can be referenced by multiple name variations in different situations or even share the same name with other people. In this paper, we present an efficient framework by using two novel topic-based models, extended from Probabilistic Latent Semantic Analysis (PLSA) and Latent Dirichlet Allocation (LDA). Our models explicitly introduce a new variable for persons and learn the distribution of topics with regard to persons and words. Experiments indicate that our approach consistently outperforms other unsupervised methods including spectral and DBSCAN clustering. Scalability is addressed by disambiguating authors in over 750,000 papers from the entire CiteSeer dataset. 2. TOPIC-BASED PLSA a K w Ad d z w Nd a Ad Nd dD d D (a) (b) Categories and Subject Descriptors H.3.3 [Information Systems]: Information Search and Retrieval General Terms Algorithms, Experimentation, Theory Figure 1: Graphical model representation of (a) PLSA and (b) LDA model. K is the number of topics, D is the total number of documents, Nd is the number of tokens in document d and Ad represents the number of name appearances in document d. The joint probability of the topic-based PLSA model (see Figure 1(a)) over d×a×w is defined as the mixture P (d, a, w) = P P (d)P (a, w|d) where P (a, w|d) = zZ P (a, w|z )P (z |d). The definition of the generative model can be described as follows. (1) pick a doc d from the corpus D with probability P (d); (2) select a latent class zk with probability P (zk |d); (3) generate a word w with probability P (w|zk ); (4) generate a name a with probability P (a|zk ). Putting it all together, the joint probability can be parameterized by X P (d, a, w) = P (z )P (z |d)P (w|z )P (a|z ). (1) z Z Keywords Unsupervised Machine Learning, Name Disambiguation. 1. INTRODUCTION Name queries makes up approximately 5-10% of all searches on the Internet, but they are usually treated by search engines as normal keyword searches without paying attention to the ambiguity of particular names. For example, searching Google for "Yang Song" results in more than 11,000,000 pages, of which even the first page shows five different people's home pages. Beyond the problem of sharing the same name, name misspelling, name abbreviations and other issues compound the challenge of name disambiguation. The same issue also exists in most Digital Libraries (DL), due to the existence of both synonyms and polysems. In the case of synonyms, an author may have multiple name variations in citations, e.g., the author "C. Lee Giles" is sometimes used as "C. L. Giles" in his citations. For polysems, different authors may share the same name label in multiple citations, e.g., both "Guangyu Chen" and "Guilin Chen" are used as "G. Chen" in their citations. Copyright is held by the author/owner(s). WWW 2007, May 8­12, 2007, Banff, Alberta, Canada. ACM 978-1-59593-654-7/07/0005. 2.1 Model Fitting with the EM Algorithm The standard Expectation-Maximization (EM) algorithm is applied to estimate the parameters. In the E-step, we z ) ( z P (d z) compute P (z |d, a, w) P P ((z PPa|a|)z )P|(z)|P ()w|(w|z ) . )( dz P zP In the M-step, we aim at maximizing the expectation of thP complete data likelihood, the formulas are: P (a|z ) e P P d,w n(d,a,w)P (z |d,a,w) ,w , d,a ,w n(d,aP )P (z |d,a w) , P (w|z ) P a,d n(d,a,w)P (z |d,a,w) ) ) d,a,w n(d,a,w P (z |d,a,w , P (z |d) , where n(d, a, w) denotes the number of occurrences of word w in document d with name a. The EM algorithm stops on convergence, i.e., when the improvement of the log-likelihood is significantly small. To predict the topics of new documents, P (w|z ) are used to estimate P (a|z ) for new names a in test document d n(d,a,w)P (z |d,a,w) P a,w , , d ,a,w n(d a,w)P (z |d a,w) 1163 WWW 2007 / Poster Paper Topic: Search Figure 2: Clustering results on the CiteSeer data set. 1:A. Gupta, 2:A. Kumar, 3:C. Chen, 4:D. Johnson, 5:J. Robinson, 6:J. Smith, 7:K. Tanaka, 8:M. Jones, 9:M. Miller. through a "folding-in" process [2]. Specifically, the E-step is the same as before; however, the M-step maintains the original P (w|z ) and only updates P (a|z ) as well as P (z |d). We collected meta-data from the CiteSeer digital library. Nine most ambiguous author names from the entire data set are tested, the results are shown in Figure 2. Topic 40 "Database" query 0.0375 xml 0.0321 database 0.0321 scalability 0.0315 process 0.0315 storage 0.0215 memory 0.0215 Jun Yang(Duke) 0.1258 Jun Yang(CMU) 0.0477 0.35 0.3 0.25 Jun Yang(Duke) Jun Yang(CMU) 3. TOPIC-BASED LDA The generative process of our topic-based LDA model extended from [1] (shown in Figure 1(b)) can be formalized as follows. (1) Draw a multinomial distribution z for each topic z from a Dirichlet distribution with prior ; (2) For each document d, draw a multinomial distribution d from a Dirichlet distribution with prior ; (3) For each word wdi in d, draw a topic zdi from the multinomial distribution d ; (4) Draw a word wdi from the multinomial distribution zdi ; (5) Draw a name adi from the multinomial distribution zdi . 3.1 Inference and Parameter Estimation The inference problem in LDA is to compute the posterior of the (document-level) hidden variables given a document d = (w, a) with parameters and , i.e., p(, , z|w, a, , , ) , ,z w,a , = p(pw,,a|,|),) . Here p(w, a|, , ) is usually referred to ( , aR the marginal distribution of document d: p(w, a|, , ) = s R QM Q p(|)p(| ) N=1 p(wn |, ) m=1 p(am |, ) dd. n By marginalizing over the hidden variable z , the name disP tribution p(a|, ) can be represented as z p(a|z , )p(z |). As a result, the likelihood of a corpus D can be calculated by taking the product of the marginal probabRliQ s of each of Ri tie K the documents. Specifically, p(D|, , ) = z =1 p(z | ) QN QN QM d=1 p(d |) n=1 p(wn | , ) m=1 p(am | , )d d. To estimate the parameters and , we construct a Markov chain that converges to the posterior distribution on z. The posterior distribution can be derived as follows: p(zi = j |z i , w, a) p(zi = j |z i )p(wi |z, w i )p(ai |z, a i ) (2) P D Hdj T + H WjT + P mW T , (3) DT j Hdj + K m Hm j + W Topic 42 "Multimedia" retrieval 0.0411 multimedia 0.0411 broadcast 0.0360 video 0.0311 shot 0.0311 labeling 0.0311 flash 0.0215 Jun Yang(Duke) 0.0398 Jun Yang(CMU) 0.2781 Multimedia Retrieval Probability 0.2 0.15 0.1 0.05 0 0 5 10 15 20 25 Topics 30 35 40 45 50 Database Table 1: LDA results of two different "Jun Yang". Table 1 lists an illustrative result from LDA. We depict topics that clearly show the differences for disambiguating authors with exactly the same name. One "Jun Yang" has very high probability of topic "Database" while the other are highly related with the topic "Multimedia", thus they can be clearly disambiguated from each other. We empirically tested our models for the entire CiteSeer data set with more than 750,000 documents. PLSA yields 418,500 unique authors in 2,570 minutes, while LDA finishes in 4,390 minutes with 418,775 authors. Considering that our methods only make use of a small portion of the text for each instance (metadata plus the first page), we believe the framework can be efficient for large-scale data sets. where z i means all topic assignments not including the ith W word; HmjT is the number of times word m assigned to topic D j except the current instance and Hdj T is the number of times doc d contains topic j except the current instance. 4. EXPERIMENTS To disambiguate names, we use a hierarchical agglomerative clustering method. Two sets of metrics are applied in our experiments, namely pair-level pairwise F1 score F 1P and cluster-level pairwise F1 score F 1C . We also compare with the basic agglomerative clustering, spectral clustering and DBSCAN method. [1] D. M. Blei, A. Y. Ng, and M. I. Jordan. Latent dirichlet allo cation. J. Mach. Learn. Res., 3:993­1022, 2003. [2] T. Hofmann. Probabilistic Latent Semantic Indexing. In SIGIR '99: Proceedings of the 22nd annual international ACM SIGIR conference on Research and development in information retrieval, pages 50­57, Berkeley, California. 5. REFERENCES 1164