SIGIR 2007 Proceedings Poster Generative Modeling of Persons and Documents for Expert Search Pavel Serdyukov, Djoerd Hiemstra, Maar ten Fokkinga and Peter M.G. Apers Database Group, University of Twente PO Box 217, 7500 AE Enschede, The Netherlands {serdyukovpv, hiemstra, fokkinga, apers}@cs.utwente.nl 2. EXPERT MODELING The most p opular and effective assumption in exp ert finding research states that the level of p ersonal exp ertise can b e determined by the analysis of co-occurrence of the query terms and p ersonal id within the context of a document or a passage [1]. We similarly supp ose that our task comes to the estimation of the joint probability P (e, q1 , ..., qk ) of observing the candidate exp ert e together with the query terms q1 ...qk in the documents ranked by the query. In this pap er we examine two estimation methods. Method 1 considers a candidate and the query terms to b e conditionally indep endent given a ranked document (see Figure 1a). Thus, the total joint probability is, P (e, q1 , ..., qk ) = ABSTRACT In this pap er we address the task of automatically finding an exp ert within the organization, known as the exp ert search problem. We present the theoretically-based probabilistic algorithm which models retrieved documents as mixtures of exp ert candidate language models. Exp eriments show that our approach outp erforms existing theoretically sound solutions. Categories and Subject Descriptors H.3 [Information Storage and Retrieval]: ;; H.3.3 [Information Search and Retrieval]: ; General Terms Algorithms, Theory, Performance, Exp erimentation D P (D)P (e, q1 , ..., qk |D) (1) Keywords exp ert finding, exp ertise, enterprise search, e-mail P (e, q1 , ..., qk |D) = P (e|D) i k P (qi |D) =1 1. INTRODUCTION Exp ert finding is a new rapidly evolving direction of Information Retrieval research [4]. An exp ert search system finds p ersons with certain exp ertise within an organization. It uses a short user query and the information stored on p ersonal desktops or within centralized databases as an input. There are mainly two approaches to do exp ert candidates modeling and ranking. The first approach is profile-centric. All documents related to a candidate exp ert are merged into a single p ersonal profile. The p ersonal profiles are ranked as in standard document retrieval and corresp onding b est candidates are returned to the user. The second approach is document-centric. It runs a query against all documents and ranks candidates by summarized scores of associated documents. Our generative modeling method combines the features of the b oth approaches: it ranks candidates using their language models built from retrieved documents. This method is analogous to the most successful and theoretically sound approach prop osed so far [1]. Thus, it serves as a baseline in our exp eriments. Method 2, which is the contribution of our pap er, is based on the assumption of dep endency b etween the query terms and a candidate. We supp ose that candidates actually generate the query terms within retrieved documents (see Figure 1b). We calculate the required joint probability as follows considering the query terms to b e sampled indep endently given an exp ert candidate: P (e, q1 , ..., qk ) = P (q1 , ..., qk |e)P (e) (2) P (q1 , ..., qk |e) = i k P (qi |e) =1 We set the candidate prior P (e) to b e: P (e ) = D P (e|D)P (D) (3) Copyright is held by the author/owner(s). SIGIR`07, July 23­27, 2007, Amsterdam, The Netherlands. ACM 978-1-59593-597-7/07/0007. So, now we need to estimate the probabilities P (qi |e). Since, we have already p ostulated that candidates are resp onsible for generating query terms in the documents they are mentioned in, we represent the language model of a ranked document as a mixture of exp ert candidate language 827 SIGIR 2007 Proceedings Poster retrieval of documents. Table 1 contains the results of b oth exp ert candidates ranking methods. M1 M2 MAP 0.1587 0.1712 MRR 0.6550 0.6712 R-pr 0.2598 0.2755 P5 0.4285 0.4306 P10 0.4122 0.4304 P20 0.3341 0.3653 Figure 1: Dependence networks for two methods of estimating the joint probability P (e, q1 , ..., qk ) models and the global language model. We define the likelihood of the top retrieved documents, set R, as: P (R) = Table 1: Performance of expert ranking methods We see that our Method 2 improves the baseline Method 1 over all standard IR measures including mean average precision, mean reciprocal rank, R-Precision and precisions for 5, 10 and 20 top exp ert candidates. It shows that the assumption of indep endence of terms and candidates (see Figure 1a) associated with a document is less realistic. It seems imp ortant for exp ert ranking methods to model candidates, queries and documents considering that occurrence of the sp ecific candidate in a document determines which terms the document consists of (see Figure 1b). Dw R ((1-G )( D i m P (ei |D )P (w |ei ))+G P (w |G))c(w,D ) =1 Here, e1 , ..., em are the exp erts, c(w, D) is the count of term w in document D, G is the probability that a term will b e generated from the global language model and not from any of candidate language models. P (w|G) is the global language model estimated over the whole document collection. Further, we apply the EM algorithm [3], traditionally used to estimate unknown parameters, to calculate P (w|ei ). We prop ose the following up dating formulas to b e used recursively to maximize the likelihood of set R: E-step: P (e|w , D ) = 4. CONCLUSIONS We presented a new generative model-based method for exp ert finding and evaluated it using the TREC Enterprise collection. The result suggests that it is more effective to drop the assumption of indep endence b etween candidates and document terms, while this claim should b e carefully studied with additional exp eriments. We see the p otential of the presented method for the understanding of exp ertise distribution in enterprise. In the future, we plan to take a closer look at pseudo-relevance techniques, namely query expansion, since our approach inherits much of this family of approaches. (1 - G )( È m i=1 M-step: P (w |e) = È ÈÈ w p(e|D )P (w |e) P (ei |D )P (w |ei )) + G P (w |G) (4) c(w , D )P (e|w , D ) c(w , D )P (e|w , D ) D R (5) 5. ACKNOWLEDGMENTS We thank Sergey Chernov, Gianluca Demartini and Julien Gaugaz from L3S Lab Hannover for the help with data preprocessing and series of fruitful discussions. D R The probabilities P (e|D) are calculated using association scores a(e, D) b etween the document and exp ert candidates: P (e|D) = a(e, D)/ m 1 a(ei , D). The probability distribui= tion P (D) is considered to b e uniform in the b oth methods. Our approach to exp ert candidates modeling is based on the similar hyp othesis with one used in model-based pseudorelevance feedback methods for document retrieval [5]. It considers that the relevance model of a user can b e mined from the top of retrieved documents. The significant difference is that we represent the relevance model, which is in fact the model of the query topic, as a mixture of models of exp ert candidates who actually hold and share the desired knowledge. È 6. REFERENCES [1] K. Balog, L. Azzopardi, and M. de Rijke. Formal models for exp ert finding in enterprise corp ora. In SIGIR '06: Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval, pages 43­50, 2006. [2] K. Balog and M. de Rijke. Finding exp erts and their details in e-mail corp ora. In 15th International World Wide Web Conference (WWW2006), 2006. [3] A. Dempster, N.M.Laird, and D.B.Rubin. Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society, Series B, 39(1):1­38, 1977. [4] D. Hawking. Challenges in enterprise search. In ADC '04: Proceedings of the 15th Australasian database conference, pages 15­24, Darlinghurst, Australia, Australia, 2004. Australian Computer Society, Inc. [5] C. Zhai and J. Lafferty. Model-based feedback in the language modeling approach to information retrieval. In CIKM '01: Proceedings of the tenth international conference on Information and know ledge management, pages 403­410, 2001. 3. EXPERIMENTS For the evaluation of our approach we utilize data from the exp ert search task in the Enterprise track, TREC 2006. This track contains 1092 exp ert candidates and 50 queries with resp ective lists of exp erts. We use only the email part of the collection since it allows us to extract candidate-document associations easily, using from, to and cc email fields and given the candidate's email addresses. Association scores are taken to b e 1.5, 1.0 and 2.5 resp ectively what is the b est combination according to recent studies [2]. The numb er of retrieved documents for modeling is restricted to 1000 what is the standard for document retrieval tasks in TREC. The standard language model based IR approach is used for the 828