SIGIR 2007 Proceedings Poster Model-Averaged Latent Semantic Indexing Miles Efron School of Information University of Texas, Austin miles@ischool.utexas.edu ABSTRACT This poster introduces a novel approach to information retrieval that uses statistical model averaging to improve latent semantic indexing (LSI). Instead of choosing a single dimensionality k for LSI , we propose using several models of differing dimensionality to inform retrieval. To manage this ensemble we weight each model's contribution to an extent inversely proportional to its AIC (Akaike information criterion). Thus each model contributes proportionally to its expected Kullback-Leibler divergence from the distribution that generated the data. We present results on three standard IR test collections, demonstrating significant improvement over both the traditional vector space model and single-model LSI. 2. APPROACH Recent work [2, 6] suggests that no single choice of k, the dimensionality of an LSI model, will be optimal for all queries. With this in mind, our approach, model-averaged LSI (MALSI), uses a set of M LSI models. The intuition behind MALSI is that choosing any one model is risky. If we reduce dimensionality too far, we lose important information. On the other hand, if we retain too many dimensions, we risk overfitting the data, incurring the vocabulary mismatch problem. MALSI compensates for this risk by allowing many models to "vote" on the relevance of a document to a query, weighting each vote according to our confidence in that model. To quantify our confidence in a model, let Uk be a kdimensional LSI model; thus Uk contains the first k left singular vectors of the n term by p document matrix X. We assess the fit of Uk by the Akaike information criterion (AIC), an estimate of the expected KL divergence between Uk and the unknown true model [3]: AI C = -2 log (L(Uk |X)) + 2D (1) Categories and Subject Descriptors H.3.3 [Information Search and Retrieval]: Retrieval Models General Terms Experimentation, Performance, Theory Keywords latent semantic indexing, model selection, model averaging where D = rk + 1 - k(k - 1)/2 and r is the rank of X. Although LSI does not strictly define a generative model, work by Chris Ding [5] has derived the following log-likelihood function for an LSI model of k dimensions: log L(Uk ) = 1 + . . . + k - n log Z (Uk ) (2) 1. INTRODUCTION This poster reports efforts to improve latent semantic indexing (LSI) by statistical model averaging. LSI pro jects documents and queries onto a low-dimensional subspace of the observed vector space by use of the singular value decomposition (SVD) [4]. According to its proponents, LSI's dimensionality reduction improves retrieval by accounting for linguistic ambiguity. But how aggressively to truncate the SVD is an open research question [5, 7]. We propose an ensemble approach, using many models weighted by their expected estimated Kullback-Leibler (KL) divergence from the distribution that generated the data. The author thanks Chris H. Q. Ding for generous advice during early work on this paper. where k is the kth eigenvalue of X function Z Zk = ... Z X and Z is a partition exp [(x · u1 )2 + . . . + (x · uk )2 ]dx1 . . . dxp . (3) Given a set of candidate models M (we use all models between kmin and kmax ) we find the p-vector of querydocument similarities by r(q) = ^ X k M wk (q Uk )(k Vk ) (4) Copyright is held by the author/owner(s). SIGIR'07, July 23­27, 2007, Amsterdam, The Netherlands. ACM 978-1-59593-597-7/07/0007. where wk is the weight of the kth model (inversely proportional to its AIC), k is diagonal, containing the first k singular values of X, and Vk holds the first k right singular vectors. Figure 1 shows the log-likelihood and AIC values for all possible dimensionalities on three standard test collections. 755 SIGIR 2007 Proceedings Poster Medline -7300 lll llll llll llllllll CACM -23000 -15000 lllllll lllllllllllllllllllllllllllllll llllllllllllllllllllllllllllllllllllllllllllllllll llllllllllll llllllllllll llllll lllllllllllll lllllll llllllll llllllllll lll llllllll lll lllllll llll lllllllll ll llllll lll lllll lll llllll ll l llllll lllll lll lllll lll lllll ll llll lllll ll l lllllll llllll ll lllll ll llll l lllll lllll ll lllll l lllll l llll llll l lllll l llllll llllll l lllllll llllll lll l l l l l l REUTERS lllllllllllllllllllllllllllllllllllllllllllllllllll llllllllllllllllll lllllllllllllllll llllllllllllll llllllllll lllllll llllll llllllll lllll lllllllll llll llllllllll lll lllllll ll llllllll lll lllllll ll llllll ll lllllll ll lllllll llllll ll l llllll ll lllllll llllllll ll llllll l llllllll l llllll l lllllll l lllllll l lllllll llllllll l llllll l lllll l llllllll lllllll llllll l lllll lllllll l llllllll lll l l -7500 Log-Likelihood Log-Likelihood -23400 -7700 l l l l l l l l -7900 0 200 400 600 800 1000 -24200 0 500 1000 1500 2000 2500 3000 -16000 l l -23800 ll l ll ll ll ll l ll -15500 ll Log-Likelihood ll ll ll ll lll lll lll lll l 0 1000 2000 Dimensions 3000 Dimensions Dimensions Medline 94.5 95.0 95.5 96.0 96.5 97.0 l CACM l llllll llllllll llllll lllllll lllll llll llll lllll lllll lllll lllll llll lllll llll lllll lllll lllll llll l lllll lllll lllll lllll lllll l llll lllll l llll llllll l lllll lllll l lllll lllll l llllll l lllll llll l llllll l lllllll ll lllll ll llllll ll lllll lll lllll lllll lll llllll llll lllllll llll llllllllll lllll lllllllll lllllllll lllllllllllllllllllllllllllllllllll REUTERS lllllllll lllllll llllll lllllll llllllll llllllll llllll llllll lllllll lllllll llllllll llllll lllllll llllll lllll lllllll lllllll l lllll llllll lllll llllll lllll lllll l lllll llllll lllllll llll l lllll lllllll llll l llllll l llllll lllll l llllll l lllllll l llllll ll lllllll ll llllllll llllll ll llllll ll lllll lll llllllll lll llllllll llll llllllllll l llllllll lllll lllllllllllllll lllllllllllllllllllllllll l llll 154 l l AIC AIC 150 AIC l l l l l l ll l lll ll ll lll lll llll ll l lll lll llll lll lll lll lll llll l 148 146 ll lll ll llll lllllll l ll ll ll lll lll llll lll lll lll 0 200 400 600 800 1000 0 500 1000 1500 2000 2500 3000 88 90 92 94 l 152 96 0 1000 2000 Dimensions 3000 Dimensions Dimensions Figure 1: Log-Likeliho o d and AIC Values for LSI Mo dels of Three Corp ora Several results from Table 1 are especially important. First, MALSI mitigated LSI's poor showing on CACM, supporting the hypothesis that model-averaging can lessen detrimental effects of dimensionality reduction. MALSI also improved overall accuracy, outperforming VSM and LSI at statistically significant levels (p <0.05) for all cases except CACM. Table 1: Performance Averaged Over All Queries VSM LSI MALSI MAP 0.466 0.483 0.502 MED R-Prec 0.455 0.467 0.486 CACM MAP R-Prec 0.158 0.155 0.143 0.146 0.160 0.175 REUT MAP R-Prec 0.486 0.490 0.516 0.509 0.553 0.531 Our motivation for using AIC instead of the raw log-likelihood is evident from the different extrema that each function gives over the domain of candidate models. Due to its penalty for free parameters, AIC is optimized at a lower k than the loglikelihood; though more complex models may yield higher likelihood, AIC offers a better basis for model averaging [3]. 4. CONCLUSION Our use of AIC to assess model goodness of fit yielded promising results and puts MALSI on a strong informationtheoretic footing. While this is sound and yields good results in future work we hope to compare the technique described here to a fully Bayesian approach, which admits a less restrictive notion of model uncertainty. Also, we plan to test MALSI on larger, more realistic corpora. 3. EXPERIMENTS We compared MALSI to the vector space model (VSM) and LSI (with single models chosen by minimum AIC) against three corpora: Medline (1033 documents), CACM (3204 documents) and a subset of Reuters-215781 [1]. Due to memory constraints, only the first 4000 Reuters documents were processed. Reuters queries were created by choosing TOPIC elements. Topics assigned to fewer than 10 documents were rejected, leaving 29. To gauge performance we used precision averaged across 11 levels of recall, and Rprecision. Table 1 shows each performance measure averaged over all queries. In all cases MALSI outperformed both LSI and VSM, even when LSI performed worse than the VSM. To test the significance of these results, we conducted paired one-sided t-tests for each query (as opposed to the averaged results shown in Table 1). With respect to mean average precision (MAP) MALSI performed significantly better than LSI and VSM in most cases. The p-value of a test 0 between MALSI and LSI was 0.058 for CACM, with p .01 for the other corpora. MALSI decisively outperformed LSI on R-precision yielding p-values of 0.0007, 0.03, 0.09 for Medline, Reuters, and CACM, respectively. 1 5. REFERENCES [1] R. Baeza-Yates and B. Ribiero-Neto. Modern Information Retrieval. ACM Press, New York, 1999. [2] H. Bast and D. Ma jumdar. Why spectral retrieval works. In SIGIR '05, pages 11­18, 2005. [3] K. P. Burnham and D. R. Anderson. Model Selection and Multimodel Inference. Springer, New York, 2nd edition, 2002. [4] S. C. Deerwester, S. T. Dumais, T. K. Landauer, G. W. Furnas, and R. A. Harshman. Indexing by latent semantic analysis. JASIS, 41(6):391­407, 1990. [5] C. H. Q. Ding. A similarity-based probability model for latent semantic indexing. In SIGIR '99, pages 58 ­ 65, 1999. [6] G. Dupret. Latent concepts and the number of orthogonal factors in latent semantic analysis. In SIGIR '03, pages 221 ­ 226, 2003. [7] M. Efron. Eigenvalue-based model selection during latent semantic indexing. JASIST, 56(9):969­988, 2005. http://www.daviddlewis.com/resources/testcollections/reuters21578 756