SIGIR 2007 Proceedings

Poster

Model-Averaged Latent Semantic Indexing
Miles Efron
School of Information University of Texas, Austin


miles@ischool.utexas.edu

ABSTRACT
This poster introduces a novel approach to information retrieval that uses statistical model averaging to improve latent semantic indexing (LSI). Instead of choosing a single dimensionality k for LSI , we propose using several models of differing dimensionality to inform retrieval. To manage this ensemble we weight each model's contribution to an extent inversely proportional to its AIC (Akaike information criterion). Thus each model contributes proportionally to its expected Kullback-Leibler divergence from the distribution that generated the data. We present results on three standard IR test collections, demonstrating significant improvement over both the traditional vector space model and single-model LSI.

2. APPROACH
Recent work [2, 6] suggests that no single choice of k, the dimensionality of an LSI model, will be optimal for all queries. With this in mind, our approach, model-averaged LSI (MALSI), uses a set of M LSI models. The intuition behind MALSI is that choosing any one model is risky. If we reduce dimensionality too far, we lose important information. On the other hand, if we retain too many dimensions, we risk overfitting the data, incurring the vocabulary mismatch problem. MALSI compensates for this risk by allowing many models to "vote" on the relevance of a document to a query, weighting each vote according to our confidence in that model. To quantify our confidence in a model, let Uk be a kdimensional LSI model; thus Uk contains the first k left singular vectors of the n term by p document matrix X. We assess the fit of Uk by the Akaike information criterion (AIC), an estimate of the expected KL divergence between Uk and the unknown true model [3]: AI C = -2 log (L(Uk |X)) + 2D (1)

Categories and Subject Descriptors
H.3.3 [Information Search and Retrieval]: Retrieval Models

General Terms
Experimentation, Performance, Theory

Keywords
latent semantic indexing, model selection, model averaging

where D = rk + 1 - k(k - 1)/2 and r is the rank of X. Although LSI does not strictly define a generative model, work by Chris Ding [5] has derived the following log-likelihood function for an LSI model of k dimensions: log L(Uk ) = 1 + . . . + k - n log Z (Uk ) (2)

1.

INTRODUCTION

This poster reports efforts to improve latent semantic indexing (LSI) by statistical model averaging. LSI pro jects documents and queries onto a low-dimensional subspace of the observed vector space by use of the singular value decomposition (SVD) [4]. According to its proponents, LSI's dimensionality reduction improves retrieval by accounting for linguistic ambiguity. But how aggressively to truncate the SVD is an open research question [5, 7]. We propose an ensemble approach, using many models weighted by their expected estimated Kullback-Leibler (KL) divergence from the distribution that generated the data.
 The author thanks Chris H. Q. Ding for generous advice during early work on this paper.

where k is the kth eigenvalue of X function Z Zk = ... Z

X

and Z is a partition

exp [(x · u1 )2 + . . . + (x · uk )2 ]dx1 . . . dxp .

(3) Given a set of candidate models M (we use all models between kmin and kmax ) we find the p-vector of querydocument similarities by r(q) = ^ X
k M

wk (q Uk )(k Vk )

(4)

Copyright is held by the author/owner(s). SIGIR'07, July 23­27, 2007, Amsterdam, The Netherlands. ACM 978-1-59593-597-7/07/0007.

where wk is the weight of the kth model (inversely proportional to its AIC), k is diagonal, containing the first k singular values of X, and Vk holds the first k right singular vectors. Figure 1 shows the log-likelihood and AIC values for all possible dimensionalities on three standard test collections.

755


SIGIR 2007 Proceedings

Poster

Medline
-7300
lll llll llll llllllll

CACM
-23000 -15000
lllllll lllllllllllllllllllllllllllllll llllllllllllllllllllllllllllllllllllllllllllllllll llllllllllll llllllllllll llllll lllllllllllll lllllll llllllll llllllllll lll llllllll lll lllllll llll lllllllll ll llllll lll lllll lll llllll ll l llllll lllll lll lllll lll lllll ll llll lllll ll l lllllll llllll ll lllll ll llll l lllll lllll ll lllll l lllll l llll llll l lllll l llllll llllll l lllllll llllll lll l l l l l l

REUTERS
lllllllllllllllllllllllllllllllllllllllllllllllllll llllllllllllllllll lllllllllllllllll llllllllllllll llllllllll lllllll llllll llllllll lllll lllllllll llll llllllllll lll lllllll ll llllllll lll lllllll ll llllll ll lllllll ll lllllll llllll ll l llllll ll lllllll llllllll ll llllll l llllllll l llllll l lllllll l lllllll l lllllll llllllll l llllll l lllll l llllllll lllllll llllll l lllll lllllll l llllllll lll l l

-7500

Log-Likelihood

Log-Likelihood

-23400

-7700

l l l l l l l

l

-7900

0

200

400

600

800

1000

-24200

0

500

1000

1500

2000

2500

3000

-16000

l

l

-23800

ll

l

ll

ll

ll

ll

l ll

-15500

ll

Log-Likelihood

ll

ll

ll

ll

lll

lll

lll

lll

l

0

1000

2000 Dimensions

3000

Dimensions

Dimensions

Medline
94.5 95.0 95.5 96.0 96.5 97.0
l

CACM
l llllll llllllll llllll lllllll lllll llll llll lllll lllll lllll lllll llll lllll llll lllll lllll lllll llll l lllll lllll lllll lllll lllll l llll lllll l llll llllll l lllll lllll l lllll lllll l llllll l lllll llll l llllll l lllllll ll lllll ll llllll ll lllll lll lllll lllll lll llllll llll lllllll llll llllllllll lllll lllllllll lllllllll lllllllllllllllllllllllllllllllllll

REUTERS
lllllllll lllllll llllll lllllll llllllll llllllll llllll llllll lllllll lllllll llllllll llllll lllllll llllll lllll lllllll lllllll l lllll llllll lllll llllll lllll lllll l lllll llllll lllllll llll l lllll lllllll llll l llllll l llllll lllll l llllll l lllllll l llllll ll lllllll ll llllllll llllll ll llllll ll lllll lll llllllll lll llllllll llll llllllllll l llllllll lllll lllllllllllllll lllllllllllllllllllllllll l llll

154

l

l

AIC

AIC

150

AIC

l l l l l l ll l lll ll ll lll lll llll ll l lll lll llll lll lll lll

lll

llll

l

148

146

ll

lll

ll

llll

lllllll l ll ll ll

lll lll

llll

lll

lll

lll

0

200

400

600

800

1000

0

500

1000

1500

2000

2500

3000

88

90

92

94

l

152

96

0

1000

2000 Dimensions

3000

Dimensions

Dimensions

Figure 1: Log-Likeliho o d and AIC Values for LSI Mo dels of Three Corp ora Several results from Table 1 are especially important. First, MALSI mitigated LSI's poor showing on CACM, supporting the hypothesis that model-averaging can lessen detrimental effects of dimensionality reduction. MALSI also improved overall accuracy, outperforming VSM and LSI at statistically significant levels (p <0.05) for all cases except CACM.

Table 1: Performance Averaged Over All Queries
VSM LSI MALSI MAP 0.466 0.483 0.502 MED R-Prec 0.455 0.467 0.486 CACM MAP R-Prec 0.158 0.155 0.143 0.146 0.160 0.175 REUT MAP R-Prec 0.486 0.490 0.516 0.509 0.553 0.531

Our motivation for using AIC instead of the raw log-likelihood is evident from the different extrema that each function gives over the domain of candidate models. Due to its penalty for free parameters, AIC is optimized at a lower k than the loglikelihood; though more complex models may yield higher likelihood, AIC offers a better basis for model averaging [3].

4. CONCLUSION
Our use of AIC to assess model goodness of fit yielded promising results and puts MALSI on a strong informationtheoretic footing. While this is sound and yields good results in future work we hope to compare the technique described here to a fully Bayesian approach, which admits a less restrictive notion of model uncertainty. Also, we plan to test MALSI on larger, more realistic corpora.

3. EXPERIMENTS
We compared MALSI to the vector space model (VSM) and LSI (with single models chosen by minimum AIC) against three corpora: Medline (1033 documents), CACM (3204 documents) and a subset of Reuters-215781 [1]. Due to memory constraints, only the first 4000 Reuters documents were processed. Reuters queries were created by choosing TOPIC elements. Topics assigned to fewer than 10 documents were rejected, leaving 29. To gauge performance we used precision averaged across 11 levels of recall, and Rprecision. Table 1 shows each performance measure averaged over all queries. In all cases MALSI outperformed both LSI and VSM, even when LSI performed worse than the VSM. To test the significance of these results, we conducted paired one-sided t-tests for each query (as opposed to the averaged results shown in Table 1). With respect to mean average precision (MAP) MALSI performed significantly better than LSI and VSM in most cases. The p-value of a test 0 between MALSI and LSI was 0.058 for CACM, with p .01 for the other corpora. MALSI decisively outperformed LSI on R-precision yielding p-values of 0.0007, 0.03, 0.09 for Medline, Reuters, and CACM, respectively.
1

5. REFERENCES
[1] R. Baeza-Yates and B. Ribiero-Neto. Modern Information Retrieval. ACM Press, New York, 1999. [2] H. Bast and D. Ma jumdar. Why spectral retrieval works. In SIGIR '05, pages 11­18, 2005. [3] K. P. Burnham and D. R. Anderson. Model Selection and Multimodel Inference. Springer, New York, 2nd edition, 2002. [4] S. C. Deerwester, S. T. Dumais, T. K. Landauer, G. W. Furnas, and R. A. Harshman. Indexing by latent semantic analysis. JASIS, 41(6):391­407, 1990. [5] C. H. Q. Ding. A similarity-based probability model for latent semantic indexing. In SIGIR '99, pages 58 ­ 65, 1999. [6] G. Dupret. Latent concepts and the number of orthogonal factors in latent semantic analysis. In SIGIR '03, pages 221 ­ 226, 2003. [7] M. Efron. Eigenvalue-based model selection during latent semantic indexing. JASIST, 56(9):969­988, 2005.

http://www.daviddlewis.com/resources/testcollections/reuters21578

756