A Parallel Derivation of Probabilistic Information Retrieval Models
Thomas Roelleke
Queen Mary, University of London

Jun Wang
Queen Mary, University of London

thor@dcs.qmul.ac.uk

wangjun@dcs.qmul.ac.uk

ABSTRACT
This pap er investigates in a stringent mathematical formalism the parallel derivation of three grand probabilistic retrieval models: binary indep endent retrieval (BIR), Poisson model (PM), and language modelling (LM). The investigation has b een motivated by a numb er of questions. Firstly, though sharing the same origin, namely the probability of relevance, the models differ with resp ect to event spaces. How can this b e captured in a consistent notation, and can we relate the event spaces? Secondly, BIR and PM are closely related, but how does LM fit in? Thirdly, how are tf-idf and probabilistic models related? The parallel investigation of the models leads to a numb er of formalised results: 1. BIR and PM assume the collection to b e a set of non-relevant documents, whereas LM assumes the collection to b e a set of terms from relevant documents. 2. PM can b e viewed as a bridge connecting BIR and LM. 3. A BIR-LM equivalence explains BIR as a sp ecial LM case. 4. PM explains tf-idf, and b oth, BIR and LM probabilities express tf-idf in a dual way.

Categories and Subject Descriptors
H.3.3 [Information Search and Retrieval]: Retrieval models

General Terms
Theory

Keywords
Binary Retrieval Model, Poisson, Language Modelling

1.

INTRODUCTION AND BACKGROUND

The search for the best retrieval model drives information retrieval (IR) research. [17] brought the BIR model, this forming a sound theoretical foundation for IR. Tuning of tf-idf led to probably the b est tf-idf-based retrieval function known as BM25 ([15] at TREC 1995). In late 90s,

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. SIGIR'06, August 6­11, 2006, Seattle, Washington, USA. Copyright 2006 ACM 1-59593-369-7/06/0008 ...$5.00.

language models emerged as an alternative retrieval model ([13]). Early 00s brought the divergence from randomness (DFR) branch of models ([3], [1]). A physician might ask: What is the common ground of your models? For centuries, physicians have b een investigating phenomena like gravity, light, and energy. For each phenomena, there is a model with significant maths around. Physicians look for the ground model that explains the other models: we hear, for example, of quarks and quantum theory. Are there IR quarks that explain the existing IR models? This pap er is not ab out IR quarks, nor does it prop ose a ground model for IR. This pap er is b est describ ed as a parallel investigation of gravity, light, and energy, where the parallel investigation helps highlighting asp ects and relationships of the single phenomena (models). The contribution of this pap er is the formal presentation of the parallel derivation of existing IR models. The derivation leads to theorems and formulae that relate and explain existing IR models. This pap er looks at the three grand probabilistic retrieval models: binary indep endent retrieval (BIR), Poisson model (PM), and language modelling (LM). Results include, for example, the formalisation of event spaces. For BIR, judgements in documents form the event space, whereas for PM, frequencies of terms form the event space, and for LM, terms at locations form the event space (see section 3). Our research also addressed DFR, but this pap er focusses on the three basic models only. A pap er on retrieval models is b ound to have a low-recall reference list, and we selected relevant pap ers that deal with the relationships and explanations of models. Highly relevant is the introduction [11] in the b ook [6] on LM, where the authors make explicit that the probability of relevance is the origin of b oth, BIR and LM. [9] in the same b ook takes a detailed look at the relevance event, and [18] is a milestone in p ointing at the different event spaces of probabilistic models, and raising attention for what happ ens when relating BIR and LM. Further, the pap ers [8] and [16] are highly relevant. [8] provides a probabilistic explanation of tf-idf, the explanation b eing based on LM. The elegant formulation of the LM retrieval status value has b een an imp ortant step for the efficient processing of LM; however, the tf-idf explanation mixes event spaces (usage of document frequencies for estimating the probability of a term in a collection, and usage of term frequencies for estimating the probability of a term in a document), and we pay in our pap er particular attention to make the underlying event spaces explicit.

107


[16] on theoretical arguments for idf draws on the relationship b etween BIR and idf, p ointing out how idf follows from BIR for missing relevance. [7] picked up the latter, and discusses a fully idf -based explanation of BIR. The next class of relevant pap ers deals with the Poisson model: [14] divides the collection into two sets: elite and non-elite set of documents (elite set contains the documents that are "ab out" a query concepts/terms, and in this set of documents the term usually occurs to a greater extend than in the rest of the documents). Dividing the collection into two sets should show an improvement since the 2-Poisson model with two parameters b etter fits the divided collection, this also b eing rep orted in the Poisson investigation [12]. The results stress to consider two document sets, and this is exactly what we observe when relating BIR and LM probabilities, where the average document length of documents in the elite set of a term and the average document length in the collection form pillars of the bridge that connects BIR and LM probabilities. [5] investigates idf based against ilf -based retrieval (the pap er refers to tokenbased retrieval, but we apply the notion ilf, inverse location frequency). The work of Church/Gale has shown that the burstiness of terms (a bursty term is a term that, if it occurs, occurs several times) is a crucial feature of a good term. The b ook [21] is probably to date the most adventurous approach for a ground model of IR, claiming that IR can b e explained by linear algebra, vector (Hilb ert) spaces and extensions. Different from work on ground models, our pap er does not claim any unifying ground model. We derive the existing models in a unifying formalism and notation, without making any claim ab out a new model. The remainder of this pap er is structured as follows: Section 2: Brief review of the probability of relevance b eing the origin of probabilistic models. Section 3: Parallel derivation of BIR, PM, and LM. Section 4: Relevance assumption: BIR and PM assume the collection to b e a set of non-relevant documents, whereas LM assumes the collection to b e a set of terms from relevant documents. Section 5: Poisson bridge: The Poisson parameter  can b e viewed as a bridge connecting BIR and LM probabilities. Section 6: BIR-LM equivalence: Shows a case for which BIR and LM are equivalent. Section 7: TF-IDF: PM explains tf-idf-like retrieval, and this can b e represented using either BIR or LM probabilities.

ing function: O(r |d, q ) := P (r |d, q ) P (d , q , r ) = P (r |d, q ) Ż P (d , q , r ) Ż

In the formulation with odds, P (d, q ) drops out. For computing P (d, q , r ) (P (d, q , r ), resp ectively), we need Ż the probability P (d, q |r ) (P (d, q |r ) resp ectively). Either let Ż d dep end on q , or let q dep end on d, and we obtain: P (d , q , r ) = P (d|q , r ) · P (q |r ) · P (r ) = P (q |d, r ) · P (d|r ) · P (r ) (1) (2)

Equation 1 is the basis of binary indep endent retrieval (BIR) and the Poisson model (PM), and equation 2 is the basis of language modelling (LM) (see [11]). The events d and q are sequences (conjunctions) of events: For BIR, the events corresp ond to binary judgements on whether or not a term occurs in d; for PM, the events corresp ond to the frequencies with which the terms occur in d; for LM, the events corresp ond to terms. We formalise and discuss the event spaces and other asp ects of the probabilistic IR models in the next section.

3. PARALLEL DERIVATION
This section shows the parallel derivation of BIR, PM, and LM. For presenting such a complex mathematical topic in an understandable way, and for giving the reader an anchor for looking up the issues and notations, we develop ed figures 1 and 2. They are relatively busy figures with lots of mathematical notation, and we will proceed step by step. The notation is partially based on [19]. Figure 1 p oints at the event spaces, background models, frequencies, probabilities, and term probability interpretations of each model. Figure 2 focuses on the document and query interpretations, retrieval status values (RSV's), and parameter estimations.

3.1 Event spaces, Background models, Frequencies, Probabilities: Figure 1
3 . 1 . 1 Ev e n t s p a c e s
BIR is based on a space of binary judgements over documents, whereas PM is based on a space of frequencies of terms, and LM is based on a space of terms at locations.1 The next section highlights that BIR involves 2 · NT (c) judgement sequences (two sequences for each term, NT (c) is the numb er of terms) as background model, whereas PM involves two frequency sequences as background model, and LM involves one term sequence as background model.

2.

THE PROBABILITY OF RELEVANCE

The probability of relevance can b e viewed as the origin of BIR, PM, and LM. The conditional probability P (r |d, q ) of relevance is equal to the fraction of the probability P (r, d, q ) and the evidence probability P (d, q ), where r is the relevance event, d is a document, and q is a query. P (d , q , r ) P (r |d, q ) = P (d , q ) A ranking based on P (r |d, q ) is optimal, if the costs for reading a relevant document are assumed to b e less than the costs for reading a non-relevant document (probabilistic ranking principle). Therefore, P (r |d, q ) is the optimal basis for defining a retrieval status value (RSV). The odds of an event (the fraction of the probability that the event is true and false, resp ectively) is a convenient rank-

3.1.2 Background models
A background model is based on the knowledge we have ab out the collection, and from this p oint of view, each of BIR, PM, and LM has a background model. Note that the notion "background model" is common for LM, and the parallel derivation here applies this notion to all models.
1 Note in figure 1 the carefully chosen notation for referring to the elements of the event spaces, and for underlining the parallels of the models: jd is a judgement ({1, 0}) for document d. ft is a frequency ({0, 1, 2, . . .}) of term t. tl is a term ({1 , 2 , . . .}) at location l.

108


BIR 3.1.1 Event spaces D: Set of documents J : Set of judgements jd : judgement for document d jd  {1, 0}

PM T : Set of terms F : Set of frequencies ft : frequency of term t ft  {0, 1, 2, ...}

LM L: Set of locations T : Set of terms tl : term at location l tl  {1 , 2 , ...} One sequence c = t1 , t2 , . . . of terms where tl is the term at location l, and c is the collection.

3.1.2 Background models 2 · NT (c) sequences ct,x = j1 , j2 , . . . of Two sequences cx = f1 , f2 , . . . of frejudgements where jd is the judgement quencies where ft is the location frefor document d, and ct,x denotes the quency of term t, and cx denotes the lojudgement sequence for the term t and cation frequency sequence for the set x the set x of documents. of documents. x is either the set r of relevant or the set r of non-relevant documents Ż 3.1.3 Frequencies x ND (x): numb er of documents in nL (t, x) := ft,x , (t, x) := nL (t,x)) ND ( set x; t : ND (x) = ND (ct,x ) (t, x): average (exp ected) nD (j, ct,x ): numb er of documents in within-document term frequency in judgement sequence ct,x for which set x; (t, x) = (t, cx ) judgement j occurs x is either the set r of relevant or the set r of non-relevant documents Ż 3.1.4 Probabilities The probability of judgement j , given The probability of frequency ft , given judgement sequence ct,x : frequency sequence cx : P (J = j |ct,x ) := nD (j, ct,x ) ND (x) P (F = ft |x) :=  ( t , x ) f t - (t , x ) ·e ft !

NL (y ): numb er of locations in sequence y nL (t, y ): numb er of locations in term sequence y at which term t occurs y is either a document or the collection The probability of term t, given term sequence y : P (T = t|y ) := nL (t, y ) NL (y )

3.1.5 Term probability interpretations In PBI R (t|x), interpret t as the event In PP M (t|x), interpret t as the event that j = 1 for t. nD (t, x) := n(1, ct,x ). that t occurs ft times. PBI R (t|x) := = P (J = 1|ct,x ) nD (t, x) ND (x) PP M (t|x) := P (F = ft |x) =  (t , x ) ft !
ft

In PLM (t|y ), t corresp onds to a term event. PLM (t|y ) := = P (T = t|y ) nL (t, y ) NL (y )

· e-(t,x)

Figure 1: Event spaces, Background models, Frequencies, Probabilities, and Term probability interpretations The background model could include global (collectionindep endent) knowledge such as individual user profiles or statistics over previous queries. For the scop e of this pap er, we consider the collection b eing the only evidence source. Still, the discussion here makes explicit how global knowledge is to b e included: Dep ending on the model chosen, the global knowledge leads either to an adaptation (mixture) of judgements, frequencies, or terms. Next, consider the background model for each of the probabilistic retrieval models. BIR: The background model comprises several sequences of judgements. BIR is based on a set of relevant documents, and a set of non-relevant documents, hence, for each term t in the collection, there are two judgement sequences ct,r and ct,r , where ct,r is for the relevant and ct,r is for the Ż Ż non-relevant documents, resp ectively. Thus, the BIR background model consists of 2 · NT (c) judgement sequences, where 2 is the numb er of document sets (relevant and nonrelevant documents), and NT (c) is the numb er of terms. PM: The background model consists of two sequences: The sequence cr represents the location frequencies in relevant documents, and the sequence cr represents the location Ż frequencies in non-relevant documents. LM: The background model is simply the sequence of terms in the collection, i.e. the concatenation of all documents. From the background models' simplicity (numb er of sequences) p oint of view, we could argue that LM is the simplest model, followed by PM, followed by BIR.

3.1.3 Frequencies
BIR is based on document frequencies. ND (x) denotes the total numb er of documents in set x (x := r for relevant documents, x := r for non-relevant), and nD (j, x) denotes Ż the numb er of documents for which a judgement j holds. Since the numb er of relevant or non-relevant documents is the same for each term (t : ND (ct,x ) = ND (x)), we apply ND (x) as an abbreviation of ND (ct,x ). PM is based on the average (exp ected) within-document term frequency. Here, nL (t, x) is the numb er of locations in the set x of documents at which term t occurs. Then, in average, we exp ect nL (t, x)/ND (x) locations with t p er document in the set x. LM is based on location frequencies. NL (y ) denotes the total numb er of locations in sequence y , where y is a document or the collection, and nL (t, y ) denotes the numb er of locations at which term t occurs in term sequence y .

109


3.1.4 Probabilities
BIR: The probability of a judgement is estimated as the fraction of the numb er of documents for which the judgement holds, divided by the total numb er of documents. For example, let term t occur in two of five relevant documents, i.e. nD (1, ct,r ) = 2, ND (r ) = 5. Then, P (J = 1|ct,r ) = nD (1, ct,r )/ND (r ) = 2/5 is the probability that term t occurs in a relevant document. PM: We need to chose a probability distribution for estimating the probability of a frequency. The probabilistic's first choice is Poisson (though terms are not randomly distributed, and the set of relevant documents is usually relatively small, and therefore, pure Poisson is known to b e not the end of the game, see [2], [4]). For the discussion in this pap er, we stick to Poisson, and we will see in section 5 (Poisson bridge), that the Poisson distribution parameter (t, c), the average numb er of locations of term t p er document, connects BIR and LM. For reviewing how the Poisson distribution works, consider a term t that occurs in average (t, x) times in document set x. For example, assume that we observe the following location frequencies for six terms in the set x of documents: 5, 10, 5, 3, 3, 4 (nL (t1 , x) = 5, nL (t2 , x) = 10, etc). The average frequency is (t, x) = nL (t, x)/ND (x). For example, for five documents (ND (x) = 5), (t1 , x) = 5/5 = 1, (t2 , x) = 10/5 = 2. Then, for a document d, P (t|d) = (t, x)/NL (d) is the single event probability that t occurs, and for large documents, ,x ft the Poisson probability P (F = ft |x) = (tft !) · e-(t,x) approximates the probability that t occurs ft times in a document, assuming that t is randomly distributed in x. The characteristic feature of the Poisson probability is to p eak for (t, x), i.e. P (F = (t, x)|x) is the maximal probability. Thus, P (d) is maximal for a document in which the term (location) frequencies reflect the frequencies in the set x from which the averages were computed. LM: The probability of a term is estimated as the fraction of the numb er of locations at which the term occurs, divided by the total numb er of locations. For example, let term t occur at 100 of 106 locations in collection c, then P (T = t|c) = 10-4 .

3.2 Document and query interpretations, RSV's, Parameter estimations: Figure 2
3.2.1 Document and query interpretations
BIR interprets a document as a conjunction of indep endent binary judgements, whereas PM interprets a document as a conjunction of indep endent frequencies, and LM interprets a query as a conjunction of indep endent terms.

3.2.2 RSV's
The definitions of the RSV's of BIR, PM, and LM are presented in a condensed way in figure 2. We are aware that this condensed derivation and RSV definitions are not easy to access, but the consistent and parallel framework in the figure provides the overall picture of the derivation. BIR applies odds. The assumption that non-query terms do not affect the RSV reduces the multiplication over t  d to t  d  q , and t  d b ecomes t  q \ d. The multiplication with a constant factor leads to RSVBI R . P (t|x) is an abbreviation of P (t|q , x) (this corresp onds to q  x = x). PM applies odds and assumptions similar to BIR. The multiplication with a constant factor leads to RSVP M . LM applies a mixture of collection-based and documentbased probabilities. The multiplication with a constant factor leads to RSVLM . The LM mixture parameter is  , since , the letter used in many LM publications, denotes the parameter of the Poisson distribution, and for a clear parallel derivation, we let  denote the LM mixture parameter.

3.2.3 Ranking rationales
BIR: Terms that occur more often in relevant than in non-relevant documents have a p ositive effect on the RSV, whereas terms that occur more often in non-relevant than in relevant documents have a negative effect on the RSV. Mathematically, we summarise this as follows: PBI R (t|r ) > PBI R (t|r ): good term, p ositive effect on RSV. PBI R (t|r ) < Ż Ż PBI R (t|r ): bad term, negative effect on RSV. PBI R (t|r ) = PBI R (t, r ): neutral term, no effect on RSV. Ż PM: Terms for which the average frequency in relevant documents is greater than the average frequency in nonrelevant documents, have a p ositive effect on the RSV, whereas terms for which the average occurrence in relevant documents is less than the average occurrence in nonrelevant documents, have a negative effect on the RSV. Mathematically, we summarise this as follows: (t, r ) > (t, r): good term, p ositive effect on RSV. (t, r ) < (t, r): Ż Ż bad term, negative effect on RSV. (t, r ) = (t, r ): neutral Ż term, no effect on RSV. The occurrence count (frequency) nL (t, d) increases the effect of a term. LM: Large (small) P (t|d) implies strong (little) effect on the RSV. Small (large) P (t|c) implies strong (little) effect on the RSV. A document that misses a rare term (P (t|d) small or even zero, and P (t|c) small since t is rare) misses a significant contribution to the RSV.

3.1.5 Term probability interpretations
The parallel consideration of the probabilities and event spaces leads to a well-defined interpretation of the term Ż probabilities P (t|r ), P (t|r ) and P (t|c). The interpretation and the theoretically sound estimation dep ends on the model, as the next paragraphs underline. BIR: The notation PBI R (t|r ) needs to b e interpreted as an abbreviation. PBI R (t|r ) is an abbreviation for P (J = 1|ct,r ), i.e. the random variable J takes the judgement, and ct,r is the judgement sequence for the term t and the set r of relevant documents. PM: The notation PP M (t|r ) is, similar to BIR, just an abbreviation. The notation PP M (t|r ) is an abbreviation for P (F = ft |cr ), i.e. the random variable F takes the frequency ft of term t, and cr corresp onds to the sequence of average frequencies derived from the set r of relevant documents. LM: The notation PLM (t|c) is consistent with the underlying event space of terms. Again, like for the background models, LM ranks top with resp ect to simplicity.

3.2.4 Parameter estimations
BIR applies statistics over judgements for documents, whereas PM applies statistics over frequencies of terms, and LM applies statistics over terms at locations. For estimating the r -based parameters PBI R (t|r ) and (t, r ), BIR and PM use the relevance information available, or, in case of missing relevance information, assume Ż PBI R (t|r ) = PBI R (t|r ) = 0.5, and (t, r ) = 1, resp ectively.

110


BIR PM 3.2.1 Document and query interpretations Document is a conjunction of indep en- Document is a conjunction of indep endent judgements: dent frequencies: Y Y P (d|q , x) = P (j |q , x) P (d|q , x) = P (ft |q , x) " =
j d

LM Query is a conjunction of indep endent terms: Y P (q |d, c) = P (t|d, c)
t q

Y
t d

#2

P (t|q , x) · 4

Y
td

3 Ż P (t|q , x)5 =

"

ft  d

Y
t d

#2

P (ft |q , x) · 4

Y
td

3 e - (t , x ) 5

3.2.2 Retrieval status values (RSV's) Odds and assumption: Odds and assumption as for BIR. P (d|q , r ) P (q , r ) O(r |d, q ) = · P (d|q , r ) P (q , r ) Ż Ż t  q : P (t|q , r ) = P (t|q , r) Ż O(r |d, q )  Y P (t|q , r ) Y P (t|q , r ) Ż · ŻŻ P (t|q , r ) Ż P (t|q , r ) t  d q
t  q \d

O(r |d, q )  Y P (t|q , r )
t  d q

P (t|q , r ) Ż

·

t  q \d

Y e - (t , r ) Ż e - (t , r )
Ż e-(t,r ) tq e-(t,r)

Assume O(r |d, q ) to b e ranking equivalent to P (q |d, r ), P (q |d, c), resp ectively. This assumption discards the document prior P (d|r )/P (d|r). See secŻ tion 4 for discussion. Linear mixture for estimating term probability: P (t|d, c) :=  · P (t|c) + (1 -  ) · P (t|d) Multiply P (q |d, c) by Q
1 t  q  · P (t | c )

Multiply O(r |d, q ) by

Q

ŻŻ P (t|q ,r ) Ż tq P (t|q ,r )

Multiply O(r |d, q ) by

Q

RSVBI R (d, q ) := Y P (t|r ) · P (t|r ) ŻŻ := log Ż P (t|r) · P (t|r ) Ż
t  d q

RSVP M (d, q ) := Y ,, (t, r ) «nL (t,d) := log  (t , r ) Ż
t  d q

RSVLM (d, q ) := « Y,, 1 -  P (t|d) 1+ := log ·  P (t|c) t  d q

3.2.3 Ranking rationales (see text) 3.2.4 Parameter estimations nD (t, r ) ND (r ) nD (t, r ) nD (t, c) Ż Ż PBI R (t|r) =  ND (r) Ż ND (c) PBI R (t|r ) = nL (t, r ) ND (r ) Ż nL (t, r ) nL (t, c)  (t , r ) = Ż  ND (r ) Ż ND (c)  (t , r ) = nL (t, c) NL (c) nL (t, d) PLM (t|d) = NL (d) PLM (t|c) =

Figure 2: Document and query interpretations, RSV's, Parameter estimations For estimating the r -based parameters, BIR and PM apŻ ply two approaches: Either consider a set of explicitly nonrelevant documents (for example, the set of retrieved documents minus the set of visited documents), or, assume r  c, Ż i.e. assume that the ma jority of the documents in the collection are implicitly non-relevant. Whereas the parameter estimation reflects already the relevance assumption of BIR and PM, the relevance event (and thus the probability of relevance) is not yet related to LM, and we investigate this in the next section. Theorem 1 (Relevance Assumption). BIR and PM assume the col lection to contain non-relevant documents, whereas LM assumes the col lection to be a set of terms from relevant documents. Proof. The parameter estimation for BIR and PM (see figure 2) is as follows: PBI R (t|r ) = Ż  (t , r ) = Ż nD (t, c) ND (c) nL (t, c) ND (c)

4.

RELEVANCE ASSUMPTION

In section 2, we reviewed the probability P (r |d, q ) b eing the origin of BIR, PM, and LM. Now, we show formally that the parameter estimation (see figure 2) implies that BIR and PM assume the collection to b e a set of non-relevant documents, whereas the interpretation of LM as an estimate of P (r, d, q ) implies that LM assumes the collection to b e a set of terms from relevant documents. We formalise this in the following theorem.

For BIR and PM, the estimates show that the non-relevant set of documents is approximated by the statistics available for the whole collection. For LM, the product of the term probabilities PLM (t|d, c) leads to the query probability PLM (q |d, c). With c = r , the LM query probability is an estimate of the probability P (q |d, r ) in equation 2 (section 2). This makes explicit that LM assumes the collection c to b e a sequence of terms from relevant documents.

111


This relevance assumption implies the interpretation of the document prior P (d|r ) in equation 2 (rememb er that for BIR and PM, the query prior P (q |r ) can b e dropp ed since it is constant for all documents). The document prior P (d|r ) is to b e interpreted as P (d|c), since LM assumes the collection c to represent the relevance event. As an example for the imp ortance of P (d|r ), consider [10], a web retrieval investigation on entry page search.

5.

THE POISSON BRIDGE CONNECTING BIR AND LM

The RSV's of equivalent models differ by a constant K (this b eing additive, since the RSV's are logarithmic). Equivalence is a stricter prop erty than just ranking equivalence. We have results on ranking equivalence, too, however, due to the space restriction in this pap er, we decided to present only a result on equivalence. For which conditions are BIR and LM equivalent? Our approach is to express RSVBI R such that we achieve an equivalence of BIR and LM. We formalise the BIR-LM equivalence in the following theorem. Theorem 2 (BIR-LM Equivalence). If the RSV contribution coming from relevant documents is constant (same for al l documents), i.e. Y PBI R (t|r ) d : Kr = log Ż PBI R (t|r )
t  d q

This section shows that the Poisson parameter (t, c) can b e viewed as a bridge connecting BIR and LM. We capture the mathematical relationship b etween document-based (BIR-based) and location-based (LM-based) probabilities as follows: NL (c) nL (t, c) PBI R (t|c) · = (t, c) = · PLM (t|c) (3) nD (t, c) ND (c) For th e verification, reconsider the definitions (t, c) := nL (t, c)/ND (c), PBI R (t|c) := nD (t, c)/ND (c), and PLM (t|c) := nL (t, c)/NL (c) in figure 2, parameter estimations. The two fractions in equation 3 have the following meaning: avgd l(c) := NL (c)/ND (c) is the average document length, and avgtf(t, c) := nL (t, c)/nD (t, c) is the average term frequency of term t in t-documents (the documents in which term t occurs are the t-documents). Given these definitions, we obtain the following equation: PBI R (t|c) · av g tf (t, c) = (t, c) = av g dl(c) · PLM (t|c) (4)

and if the BIR term weight based on the col lection (col lections is assumed to represent the statistics for non-relevant documents) is equal to the LM term weight multiplied by a constant , i.e. ,, « Ż 1 -  PLM (t|d) PBI R (t|c) =· 1+ · (5) PBI R (t|c)  PLM (t|c) then BIR and LM are equivalent. Proof. Start with the definition of RSVBI R . Y PBI R (t|r ) ŻŻ RSVBI R (d, q ) = Kr + log PBI R (t|r) Ż t  d q Insert the condition for the term weights (equation 5). RS VBI R (d, q ) = Kr + ,, « Y 1 -  PLM (t|d) log · 1+ ·  PLM (t|c) t  d q = Kr + log  + RSVLM (d, q )

We refer to this equation as Poisson or BIR-LM bridge, since the left and right of the equation yield the Poisson parameter (t, c). For example, let "sailing" occur in 5 locations and 4 documents (nL (sailing , c) = 5, nD (sailing , c) = 4), and let the collection have 100 locations and 10 documents (NL (c) = 100, ND (c) = 10). Then, the average within-document frequency of sailing is (sailing , c) = 5/10. In average, we exp ect avgtf(sailing , c) = 5/4 locations containing sailing p er sailing-document, and we exp ect avgd l(c) = 100/10 locations p er document. For the probabilities, we obtain PBI R (sailing |c) = 4/10, and PLM (sailing |c) = 5/100. Here, avgtf(sailing , c) < avgd l(c) and PBI R (sailing |c) > PLM (sailing |c), and we could view this to b e typical for real-world terms and collections, i.e. we exp ect for all terms avgtf(t, c) < avgd l(c) to hold. A term t with high avgtf(t, c) is referred to as bursty term, and a term t with high PBI R (t|c) is referred to as frequent term. In the next section, we take a closer look at burstiness and frequency of terms, when discussing an interesting equivalence of BIR and LM.

What does this theorem bring? It explains BIR in the event space of LM, i.e. given PBI R (t|c), for BIR and LM to b e equivalent, we obtain a function PLM (t|d) for the within document term frequency which can b e interpreted as the within-document term frequency assumed by BIR. Solve equation 5 for obtaining PLM (t|d). This leads to: PLM (t|d) = ,, « 1 - PBI R (t|c)  = -1 · · PLM (t|c)  · PBI R (t|c) 1-  PLM (t|c) = (1 - (1 + ) · PBI R (t|c)) · · 1 -   · PBI R (t|c) Apply the Poisson bridge (equation 4), and we obtain:  avgtf(t, c) · 1 -   · avgd l(c)

6.

BIR-LM EQUIVALENCE

Deriving the models in parallel p oses the question for equivalences. We discuss in this section a condition which makes BIR and LM equivalent. We define the equivalence of two models (rankings) as follows: Definition 1 equivalent iff (Equivalence). Models A and B are

PLM (t|d) = (1 - (1 + ) · PBI R (t|c)) ·

RSVA (d, q ) = K + RSVB (d, q )

Figure 3 shows PLM (t|d) as a function of PBI R (t|c), and PLM (t|d) is plotted for various average term frequencies avgtf(t, c). The graphs keep avgd l(c) = 1000, LM mixture  = 0.8, and  = 1 constant. PLM (t|d) is maximal for PBI R (t, c) = 0, which is according to exp ectation, since rare terms have a p ositive effect for BIR, this b eing reflected in the maximal value for the

112


0.04 0.035 0.03 PLM(t|d) 0.025 0.02 0.015 0.01 0.005 0 0 0.1 0.2 0.3 PBIR(t|c) 0.4 0.5 avgtf=1 avgtf=2 avgtf=5 avgtf=10

Apply the BIR side of the Poisson bridge (equation 4), and we obtain the BIR-based formulation of RSVP M : X PBI R (t|r ) · av g tf (t, r ) RSVP M (d, q ) = nL (t, d) · log (7) PBI R (t|c) · av g tf (t, c) t  d q Apply the LM side of the Poisson bridge, and we obtain the LM-based formulation of RSVP M : X PLM (t|r ) · av g dl(r ) nL (t, d) · log R S V P M (d , q ) = (8) PLM (t|c) · av g dl(c) t  d q The BIR-based and LM-based formulations of RSVP M stress that b oth, BIR and LM probabilities can b e used in a dual way, and we link this in the next section to tf-idf.

Figure 3: PLM (t|d) as a function of PBI R (t|c).

within document frequency. The constant  in equation 5 is a scaling factor that allows to stretch the value interval of PLM (t|d). For  < 1, we obtain accordingly greater values for PLM (t|d) than shown in figure 3. Next, figure 4 illustrates the product PBI R (t|c) · avgtf(t, c) = (t, c).
avgtf(t,c) 10 8 6 4 2 SOLITUDE RARE 0.1 0.2 0.3 0.4 0.5 (t,c) = 2 (t,c) = 1 SOLITUDE FREQUENT P IR(t|c) B BURSTY RARE BURSTY FREQUENT (t,c) = 4

7.2 IDF and ILF
We apply the definition idf(t, x) := - log PBI R (t|x) and obtain the equation 9 from the BIR-based equation 7. (9) RSVP M (d, q ) = « ,, X avgtf(t, r ) nL (t, d) · idf(t, c) - idf(t, r ) + log avgtf(t, c)
t  d q

Next, we define the inverse location frequency (ilf ) analogously to idf : ilf(t, x) := - log PLM (t|x). Then, we obtain the equation 10 from the LM-based equation 8: RSVP M (d, q ) = (10) ,, « X avgd l(r ) nL (t, d) · ilf(t, c) - ilf(t, r ) + log avgd l(c) t  d q The idf -based and ilf -based formulations of the PM differ from classical tf-idf in a factor that is for idf based on the average term frequencies of a term t in the documents in which t occurs (the t-documents), and for ilf, the factor is based on the average document lengths. Set avgtf(t, r ) = avgtf(t, c), and we are close to classical tf-idf, apart from the document normalisation. From the probability of relevance p oint of view, BIR and PM are based on P (d|q , r ), whereas LM is based on P (q |d, r ). This implies that there is no document normalisation in BIR and PM, whereas normalisation comes for free in LM. To comp ensate for the lack of document normalisation in P (d|q ) approaches, tf-idf, BM25, and pivoted document length ([20]) add normalisations. For example, traditional tf-idf replaces the total term frequency nL (t, d) in RSVP M by the linear estimate tf(t, d) := P (t|d) = nL (t, d)/NL (d), the normalisation b eing reflected by NL (d), the document length. BM25 gives evidence that a tf-factor such as tf(t, d) := nL (t, d)/(nL (t, d) + K ) works b etter than the linear estimate. Growth of the BM25 tf is non-linear, and for K = 1, we find the lower b ound 0.5  tf(t, d), indep endent of document length. In BM25, the factor (av g dl - dl)/(av g dl + dl) awards short documents, and p enalises long documents. Pivoted document length helps to balance single term weights by replacing the non-pivoted normalisation factor by piv oted nor m = (1 - slope) × piv ot + slope × old nor m. For example, view NL (d) (the document length) as old nor m, and the average document length could b e the pivot. Then, we obtain a normalisation that assigns smaller term weights for small documents and larger term weights to large documents, than the non-pivoted normalisation does. The slop e parameter controls the impact of the pivot.

Figure 4: Burstiness and frequency of terms Figure 4 divides the terms in rare and frequent based on PBI R (t|c) on the horizontal axis, and in solitude and bursty based on avgtf(t, c) on the vertical axis. We will see in the next section on tf-idf, how the burstiness is reflected in tf-idf when tf-idf is derived from RSVP M .

7.

TF-IDF

In this section, we explore the dual application of BIR and LM parameters for expressing tf-idf.

7.1 Dual Application of BIR and LM Parameters
We show in this section that by starting from RSVP M , we can derive a tf-idf retrieval function, and we show two equivalent formulations, one based on the probability PBI R (t|c), and one based on the probability PLM (t|c). These formulations stress the duality of BIR and LM: Dep ending on the event space, parameter estimation is different, however, PBI R (t|c) and PLM (t|c) can b e alternatively used for expressing RSVP M and tf-idf, resp ectively. The RSVP M is defined as follows (see figure 2): RSVP M (d, q ) := X
t  d q

nL (t, d) · log

 (t , r )  (t , r ) Ż

(6)

113


8.

SUMMARY AND CONCLUSIONS

We derived -- strictly parallel -- the three grand probabilistic retrieval models: BIR, PM, and LM. The parallel derivation in a stringent mathematical formalism highlighted the event spaces, background models, frequencies, probabilities, term probability interpretations (summarised in figure 1), and document/query interpretations, retrieval status values, and parameter estimations (summarised in figure 2). There have b een several motivations for this parallel derivation: First of all, the explanation of retrieval models through the probability of relevance involves notations such as P (t|c) for the probability of a term, and this notation is correct for LM but misleading for BIR and PM, since not terms but judgements and frequencies form the event spaces for BIR and PM, resp ectively. Therefore, this pap er clarifies in a mathematical formalism the event spaces and notations of the models. The second motivation has b een the observation that there are ranking equivalences and even equivalences of the models; due to space restriction, we rep orted only one equivalence. Equivalences allow for an interesting insight of what the models assume. For example, looking at BIR through LM glasses shows that BIR assumes the probabilities PLM (t|d), i.e. the probabilities of the terms in the documents, to b e a function of the document occurrence in the collection (PBI R (t|c)), and other parameters such as the average term frequency in the elite set of the term, and the average document length. The third motivation was b orn when looking for theoretical arguments for tf-idf. Whereas for BIR and PM, the relationship is well researched, this pap er highlights through the Poisson bridge the dual representation of tf-idf retrieval applying either BIR or LM parameters. We carried out this theoretical investigation of the parallel derivation of the probabilistic models to clarify the origin and assumptions of the models, and to find relationships of the models. The result supp orts the well-defined refinements of the models, and it might supp ort the search for the quarks of IR. Acknowledgements: Hugo Zaragoza, Stephen Rob ertson, Arjen de Vries, and Djoerd Hiemstra influenced significantly the content of this pap er. Thanks to Hugo for enjoyable and effective workshops, where Hugo questioned Poisson, Stephen challenged event spaces, and Arjen stressed consistent notation. Thanks to Djoerd for the discussion in our IR seminar, and, thanks to the reviewer who went so deeply into the notation and formulae in this pap er, and provided most valuable feedback.

9.

REFERENCES

[1] G. Amati. Probability Models for Information Retrieval based on Divergence from Randomness. PhD thesis, Glasgow University, June 2003. [2] G. Amati and C. J. Rijsbergen. Term frequency normalization via Pareto distributions. In F. Crestani, M. Girolami, and C. J. Rijsbergen, editors, 24th BCS-IRSG European Col loquium on IR Research, Glasgow, Scotland, 2002. [3] Gianni Amati and C. J. van Rijsbergen. Probabilistic models of information retrieval based on measuring the divergence from randomness. ACM Transaction on Information Systems (TOIS), 20(4):357­389, October 2002.

[4] Marcia J. Bates. After the dot-bomb: Getting web information retrieval right this time. First Monday, 7(7), 2002. [5] K. Church and W Gale. Inverse document frequency (idf ): A measure of deviation from poisson. In Proceedings of the Third Workshop on Very Large Corpora, pages 121­130, 1995. [6] Bruce Croft and John Lafferty, editors. Language Modeling for Information Retrieval. Kluwer, 2003. [7] Arjen de Vries and Thomas Roelleke. Relevance information: A loss of entropy but a gain for idf ? In ACM SIGIR, Salvador, Brazil, 2005. [8] Djoerd Hiemstra. A probabilistic justification for using tf.idf term weighting in information retrieval. International Journal on Digital Libraries, 3(2):131­139, 2000. [9] K. Sparck Jones, S.E. Robertson, D. Hiemstra, and H. Zaragoza. Language modelling and relevance. Language Model ling for Information Retrieval, pages 57­70, 2003. [10] Wessel Kraaij, Thijs Westerveld, and Djoerd Hiemstra. The importance of prior probabilities for entry page search. In SIGIR '02: Proceedings of the 25th annual international ACM SIGIR conference on Research and development in information retrieval, pages 27­34, New York, NY, USA, 2002. ACM Press. [11] John Lafferty and ChengXiang Zhai. Probabilistic Relevance Models Based on Document and Query Generation, chapter 1. In Croft and Lafferty [6], 2002. [12] E.L. Margulis. N-poisson document modelling. In N. Belkin, P. Ingwersen, and M. Pejtersen, editors, Proceedings of the Fifteenth Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 177­189, New York, 1992. [13] J.M. Ponte and W.B. Croft. A language modeling approach to information retrieval. In W. Bruce Croft, Alistair Moffat, C. J. van Rijsbergen, Ross Wilkinson, and Justin Zobel, editors, Proceedings of the 21st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 275­281, New York, 1998. ACM. [14] S. E. Robertson and S. Walker. Some simple effective approximations to the 2-poisson model for probabilistic weighted retrieval. In W. Bruce Croft and C. J. van Rijsbergen, editors, Proceedings of the Seventeenth Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 232­241, London, et al., 1994. Springer-Verlag. [15] S. E. Robertson, S. Walker, and M.M. Hancock-Beaulieu. Large test collection experiments on an operational interactive system: Okapi at TREC. Information Processing and Management, 31:345­360, 1995. [16] S.E. Robertson. Understanding inverse document frequency: On theoretical arguments for idf. Journal of Documentation, 60:503­520, 2004. [17] S.E. Robertson and K. Sparck Jones. Relevance weighting of search terms. Journal of the American Society for Information Science, 27:129­146, 1976. [18] Stephen Robertson. On event spaces and probabilistic models in information retrieval. Information Retrieval, 8(2):319­329, 2005. [19] Thomas R¨lleke, Theodora Tsikrika, and Gabriella Kazai. o A general matrix framework for modelling information retrieval. Journal on Information Processing & Management (IP&M), Special Issue on Theory in Information Retrieval, 42(1), 2006. [20] Amit Singhal, Chris Buckley, and Mandar Mitra. Pivoted document length normalistation. In H.P. Frei, D. Harmann, P. Sch¨uble, and R. Wilkinson, editors, Proceedings of the a 19th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 21­39, New York, 1996. ACM. [21] Keith van Rijsbergen. The Geometry of Information Retrieval. Cambridge University Press, 2004.

114