1
|
- LBSC 796/CMSC 828o
- Session 4, February 16, 2004
- Douglas W. Oard
|
2
|
|
3
|
- Why distinguish utility and relevance?
- How coordination measure is ranked Boolean
- Why use term weights?
- The meaning of DF
- The problem with log(1)=3D0
- How the vectors are built
- How to do cosine normalization (5)
- Why to do cosine normalization (2)
- Okapi graphs
|
4
|
|
5
|
- We ask “is this document relevant?”
- Vector space: we answer “somewhat”
- Probabilistic: we answer “probably”
- The key is to know what “probably” means
- First, we’ll formalize that notion
- Then we’ll apply it to retrieval
|
6
|
- What is probability?
- Statistical: relative frequency as n ® ¥=
li>
- Subjective: degree of belief<=
/li>
- Thinking statistically
- Imagine a finite amount of “stuff”
- Associate the number 1 with the total amount
- Distribute that “mass” over the possible events
|
7
|
- A and B are independent if and only if:
- P(A and =
B) =3D
P(A) ´ P(B)
- Independence formalizes “unrelated”
- P(“being brown eyed”) =3D 85/100
- P(“being a doctor”) =3D 1/1000
- P(“being a brown eyed doctor”) =3D 85/100,000
|
8
|
- Suppose”
- P(“having a B.S. degree”) =3D 2/10
- P(“being a doctor”) =3D 1/1000
- Would you expect
- P(“having a B.S. degree and being a doctor”)
- =3D 2/10,000<=
span
style=3D'mso-spacerun:yes'> ???
- Extreme example:
- P(“being a doctor”) =3D 1/1000
- P(“having studied anatomy”) =3D 12/1000
|
9
|
- P(A | B) º P(A
and B) / P(B)
|
10
|
- Suppose
- P(“having studied anatomy”) =3D 12/1000
- P(“being a doctor and having studied anatomy”) =3D 1/10=
00
- Consider
- P(“being a doctor” | “having studied anatomy̶=
1;)
=3D 1/12
- But if you assume all doctors have studied anatomy
- P(“having studied anatomy” | “being a doctor̶=
1;)
=3D 1
|
11
|
- Consider
- A set of hypotheses:
H1, H2, H3
- Some observable evidence &nb=
sp;
O
- P(O|H1) =3D probability of O being observed
- if we knew H=
1 were
true
- P(O|H2) =3D probability of O being observed
- if we knew H=
2 were
true
- P(O|H3) =3D probability of O being observed
- if we knew H=
3 were
true
|
12
|
- Let
- O =3D “Joe earns more than $80,000/year”
- H1 =3D “Joe is a doctor”
- H2 =3D “Joe is a college professor”
- H3 =3D “Joe works in food services”
- Suppose we do a survey and we find out
- P(O|H1) =3D 0.6
- P(O|H2) =3D 0.07
- P(O|H3) =3D 0.001
- What should be our guess about Joe’s profession?
|
13
|
- What’s P(H1|O)?
P(H2|O)? P(H3|O=
)?
- Theorem:
|
14
|
- Suppose we also have good data about priors:
- P(O|H1) =3D 0.6 P(H1) =3D 0.0001&n=
bsp;
doctor
- P(O|H2) =3D 0.07 P(H2) =3D 0.001 prof
- P(O|H3) =3D 0.001 P(H3) =3D 0.2 &nbs=
p;
food
- We can calculate
- P(H1|O) =3D 0.00006 /
P(“earning > $70K/year”)
- P(H2|O) =3D 0.0007 /
P(“earning > $70K/year”)
- P(H3|O) =3D 0.0002 /
P(“earning > $70K/year”)
|
15
|
- Defining probability using frequency
- Statistical independence
- Conditional probability
- Bayes’ rule
|
16
|
|
17
|
- Assume binary relevance/document independence
- Each document is either relevant or it is not
- Relevance of one doc reveals nothing about another
- Assume the searcher works down a ranked list
- Seeking some number of relevant documents
- Theorem (provable from assumptions):
- Documents should be ranked in order of decreasing probability of
relevance to the query,
- P(d relevant-to q)
|
18
|
- Estimate how terms contribute to relevance
- How do TF, DF, and length influence your judgments about document
relevance? (e.g., Oka=
pi)
- Combine to find document relevance probability
- Order documents by decreasing probability
|
19
|
|
20
|
- Binary refers again to binary relevance
- Assume “term independence”
- Presence of one term tells nothing about another
- Assume “uniform priors”
- P(d) is the same for all d
|
21
|
|
22
|
- Models probability of generating any string
|
23
|
- Models probability of generating any string
|
24
|
- Treat each document as the basis for a model
- Rank document d based on P(d | q)
- P(d | q) =3D P(q | d) x P(d) / P(q)
- P(q) is same for all documents, can’t change ranks
- P(d) [the prior] is often treated as the same for all d
- But we could use criteria like authority, length, genre
- P(q | d) is the probability of q given d’s model
- Same as ranking by P(q | d)
|
25
|
- Build a smoothed language model for d
- Count the frequency of each term in d
- Count the frequency of each term in the collection
- Combine the two in some way
- Redistribute probabilities to unobserved events
- Example: add 1 to every count
- Combine the probability for the full query
- Summing over the terms in q is a soft “OR”
|
26
|
- Probabilistic methods formalize assumptions
- Binary relevance
- Document independence
- Term independence
- Uniform priors
- Top-down scan
- Natural framework for combining evidence
|
27
|
- A flexible way of combining term weights
- Boolean model
- Binary independence model
- Probabilistic models with weaker assumptions
- Key concept: rank based on P(d | q)
- P(d | q) =3D P(q | d) x
P(d) / P(q)
- Efficient large-scale implementation
- InQuery text retrieval system from U Mass
|
28
|
|
29
|
|
30
|
- Turn on exactly one document at a time
- Boolean: Every connec=
ted
term turns on
- Binary Ind: Connected=
terms
gain their weight
- Compute the query value
- Boolean: AND and OR n=
odes
use truth tables
- Binary Ind: Fraction =
of the
possible weight
|
31
|
- Most of the assumptions are not satisfied!
- Searchers want utility, not relevance
- Relevance is not binary
- Terms are clearly not independent
- Documents are often not independent
- The best known term weights are quite ad hoc
- Unless some relevant documents are known
|
32
|
- Ranked retrieval paradigm is powerful
- Well suited to human search strategies
- Probability theory has explanatory power
- At least we know where the weak spots are
- Probabilities are good for combining evidence
- Good implementations exist (InQuery, Lemur)
- Effective, efficient, and large-scale
|
33
|
- Similar in some ways
- Term weights can be based on frequency
- Terms often used as if they were independent
- Different in others
- Based on probability rather than similarity
- Intuitions are probabilistic rather than geometric
|
34
|
- Which assumption underlying the probabilistic retrieval model causes=
you
the most concern, and why?
|