1
|
- LBSC 796/CMSC828o
- Session 3, February 9, 2004
- Douglas W. Oard
|
2
|
- Thinking about search
- Design strategies
- Decomposing the search component
- Boolean “free text” retrieval
- The “bag of terms” representation
- Proximity operators
- Ranked retrieval
- Vector space model
- Passage retrieval
|
3
|
|
4
|
- Foster human-machine synergy
- Exploit complementary strengths
- Accommodate shared weaknesses
- Divide-and-conquer
- Divide task into stages with well-defined interfaces
- Continue dividing until problems are easily solved
- Co-design related components
- Iterative process of joint optimization
|
5
|
- Machines are good at:
- Doing simple things accurately and quickly
- Scaling to larger collections in sublinear time
- People are better at:
- Accurately recognizing what they are looking for
- Evaluating intangibles such as “quality”
- Both are pretty bad at:
- Mapping consistently between words and concepts
|
6
|
- Strategy: use encapsulation to limit complexity
- Approach:
- Define interfaces (input and output) for each component
- Query interface: input terms, output representation
- Define the functions performed by each component
- Remove common words, weight rare terms higher, …
- Repeat the process within components as needed
- Result: a hierarchical decomposition
|
7
|
- Choose the same documents a human would
- Without human intervention (less work)
- Faster than a human could (less time)
- As accurately as possible (less accuracy)
- Humans start with an information need
- Machines start with a query
- Humans match documents to information needs
- Machines match document & query representations
|
8
|
|
9
|
- Relevance relates a topic and a document
- Duplicates are equally relevant, by definition
- Constant over time and across users
- Pertinence relates a task and a document
- Accounts for quality, complexity, language, …
- Utility relates a user and a document
- Accounts for prior knowledge
- We seek utility, but relevance is what we get!
|
10
|
- Bag =3D a “set” that can contain duplicates
- “The quick brown fox jumped over the lazy dog’s back=
221;
®
- &nbs=
p;
{back, brown, dog, fox, jump, lazy, over, quick, the, the}<=
/font>
- Vector =3D values recorded in any consistent order
- {back, brown, dog, fox, jump, lazy, over, quick, the, the} ®
- [1 1 1 1 1 1 1 1 2]
|
11
|
|
12
|
- Limit the bag of words to “absent” and “present=
221;
- “Boolean” values, represented as 0 and 1
- Represent terms as a “bag of documents”
- Same representation, but rows rather than columns
- Combine the rows using “Boolean operators”
- Result set: every document with a 1 remaining
|
13
|
|
14
|
- dog AND fox
- dog NOT fox
- fox NOT dog
- dog OR fox
- good AND party
- good AND party NOT over
|
15
|
- Boolean operators approximate natural language
- Find documents about a good party that is not over
- AND can discover relationships between concepts
- OR can discover alternate terminology
- NOT can discover alternate meanings
|
16
|
- Every information need has a perfect doc set
- If not, there would be no sense doing retrieval
- Almost every document set has a perfect query
- AND every word to get a query for document 1
- Repeat for each document in the set
- OR every document query to get the set query
- But users find Boolean query formulation hard
- They get too much, too little, useless stuff, …
|
17
|
- Natural language is way more complex
- She saw the man on the hill with a telescope
- AND “discovers” nonexistent relationships
- Terms in different paragraphs, chapters, …
- Guessing terminology for OR is hard
- good, nice, excellent, outstanding, awesome, …
- Guessing terms to exclude is even harder!
- Democratic party, party to a lawsuit, …
|
18
|
- More precise versions of AND
- “NEAR n” allows at most n-1 intervening terms
- “WITH” requires terms to be adjacent and in order
- Easy to implement, but less efficient
- Store a list of positions for each word in each doc
- Stopwords become very important!
- Perform normal Boolean computations
- Treat WITH and NEAR like AND with an extra constraint
|
19
|
- time AND come
- time (NEAR 2) come
- quick (NEAR 2) fox
- quick WITH fox
|
20
|
- Strong points
- Accurate, if you know the right strategies
- Efficient for the computer
- Weaknesses
- Often results in too many documents, or none
- Users must learn Boolean logic
- Sometimes finds relationships that don’t exist
- Words can have many meanings
- Choosing the right words is sometimes hard
|
21
|
- Exact match retrieval often gives useless sets
- No documents at all, or way too many documents
- Query reformulation is one “solution”
- Manually add or delete query terms
- “Best-first” ranking can be superior
- Select every document within reason
- Put them in order, with the “best” ones first
- Display them one screen at a time
|
22
|
- Closer to the way people think
- Some documents are better than others
- Enriches browsing behavior
- Decide how far down the list to go as you read it
- Allows more flexible queries
- Long and short queries can produce useful results
|
23
|
- “Best first” is easy to say but hard to do!
- The best we can hope for is to approximate it
- Will the user understand the process?
- It is hard to use a tool that you don’t understand
- Efficiency becomes a concern
- Only a problem for long queries, though
|
24
|
- Form several result sets from one long query
- Query for the first set is the AND of all the terms
- Then all but the 1st term, all but the 2nd, …
- Then all but the first two terms, …
- And so on until each single term query is tried
- Remove duplicates from subsequent sets
- Display the sets in the order they were made
- Document rank within a set is arbitrary
|
25
|
|
26
|
- Treat the query as if it were a document
- Create a query bag-of-words
- Find the similarity of each document
- Using the coordination measure, for example
- Rank order the documents by similarity
- Most similar to the query first
- Surprisingly, this works pretty well!
- Especially for very short queries
|
27
|
- How similar are two documents?
- In particular, how similar is their bag of words?
|
28
|
- Count the number of terms in common
- Based on Boolean bag-of-words
- Documents 2 and 3 share two common terms
- But documents 1 and 2 share no terms at all
- Useful for “more like this” queries
- “more like doc 2” would rank doc 3 ahead of doc 1
- Where have you seen this before?
|
29
|
|
30
|
- Terms tell us about documents
- If “rabbit” appears a lot, it may be about rabbits
- Documents tell us about terms
- “the” is in every document -- not discriminating
- Documents are most likely described well by rare terms that occur in
them frequently
- Higher “term frequency” is stronger evidence
- Low “collection frequency” makes it stronger still
|
31
|
- Humans look for documents with useful parts
- But probabilities are computed for the whole
- Document lengths vary in many collections
- So probability calculations could be inconsistent
- Two strategies
- Adjust probability estimates for document length
- Divide the documents into equal “passages”
|
32
|
- High term frequency is evidence of meaning
- And high IDF is evidence of term importance
- Recompute the bag-of-words
- Compute TF * IDF for every element
|
33
|
- Unweighted queries
- Add up the weights for every matching term
- User specified query term weights
- For each term, multiply the query and doc weights
- Then add up those values
- Automatically computed query term weights
- Most queries lack useful TF, but IDF may be useful
- Used just like user-specified query term weights
|
34
|
|
35
|
- Long documents have an unfair advantage
- They use a lot of terms
- So they get more matches than short documents
- And they use the same words repeatedly
- So they have much higher term frequencies
- Normalization seeks to remove these effects
- Related somehow to maximum term frequency
- But also sensitive to the of number of terms
|
36
|
- Compute the length of each document vector
- Multiply each weight by itself
- Add all the resulting values
- Take the square root of that sum
- Divide each weight by that length
|
37
|
|
38
|
|
39
|
- Think of a document as a vector from zero
- Similarity is the angle between two vectors
- Small angle =3D very similar
- Large angle =3D little similarity
- Passes some key sanity checks
- Depends on pattern of word use but not on length
- Every document is most similar to itself
|
40
|
|
41
|
- Another approach to long-document problem
- Break it up into coherent units
- Recognizing topic boundaries is hard
- But overlapping 300 word passages work fine
- Document rank is best passage rank
- And passage information can help guide browsing
|
42
|
- Goal: find documents most similar to the query
- Compute normalized document term weights
- Some combination of TF, DF, and Length
- Optionally, get query term weights from the user
- Estimate of term importance
- Compute inner product of query and doc vectors
- Multiply corresponding elements and then add
|
43
|
- On a sheet of p=
aper,
please briefly answer the following question (no names):
- What was the mu=
ddiest
point in today’s lecture?
|