1
|
- Session 12
- LBSC 690
- Information Technology
|
2
|
- The search process
- Information retrieval
- Recommender systems
- Evaluation
|
3
|
|
4
|
|
5
|
|
6
|
- Find something that you want
- The information need may or may not be explicit
- Known item search
- Answer seeking
- Is Lexington or Louisville the capital of Kentucky?
- Directed exploration
- Who makes videoconferencing systems?
|
7
|
- The four components of the information retrieval environment:
- User (user needs)
- Process
- System
- Data
|
8
|
|
9
|
|
10
|
|
11
|
- Machines are good at:
- Doing simple things accurately and quickly
- Scaling to larger collections in sublinear time
- People are better at:
- Accurately recognizing what they are looking for
- Evaluating intangibles such as “quality”
- Both are pretty bad at:
- Mapping consistently between words and concepts
|
12
|
|
13
|
- Searching metadata
- Using controlled or uncontrolled vocabularies
- Searching content
- Characterize documents by the words the contain
- Searching behavior
- User-Item: Find similar users
- Item-Item: Find items that cause similar reactions
|
14
|
|
15
|
- Find all documents with some characteristic
- Indexed as “Presidents -- United States”
- Containing the words “Clinton” and “Peso”=
li>
- Read by my boss
- A set of documents is returned
- Hopefully, not too many or too few
- Usually listed in date or alphabetical order
|
16
|
- Every information need has a perfect document ste
- Finding that set is the goal of search
- Every document set has a perfect query
- AND every word to get a query for document 1
- Repeat for each document in the set
- OR every document query to get the set query
- The problem isn’t the system … it’s the query!
|
17
|
- Low query construction effort
- 2.35 (often imprecise) terms per query
- 20% use operators
- 22% are subsequently modified
- Low browsing effort
- Only 15% view more than one page
- Most look only “above the fold”
- One study showed that 10% don’t know how to scroll!
|
18
|
- Informational (30-40% of AltaVista queries)
- Navigational
- Find the home page of United Airlines
- Transactional
- Data: What is the weather in Paris?
- Shopping: Who sells a Viao Z505RX?
- Proprietary: Obtain a journal article
|
19
|
- Put most useful documents near top of a list
- Possibly useful documents go lower in the list
- Users can read down as far as they like
- Based on what they read, time available, ...
- Provides useful results from weak queries
- Untrained users find exact match harder to use
|
20
|
- Assume “most useful” =3D most similar to query
- Weight terms based on two criteria:
- Repeated words are good cues to meaning
- Rarely used words make searches more selective
- Compare weights with query
- Add up the weights for each query term
- Put the documents with the highest total first
|
21
|
|
22
|
- Major factors
- Uncommon terms are more selective
- Repeated terms provide evidence of meaning
- Adjustments
- Give more weight to terms in certain positions
- Title, first paragraph, etc.
- Give less weight each term in longer documents
- Ignore documents that try to “spam” the index
- Invisible text, excessive use of the “meta” field, =
230;
|
23
|
|
24
|
- Crawl quality
- Comprehensiveness, dead links, duplicate detection
- Document analysis
- Frames, metadata, imperfect HTML, …
- Document extension
- Anchor text, source authority, category, language, …
- Document restriction (ephemeral text suppression)
- Banner ads, keyword spam, …
|
25
|
- Spam suppression
- “Adversarial information retrieval”
- Every source of evidence has been spammed
- Text, queries, links, access patterns, …
- “Family filter” accuracy
- Link analysis can be very helpful
|
26
|
- A type of “document expansion”
- Terms near links describe content of the target
- Works even when you can’t index content
- Image retrieval, uncrawled links, …
|
27
|
|
28
|
|
29
|
- Document image generation model
- A document consists many layers, such as handwriting, machine print=
ed
text, background patterns, tables, figures, noise, etc.
|
30
|
|
31
|
|
32
|
|
33
|
|
34
|
|
35
|
|
36
|
- Use ratings as to describe objects
- Personal recommendations, peer review, …
- Beyond topicality:
- Accuracy, coherence, depth, novelty, style, …
- Has been applied to many modalities
- Books, Usenet news, movies, music, jokes, beer, …
|
37
|
|
38
|
|
39
|
|
40
|
|
41
|
- What can be measured that reflects the searcher’s ability to u=
se a
system? (Cleverdon, 1966)
- Coverage of Information
- Form of Presentation
- Effort required/Ease of Use
- Time and Space Efficiency
- Recall
- Precision
|
42
|
- User-centered strategy
- Given several users, and at least 2 retrieval systems
- Have each user try the same task on both systems
- Measure which system works the “best”
- System-centered strategy
- Given documents, queries, and relevance judgments
- Try several variations on the retrieval system
- Measure which ranks more good docs near the top
|
43
|
|
44
|
- Precision
- How much of what was found is relevant?
- Often of interest, particularly for interactive searching
- Recall
- How much of what is relevant was found?
- Particularly important for law, patents, and medicine
|
45
|
|
46
|
|
47
|
- Measure stickiness through frequency of use
- Non-comparative, long-term
- Key factors (from cognitive psychology):
- Worst experience
- Best experience
- Most recent experience
- Highly variable effectiveness is undesirable
- Bad experiences are particularly memorable
|
48
|
- Google: keyword in context
- Microsoft Live: query refinement suggestions
- Exalead: faceted refinement
- Vivisimo/Clusty: clustered results
- Kartoo: cluster visualization
- WebBrain: structure visualization
- Grokker: “map view”
- PubMed: related article search
|
49
|
|