1
|
- Session 12
- LBSC 690
- Information Technology
|
2
|
- The search process
- Information retrieval
- Recommender systems
- Evaluation
|
3
|
- Find something that you want
- The information need may or may not be explicit
- Known item search
- Answer seeking
- Is Lexington or Louisville the capital of Kentucky?
- Directed exploration
- Who makes videoconferencing systems?
|
4
|
|
5
|
|
6
|
|
7
|
- Machines are good at:
- Doing simple things accurately and quickly
- Scaling to larger collections in sublinear time
- People are better at:
- Accurately recognizing what they are looking for
- Evaluating intangibles such as “quality”
- Both are pretty bad at:
- Mapping consistently between words and concepts
|
8
|
|
9
|
- Searching metadata
- Using controlled or uncontrolled vocabularies
- Free text
- Characterize documents by the words the contain
- Social filtering
- Exchange and interpret personal ratings
|
10
|
- Find all documents with some characteristic
- Indexed as “Presidents -- United States”
- Containing the words “Clinton” and “Peso”=
li>
- Read by my boss
- A set of documents is returned
- Hopefully, not too many or too few
- Usually listed in date or alphabetical order
|
11
|
- Put most useful documents near top of a list
- Possibly useful documents go lower in the list
- Users can read down as far as they like
- Based on what they read, time available, ...
- Provides useful results from weak queries
- Untrained users find exact match harder to use
|
12
|
- Assume “most useful” =3D most similar to query
- Weight terms based on two criteria:
- Repeated words are good cues to meaning
- Rarely used words make searches more selective
- Compare weights with query
- Add up the weights for each query term
- Put the documents with the highest total first
|
13
|
|
14
|
- Major factors
- Uncommon terms are more selective
- Repeated terms provide evidence of meaning
- Adjustments
- Give more weight to terms in certain positions
- Title, first paragraph, etc.
- Give less weight each term in longer documents
- Ignore documents that try to “spam” the index
- Invisible text, excessive use of the “meta” field, =
230;
|
15
|
|
16
|
- Crawl quality
- Comprehensiveness, dead links, duplicate detection
- Document analysis
- Frames, metadata, imperfect HTML, …
- Document extension
- Anchor text, source authority, category, language, …
- Document restriction (ephemeral text suppression)
- Banner ads, keyword spam, …
|
17
|
- A type of “document expansion”
- Terms near links describe content of the target
- Works even when you can’t index content
- Image retrieval, uncrawled links, …
|
18
|
- Low query construction effort
- 2.35 (often imprecise) terms per query
- 20% use operators
- 22% are subsequently modified
- Low browsing effort
- Only 15% view more than one page
- Most look only “above the fold”
- One study showed that 10% don’t know how to scroll!
|
19
|
- Informational (30-40% of AltaVista queries)
- Navigational
- Find the home page of United Airlines
- Transactional
- Data: What is the weather in Paris?
- Shopping: Who sells a Viao Z505RX?
- Proprietary: Obtain a journal article
|
20
|
|
21
|
|
22
|
|
23
|
- Use ratings as to describe objects
- Personal recommendations, peer review, …
- Beyond topicality:
- Accuracy, coherence, depth, novelty, style, …
- Has been applied to many modalities
- Books, Usenet news, movies, music, jokes, beer, …
|
24
|
|
25
|
|
26
|
|
27
|
|
28
|
- Browsing histories are easily captured
- Send all links to a central site
- Record from and to pages and user’s cookie
- Redirect the browser to the desired page
- Reading time is correlated with interest
- Can be used to build individual profiles
- Used to target advertising by doubleclick.com
|
29
|
|
30
|
|
31
|
- Web Pages (using spatial layout)
- Images (based on image similarity)
- http://elib.cs.berkeley.edu/photos/blobworld/
- Multimedia (based on metadata)
- Movies (based on recommendations)
- http://www.movielens.umn.edu
- Grey literature (based on citations)
- http://citeseer.ist.psu.edu/
|
32
|
- What can be measured that reflects the searcher’s ability to u=
se a
system? (Cleverdon, 1966)
- Coverage of Information
- Form of Presentation
- Effort required/Ease of Use
- Time and Space Efficiency
- Recall
- Precision
|
33
|
|
34
|
|
35
|
- Measure stickiness through frequency of use
- Non-comparative, long-term
- Key factors (from cognitive psychology):
- Worst experience
- Best experience
- Most recent experience
- Highly variable effectiveness is undesirable
- Bad experiences are particularly memorable
|
36
|
- Spam suppression
- “Adversarial information retrieval”
- Every source of evidence has been spammed
- Text, queries, links, access patterns, …
- “Family filter” accuracy
- Link analysis can be very helpful
|
37
|
|