1
|
- LBSC 796/INFM 718R
- Douglas W. Oard
- Session 10, November 12, 2007
|
2
|
- Questions
- Observable Behavior
- Information filtering
|
3
|
|
4
|
- An abstract problem in which:
- The information need is stable
- Characterized by a “profile”
- A stream of documents is arriving
- Each must either be presented to the user or not
- Introduced by Luhn in 1958
- As “Selective Dissemination of Information”
- Named “Filtering” by Denning in 1975
|
5
|
|
6
|
- Use any information retrieval system
- Boolean, vector space, probabilistic, …
- Have the user specify a “standing query”
- Limit the standing query by date
- Each use, show what arrived since the last use
|
7
|
- Unnecessary indexing overhead
- Indexing only speeds up retrospective searches
- Every profile is treated separately
- The same work might be done repeatedly
- Forming effective queries by hand is hard
- The computer might be able to help
- It is OK for text, but what about audio, video, …
- Are words the only possible basis for filtering?
|
8
|
- Boolean filtering using custom hardware
- Up to 10,000 documents per second (in 1996!)
- Words pass through a pipeline architecture
- Each element looks for one word
|
9
|
- Build an inverted file of profiles
- Postings are profiles that contain each term
- RAM can hold 5 million profiles/GB
- And several machines could run in parallel
- Both Boolean and vector space matching
- User-selected threshold for each ranked profile
- Hand-tuned on a web page using today’s news
|
10
|
- Privacy
- Central profile registry, associated with known users
- Usability
- Manual profile creation is time consuming
- May not be kept up to date
- Threshold values vary by topic and lack “meaning”
|
11
|
|
12
|
|
13
|
- IDF estimation
- Unseen profile terms would have infinite IDF!
- Incremental updates, side collection
- Interaction design
- Score threshold, batch updates
- Evaluation
|
14
|
- All learning systems share two problems
- They need some basis for making predictions
- This is called an “inductive bias”
- They must balance adaptation with generalization
|
15
|
- Hill climbing (Rocchio)
- Instance-based learning (kNN)
- Rule induction
- Statistical classification
- Regression
- Neural networks
- Genetic algorithms
|
16
|
- Automatically derived Boolean profiles
- (Hopefully) effective and easily explained
- Specificity from the “perfect query”
- AND terms in a document, OR the documents
- Generality from a bias favoring short profiles
- e.g., penalize rules with more Boolean operators
- Balanced by rewards for precision, recall, …
|
17
|
- Represent documents as vectors
- Usual approach based on TF, IDF, Length
- Build a statistical models of rel and non-rel
- e.g., (mixture of) Gaussian distributions
- Find a surface separating the distributions
- Rank documents by distance from that surface
|
18
|
- Overtraining can hurt performance
- Performance on training data rises and plateaus
- Performance on new data rises, then falls
- One strategy is to learn less each time
- But it is hard to guess the right learning rate
- Usual approach: Split the training set
- Training, DevTest for finding “new data” peak
|
19
|
|
20
|
- Social filtering will not work in isolation
- Without ratings, no recommendations
- Without recommendations, we rate nothing
- An initial recommendation strategy is needed
- Popularity
- Stereotypes
- Content-based
|
21
|
- Observe user behavior to infer a set of ratings
- Examine (reading time, scrolling behavior, …)
- Retain (bookmark, save, save & annotate, print, …)
- Refer to (reply, forward, include link, cut & paste, …)=
li>
- Some measurements are directly useful
- e.g., use reading time to predict reading time
- Others require some inference
- Should you treat cut & paste as an endorsement?
|
22
|
|
23
|
- Adversarial IR
- Targeting, probing, spam traps, adaptation cycle
- Compression-based techniques
- Blacklists and whitelists
- Members-only mailing lists, zombies
- Identity authentication
- Sender ID, DKIM, key management
|