1
|
- Week 6
- LBSC 796/INFM 718R
- October 15, 2007
|
2
|
- Affordable storage
- Adequate backbone capacity
- 25,000 simultaneous transfers by 1995
- Adequate “last mile” bandwidth
- 1 second/screen (of text) by 1995
- Display capability
- 10% of US population could see images by 1995
- Effective search capabilities
- Lycos and Yahoo! achieved useful scale in 1994-1995
|
3
|
- HTTP, HTML, or URL?
- Static, dynamic or streaming?
- Public, protected, or internal?
|
4
|
|
5
|
- Any server at port 80?
- Misses many servers at other ports
- Some servers host unrelated content
- Some content requires specialized servers
|
6
|
|
7
|
|
8
|
- Put a set of known sites on a queue
- Repeat the until the queue is empty:
- Take the first page off of the queue
- Check to see if this page has been processed
- If this page has not yet been processed:
- Add this page to the index
- Add each link on the current page to the queue
- Record that this page has been processed
|
9
|
|
10
|
|
11
|
|
12
|
- Discovering “islands” and “peninsulas”
- Duplicate and near-duplicate content
- Server and network loads
- Dynamic content generation
- Link rot
- Temporary server interruptions
|
13
|
- Structural
- Identical directory structure (e.g., mirrors, aliases)
- Syntactic
- Identical bytes
- Identical markup (HTML, XML, …)
- Semantic
- Identical content
- Similar content (e.g., with a different banner ad)
- Related content (e.g., translated)
|
14
|
- Depends on voluntary compliance by crawlers
- Exclusion by site
- Create a robots.txt file at the server’s top level
- Indicate which directories not to crawl
- Exclusion by document (in HTML head)
- Not implemented by all crawlers
- <meta name=3D"robots“
content=3D"noindex,nofollow">
|
15
|
- Search is user-controlled suppression
- Everything is known to the search system
- Goal: avoid showing things the user doesn’t want
- Other stakeholders have different goals
- Authors risk little by wasting your time
- Marketers hope for serendipitous interest
- Metadata from trusted sources is more reliable
|
16
|
- Goal: Manipulate rankings of an IR system
- Multiple strategies:
- Create bogus user-assigned metadata
- Add invisible text (font in background color, …)
- Alter your text to include desired query terms
- “Link exchanges” create links to your page
|
17
|
- Web crawls since 1997
- Check out Maryland’s Web site in 1997
- Check out the history of your favorite site
|
18
|
|
19
|
- We ask “is this document relevant?”
- Vector space: we answer “somewhat”
- Probabilistic: we answer “probably”
- The key is to know what “probably” means
- First, we’ll formalize that notion
- Then we’ll apply it to ranking
|
20
|
- Build a model for every document
- Rank document d based on P(MD | q)
- Expand using Bayes’ Theorem
|
21
|
|
22
|
|
23
|
- Assume binary relevance, document independence
- Each document is either relevant or it is not
- Relevance of one doc reveals nothing about another
- Assume the searcher works down a ranked list
- Seeking some number of relevant documents
- Documents should be ranked in order of decreasing probability of
relevance to the query,
|
24
|
- Suppose there’s a horrible, but very rare disease
- But there’s a very accurate test for it
- Unfortunately, you tested positive…
|
25
|
- You want to find
- But you only know
- How rare the disease is
- How accurate the test is
- Use Bayes’ Theorem (hence Bayesian Inference)
|
26
|
|
27
|
|
28
|
- Probability distribution over strings of text
- How likely is a string in a given “language”?
- Probabilities depend on what language we’re modeling
|
29
|
- Assume each word is generated independently
- Obviously, this is not true…
- But it seems to work well in practice!
- The probability of a string, given a model:
|
30
|
- Colored balls are randomly drawn from an urn (with replacement)
|
31
|
|
32
|
|
33
|
- Build a model for every document
- Rank document d based on P(MD | q)
- Expand using Bayes’ Theorem
|
34
|
|
35
|
|
36
|
- How do we build a language model for a document?
|
37
|
- Simply count the frequencies in the document =3D maximum likelihood
estimate
|
38
|
- Suppose some event is not in our observation S
- Model will assign zero probability to that event
|
39
|
|
40
|
- Assign some small probability to unseen events
- But remember to take away “probability mass” from other
events
- Some techniques are easily understood
- Add one to all the frequencies (including zero)
- More sophisticated methods improve ranking
|
41
|
- Indexing-time:
- Build a language model for every document
- Query-time Ranking
- Estimate the probability of generating the query according to each
model
- Rank the documents according to these probabilities
|
42
|
- Probabilistic methods formalize assumptions
- Binary relevance
- Document independence
- Term independence
- Uniform priors
- Top-down scan
- Natural framework for combining evidence
|
43
|
- Most of the assumptions are not satisfied!
- Searchers want utility, not relevance
- Relevance is not binary
- Terms are clearly not independent
- Documents are often not independent
- Smoothing techniques are somewhat ad hoc
|
44
|
- Ranked retrieval paradigm is powerful
- Well suited to human search strategies
- Probability theory has explanatory power
- At least we know where the weak spots are
- Probabilities are good for combining evidence
- Good implementations exist (e.g., Lemur)
- Effective, efficient, and large-scale
|
45
|
- Similar in some ways
- Term weights based on frequency
- Terms often used as if they were independent
- Different in others
- Based on probability rather than similarity
- Intuitions are probabilistic rather than geometric
|
46
|
- Perform an initial Boolean query
- Balancing breadth with understandability
- Rerank the results
- Using either Okapi or a language model
- Possibly also accounting for proximity, links, …
|
47
|
|
48
|
|
49
|
|
50
|
|
51
|
- Dynamic pages, generated from databases
- Not easily discovered using crawling
- Perhaps 400-500 times larger than surface Web
- Fastest growing source of new information
|
52
|
- 60 Deep Sites Exceed Surface Web by 40 Times
|
53
|
|
54
|
|
55
|
- RDF provides the schema for interchange
- Ontologies support automated inference
- Similar to thesauri supporting human reasoning
- Ontology mapping permits distributed creation
- This is where the magic happens J
|