1
|
- Week 11
- LBSC 690
- Information Technology
|
2
|
- Affordable storage
- Adequate backbone capacity
- 25,000 simultaneous transfers
- Adequate “last mile” bandwidth
- Display capability
- Effective search capabilities
|
3
|
- HTTP, HTML, or URL?
- Static, dynamic or streaming?
- Public, protected, or internal?
- Content or behavior?
|
4
|
|
5
|
- OCLC counted any server at port 80
- Misses many servers at other ports
- Some servers host unrelated content
- Some content requires specialized servers
|
6
|
|
7
|
|
8
|
- Discovering “islands” and “peninsulas”
- Duplicate and near-duplicate content
- Server and network loads
- Dynamic content generation
- Link rot
- Temporary server interruptions
|
9
|
- Structural
- Identical directory structure (e.g., mirrors, aliases)
- Syntactic
- Identical bytes
- Identical markup (HTML, XML, …)
- Semantic
- Identical content
- Similar content (e.g., with a different banner ad)
- Related content (e.g., translated)
|
10
|
- Requires voluntary compliance by crawlers
- Exclusion by site
- Create a robots.txt file at the server’s top level
- Indicate which directories not to crawl
- Exclusion by document (in HTML head)
- Not implemented by all crawlers
- <meta name=3D"robots“
content=3D"noindex,nofollow">
|
11
|
- alexa.com Web crawls since 1997
- Check out the CLIS Web site from 1998!
- Check out the history of your favorite site
|
12
|
- Can we save everything?
- Should we?
- Do people have a right to remove things?
|
13
|
- Dynamic pages, generated from databases
- Not easily discovered using crawling
- Perhaps 400-500 times larger than surface Web
- Fastest growing source of new information
|
14
|
|
15
|
|
16
|
- 60 Deep Sites Exceed Surface Web by 40 Times
|
17
|
|
18
|
|
19
|
|
20
|
|
21
|
|
22
|
|
23
|
|
24
|
|
25
|
|
26
|
|
27
|
- Speech is better for some things than writing
- Spoken bits are as persistent as written bits
- Storage costs is 80 times more than text
- Disk cost falls by a factor of 80 in ~16 years
- If speech is searchable, we will keep lots of it
|
28
|
- Collectable spoken words ≈ 10 Tw/day
- 1 billion users * 100 words/min * 200 min/day / 2
- Compressed speech ≈ 2 words/kiloByte
- (100/60 w/sec) * (6.5 kb/sec / 8 b/B)
- Required storage ≈ 5 PetaBytes/day
|
29
|
- Collectable spoken words ≈ 10 Tw/day
- 1 billion users * 100 words/min * 200 min/day / 2
- Compressed speech ≈ 2 words/kiloByte
- (100/60 w/sec) * (6.5 kb/sec / 8 b/B)
- Required storage ≈ 5 PetaBytes/day
- Storage array sales > 5 PB/day
- 457 PB in 2Q 2005 (increasing 59% per year)
- $22/person/year (decreasing at 31%/year)
|
30
|
|
31
|
- audio.search.yahoo.com
- blinkx.com
- ocw.mit.edu
- podcasts.net
|
32
|
|
33
|
|
34
|
|
35
|
|
36
|
- Browsing histories are easily captured
- Make all links initially point to a central site
- Encode the desired URL as a parameter
- Build a time-annotated transition graph for each user
- Cookies identify users (when they use the same machine)
- Redirect the browser to the desired page
- Reading time is correlated with interest
- Can be used to build individual profiles
- Used to target advertising by doubleclick.com
|
37
|
|
38
|
- http://hannu.biz/aolsearch/
|
39
|
|
40
|
- Observe public behavior
- Hypertext linking, publication, citing, …
- Policy protection
- EU: Privacy laws
- US: Privacy policies + FTC enforcement
- Statistical assurance of privacy
- Distributed architecture
- Model and mitigate privacy risks
|
41
|
|
42
|
- User selects an article
- Interpretation: Summary was interesting
- User quickly prints the article
- Interpretation: They want to read it
- User selects a second article
- Interpretation: another interesting summary
- User scrolls around in the article
- Interpretation: Parts with high dwell time and/or repeated revisits=
are
interesting
- User stops scrolling for an extended period
- Interpretation: User was interrupted
|
43
|
|
44
|
- Protecting privacy
- What absolute assurances can we provide?
- How can we make remaining risks understood?
- Scalable rating servers
- Is a fully distributed architecture practical?
- Non-cooperative users
- How can the effect of spamming be limited?
|