1
|
- Week 11
- LBSC 690
- Information Technology
|
2
|
- Affordable storage
- Adequate backbone capacity
- 25,000 simultaneous transfers
- Adequate “last mile” bandwidth
- Display capability
- Effective search capabilities
|
3
|
- HTTP, HTML, or URL?
- Static, dynamic or streaming?
- Public, protected, or internal?
|
4
|
|
5
|
- OCLC counted any server at port 80
- Misses many servers at other ports
- Some servers host unrelated content
- Some content requires specialized servers
|
6
|
|
7
|
- Discovering “islands” and “peninsulas”
- Duplicate and near-duplicate content
- Server and network loads
- Dynamic content generation
- Link rot
- Temporary server interruptions
|
8
|
|
9
|
- Structural
- Identical directory structure (e.g., mirrors, aliases)
- Syntactic
- Identical bytes
- Identical markup (HTML, XML, …)
- Semantic
- Identical content
- Similar content (e.g., with a different banner ad)
- Related content (e.g., translated)
|
10
|
- Requires voluntary compliance by crawlers
- Exclusion by site
- Create a robots.txt file at the server’s top level
- Indicate which directories not to crawl
- Exclusion by document (in HTML head)
- Not implemented by all crawlers
- <meta name=3D"robots“
content=3D"noindex,nofollow">
|
11
|
- alexa.com Web crawls since 1997
- Check out Maryland’s Web site in 1997
- Check out the history of your favorite site
|
12
|
- Can we save everything?
- Should we?
- Do people have a right to remove things?
|
13
|
- Dynamic pages, generated from databases
- Not easily discovered using crawling
- Perhaps 400-500 times larger than surface Web
- Fastest growing source of new information
|
14
|
|
15
|
|
16
|
- 60 Deep Sites Exceed Surface Web by 40 Times
|
17
|
|
18
|
|
19
|
|
20
|
|
21
|
|
22
|
|
23
|
|
24
|
|
25
|
|
26
|
|
27
|
- Speech is better for some things than writing
- Spoken bits are as persistent as written bits
- Storage costs is 80 times more than text
- Disk cost falls by a factor of 80 in ~16 years
- If speech is searchable, we will keep lots of it
|
28
|
- Collectable spoken words ≈ 10 Tw/day
- 1 billion users * 100 words/min * 200 min/day / 2
- Compressed speech ≈ 2 words/kiloByte
- (100/60 w/sec) * (6.5 kb/sec / 8 b/B)
- Required storage ≈ 5 PetaBytes/day
|
29
|
- Collectable spoken words ≈ 10 Tw/day
- 1 billion users * 100 words/min * 200 min/day / 2
- Compressed speech ≈ 2 words/kiloByte
- (100/60 w/sec) * (6.5 kb/sec / 8 b/B)
- Required storage ≈ 5 PetaBytes/day
- Storage array sales > 5 PB/day
- 457 PB in 2Q 2005 (increasing 59% per year)
- $22/person/year (decreasing at 31%/year)
|
30
|
|
31
|
- singingfish.com
- blinkx.com
- ocw.mit.edu
- podcasts.yahoo.com
|