1
|
- Week 9
- LBSC 690
- Information Technology
|
2
|
- What is the Web?
- What’s on the Web?
- What is the nature of the Web?
- Preserving the Web
|
3
|
- HTTP, HTML, or URL?
- Static, dynamic or streaming?
- Public, protected, or internal?
|
4
|
- Affordable storage
- Adequate backbone capacity
- 25,000 simultaneous transfers
- Adequate “last mile” bandwidth
- Display capability
- Effective search capabilities
- Lycos (now google), Yahoo
|
5
|
- Over one billion pages by 1999
- Growing at 25% per month!
- Google indexed about 3 billion pages in 2003
- Unstable
- Redundant
|
6
|
|
7
|
|
8
|
|
9
|
- OCLC counts any server at port 80
- Misses many servers at other ports
- Some servers host unrelated content
- Some content requires specialized servers
|
10
|
|
11
|
|
12
|
|
13
|
|
14
|
|
15
|
|
16
|
|
17
|
|
18
|
- SingingFish indexes 35 million streams
- 60% of queries are for music
- Then movies
- Then sports
- Then news
|
19
|
|
20
|
- Temporary server interruptions
- Discovering “islands” and “peninsulas”
- Duplicate and near-duplicate content
- Dynamic content
- Link rot
- Server and network loads
- Have I seen this page before?
|
21
|
- Structural
- Identical directory structure (e.g., mirrors, aliases)
- Syntactic
- Identical bytes
- Identical markup (HTML, XML, …)
- Semantic
- Identical content
- Similar content (e.g., with a different banner ad)
- Related content (e.g., translated)
|
22
|
- Based on voluntary compliance by crawlers
- Exclusion by site
- Create a robots.txt file at the server’s top level
- Indicate which directories not to crawl
- Exclusion by document (in HTML head)
- Not implemented by all crawlers
- <meta name=3D"robots“
content=3D"noindex,nofollow">
|
23
|
|
24
|
- Dynamic pages, generated from databases
- Not easily discovered using crawling
- Perhaps 400-500 times larger than surface Web
- Fastest growing source of new information
|
25
|
|
26
|
- 60 Deep Sites Exceed Surface Web by 40 Times
|
27
|
- Internet Archive
- Stored Alexa.com Web crawls since 1997
- http://archive.org
- Check out Maryland’s Web site in 1997
- Check out the history of your favorite site
|
28
|
- Can we save everything?
- Should we?
- Do people have a right to remove things?
|