Fall 2005 Final Exam Answers
LBSC 690 Section 0301

Question 1:

1-a answer is i.

1-b answer is iii. (actual answer is about 8.3 years)

1-c answer is iii.

1-d answer is ii.

Question 2:

The Internet Archive obtains the pages it stores from donated Web
crawl results from search engines (most notably, Alexa.com).  A
particular version of a page may be missing for any of the following
reasons:

- It may have been dynamically generated in response to a query
- It may have changed more than once between two crawls
- It may have changed before the Archive's first crawl (in 1997)
- Crawling may have been precluded by the robot exclusion protocol
- Crawling may have been precluded by password protection
- Crawling may have been precluded by a temporary server interruption
- Crawling may not have discovered the page
- The page may have been deleted by the crawler as a (near-)duplicate
- The archive may not retain the type of content on the page by policy
- Someone may have requested that the page be removed from the archive

Question 3: 

Cookies are a threat to privacy because they make it possible to
associate actions taken using a Web browser during one session with
actions taken during earlier sessions.  One possible technical means
of addressing that threat is to install software that automatically
deletes cookies after each session.  A limitation to that solution is
that the user must specifically authorize cookie retention for any
case in which they wish to take advantage of functions that involve
storing information from one session to another (e.g., so that
weather.com knows your location without being told each time).  A
policy option would be to pass a law limiting the conditions under
which information obtained from Web browsers could be combined with
other information.  A possible objection to that policy option is that
it would limit opportunities for commercial activities that could
provide value to citizens (such as targeting advertisements to people
that might wish to receive them based on their interests).  In this
case, the technical means are reasonably effective and the policy
option is somewhat onerous, so it would be reasonable to advocate
technical means of addressing the concern over this particular policy
option.

Question 4:

An example of a good answer to this question is available in a
separate file.

Question 5:

Before the query is received, search engines crawl the Web and then
build an "inverted index" that is arranged for efficient term-based
access in which the key is a query term and the result is a list of
pages on which that term appears, along with scores that indicate the
weight that term has in describing that page (computed based on how
often the term appears, how common the term is, how authoratative the
page is, etc.).  When the query is run, the results are first compared
to cached results from recent popular queries -- if a match is found,
then network latency is the only significant delay.  If no results are
found in the cache (which is actually in main memory in this case, not
in some hyper-expensive "cache" memory), then there is a delay for
disk reads (at perhaps 10 ms per query term) and for network latency
(which may be faster or slower than the disk latency, depending on the
speed of your connection and on network congestion at the time).  In
no case is the speed of the processor a significant factor in the
response time to a query since the disk and the network are much much
slower than the processor.