Fall 2005 Final Exam Answers LBSC 690 Section 0301 Question 1: 1-a answer is i. 1-b answer is iii. (actual answer is about 8.3 years) 1-c answer is iii. 1-d answer is ii. Question 2: The Internet Archive obtains the pages it stores from donated Web crawl results from search engines (most notably, Alexa.com). A particular version of a page may be missing for any of the following reasons: - It may have been dynamically generated in response to a query - It may have changed more than once between two crawls - It may have changed before the Archive's first crawl (in 1997) - Crawling may have been precluded by the robot exclusion protocol - Crawling may have been precluded by password protection - Crawling may have been precluded by a temporary server interruption - Crawling may not have discovered the page - The page may have been deleted by the crawler as a (near-)duplicate - The archive may not retain the type of content on the page by policy - Someone may have requested that the page be removed from the archive Question 3: Cookies are a threat to privacy because they make it possible to associate actions taken using a Web browser during one session with actions taken during earlier sessions. One possible technical means of addressing that threat is to install software that automatically deletes cookies after each session. A limitation to that solution is that the user must specifically authorize cookie retention for any case in which they wish to take advantage of functions that involve storing information from one session to another (e.g., so that weather.com knows your location without being told each time). A policy option would be to pass a law limiting the conditions under which information obtained from Web browsers could be combined with other information. A possible objection to that policy option is that it would limit opportunities for commercial activities that could provide value to citizens (such as targeting advertisements to people that might wish to receive them based on their interests). In this case, the technical means are reasonably effective and the policy option is somewhat onerous, so it would be reasonable to advocate technical means of addressing the concern over this particular policy option. Question 4: An example of a good answer to this question is available in a separate file. Question 5: Before the query is received, search engines crawl the Web and then build an "inverted index" that is arranged for efficient term-based access in which the key is a query term and the result is a list of pages on which that term appears, along with scores that indicate the weight that term has in describing that page (computed based on how often the term appears, how common the term is, how authoratative the page is, etc.). When the query is run, the results are first compared to cached results from recent popular queries -- if a match is found, then network latency is the only significant delay. If no results are found in the cache (which is actually in main memory in this case, not in some hyper-expensive "cache" memory), then there is a delay for disk reads (at perhaps 10 ms per query term) and for network latency (which may be faster or slower than the disk latency, depending on the speed of your connection and on network congestion at the time). In no case is the speed of the processor a significant factor in the response time to a query since the disk and the network are much much slower than the processor.