WWW 2008 / Poster Paper April 21-25, 2008 · Beijing, China Microscale Evolution of Web Pages Carrie Grimes Google 1600 amphitheatre Pkwy Mountain View, CA 94043 Sean O'Brien Google 1600 amphitheatre Pkwy Mountain View, CA 94043 cgrimes@google.com sobrien@google.com ABSTRACT We track a large set of "rapidly" changing web pages and examine the assumption that the arrival of content changes follows a Poisson process on a microscale. We demonstrate that there are significant differences in the b ehavior of pages that can b e exploited to maintain freshness in a web corpus. Catagories and Sub ject Descriptors: H.4.m [Information Systems]: Miscellaneous; H.3.3 [Information Search and Retrieval]: Search Process General Terms: Measurement, Exp erimentation, Algorithms Keywords: Web evolution, rate of change, change detection 1. INTRODUCTION Search engines crawl the web to download a corpus of web pages to index for user queries. One of the most efficient ways to maintain an up-to-date corpus of web pages for a search engine to index is to re-crawl pages preferentially based on their rate of content up date [3]. Much of the existing work on estimating exp ected rates of change has b een based on the assumption that changes arrive as a Poisson process, and that the average rate of change can b e estimated under this model. Given the rate at which rapidly changing pages need to b e crawled, there are high costs associated with these pages b oth in terms of crawling resources and corpus consistency with the web. In this pap er, we ask the question of whether the Poisson model truly represents the changes for rapidly up dating pages, and if not, what can b e gained by b etter understanding the real structure of page changes in terms of acquiring fresh content. b etween changes is , and the time b etween page up dates is distributed exp onentially with parameter 1/. Defining a `Change': We employ a hashing technique, outlined by Charikar [1], which creates a fingerprint-like representation of the page but has the unique b enefit that pages with similar content have similar hash values. Distance b etween hashes can b e measured by the numb er of bits in which they differ; for this study we consider versions of a page with 6 or more bits different in their hashes to b e changed. Computing Rates of Change: Given a history of rep orted "changes," measured on any regular interval of length C , the simple estimator is to divide the total numb er of changes observed by the total time observed. That simple ^ estimator of the rate of change, , is = T /X , where T = total time and X = total numb er of changes. However, if the time b etween crawls of the page is remotely similar to the rate of change of the page, this estimate is significantly asymptotically biased. If more than one change occurs during an interval of length C , the crawler will only observe a single change. As a result, if C is too large compared to , no matter how many observations are taken, the estimate will always overestimate the length of time b etween changes. Cho and Garcia-Molina, section 4.2, [2] reduce the asymptotic bias of the simple estimator by subtracting off the exp ected bias and making a small modification to remove singularities. The modified estimator, ^ = - 1 +0 log( T -X0.5.5 ) T+ (1) 2. METHODOLOGY Definitions: For each page, the numb er of up dates that occur in a single hour, X , is distributed as a Poisson distribution, with parameter = 1/. The average time Copyright is held by the author/owner(s). WWW 2008, April 21­25, 2008, Beijing, China. ACM 978-1-60558-085-2/08/04. is demonstrated to have significantly b etter bias, even for small (N < 50) observations, esp ecially when C / is in the neighb orhood of 0.5 - 2.0. Although our page crawl interval granularity will b e quite small compared to the total space of p ossible rates of change, the pages we are examining have rates of change on the order of 1-2 days, and therefore the shrinkage of the estimator given by Cho and Garcia-Molina makes a critical difference in the values. However, if the samples are significantly non-Poisson, the asymptotic results for this estimator do not apply. For this reason, we will ^ ^ compute b oth and the estimator . Sampling Web Pages: We use a multi-staged process to generate our sample. We first selected hosts according to the numb er of URLs they had in a feed from the Google crawl containing over 1B documents. From each host we sampled a numb er of URLs, crawled them twice daily, and only kept the URLs with an average rate of change of less than 48 hours. This left us with 29,000 URLs which we crawled every hour. Of those 80% had at least 500 consecutive successful 1149 WWW 2008 / Poster Paper April 21-25, 2008 · Beijing, China Figure 1: Heatmap of observed interval frequencies given average observed rate of change of the page. Red = low intensity, Yellow = high intensity. Figure 2: Comparison of the observed interval fre^ quencies for pages with = 24 with the number predicted by an exponentially-distributed waiting time between updates. fetches. This is the set of URLs we will examine and refer to as the rapid change sample. Every time a page is accessed in our sample, we compute a hash of the page's content and consider the page changed if the current hash differs from the previous by 6 or more bits. 3. DISTRIBUTION OF CONTENT CHANGES Content Changerate Profiles: We b egin by examining the overall distribution of rates of change given by this sample. In our data, only a very small (< 5%) p ortion of the sample changes by 6 bits or more every time we access it. However, applying the modified estimator in 1, we estimate that up to 25% of our rapid changing sample has an average rate of change less than or equal to one hour. Over 50% of pages in the sample have an estimated rate of change of less than 4 hours. The primary differences b etween the simple estimator () and the modified estimator ( ) occurs in the fastest-changing bins b ecause those are the most sensitive to the remaining censoring introduced by our hour-granularity sample. Pages with Regular Updates: Intuitively, it would seem that many pages should show a much more regular b ehavior than is dictated by the Poisson model due to automated up dating of the sites on an hourly or daily basis. Figure 1 is a heatmap of all actual b etween-change intervals observed, plotted by the overall average rate of change observed for the page. The high-frequency bins are concentrated around the fastest-changing pages at the lower left corner. However, there is an additional bright sp ot at 24-hour observed intervals ­ there are significant numb ers of pages with an average rate of change near 24 hours and a large numb er of changes at exactly 24 hours. Fig^ ure 2 illustrates this effect sp ecifically for pages with = 24. The p oints plotted in Figure 2 show the observed numb er of change intervals of a given length, and the lighter line shows the predicted numb er of change intervals of each length from a model where observed times b etween changes are distributed E xp(1/24) as implied by a Poisson model. Temporal Effects in Content Updates: If humans or machines set by humans are the cause of page up dates, those up dates should have a clear temp oral association. Within our rapid change sample, we divided the pages into regionassociations based on the top level domain of the page. The data was aggregated into two ma jor groups, CJK = {.cn, .jk, .ko} pages, and FIGS = {.fr, .it, .de, .es} pages. Plotting the arrival times of observed up dates against the day (day is with resp ect to Pacific Standard Time) in Figure 3, we see that there is a significant decrease in probability Figure 3: Arrival times of page updates over one week, with respect to PST clock times. of a page change b etween local daytime and local nighttime, and an even more significant decrease in up date volume on the weekend. The graph is smoothed with a 5-hour window to reduce significant spikiness. This graph suggests that resources for refreshing pages should b e prioritized to occur directly after the highest local page up date volume. 4. CONCLUSIONS The case for an aggregate Poisson model for these fastchanging pages is somewhat inconclusive: relatively few pages in our sample were strictly consistent with a Poisson model, but only a small p ortion differ significantly. We do show several effects that can b e exploited to improve refresh p erformance on fast-changing pages: change volume increases dep ending on local time and day of the week, and fast-up dating pages bias toward hourly and daily changes. We did not analyze the quality of changes seen on the URLs in our sample. One dominating question in this work is whether the large comp onent (nearly 20%) of pages that are more consistent with a 1-hour regular change pattern than with a Poisson process are up dating content useful to the user. 5. REFERENCES [1] M. S. Charikar. Similarity estimation techniques from rounding algorithms. In STOC '02: Proceedings of the thiry-fourth annual ACM symposium on Theory of computing, pages 380­388, New York, NY, USA, 2002. ACM Press. [2] J. Cho and H. Garcia-Molina. Estimating frequency of change. ACM Trans. Inter. Tech., 3(3):256­290, 2003. [3] J. Cho, H. Garc´a-Molina, and L. Page. Efficient crawling i through URL ordering. Computer Networks and ISDN Systems, 30(1­7):161­172, 1998. 1150