WWW 2007 / Poster Paper Topic: Search GigaHash: Scalable Minimal Perfect Hashing for Billions of URLs Kumar Chellapilla Microsoft Live Labs One Microsoft Way Redmond, WA 98052 USA Anton Mityagin Microsoft Live Labs One Microsoft Way Redmond, WA 98052 USA Denis Charles Microsoft Live Labs One Microsoft Way Redmond, WA 98052 USA kumarc@microsoft.com ABSTRACT mityagin@microsoft.com cdx@microsoft.com A minimal perfect function maps a static set of keys on to the range of integers {0,1,2, ... , - 1} . We present a scalable high performance algorithm based on random graphs for constructing minimal perfect hash functions (MPHFs). For a set of keys, our algorithm outputs a description of in expected time (). The evaluation of () requires three memory accesses for any key and the description of takes up 0.89 bytes (7.13 bits). This is the best (most space efficient) known result to date. Using a simple heuristic and Huffman coding, the space requirement is further reduced to 0.79 bytes (6.86 bits). We present a high performance architecture that is easy to parallelize and scales well to very large data sets encountered in internet search applications. Experimental results on a one billion URL dataset obtained from Live Search crawl data, show that the proposed algorithm (a) finds an MPHF for one billion URLs in less than 4 minutes, and (b) requires only 6.86 bits/key for the description of . hash functions deal with a dynamic set of keys, perfect hash functions always work with a static set of keys. MPHFs have many applications (information retrieval systems, database systems, hypertext, hypermedia, language translation systems, electronic commerce systems, compilers, operating systems, among others) but we focus on their use in web search engines. Large scale web applications typically work with several billion URLs. For example, commercial search crawlers1 encounter several tens of billions of URLs during crawling and index them for search and retrieval. However, URLs are of variable length and not suitable for efficient processing. So, rather than process URLs directly, they are first hashed to a fixed size data structure. All subsequent processing deals with the hashes. The hashing scheme has to be carefully designed to avoid collisions. Given 1 billion URLs, one can uniquely represent each URL using a 30-bit number (230 1 billion). A simple universal hash would require twice as many bits to have a low probability of collision (see birthday paradox problem [1]), which puts the required number of bits per URL at 60. A 60- or 64-bit hash is a perfect hash with high probability. Some search related applications require the generated hashes to be contiguous i.e., the perfect hash function also needs to be minimal. For example, computing the PageRank for a web graph requires a mapping from URL space to a contiguous sequence of integers that represent rows and columns of the web graph adjacency matrix. In this paper, we propose using MPHFs for very large static web datasets such as a static set of all URLs seen by crawler and indexed by a search engine. The evaluation of () requires three memory accesses for any key and the description of takes less than one byte per key. Categories and Subject Descriptors H.3.3 [Information Storage and Retrieval]: Information Search and Retrieval; G.2.2 [Discrete Mathematics]: Graph Theory-- Graph algorithms General Terms Algorithms, Experimentation, Performance. Keywords Minimal perfect hashing, perfect hash function, Web search engine, space efficient hash table 1. INTRODUCTION A hash function maps elements from an input space to a finite range of integers. Typically, the range of the hash function is much smaller than the input space. Thus a hash function is not injective. However, for a subset of the input space that is smaller than the range, the hash function has few collisions. In particular, if a hash function is drawn from a 2-universal family of hash functions that map to a set of size , then with high probability a hash function will be collision free if it is used to map keys (see [1]). A perfect hash function maps a static set of keys into a set of integer numbers without collisions, where . If = , the hash function is called minimal. Usually, the range of the minimal perfect hash function (MPHF) is the contiguous set of integers {0,1,2, ... , - 1}. One point to note is that while general Copyright is held by the author/owner(s). WWW 2007, May 8­12, 2007, Banff, Alberta, Canada. ACM 978-1-59593-654-7/07/0005. 2. MINIMAL HASHING ALGORITHM 2.1 Notation and Terminology We use the following notation and terminology in this paper: : universe of keys of size = . : the static set of keys to be hashed. . : the number of static keys = . : is a hash function that maps keys from the universe into a given range of integers = {0,1,2, ... , - 1} is a perfect hash function if it is one-to-one on , i.e., if h(k1) h(k2) for all k1 k2, k1, k2 S. h is a minimal perfect hash function if it is one-to-one on S and n = m. 1 such as GoogleBot (Google), Yahoo! Slurp (Yahoo), MSNBot (Live), Ask.com/Teoma 1165 WWW 2007 / Poster Paper Topic: Search 2.2 The Minimal Hashing Scheme Perfect hashing and minimal perfect hashing have enjoyed a rich history of development ever since the early days of computer science when hashing was introduced. However, our construction improves upon recent work of Czech et al. [2] and Botelho et al. [3]. These MPH algorithms are based on random graphs and have the following stages: Mapping: transform the key set, S, from the original universe U to a new universe U . Ordering: place the keys in a sequence that determines the order in which hash values are assigned to keys. Searching: assign hash values to the keys in order. The universe is mapped to the edges of a random acyclic graph = (, ) with = || : key is mapped to edge = (1 , 2 ) , where 1 , 2 : are 2-universal hash functions. Next, they pick an arbitrary isomorphism {0, ... , - 1} and find a function : {0, ... , - 1} such that = 1 + 2 mod . . Then the MPH is simply 1 + 2 mod . Czech et al. [2], showed that if = | | with 2.09 then obtained in this way is acyclic with sufficiently high probability. Since the graph is acyclic the set of equations for the values can be solved very easily. The description of the hash function requires the table of values to be stored and is proportional to . The algorithm of Botelho et al. sets to be < 2.09, and consequently the graph is cyclic. They show how one can order the assignment of the hash values to the cyclic portion, the so-called 2-core, of the graph to be able to construct the MPH (see [3] for the details). Their 1 method works as long as the 2-core is ||. The size of the 2core is known to be || as long as 1.15 thus leading to 2 improvements. Our algorithm begins with the observation that the distribution the values produced is skewed. In particular, we analyze the probability that = 0. We use this to compress the storage requirements of the table of values further. Theoretically, we prove that the compression allows us to reach a lower effective value of 0.94 , which is 23% better. Empirically, on a one billion URL dataset, we show that can be taken as low as 0.81, an improvement of 42% over = 1.15. 1 2 2.4 Scalable High Performance Architecture Bucket-1 MPH-1 Bucket-2 B is chosen such that number of entries per bucket is in {160-180} MPH-2 MPH-3 MPH-4 MPH-5 Bucket-3 Bucket-4 Bucket-5 Billions of Urls (N) Bucket Urls b = h0(url) % B Bucket-B MPH-B Figure 1. Architecture of our minimal perfect hash function Figure 1 shows the high level organization of our minimal perfect hash (MPH) function for billions of URLs. Generation of a MPHF consists of the following steps: 1. 2. Read the input data from a hard drive or network. Convert URLs to fingerprint strings and distribute them into buckets. Text data is parsed and segmented into individual URLs. For each URL we compute 64-bit fingerprints using Jenkin's Hash. Based on the fingerprint, we determine a bucket ID and place the URL in that bucket. Bucket sizes are chosen so that they contain 160-180 URLs on average. Create minimal perfect hash function (over the fingerprints) for each of the buckets. For each of the buckets we construct a minimal perfect hash function as described in Section 2.2 and 2.3. Each of the buckets is processed in parallel. The MPHs for the buckets are stitched into a global MPH using a table of offsets for each bucket. 3. 4. 2.5 Results In the segmentation step, one billion URLs were segmented into 6.25 million buckets. Each of the buckets was independently processed to obtain local MPHFs. Table 1 shows the space gains using a bit-vector and Hufman coding of the table. The time to create the table was 3.9 min on an AMD Opteron 285 (dual processor) 64-bit machine with 16GB RAM. Hash lookups required roughly 0.025 × 10-6 sec, completing 1 billion lookups in 23.7 seconds. Table 1. Space gains using zero/unused [. ] bitvector and Huffman coding for hashing 1 billion URLs. C 1.15 1.00 0.95 0.93 0.90 0.89 MPHF size (GB) 1.267 1.114 1.067 1.048 1.017 1.008 zero/ Huffman Space unused Coded gain [. ] values Size (GB) (%) 22.99 % 1.011 20.12 18.52 % 0.929 14.88 16.26 % 0.899 13.43 15.35 % 0.885 12.76 14.02 % 0.862 11.92 13.54 % 0.858 11.60 Bits per URL 8.10 7.59 7.39 7.31 7.17 7.13 Bits per URL (Huff) 8.09 7.43 7.19 7.08 6.90 6.86 2.3 Key Improvements The following proposition allows us to analyze the proportion of vertices with = 0. Proposition 1. The expected proportion of vertices that have () = 0 owing to the fact that they belong to a connected components of size 2 in is 2 2 2(-1) + 1- One can use an bit-vector to identify unused/zero () values. Using a bit-vector the space required for the MPH drops from log bits to 1 - 0 log 2 + 0 bits. For = 109 and 1.15 , we have 0 0.211 , thus the effective = 0.9 4. Empirically, one can reduce down to 0 . 93 , wherein 0 0.12 and the effective drops to 0.81. One can also exploit the non-uniform distribution of values using Huffman coding. 0 = 1 - 3. REFERENCES [1] D. E. Knuth. The Art of Computer Programming: Sorting and Searching, volume 3. Addison-Wesley, 1973. [2] Z. Czech, G. Havas, and B. Majewski. An optimal algorithm for generating minimal perfect hash functions. Information Processing Letters, 43(5):257­264, 1992. [3] F. C. Botelho, Y. Kohayakawa, and N. Ziviani. A Practical Minimal Perfect Hashing Method. 4th Intl. Workshop on Efficient and Experimental Algorithms (WEA05), SpringerVerlag, vol. 3505, 488-500, 2005. 1166