Family Trees
An ordered dictionary with optimal congestion, locality, degree, and search time Kevin C. Zatloukal


Nicholas J. A. Harvey


Abstract We consider the problem of storing an ordered dictionary data structure over a distributed set of nodes. In contrast to traditional sequential data structures, distributed data structures should ideally have low congestion. We present a novel randomized data structure, called a family tree, to solve this problem. A family tree has optimal expected congestion, uses only a constant amount of state per node, and supports searches and node insertion/deletion in expected O(log n) time on a system with n nodes. Furthermore, a family tree supports keys from any ordered domain. Because the keys are not hashed, searches have good locality in the sense that intermediate nodes on the search path have keys that are not far outside of the range between the source and destination. 1 Intro duction While data structures have long been used to organize data on an individual computer system, the past decade has seen significant work on distributed data structures for organizing nodes and data in a distributed system. This work has proceeded chiefly in three directions: peer-to-peer overlay networks, Scalable Distributed Data Structures, and compact routing structures. This paper describes family trees, which are distributed data structures suitable for use as a peer-topeer overlay and are an advancement on the existing work in this area. Peer-to-peer overlay networks are structures for organizing nodes, routing traffic, and searching for data in a distributed system. A primary ob jective of peerto-peer overlays is that no node bear a disproportionate amount of work, implying that the overlay's structure should have low congestion. Overlay networks have numerous practical applications, such such as multicast communication schemes [17] and distributed caching mechanisms [8, 19]. Overlays such as CAN [16], Chord [18], Viceroy
 MIT Computer Science and Artificial Intelligence Lab oratory. Email: {kevinz,nickh}@mit.edu. This permutation of the authors' names was chosen uniformly at random in order to ensure fair treatment of the alphabetically challenged.

[15], and Koorde [9] use hashing to distribute nodes and data uniformly in a numeric space. Thus, using the consistent hashing approach [10], they can support a distributed hash table. Each overlay node maintains "pointers" to a set of other nodes, where each pointer is typically just a physical network address. Any node may route a message to any target node by using the appropriate pointers to forward the message along a sequence of intermediate nodes. These overlays differ primarily in their scheme for selecting routing pointers. The schemes used by all of these overlays [16, 18, 15, 9] achieve routing in O(log n) time, assuming each pointer can be traversed in unit time.1 Most existing peer-to-peer overlays require (log n) routing pointers per node in order to achieve this routing performance. Viceroy [15] and Koorde [9] are the notable exceptions -- they achieve O(log n) hops with only O(1) pointers per node. Having few routing pointers offers a significant practical benefit since correspondingly less maintenance traffic is required to check integrity of the overlay structure. Alternatively, an overlay with fewer routing pointers may check those pointers more frequently while using the same overall bandwidth. Skip Graphs [1], SkipNet [6], and their variants [7, 2] are peer-to-peer overlays, based on skip lists, that support ordered keys. Ordered overlays have significant practical advantages over distributed hash tables. First, they can take advantage of locality in search requests. For example, a search from a.mit.edu for p.mit.edu will not require contacting any nodes outside of mit.edu. This property does not necessarily hold for distributed hash tables. Second, Skip Graphs [1] and SkipNet [6] support range query operations. For example, they could be used to broadcast a message to all nodes in mit.edu. However, these ordered overlays all require (log n) pointers per node. Prior to this paper, no existing overlay has used O(1) pointers without requiring that the keys be hashed. Scalable Distributed Data Structures (SDDSs) were first proposed by Litwin et al. [13] as a means to dynamically distribute buckets of data among the nodes of a
1 We will use n to denote the numb er of no des throughout this paper.

Copyright й 2004 by the Association for Computing Machinery, Inc. and the Society for industrial and Applied Mathematics. All Rights reserved. Printed in The United States of America. No part of this book may be reproduced, stored, or transmitted in any manner without the written permission of the publisher. For information, write to the Association for Computing Machinery, 1515 Broadway, New York, NY 10036 and the Society for Industrial and Applied Mathematics, 3600 University City Science Center, Philadelphia, PA 19104-2688

308


Next, each node X has a level, denoted X.Level, which determines the approximate distance, within the ordering by name ID, advanced by the node's level list pointers. The level number is chosen at random from {0, . . . , lg n0 -1}, where n0 is an estimate of n. We use an estimate for two reasons. First, it is not clear that n can be precisely computed in a manner that is efficient in both time and congestion. Second, it is helpful that 2 Family Trees different nodes produce different estimates of n so that A family tree is a data structure for organizing machines nodes are not all updated to the presence of a new level or resources in a distributed system. Family trees simultaneously: if we computed n exactly, then every combine the techniques of Viceroy [15] and SkipNet node would need to update its level when n reached the [6] in a novel manner. Like Viceroy, and the butterfly next power of 2. network [12] on which it is based, nodes are separated We will discuss the method of estimating n in into approximately lg n levels; nodes at level i will have Section 2.2. In the remainder of this subsection, we will pointers to nodes approximately 2i lg n positions away define all of the pointers on a node. These pointers are in the ordered list of keys. We generate these pointers by separating the nodes at level i randomly into 2i 2 t hi p w say t Of i separate ordered lists, in a manner similar to SkipNet. highThroughoiuy"ttos meaper,hatethere thastsX nis "> (0 (n)) whth probabil t an t exi a such t at The resulting data structure has a natural analogy to a Pr(X > cf (n)) < 1/nc for any c  1 and for sufficiently large n.

distributed system and to perform efficient search operations across those nodes. Numerous SDDS structures have been proposed including distributed hash tables [13] and ordered, distributed dictionaries [14, 11, 3]. Unlike peer-to-peer overlays, SDDSs are not intended for use as a routing structure; accordingly, SDDSs do not focus on congestion. Family trees could conceivably be used as the basis for an SDDS, but we do not consider such an extension in this paper. Compact routing structures [4, 5] have significant differences from SDDSs and peer-to-peer overlays. First, the latter two assume connectivity between any pair of nodes (typically via some underlying routing scheme), whereas compact routing structures do not. Second, compact routing structures typically do not allow updates (i.e., node insertions or deletions). Lastly, compact routing structures typically must assign new identifiers to the nodes for routing purposes. Due to these differences, compact routing structures typically can not be used for implementing ordered dictionaries, so we do not discuss them further. In this paper we address the previously unsolved problem of designing an ordered, distributed dictionary with constant degree, O(log n/n) congestion, and O(log n) performance for search and update operations. We present a randomized solution to this problem, family trees, which achieve the update, search, and congestion bounds in expectation and use only nine pointers per node. This work is an improvement over existing distributed data structures which are either not ordered (and hence have poor locality) [15, 9], do not support low-congestion routing [14, 3], or require (log n) routing pointers per node [1, 6, 7, 2]. The remainder of this paper is organized as follows. Section 2 describes family trees and proves some of their important properties. Section 3 presents the algorithms for searching in a family tree and proves bounds on performance. Section 4 gives algorithms for inserting and removing nodes from a family tree. Section 5 concludes the paper.

genealogical family tree, as will be made clear shortly. Figure 1 shows an example instance of a family tree. 2.1 Definitions. We will now formally define the properties of each node in the data structure. Each node has a name ID, which is an arbitrary element from some ordered domain. This is the normal key for looking up a node in the dictionary. If X is a node, we will denote its name ID by X.NameID. Each node also has a numeric ID, denoted X.NumID, which is an infinite sequence of random bits, each bit chosen uniformly at random. Equivalently, the numeric ID is a random real number in [0, 1). As explained in Section 3.2, nodes can also be looked up by their numeric IDs. Thus, by using consistent hashing [10], family trees support a distributed hash table. Obviously, each node does not generate an infinite sequence of random bits. Each node only needs to generate as many bits as necessary to distinguish its numeric ID from the others. The following proposition bounds the number of numeric ID bits that will need to be generated. Proposition 2.1. Every node only generates O(lg n) numeric ID bits with high probability.2 Proof. Let X and Y be nodes. The probability that X and Y choose the first (c + 2) lg n bits the same is 1/2(c+2) lg n = 1/nc+2 . Thus, the probability that any node chooses the first (c + 2) lg n bits the same as X is (n - 1)/nc+2 < 1/nc+1 , and the probability that any node needs more than (c + 2) lg n bits is less than n/nc+1 = 1/nc .

309


Figure 1: An example family tree with 24 nodes. Each node, denoted by a square, has a name ID (i.e., key), numeric ID, and level. Level 0 nodes are connected into a single list. The level 1 nodes are divided into two disjoint, interleaved lists: nodes whose numeric IDs start with a 0, and nodes whose numeric IDs start with a 1. At level 2, there are four lists: nodes join the list given by the first two bits of their numeric ID. Al l of these level lists are sorted by the name IDs of the nodes. Nodes also maintain pointers to their parents, one level higher: white pointers denote a mother, and black pointers denote a father. Nodes also point to their first child, one level lower. Lastly, al l nodes belong to a list sorted by name IDs and a list sorted by numeric IDs. the only other data associated with each node beyond its name ID, numeric ID, and level. In a family tree, each node has nine pointers. The first six pointers link each node into three circularly linked lists. The NamePrev and NameNext pointers link all nodes into a list sorted by name ID. The NumPrev and NumNext pointers link all nodes into a list sorted by numeric ID. As mentioned above, the data structure separates the nodes into levels and, within level i, into 2i separate lists. Each node chooses its list according to the first i bits of its numeric ID. The LevelPrev and LevelNext pointers link the nodes that have the same level i and have the same first i bits in their numeric IDs into a list sorted by name ID. Additionally, there are pointers between levels. If X is a node, then X.Mother points into the list of nodes with level equal to X.Level + 1, with numeric ID matching the first X.Level bits of X.NumID, and with (X.Level + 1)-th numeric ID bit equal to 0. Of these nodes, X.Mother points to the node whose name ID is closest but less than X.NameID, or it points to nil if no such node exists. X.Father is defined identically except its (X.Level + 1)-th bit must be 1. Lastly, X.FirstChild points into the list of nodes in level X.Level - 1 whose numeric IDs match the first X.Level - 1 bits of X.NumID. Of these nodes, X.FirstChild points to the node whose name ID is closest but greater than X.NameID, or it points to nil if no such node exists. For example, consider the node with name ID "N" in Figure 1, which we will denote N. Node N is linked into the list sorted by name ID and the list sorted by numeric ID. Since N has chosen level 1, it belongs to one of the two level 1 lists, according to the first bit of its numeric ID. This bit is 0, so N belongs to the same level 1 list as "B", "H", and "T". The first child of N is "O" because this is the next name after "N" in the sole level 0 list. Node "R" is also a child of N. Node "M" is the closest node to "N" in the level 2 list corresponding to numeric ID prefix 00, so node "M" is N.Mother. Similarly, node "G" is N.Father as it is the closest node to "N" in the level 2 list with prefix 01. 2.2 Estimating n. Each node must estimate n in order to choose an appropriate level. This estimation process is identical to that of Viceroy. Let X be a node and Y be its successor in numeric ID space. (The last node in the list will take the first as its successor.) Then, regarding the numeric IDs as real numbers in [0, 1), we estimate n as n0 = 1/d(X.NumID, Y.NumID), where d(x, y ) = y - x mod 1. This estimate may be substantially incorrect. However, since we only need the log of the estimate, we can show that every node estimates lg n0 = (lg n) with high probability.

310

б жбжж вйж 1

д ввйж ббжж и ж бжжж згвйж !

б жбжж йб 2

д бббж вг'б 

д бббж вг'б 

д ббжж вйб  ж бжжж згйб "

б жбжб зж 3

б $взж жббб 0

ж бжжб з$зж #

ж бжжб з$вб %

д бвзж бжб 

б бвб ж жб 4

б бвб ж жб 4

ж бжж з$бвб %

ж бжбб зsвзж

д б жб вбвб 

б жбб 6гзжвж 5

ж б'звж жбж &

д ббб йгзжвж  д йбвб ббж  д б бб йбгвзж е

б ж6гбзб бж 7

б жб б 6гбввб )

ж б'вбвб жб 

д б бж йбгзб  ж бжбж 'зб (

б жббб гвзж 0

б жгввб ббб )

д вгвзж бббб е ж бжбб звзж ж бжбб зввб 

б жгввб ббб )

д бббб вгввб   д бббб вгввб   д бббб вгввб   ж бжбб зввб 

I 8 fd ` 9 X V гH FAhgec FbaYGWU

I F 8 fd 9 гH GhrqYXapU i

RгH GTBCA98 HI F 8 S 9 @ RгH G8QBCA98 HI F P 9 @ IF8 D 9@ гH GEвBCA98


Proposition 2.2. Every node estimates n0 such that lg(n/(c + 2) ln n)  lg n0  (c + 2) lg n , for any constant c, with high probability.

with high probability and F is any other event, we can bound Pr(F ) using Pr(F |E ) and an additional error term. Specifically, we have | Pr(F ) - Pr(F |E )|  1/nc . As long our probability and expectation bounds introProof. First, we will argue the lower bound. Let d = duce no more than n such error terms, for some fixed 1/(n/(c + 2) ln n) = (c + 2) ln n/n. For a node X to , we can use Pr(F ) and Pr(F |E ) interchangeably while estimate n smaller than 1/d, every other node must only introducing an error term of 1/nc--1 . By chooshave chosen a numeric ID outside of the range of length ing suitably large c, this error term disappears in the Od after X.NumID. (If X.NumID is near to 1, then this notation. In the analysis below, we can assume that all range will be in two pieces: one at the end of [0, 1) and nodes estimate n within the bounds of Proposition 2.2 one at the beginning.) The probability that X estimates whenever convenient because we will not need to intron this small is (1-d)n-1 = (1-d)n /(1-d) < n(1-d)n < duce more than O(n) error terms. Thus, we will ignore n exp(-dn) = n exp(-n(c + 2) ln n/n) < n exp(-(c + error terms in the proofs below in order to keep the 2) ln n) = 1/nc+1 . Thus, the probability that any node explanations clear. estimates n this small is less than n/nc+1 = 1/nc . To simplify notation, we will let the c in ProposiNow, we argue the upper bound. Let d = 1/nc+2 . tion 2.2 be fixed and define the following notation for Fix a node X. The probability that some other node the upper and lower bounds. Y has a numeric ID within the range of length d after X.NumID is d. Thus, the probability that X estimated Definition 2.1. The minimum level estimate is n > 1/d is at most (n - 1)d = (n - 1)/nc+2 < 1/nc+1 . Lmin = lg(n/(c + 2) ln n) and the maximum level Thus, the probability that any node estimated n > 1/d estimate is Lmax = (c + 2) lg n . is less than n/nc+1 = 1/nc . We have shown that the desired bounds hold with- 2.3 Global Prop erties. Now that all data belongout the floors. Taking the floor of the estimate and ing to a node has been defined, we examine the global both bounds can only increase the probability that the properties of family trees. bounds hold, so the proof is complete. Proposition 2.3. If al l name IDs, numeric IDs, and An advantage of this method for estimating n is levels are given, then the family tree has only one that each node's estimate is only dependent on the possible shape. distance to its successor in the numeric ID list. Thus, each node needs to change its estimate only when Proof. Since we do not allow duplicates, the name its successor changes, which will make insertions and ID list has only one possible shape. Similarly, the deletions efficient. We will discuss this aspect further in numeric ID list has only one possible shape. Each node belongs to exactly one level list according to its level Section 4. A notable disadvantage of this method is that and the relevant bits of its numeric ID. Each such list it makes the levels dependent on the numeric IDs. is sorted, so they can have only one shape. Lastly, Furthermore, the levels themselves are not independent. the Mother, Father, and FirstChild pointers are If X has a level of i, then we know that there is no other completely determined by the shape of the level lists. node within a range of 1/2i after it. This means that Corollary 2.1. The probability that a family tree has some other node estimated n0 < (n - 1)/(1 - 1/2i ) and a given shape is independent of history. i thus has a level at most lg((n - 1)/(1 - 1/2 )) - 1. Clearly, this is a small amount of dependence: we have Proof. By Proposition 2.3, the current shape of the only shown that some other node has a level slightly data structure depends only on the name IDs, numeric less than E[ lg n0 - 1]. However, even this small IDs, and levels of the items currently in the dictionary. amount of dependence complicates analysis. The issue The name IDs and numeric IDs themselves are clearly of dependence also exists for Viceroy, although the issue independent of history. As long as each node chose its appears to be somewhat more involved for the analysis level by estimating n using its current successor (not a of family trees. past successor), then the probability distribution of the To handle this difficulty, we use the following ap- levels is independent of history. We will ensure that this proach. Proposition 2.2 shows that all estimates of n are is true by having a node choose a new level whenever reasonable with high probability. Thus Pr(X.Level = its successor changes. i | X estimated n reasonably) is within a constant factor of 1/ lg n. This yields a bound on Pr(X.Level = i) Next, we turn to properties of family trees that will as follows. In general, if E is some event that holds be useful in analyzing the performance of search and

311


Proof. Consider the process of traversing the name ID list, starting at X, and choosing a numeric ID and level for each of the nodes. Let C denote the list at level X.Level - 1 containing X's children. Our goal is to count the number of nodes that are chosen to belong Theorem 2.1. The data structure has no more than to C before choosing X's successor in its level list. This (c + 2) lg n levels with high probability. process is a sequence of trials with three outcomes: on a Proof. By Proposition 2.2, no node estimated lg n0 "success", we have chosen X's successor; on a "failure", larger than this. Hence, no node could have chosen a we have added a new node to C ; otherwise, we have a "retry", indicating an unrelated node. Equivalently, level larger than this. we can eliminate the "retry" outcome by thinking of Since the nodes at level L are separated randomly each trial as continuing until the first success or failure. into 2L lists, the expected length of these lists is less Since there are now only two possible outcomes, it is than n/2L Lmin . (As mentioned above, we are ignoring a Bernoulli trial. Call a success in the Bernoulli trial the error term of size 1/nc-1 caused by dependence a "heads" and a failure in the Bernoulli trial a "tails". of levels and numeric IDs.) Straightforward Chernoff The remainder of the proof bounds the number of tails bounds show that the length of these lists does not before the first heads. deviate much from its expectation, except at very high First, we must bound the probabilities of suclevels. We now consider the probability of encountering cess and failure in the three-outcome trial. Let k = empty lists at all levels slightly less than Lmin . X.Level - 1. For the probability of success (finding X's successor in its level list), we have Theorem 2.2. Let L = lg n - 2 lg lg n - 2 lg(c + 2). Every list in the family tree with level at most L is non- Pr(success)  1/2k+1 L k+1 (c + 2) lg n max = 1/2 empty with high probability. k+1 k+1 Pr(success)  1/2 Lmin = 1/2 lg(n/(c + 2) ln n) Proof. The probability that a list at level k is empty is at most (1-1/(2k (c+2) lg n))n . For lists at level at most L, For the probability of failure (finding another node in this bound is at most (1-(c+2) lg n/n)n < e-(c+2) lg n = C ), we have 1/nc+2 . Next, observe that the total number of lists in Pr(failure)  1/2k Lmax = 1/2k (c + 2) lg n levels 0 to L is at most 2L+1 < 2n. Thus, applying a Pr(failure)  1/2k Lmin = 1/2k lg(n/(c + 2) ln n) union bound over all lists yields the desired result. The preceding discussions focused on properties of Using these, we can bound the probability of a tails. the levels and their lists. Next, we will look at properties Let Pr(failure) = /2k lg n. Then we have that hold for arbitrary individual nodes in a family tree. Let X be a node in a family tree. Both X.FirstChild Pr(tails) = Pr(failure)/(Pr(failure) + Pr(success)) and X.LevelNext.FirstChild point to nodes in the < (/2k lg n)/(/2k lg n + 1/2k+1 (c + 2) lg n) same list (at level X.Level - 1). Consider the number = /( + 1/2(c + 2)) of nodes in this list whose name IDs fall between = 1/(1 + 1/2(c + 2)) X.NameID and X.LevelNext.NameID. This is an important concept, so we give it a name. To get an upper bound for Pr(tails), we apply an Definition 2.2. Let X be a node. Let L be the upper bound for . By definition of , we have  < level list containing X.FirstChild. We denote by lg n/ lg(n/(c + 2) ln n). For sufficiently large n, we can Children(X) the sublist of L containing al l nodes Y bound  < 2. Thus, for sufficiently large n, we can such that X.NameID < Y.NameID and Y.NameID < bound Pr(tails) by some constant p < 1. Then, the expected number of tails before the first heads is less X.LevelNext.NameID. than p/(1 - p) = O(1), one less than the expected Intuitively, we would expect every node to have value of a geometric distribution. This shows that the about two children since the lower level list is about expected number of children is O(1). twice as long. The following theorem shows that the To complete the proof, we compute a high probabilexpected length is indeed a constant. ity bound on the number of children. The probability k Theorem 2.3. For any node X, |Children(X)| is that we see at least k tails before a heads is p . If we O(1) in expectation and O(lg n) with high probability. let  = 1/p, then picking k = c log 2 lg n gives a probability of 1/ c log 2 lg n = 1/nc .

update operations. For all operations, it is necessary to bound the number of levels in the data structure since it may be necessary to visit all of them. We can bound this very tightly.

312


Definition 2.3. For any node X, let Parents(X) = {Y | X  Children(Y)}. A slight variation on the proof of Theorem 2.3 yields the following result. Theorem 2.4. For any node X, |Parents(X)| is O(1) in expectation and O(lg n) with high probability.

Search-By-Name-ID( start , dest ) 1 X  Linear-Search-For-Level-0( start , dest ) 2 X  Find-Closest-At-Level-0( X , dest ) 3 X  Linear-Search-For-Destination( X , dest ) 4 return X Linear-Search-For-Level-0( X , dest ) 5 while X.NameNext = nil and 6 X.NameNext.NameID < dest and 7 X.Level > 0 8 do X  X.NameNext 9 return X

The previous theorems examined the level lists. We now focus on the name ID list and the numeric ID list, both of which contain every node. The next theorem Find-Closest-At-Level-0( X , dest ) shows that no matter where we are in the name ID or 10 left  X.NameID numeric ID list, there is always a nearby node that is at 11 while true 12 do if X.LevelNext = nil or a given level (provided that level is not too large). Theorem 2.5. The distance in the name ID list (or numeric ID list) from a node X to the nearest node at level L  {0, . . . , Lmin - 1} is O(lg n) in expectation and is O(lg2 n) with high probability. Proof. Imagine traversing through the name ID list (or numeric ID list) and choosing the level of each node as we encounter it. The expected distance to a node at level L is  i-1 (1/Lmin ) E[distance] < i=1 i(1 - 1/Lmax )  = (1/Lmin ) i=1 i(1 - 1/Lmax )i-1 = (1/Lmin )/(1 - (1 - 1/Lmax ))2 = L2 ax /Lmin m = O(lg n) The probability that there are no level i nodes within a distance of k Lmax is less than (1 - 1/Lmax )kLmax < e-k . If we choose k = c ln n, then there are no level i nodes within a distance of c(c + 2) lg n ln n = O(lg2 n) with probability less than 1/nc . Another variation on the proofs of Theorem 2.3 yields the following result.
13 14 15 16 17 18 19 20 21 22 23 24 25

dest < X.LevelNext.NameID then break P  X.Mother or X.Father at random if P = nil then X  P else break X  Level-Search-By-Name-ID ( X , left ) while true do if X.Level > 0 then X  X.FirstChild else break X  Level-Search-By-Name-ID ( X , dest ) return X

Level-Search-By-Name-ID( X , name ) 26 while X.LevelNext = nil and 27 X.LevelNext.NameID < name 28 do X  X.LevelNext 29 while X.LevelPrev = nil and 30 name < X.NameID 31 do X  X.LevelPrev 32 return X Linear-Search-For-Destination( X , dest ) 33 while X.NameNext = nil and 34 X.NameNext.NameID < dest 35 do X  X.NameNext 36 return X

Figure 2: Search-By-Name-ID finds the node whose Theorem 2.6. Let X be a node at level L. The number name ID is closest to the given destination. of nodes whose name ID is between X.NameID and X.LevelNext.NameID is O(2L lg n) in expectation given destination name. Figure 2 contains the pseudocode for this operation. For the sake of simplicity, we and O(2L lg2 n) with high probability. have assumed that the destination name is greater than the start node's name. The other case is similar. 3 Search Op erations The algorithm works in four phases. It is important In this section, we will describe the search operations for search locality that the second phase begin at a level and analyze their performance. In the next section, we 0 node. The purpose of the first phase is to find a will consider update operations. level 0 node near the start node. This is accomplished 3.1 Search by Name ID. The Search-By-Name- in Linear-Search-For-Level-0 by a linear search in ID operation searches from a start node to find the node the name ID list. The second and third phases are whose NameID is closest but less than or equal to a implemented in the helper function Find-Closest-At-

313


Level-0. In the second phase, lines 10н19, we move upward until we reach a node X, to the left of where we began, such that dest is between X and its successor in that level, or until we reach the top. Line 19 maintains the invariant that X is the closest node to the left of where the second phase began (left). The third phase, lines 20н24, is symmetric to the second phase except that we are centered around dest and moving downward. This third phase is analogous to a binary search or a skip list search. The node returned from Find-ClosestAt-Level-0 will be the level 0 node closest to the destination, but it may not be the closest node overall since the destination node may not be at level 0. In the fourth phase, implemented in Linear-Search-ForDestination, we search through the name ID list to find the node closest to the destination. The correctness of the algorithm is assured by this last phase. Next, we will analyze the running time. Theorem 2.5 shows that the first phase requires O(lg n) time in expectation and O(lg2 n) time with high probability. Since the number of levels is O(lg n) with high probability, the outer loops of the second and third phases will execute no more than O(lg n) times. Each iteration in the second phase does O(1) work except for the call to Level-Search-By-Name-ID. Note that each node traversed in this call is a parent of the successor of the node reached by the previous iteration. Theorem 2.4 shows that the number of such parents is O(1) in expectation and O(lg n) with high probability. Each iteration of the third phase is analogous except that we traverse children instead of parents. Thus the total running time of the second and third phases is O(lg n) in expectation and O(lg2 n) with high probability.3 The analysis of the fourth phase is identical to that of the first. Thus, the running time of Search-By-Name-ID is O(lg n) in expectation and O(lg2 n) with high probability.

tination without accessing it. We will assume that such caching occurs for our discussion of locality and, later, congestion. Next, recall that the first phase searches from the start node for the closest level 0 node. To see why this first phase is important for locality, suppose that the start node is at a high level and that the destination name is very close to the start node's name. In this case, the start node's LevelNext and FirstChild pointers are both likely to point well beyond the destination node, so traversing them would result in poor search locality. Instead, the first phase finds a node at level 0 (so that the expected distance to its successor in the level list is as small as possible). We will now analyze the expected locality of Search-By-Name-ID. To begin, note that the first and fourth phases have strict locality, in the sense that they never traverse a node whose name ID is smaller than start or greater than dest. In the second phase, we may follow a parent pointer to a node whose name ID is smaller than start. We want to show that the expected maximum distance from any such node to start is O(D), where D is the distance from start to dest. Similarly, in the third phase, we may follow a FirstChild pointer to a node whose name ID is greater than dest. The analysis of the third phase is symmetric to that of the second. First, by Theorem 2.6, the expected distance from the leftmost node traversed at level j to start is O(2j lg n). We can bound the expected maximum distance of all nodes traversed up through level j by the expected sum of these distances. Since these distances increase geometrically, this sum is also O(2j lg n). Next, we can condition the expected maximum distance on the highest level reached, which will be j for the previous calculation. Define h to be the smallest value such that D  2h lg n. Intuitively, we would expect that the highest level reached is h or higher. A short calculation that shows that the probability that the highest level reached 3.1.1 Lo cality. In this section, we show that is h + i is at most 1/2i(i-1)/2 . Since these probabiliSearch-By-Name-ID has good locality, in the sense ties decrease exponentially faster than the conditioned that it does not traverse nodes that are far outside the maximum distances increase, the expected maximum range between the start and destination. distance is O(D). Before doing so, we note that the pseudocode that In comparison, searches in a Skip Graph / SkipNet we presented in Figure 2 is sequential. A distributed [1, 6] have strict locality. It is possible to adapt implementation would execute various portions of that our Search-By-Name-ID algorithm to achieve strict code at different nodes. As a practical matter, dis- locality. To do this, we would add parent pointers that tributed nodes would cache the name IDs of their neigh- point right and child pointers that point left. These boring nodes. Using this cache, the distributed search pointers would allow the search to remain between the algorithm can determine that a node is beyond the des- start and destination nodes during the second and third phases. (The first and fourth phases already have strict 3 There is considerable slack in this argument. It can be lo cality.) Due to space constraints, we do not consider shown that these phases run in O(lg n) time with high probability. this variant any further.
Unfortunately, the tighter bound for these phases does not improve the b ound of the algorithm as a whole.

314


It remains to analyze the running time of this algorithm. The last subsection showed that the time required by Find-Closest-At-Level-0 is O(lg n) in expectation and O(lg2 n) with high probability. The second phase, lines 39н47, is identical to the second phase of Search-By-Name-ID except that the parent choices are given. Thus, the same bound on the running time applies: O(lg n) in expectation, and O(lg2 n) with high is probability. The running time of the last phase depends on the level reached in the second phase. By Theorem 2.2, we will reach at least level k = lg n - 2 lg lg n - 2 lg(c + 2) with high probability. This leaves a range of size 1/2k = lg2 n ╖ lg2 (c + 2)/n in numeric ID space to be searched. Hence, a loose bound on the running time of the linear search is O(lg2 n) in 3 Figure 3: The Search-By-Numeric-ID function finds expectation and O(lg n) with high probability. By the node whose numeric ID is closest to the given value. conditioning the expectation on the level reached in the second phase, we can tighten this bound to O(lg n) in expectation and O(lg2 n) with high probability. We 3.2 Search by Numeric ID. The previous subsec- omit the details of this analysis due to space constraints. tion considered how to find a node with a given name ID. Thus, Search-By-Numeric-ID requires O(lg n) time In this subsection, we show that it is also possible to find in expectation and O(lg2 n) time with high probability. the node whose numeric ID is closest to a given number. This can be used to implement consistent hashing and, 3.3 Congestion. We define the congestion at node X consequently, a distributed hash table. to be the probability that a search operation with source The Search-By-Numeric-ID function searches S and target T, chosen uniformly at random, traverses from a start node X to find the node whose numeric ID X. For example, the congestion at the root node of n= is closest but less than or equal to the given value. This a balanced binary tree is at least ( n )2 / 2 (1). 2 value can be either a finite or an infinite sequence of bits. Congestion of (lg n/n) is optimal when the nodes If the input is a finite sequence of k bits, the function have constant degree because (lg n) nodes must be returns a node in the level k list corresponding to the traversed in most search paths. The theorem below given bits. In general, this level list will contain more shows that we achieve the optimal bound. than one node, so we will return the node whose name The definition of congestion at a node is a probaID is closest to the name of the start node. Figure 3 bility where the unknown random variables are the nucontains the pseudocode for this operation. Again, for meric IDs and levels of all nodes in the data structure, the sake of simplicity, we have assumed that the target the random bits used in the search itself, and the choice node is to the right of the start node. of S and T. If we imagine exposing the random bits used Conceptually, this algorithm is the complement of by every node in the data structure, then we could look Search-By-Name-ID. Instead of moving to a high at the congestion of the family tree, which is defined level and then back down, this algorithm first searches to be the maximum congestion at any node. Note that for a level 0 node and then moves up. The first phase is this is a probability on the unexposed random variables. implemented in the call to Find-Closest-At-Level- The following theorem shows that the congestion of a 0 at line 38. In the second phase, lines 39н47, we start family tree does not deviate much from the congestion with a level 0 node and in each iteration find a node that at an arbitrary node. matches the given value in one additional bit. Within each level, we find the node whose name is closest to the Theorem 3.1. The congestion at any particular node name of the start node. As with Search-By-Name- in a family tree is O(lg n/n). The congestion of the 2 ID, the first two phases will find a node whose numeric family tree is O(lg n/n) with high probability. ID is close but is not guaranteed to be the closest to the given value. (In particular, this could happen because Proof. We will analyze Search-By-Name-ID. The an intermediate level list is empty.) In the third phase, analysis for Search-By-Numeric-ID is nearly idenlines 48н50, we perform a linear search in the numeric ID tical. Let X be any node in the family tree. X could be list to find the closest node, which ensures correctness. traversed in any of the four phases of the search algorithm. We will consider each in turn and show that, in

Search-By-Numeric-ID( start , bits ) 37 name  start . NameID 38 X  Find-Closest-At-Level-0( start , name ) 39 while true 40 do if X.Level = | bits | 41 then return X 42 if bits [ X.Level ] = 0 43 then P  X.Mother 44 else P  X.Father 45 if P = nil 46 then break 47 X  Level-Search-By-Name-ID( P , name ) 48 X  Linear-Search-By-Num-ID( X , bits , | bits |) 49 if | bits | <  50 then X Level-Search-By-Name-ID( X , name ) 51 return X

315


each, the probability that X is traversed is O(lg n/n) in expectation and O(lg2 n/n) with high probability4 . To be traversed in the first phase, X must lie between S and the closest level 0 node to the right of S. Theorem 2.5 showed that this distance is O(lg n) in expectation and O(lg2 n) with high probability. Thus, the probability that X lies between a randomly chosen S and its closest level 0 node is O(lg n/n) in expectation and O(lg2 n/n) with high probability. For the node X to be traversed in the second phase (upward search), both of the following conditions must hold. First, the X.Level random parent choices must match the corresponding bits of X's numeric ID. This occurs with probability 1/2X.Level . Second, S must be between X and X.LevelNext.FirstChild. Since all nodes traversed in the upward search are before S, we will not traverse X if S is before it.5 If S is after X.LevelNext.FirstChild, then the closest node to S in the previous level is to the right of X.LevelNext, so the parent of that node will be X.LevelNext or further to the right. We can use two applications of Theorem 2.6 to bound the number of nodes between X and X.LevelNext.FirstChild as O(2X.Level lg n). Thus, the probability that a randomly chosen S is in this set is O(2X.Level lg n/n) in expectation and O(2X.Level lg2 n/n) with high probability. Since these two conditions are independent, we can multiply them to show that the probability that X is traversed in the second phase is O(lg n/n) in expectation and O(lg2 n/n) with high probability. The analysis of the third phase (downward search) is symmetric to that of the second, and the analysis of the fourth phase is identical to that of the first. Adding together the congestion due to each phase, we obtain a bound of O(lg n/n) in expectation and O(lg2 n/n) with high probability. 4 Up date Op erations In this section, we describe the insert and delete operations and analyze their performance. Analogous congestion bounds follow immediately from these definitions. 4.1 Insert. Most of the work required to insert a node is accomplished by calls to the search operations
"in expectation" and "with high probability", we are referring to the outcome when the structure of the family tree is revealed. The expected value of the probability is identical to the probability when no information is known, which is our definition of congestion. 5 We ignore the case where S is b efore X but the closest level 0 node is after X. This is counterbalanced by the analogous case where X.LevelNext.FirstChild is substituted for X, which we do include in the probability.
4 By

described in Section 3. Finding the predecessor of the new node in the name ID list is a simple matter of calling Search-By-Name-ID. Once this predecessor has been found, linking the node into this doubly-linked list requires updating four pointers. Inserting the new node into the numeric ID list is identical except that the call is instead made to Search-By-Numeric-ID. After this, it remains to set the level list pointers and the inter-level pointers. In order to choose a level for the new node X, we must first compute the estimate  lg n . To do this, we subtract the numeric IDs of X and its successor (mod 1) and find the first non-zero bit. This will be found before the (c lg n)-th bit with high probability. Next, we choose X.Level uniformly at random from {0, . . . , - 1}. Once X has a level, we perform a Search-By-Numeric-ID, using just the first X.Level bits of X.NumID, to find the predecessor of X in its level list. Similarly, we can find X.FirstChild by performing a Search-By-Numeric-ID using just the first X.Level - 1 bits of the numeric ID. We then enumerate all the children of X and update their appropriate parent pointers to point to X. We handle X.Mother and X.Father similarly. Lastly, we turn to the predecessor of X in the numeric ID list. As mentioned in the proof of Corollary 2.1, we must allow this node to re-estimate n and choose a new level. This ensures that the shape of the family tree is independent of history. Note that the predecessor's numeric ID does not change, only its level does, so the re-estimation process does not cascade to other nodes. The procedure for estimating n and choosing a level was described in the previous paragraph. The procedure for removing this node from its old level is described in the next subsection. It is easy to see that the running time of Insert is dominated by the six calls to search operations. We argued above that computing the estimate of lg n takes O(lg n) time with high probability. The only other work is updating the pointers. There are only nine outbound pointers. The number of inbound pointers is O(lg n) with high probability by Theorem 2.3 and Theorem 2.4. Thus, the total time required is O(lg n) in expectation and O(lg2 n) with high probability. 4.2 Delete. The algorithm for deleting a node X is straightforward. First, we enumerate the nodes with pointers to X and update them to point to the appropriate predecessor or successor. As argued in the previous section, the number of pointers to X is O(lg n) with high probability. Next, we must allow the predecessor of X in the numeric ID list to re-estimate n and choose a new level. As mentioned earlier, the

316


predecessor's numeric ID does not change, only its level does, so the re-estimation process does not cascade to other nodes. The re-estimation procedure was described in the previous section. The running time of this part, and of the algorithm as a whole, is O(lg n) in expectation and O(lg2 n) with high probability. 5 Conclusion We have presented the family tree, a new dictionary data structure suited to implementation in a distributed environment. Each node in a family tree has only nine pointers. A family tree can be searched and updated in expected O(log n) time. Searches and updates both ensure that no node bears more than an O(log n/n) fraction of the traffic in expectation. These results are optimal for any data structure with O(1) pointers per node. Furthermore, searches in a family tree can take advantage of locality in the input, only searching nodes that have names close to the source and target. Family trees can be used to implement a peer-topeer overlay network, as well as a distributed hash table. Like Skip Graphs [1] and SkipNet [6], family trees are ordered by keys from an arbitrary domain. Hence they support efficient range queries and have useful locality properties. Like Viceroy [15] and Koorde [9], a family tree's bounds on degree, congestion, and search and update performance are optimal. Family trees are the first data structure to simultaneously support arbitrary key ordering and achieve these optimal bounds. A family tree could also be used to implement an in-memory dictionary. This could be advantageous in a situation where the dictionary is being accessed by multiple concurrent readers and writers. Our congestion bounds imply that such an implementation would have low lock contention, assuming uniform access patterns. Furthermore, the family tree could take advantage of locality in the searches made by a given client. In summary, family trees are a new contribution to the theory of data structures. We have shown that they are optimal in congestion, locality, degree, and search and update performance. Furthermore, family trees have immediate practical applications, including to the implementation of peer-to-peer overlay networks. Acknowledgments We thank John Dunagan for many helpful discussions on this topic and for reviewing drafts of this paper. References
[1] J. Aspnes and G. Shah. Skip Graphs. In 14th ACMSIAM Symposium on Discrete Algorithms, Jan. 2003. [2] B. Awerbuch and C. Scheideler. The Hyperring: A Low-Congestion Deterministic Data Structure for Dis-

[3] [4] [5] [6]

[7] [8]

[9]

[10]

[11] [12] [13] [14] [15]

[16] [17]

[18]

[19]

tributed Environments. In 17th International Symposium on Distributed Computing, Oct. 2003. P. Bozanis and Y. Manolopolous. DSL: Accomodating Skip Lists in the SDDS Model. In Workshop on Distributed Data and Structures, June 2000. L. Cowen. Compact Routing with Minimum Stretch. In Proceedings of the 10th Annual ACM-SIAM Symposium on Discrete Algorithms, Jan. 1999. C. Gavoille and D. Peleg. Compact and Localized Distributed Data Structures. Technical Report 126101, LaBRI, Universit┤ Bordeaux I, Aug. 2001. e N. J. A. Harvey, M. B. Jones, S. Saroiu, M. Theimer, and A. Wolman. SkipNet: A Scalable Overlay Network with Practical Locality Properties. In USENIX Symposium on Internet Technologies and Systems, Mar. 2003. N. J. A. Harvey and J. I. Munro. Deterministic SkipNet. In ACM Symposium on Principles of Distributed Computing, July 2003. S. Iyer, A. Rowstron, and P. Druschel. Squirrel: A decentralized, peer-to-peer web cache. In Proceedings of the 21st Annual ACM Symposium on Principles of Distributed Computing. ACM, July 2002. M. F. Kaashoek and D. R. Karger. Koorde: A simple degree-optimal distributed hash table. In Proceedings of the 2nd International Workshop on Peer-to-Peer Systems, Feb. 2003. D. Karger, E. Lehman, F. Leighton, M. Levine, D. Lewin, and R. Panigraphy. Consistent hashing and random trees: Distributed caching protocols for relieving hot spots on the World Wide Web. In Proceedings of the 29th Annual ACM Symposium on Theory of Computing, pages 654н663, May 1997. B. Kroll and P. Widmayer. Balanced distributed search trees do not exist. In Workshop on Algorithms and Data Structures, 1995. F. T. Leighton. Introduction to Paral lel Algorithms and Architectures: Arrays, Trees, Hypercubes. Morgan Kaufman, 1992. W. Litwin, M. A. Neimat, and D. A. Schneider. LH* Linear hashing for distributed files. In ACM SIGMOD Intl. Conf. on Management of Data, 1993. W. Litwin, M. A. Neimat, and D. A. Schneider. RP* A family of order-preserving scalable distributed data structures. In Conf. on Very Large Data Bases, 1994. D. Malkhi, M. Naor, and D. Rata jczak. Viceroy: A scalable and dynamic emulation of the butterfly. In Proceedings of the 21st Annual ACM Symposium on Principles of Distributed Computing. ACM, July 2002. S. Ratnasamy, P. Francis, M. Handley, R. Karp, and S. Shenker. A Scalable Content-Addressable Network. In Proc. of ACM SIGCOMM, Aug. 2001. A. Rowstron, A.-M. Kermarrec, M. Castro, and P. Druschel. Scribe: The design of a large-scale event notification infrastructure. In Third International Workshop on Networked Group Communications, 2001. I. Stoica, R. Morris, D. Karger, M. F. Kaashoek, and H. Balakrishnan. Chord: A scalable Peer-To-Peer lookup service for internet applications. In Proc. of ACM SIGCOMM, Aug. 2001. M. Theimer and M. B. Jones. Overlook: Scalable Name Service on an Overlay Network. In Proceedings of the 22nd International Conference on Distributed Computing Systems, July 2002.

317