The Hyperring: A Low-Congestion Deterministic Data Structure for Distributed Environments
Baruch Awerbuch
Abstract
In this paper we study the problem of designing searchable concurrent data structures with performance guarantees that can be used in a distributed environment where data elements are stored in a dynamically changing set of nodes. Searchable data structures are data structures that provide three basic operations: Insert, Delete, and Search. In addition to searching for an exact match, we demand that for a data structure to be called "searchable", Search also has to be able to search for the closest successor or predecessor of a data item. Such a property has a tremendous advantage over just exact match, because it would allow to implement many data base applications. We are interested in finding a searchable concurrent data structure that has (1) a low degree, (2) requires a small amount of work for Insert and Delete operations, and (3) is able to handle concurrent search requests with low congestion and dilation. We present the first deterministic concurrent data structure, called Hyperring, that can fulfill all of these ob jectives in a polylogarithmic way. In fact, the Hyperring has a degree of O(log n), requires O(log3 n) work for Insert and Delete operations, and can handle concurrent search requests to random destinations, one request per node, with congestion and dilation O(log n) w.h.p. Most of the previous solutions for distributed environments are not searchable (in our sense) but only provide exact lookup, and those that are searchable do not have proofs about the congestion caused by concurrent search requests.

Christian Scheideler
tures with this approach are Chord, CAN, Pastry, and Tapestry [22, 18, 19, 23]. However, the resulting structures do not have the ability to run range queries, fuzzy search, database queries, etc in an efficient way. It takes more work to implement searchable data structures capable of supporting predecessor or successor search. In the sequential domain, these can be implemented via adaptive methods such as 2-3 trees, B-trees, and redblack trees. Implementing adaptive, searchable data structures in a distributed system poses some challenges. A 23 tree, for example, is inappropriate for a distributed environment because of high congestion at the root. What is needed is a distributed, low-congestion analog of 2-3 trees. Many attempts have been made to solve this problem. Work on concurrent variants of search trees has been reported in [10, 13, 16, 21]. However, instead of parallelizing the data structure itself, these approaches only parallelize the way it is accessed, which is not suitable for a dynamic and error-prone distributed environment. Recently, randomized solutions for concurrent and searchable data structures suitable for a distributed environment were suggested, essentially independently, by Li and Plaxton [11], Aspnes and Shah [1] and Harvey et al [8]. The underlying pointer structures are called "hyperdelta networks", "skip graphs" and "skip nets". They constitute a simple and elegant extension of a randomized skip list data structure of Pugh [17] to the distributed environment. In [11, 1, 8], only some basic properties of these data structures such as diameter and expansion were shown, leaving the following issues open: How to keep the congestion low for concurrent search, insert, or delete operations, and how to get rid of randomization in the construction of the data structure? The contribution of this paper is to propose solutions to these problems. More precisely, we present local, deterministic update rules for insert and delete operations and prove upper and lower bounds on their effect on the data structure. The outcome of our investigations is a searchable deterministic data structure, called Hyperring, that can be viewed as a low-congestion version of a 2-3 tree or the deterministric skip list pro-

1 Intro duction A searchable data structure has to provide three basic operations: Insert(d), Delete(name), and Search(name). Insert(d) inserts a data item d with some name into the structure, Delete(name) removes the data item with a given name from the structure, and Search(name) returns the data item representing the closest match (e.g. the closest prefix) to name that has been inserted and has not been deleted (e.g.,[4]). A precise-lookup dictionary is a data structure that looks up exact matches only. It is relatively easy to implement such a data structure via oblivious methods such as hashing. Examples of concurrent data struc Dept. of Computer Science, Johns Hopkins University, Baltimore, USA. Email: baruch@cs.jhu.edu. Supported by NSF grant ANIR-0240551 and NSF grant CCR-0311795.  Dept. of Computer Science, Johns Hopkins University, Baltimore, USA. Email: scheideler@cs.jhu.edu. Supported by NSF grant CCR-0311121 and NSF grant CCR-0311795.

Copyright Š 2004 by the Association for Computing Machinery, Inc. and the Society for industrial and Applied Mathematics. All Rights reserved. Printed in The United States of America. No part of this book may be reproduced, stored, or transmitted in any manner without the written permission of the publisher. For information, write to the Association for Computing Machinery, 1515 Broadway, New York, NY 10036 and the Society for Industrial and Applied Mathematics, 3600 University City Science Center, Philadelphia, PA 19104-2688

318


posed by Munro et al [14]. Independently of our work, Harvey and Munro [9] also proposed a deterministic searchable concurrent data structure that is similar to our structure. However, as it turns out from our results, their structure does not even guarantee a polylogarithmic congestion whereas we can show logarithmic congestion bounds. An obvious advantage of a local deterministic over a randomized solution is the ability to local ly self-correct the data structure in order to return the data structure to a low-congestion state after an adversarial disruption or unlucky combination of keys. This property is called self-stabilization by Dijkstra in his 1974 paper [6] and is considered an important property in existing peerto-peer systems [22, 18, 19]. We comment that by definition, pseudo-random constructions cannot be selfcorrecting. However, the work in [22, 18, 19] deals with some systems aspects of self-correction. 1.1 Existing work on concurrent searchable data structures Imagine a collection of data items Data being sorted in a doubly-linked list based on their names. To make searching efficient, we need shortcut pointers. A naive way of doing this is to assign ranks to the items based on some global (e.g. lexicographical) ordering of their names. Each item d  Data keeps pointers to all items d whose ranking is exactly 2i larger than its own for all i. The pointer graph is essentially a hypercube. Because of the expansion properties of the hypercube, it can be shown that, in a data structure with n data items, n concurrent search requests to random items can be executed with congestion and dilation O(log n), w.h.p. However, ranking is a very sensitive function; each data item that joins or leaves changes the ranking, making dynamic operations (Insert, Delete) expensive if a perfect ranking is to be maintained: the update work would be (n). To solve this problem, consider the approach in [11, 1, 8] of interconnecting nodes in a hierarchy of cycles on top of a cycle C containing all data items (or nodes) in sorted order. Each node pads its name with a random bit string. First, we decompose C into two sorted cycles: C0 , which contains all nodes with first bit 0, and C1 , which contains all nodes with first bit 1. We continue this process recursively on C0 and C1 , generating even smaller cycles in the next higher level, namely C00 , C01 , C10 , and C11 , etc. That is, at level i of the recursion, a cycle Cb1 ,b2 ...bi is a doubly-linked cycle of all nodes with the padding sequence b1 , b2 . . . bi . Since the padding sequences are chosen at random, it is not hard to show that the work for inserting or deleting a data item is O(log n) w.h.p., and searching for a data item only requires to traverse O(log n) edges w.h.p.

1.2 The Hyp erring data structure Next, we outline informally the main ideas of our construction. The exact model, the problem and complexity measures are defined in Section 2. One of our main results is: Theorem 1.1. The Hyperring is a deterministic concurrent data structure with the fol lowing features: ˇ the degree of each node in the data structure is at most O(log n) ˇ the work required is O(log n) for a Search operation and O(log3 n) for Insert and Delete ˇ the congestion for n concurrent Search operations forming a permutation, or one operation per node with random destination, is O(log n) w.h.p. (where the probability only depends on the search algorithm) ˇ the node expansion of the data structure is (1/ log n). The degree of a node and the expansion match the bounds known for skip graphs [11, 1, 8]. However, the congestion created by concurrent operations has not been shown to be even polylogarithmic in [11, 1, 8], and our results imply that the only previously known deterministic concurrent data structure suitable for a distributed environment [9] does not even guarantee a polylogarithmic congestion. Recall the bit padding approach for skip graphs. In the Hyperring, we use a deterministic rule for bit padding (see also [9]): there can be at most two consecutive 0's and 1's in each ring. Pictorially, links corresponding to two consecutive 0's or 1's, e.g. 1, 0, 0, 1, look like a bridge (see Figure 1).

bridge 1

bridge 2

Figure 1: An example of a Hyperring. Padding creates 0 and 1
sub-rings. Notice bridge 1 with sub-sequence 1, 0, 0, 1 and bridge 2 with 0, 1, 1, 0. The bridges have a distance of 5 from each other.

In addition to requiring at most two consecutive 0's or 1's, we also require that these sequences, respectively their corresponding bridges, are sufficiently far apart

319


from each other in each ring. We call a Hyperring k separated if the minimal distance between two bridges on a ring is at least k (i.e. there are at least k - 1 nodes between them). The Hyperring in Figure 1 is, for example, 5-separated. It turns out that separation is crucial for handling congestion. 1.3 Main insights In Section 2 we state the formal specifications of a concurrent data structure and state precisely the measures used in Theorem 1.1 and the theorems below. In Section 3, we prove: Theorem 1.2. A Hyperring with a constant separation has, in the worst case, a poor expansion (i.e.  = O(1/n )). This also implies a bad congestion for routing requests. Hence, a Hyperring must have a non-constant separation to be useful for concurrent operations. However, here we face a dilemma. If the separation we require is, for example, log n, then updates to the Hyperring done at some previous time might have used a completely different n than the n used by current updates. It may seem like a disaster, since we may be forced from time to time to revisit old insertions and deletions once the system grows beyond a certain size. Fortunately, we can show in Section 4 that this is not necessary. Theorem 1.3. Logarithmic separation at the time the request was executed is sufficient to guarantee  = (1/ log n).

For simplicity, we assume that every name is only once in the system at any time. We use the standard performance metrics in communication theory to analyze our data structures. All metrics are meant for the worst case, where by "worst case" we mean the worst case over all possible inputs (i.e. selections of names for the data items). ˇ degree: this measures the maximum degree of a node in the data structure, and thus the space necessary to maintain it. ˇ work: this measures the number of times a message has to be sent over a link in a Insert, Delete, or Search operation. We assume that messages can carry at most O(log n) bits. ˇ (node) expansion: this measures the connectivity or fault-tolerance of the data structure. The expansion  of a graph G = (V , E ) is defined as  = minU V , |U ||V |/2 |N (U )|/|U | where N (U ) = {w  V \ U | v  U : (v , w)  E } is the neighbor set of U . ˇ congestion  (G): this measures the (expected value of the) maximum number of Search operations traversing a node in G for the case that every node has a Search request to a random destination.

To give some examples, the best expansion a constant degree graph can have is a constant (and these graphs are known as expanders), whereas the hypercube with  n nodes is known to have an expansion of O(1/ log n). In other words, it is sufficient that a bridge is a The best congestion a constant degree graph with n logarithmic number of nodes away from existing bridges nodes can have is (log n), and there is an easy search at the time when it is created, where the logarithm is strategy for the hypercube with congestion (log n). taken to the current number of nodes, and it is OK if (However, the hypercube is not optimal because it has logarithmic degree.) In contrast, a complete binary tree this is not true any more in the future. has an expansion of O(1/n) and a congestion of (n). The expansion and congestion of a graph are closely 2 Mo del and statement of the problem related to each other. The following model and measures need to be defined in order to formally state Theorem 1.1. Claim 2.1. For every graph G of size n with congestion  (G), (G) = (1/ (G)). 2.1 Searchable concurrent data structure and p erformance metrics We require a (concurrent) Proof. Suppose that there is a set U  V with |U |  searchable data structure C to offer the following op|V |/2 and |N (U )| = o(|U |/ (G)). Then consider the erations experiment of choosing a random search problem with ˇ Insert(d): adds data item d with name (identifier) n requests, one per node. It is easy to see that the Name(d) to C . expected number of requests that have to cross the cut Ż ˇ Delete(name): removes d with Name(d) = name (U, U ) is equal to |U |. Thus, there must be a node in N (U ) that is passed by |U |/|N (U )| =  ( (G)) requests from C . ˇ Search(name) retrieves from C the closest successor on expectation. However, according to the definition of the congestion, every node in G should only be d of name, i.e. passed by O( (G)) requests on expectation, creating a d = argmin{Name(d ) | d  Data, Name(d)  name} contradiction. 3

20


Notice that it is not possible in general to conclude from the expansion of a graph about its congestion, because just knowing that a data structure has a good expansion does not necessarily imply that it allows efficient searching. For the congestion to be low, it is important to select the right paths, which is a nontrivial task in dynamic networks. We note that [12, 15] are the only results in the peer-to-peer literature so far addressing the congestion of concurrent searching (which they only do for DHTbased approaches). 3 Hyp errings of constant separation

So concerning the degree and diameter, k -separated Hyperrings look appealing even for k = 0. Unfortunately, we can also show the following main result. Theorem 3.1. For every k  0, the k -separated Hyperring has, in the worst case, an edge expansion of O(1/n1/(2(3(k+4)) ) ) . Proof. To simplify the calculations, we will ignore rounding effects throughout the proof because they turn out to be insignificant. We will construct a Hyperring that gives a node set of size n/2 with a poor expansion. For this we need the following lemma. Lemma 3.2. Consider some ring R and some set S of consecutive nodes in R with |S |  |R|/2. Then two intertwined rings R and R can be constructed on top of R so that 1 | 1 1 |V ( R )  S | = - S| 2 3(k + 4) and 1 |V ( R | = 2
) a
2

In this section we present a family of dynamic graphs called k -separated Hyperrings. It offers two primitives: ˇ Add(v ): This adds node v to the Hyperring next to the node that v contacted. ˇ Remove(v ): This removes node v from the Hyperring. Detailed proofs of this section can be found in [3]. 3.1 The basic construction The basic idea behind the Hyperring is similar to [1, 8, 9, 11]: it is organized as a hierarchical structure of rings. Suppose it currently contains n nodes. Then the Hyperring consists of approximately log n levels of rings, starting with level 0. Each level i  0 consists of approximately 2i directed cycles of approximately n/2i nodes, which we call rings. All rings have the same orientation. For every ring R at level i, two rings of level i + 1 share its nodes in an intertwined fashion. A ring at level i will also be called an i-ring in the following, and a level i edge will simply be called an i-edge. Consider some i-ring R and let (u, v , w, x) be four consecutive nodes on R. We say that (u, v , w, x) form an i-bridge (or simply a bridge if i is clear from the context) if there is an (i + 1)-edge from u to x and an (i + 1)-edge from v to w. An (i + 1)-edge is called perfect if it bridges exactly two i-edges. It is possible to maintain a Hyperring with at most one bridge in every ring. However, in this case we would create too much update work for Insert or Delete operations. Instead, we only demand that i-bridges are sufficiently far apart from each other. A Hyperring is called k -separated if in every i-ring R the i-bridges on R are at least k nodes apart from each other, which means that there are at least k - 1 nodes between the quadruples of nodes forming a bridge. We start with a few properties of Hyperrings which are easy to prove.

1

1 + 3(k + 4)

| V (R)|

and R

nd R

d

o not violate the k -separation.

Proof. Consider some set S of consecutive nodes on R. If we require the Hyperring to be k -separated, then we can minimize the number of nodes in V (R )  S by using sequences of 1 + k/2 edges where one edge bridges three edges and the k/2 others bridge two edges in R. Hence, R and R can be constructed on top of R without violating the k -separation so that |V (R )  S
|

 = 

|S | ˇ (1 + k/2 ) 3 + 2 ˇ k/2 1 | 1 - S| 2 4( k/2 + 1) + 2 1 | 1 1 - S|. 2 k+4

Lemma 3.1. For every k  0, the k -separated Hyperso that if R covers ring has a maximum degree of at most 2(1 + 2/(k + We now search for an 1 k 1)) log n and a diameter of at most 3 log n. 2 (1 + +4 )|V (R) \ S | no des in |V (R) \ S | then

Now, select a set S  S of |S |/3 consecutive nodes in S . Then it is possible to construct a ring R on top of S without violating the k -separation so that 1 | 1 1 1 ˇ |S \ S | + - S| |V (R )  S | = 2 2 k+4 1 | 1 1 = - S| 2 3(k + 4)

321


1 | |V (R )| = 1 + 3(k1 4) V (R)|. It holds that + 1 ˇ2 1 | 1 k 1 + +4 (|V (R)| - |S |) + 2 - 3(k1 4) S | = 2 + 1 | 1 + 3(k1 4) V (R)| 2 +   3 -1 3 +1 ˇ |V (R)| = |S | 3(k + 4) 3(k + 4) |R| + |S | = 1 3(|R| - |S |)

because we assumed that |S |  |V (R)|/2. Hence, it is possible to obtain the equations stated in the lemma, whicC completes the proof. h

onsider some set S of consecutive nodes on the 0-ring R0 of size n/2. We construct a k -separated Hyperring with a bad expansion for S inductively. Consider some i-ring R. If |V (R)  S |  |V (R)|/2, then we apply Lemma 3.2 to SR = V (R)  S to obtain for all i < k - 1. Furthermore, if si = |SRi |, then 1 i /2 two intertwined rings on top of R. Otherwise, we apply 1 1 Ż Lemma 3.2 to SR = V (R) \ S to obtain two intertwined si+1 = i - si . 2 (3(k + 4))2 rings on top of R. Using this construction, we prove the following For the last subpath pk we distinguish between two lemma. cases. Suppose that at least half of the times in pk , (3.1) resp. (3.3) applies. Then it holds for the final ring Lemma 3.3. For every i-ring R with i  (log n)/(1 + Ż R that for S = SR resp. S = SR , 1/(2(3(k + 4))2 )), either SR = V (R) or SR = . 1 k /2 1 1 Proof. We can view the Hyperring as a tree T = | |S  - ˇ sk (VT , ET ) with its 0-ring representing the root and edges 2k (3(k + 4))2 /2 directed towards the leafs. For every node vR at level 1 1 i representing an i-ring R, its sons represent the i + 1= - ˇ |S | 2 1 (3(k + 4))2 rings on top of R. Consider now an edge (vR , vR ) (i.e. ˇ 1 - 1+ R is on top of R). If |SR |  |V (R)|/2, then it follows 2(3(k+4))2 2 |S | from the inductive construction that 1 |  where is the length of p. Since |S | = n/2 and 1  |SR | = 2 - 3(k1 4) SR | + = (log n)/(1+1/(2(3(k +4))2 )), it follows that |SR | < 1 1 | (3.1) Ż resp. |SR | < 1 and therefore SR =  resp. SR = V (R). and |V (R )| = 1 + 3(k1 4) V (R)|  2 + Otherwise, suppose that at least half of the times in pk , (3.2) resp. (3.4) applies. Then it holds for the final and hence, ring R that 1 |   |SR | = 1 + 3(k1 4) SR | 1 k /2 2 + 1 | 1 1 (3.2) |V (R)|  - ˇ mk 1 and |V (R )| = 2 - 3(k1 4) V (R)|  2k (3(k + 4))2 + /2 1 1 If |SR | > |V (R)|/2, it follows that - ˇn = 1 | 2 1 (3(k + 4))2  ˇ Ż Ż  - 3(k1 4) SR | |SR | = 1 1 2 + - 1+ 2(3(k+4))2 2 n. 1 | (3.3) and |V (R )| = 1 + 3(k1 4) V (R)|  2 + Since = (log n)/(1 + 1/(2(3(k + 4))2 )), it follows that |V (R)|  1, and therefore R does not exist. Combining and hence, 1 | the cases proves the lemma. S  Ż Ż  + 3(k1 4) SR | |SR | = 1 2 + 1 | ince for every ring R with SR =  and SR = V (R), (3.4) 1 1 ) - 3(k+4) V (R)|  and |V (R | = 2 SR can only be connected to two outside nodes via edges

Consider now any path p through T from the root to some node vR at level i = (log n)/(1 + 1/(2(3(k + 4))2 )). We cut p into subpaths p1 , p2 , . . . , pk at those nodes vR ~ ~ with |SR | = |V (R)|/2. Then it must hold for each pi ~ that either |SR |  |V (R )| for every node vR in pi or |SR |  |V (R )| for every node vR in pi . Furthermore, to get back to |SR | = |VR |/2 at the end, half of the edges in pi must be of type (3.1) resp. (3.3), and the other half of the edges in pi must be of type (3.2) resp. (3.4). Hence, if mi is the size of the ring Ri at the beginning of pi and i is the number of edges in pi , then 1 i /2 1 i /2 1 1 1 mi+1 = mi - + 2i 3(k + 4) 3(k + 4) 1 i /2 1 1 = - mi 2i (3(k + 4))2

322


in R, it follows from Lemma 3.3 that S is connected to at most
log n 1+1/(2(3(k+4))2 )

(a)
2i  2 ˇ 2(log n)/(1+1/(2(3(k+4))
2 2

2ˇ

i

))+1

=0

= 4 ˇ n1-1/(2(3(k+4)) outsT e nodes. This proves the theorem. id

)

(b)

heorem 3.1 and Claim 2.1 together imply the following result. Corollary 3.1. For every k  0, the k -separated 2 Hyperring has a congestion of (n1/(2(3(k+4)) ) ).

(c)
Figure 2: The three cases when adding a node. Case (c) reduces to case (b).

Hence, no k -separated Hyperring with k = O((log n)1/2- ) for some constant > 0 can have a bridge polylogarithmic congestion. Hence, to have a good con gestion, we need k = ( log n). However, notice that when k depends on the size of the Hyperring, node insertions and deletions that have been performed in the past might have used a k that significantly differs from the k used by current insertions and deletions. Hence, parts of the Hyperring may be out of date. So the question is whether it is necessary to revisit these parts in Figure 3: Permuting > i-endpoints drags the bridge order to bring the Hyperring up to date. Fortunately, over to obtain, e.g., case (c) in Figure 2. as one of our main results, we show in the next section that this is not necessary. Theorem 3.2. Add preserves the k -separation of the 2 3.2 Adding and removing no des First, we intro- Hyperring and requires O(k log n) work. duce some notation. Let succi (v ) be the successor of v in its i-ring and predi (v ) be the predecessor of v in its Removing a no de We also remove a node u from the i-ring. For every node v on R, its > i-endpoints repre- Hyperring level by level, starting with level 0. In each sent all endpoints of edges in v with level more than i. level, we remove the node by either removing an already Notice that each node has two endpoints in each level. existing bridge in its k + 2-neighborhood or by creating a By "moving" the i-endpoints from u to v , we mean that new bridge. A bridge is removed by first dragging it over we replace the i-edges (predi (u), u) and (u, succi (u)) by (Figure 3) and then applying case (b) or (c) in Figure 4. the i-edges (predi (u), v ) and (v , succi (u)). By "permut- Otherwise, we just apply case (a). Remove terminates ing" the i-endpoints of u and v , we mean that we move once we reach a ring of size in {4, . . . , 7} (rings smaller the i-endpoints of u to v and the i-endpoints of v to than 4 are removed). u. Due to space limitations, we omit the proofs of the Theorem 3.3. Remove preserves the k -separation of theorems in this subsection. See [3] for details. the Hyperring and requires O(k log2 n) work. Adding a no de The basic approach behind Add is that we integrate a node u into the Hyperring level by level, 4 Hyp errings of non-constant separation starting with level 0. In each level i, we integrate the For every ring R, |R| denotes the number of edges (or node by either removing an already existing bridge in nodes) it consists of, and for every edge e, |e| denotes its k + 2-neighborhood or by creating a new bridge. the number of 0-edges bridged by it. For every node v , A bridge is removed by first dragging it over to u by (v) denotes the number of levels it participates in. Consider the case that every node v initiating Add permuting > i-endpoints (see Figure 3). Then case (b) or (c) in Figure 2 is applied. Otherwise, we just or Remove sets kd = 6(d + 3) as separation, where apply case (a). Add terminates once we reach a ring of d = (v) + 2. Since d = O(log n), Theorems 3.2 and size in {4, . . . , 7} (for larger rings, two new subrings are 3.3 imply that Add and Remove operations require a work of at most O(log3 n). (Although this bound may created).

323


n 2i

(a)

+ 1]. Furthermore, it either holds for al l nodes v that (v)  [ log n - 1, log n + 1] or for al l nodes v that (v)  [ log n , log n + 2].

(b)

(c)
Figure 4: The three cases when removing a node. Case (c) reduces to case (b). sound large, we note that with the help of some tricks, the work can be parallelized so that Add and Remove need O(log n) time to complete.) The tricky aspect with using such a separation parameter is that it is not a fixed, global parameter. Different nodes may use a different kd , and the value of kd can vary significantly over time. Nevertheless, we show that the following properties can be guaranteed at any time: Proposition 4.1. 1. the ring distortion is low, i.e. for every i-ring R, |R|  [ 1 ˇ n/2i - 1, 2 ˇ n/2i + 1] and 2 2. the edge distortion is low, i.e. for every i-edge e, | e |  4 ˇ 2i .

Proof. Let R be some i-ring and R be some (i + 1)jR ring on top of R. Let bR = bj be the total number bridges on R. Suppose that b of the bridges assign the inner edge to R and the remaining b bridges assign the outer edge to R . Each time R uses an inner edge, its u offset to traverse R is reduced by one, and each time R ses an outer edge, its offset is increased by one. Hence, it holds that |R | = = |R | + 2 b
-

b

=
.

|R| + b

-

(bR - b )

2

|R | - b R +b 2

Notice that this is always an integer since bR is even if and only if |R| is even, because otherwise R would not be able to go around R just once (which is guaranteed dR by the algorithm). Now, bR = bd and from Invariant 4.1 it follows (for bR  2) that d kd ˇ bR  |R| and d bR  d min{|R|, 2d-i+2 } . kd

Obviously, bR is maximized if using as many bR with d small d as possible, i.e.   j j -i+2  min{|R|, 2 } . bR  max 1,   kj
log |R|+i-2

Hence, For this we need some notation. For every set of consecutive nodes S on an i-ring R and every integer |R |  d  0, bS denotes a class of bridges that have a distance d of at least kd to other bridges in S , where kd = 6(d + 3).  The main invariants we want to prove are: Invariant 4.1. For every i  0 and every set of dS consecutive nodes S on an i-ring R with bd  2 it holds: 1. For every bridge B in bS , no bridge is in the kd d neighborhood of B and 2. bS  d
min{|S |,2d-i+2 } . kd

   S

  j j -i+2  min{|R|, 2 } 1 |R| + max 1,   2 kj log |R|+i-2    j 2j -i+1  1 |R| + max 1,  2 3(j + 3)  log |R|+i-2 | 1 1 2log |R|-1 R| + max , 2 log |R| + i + 1 | 1 | 1 1 ˇ max R| + 1, + R| 2 2(log |R| + i + 1)

|R| + bR 2

We prove this invariant by complete induction on the number of operations: Suppose that after operation q the invariant is still true. Then we can show the following lemma. Lemma 4.1. Suppose that Invariant 4.1 holds. Then,  1n for every i  0 and every i-ring R, |R|  [ e ˇ 2i - 1, eˇ

ince R decomposes into exactly two (i + 1)-rings, it also holds that | 1 | . 1 1 | |R  ˇmin R| - 1, - R| 2 2(log |R| + i + 1) This allows us to prove the following claim.

324


Lemma 4.2. If Invariant 4.1 is true after operation q , or some s < log n (which represents the switching point then Invariant 4.1 remains true after operation q + 1. from the right term of the max (resp. min) expression Proof. Suppose that operation q + 1 is an insertion of to the left. a node u in R. Let the neighbors of u be v and w. Proof. The claim can be shown by complete induction According to Lemma 4.1, it holds for every node v on i. First we show that for all i  s, that (v)  log n - 2. Hence, our topology control 1 . i i 1 scheme will check a neighborhood of kd nodes around 1 n 1 n u with d  (log n - 2) + 2 = log n. If a bridge is (4.1) |R|  - , + 2 log n 2i 2 log n 2i found, it is removed, which certainly does not endanger Invariant 4.1. If no bridge is found in this neighborhood, This is certainly true for i = 0. Given that it is true for then a bridge is created on top of u. If this bridge is some i, it is also true for i + 1, because for any i-ring R, in the kd -neighborhood of some existing bridge in R 1 + i with d > d, then we downgrade this bridge to d. This 1 n i+1 log |R| + i + 1  log - does not endanger Invariant 4.1(1) since we get back to 2 log n 2i neighborhoods of bridges that do not contain any other  log n bridges. It therefore remains to check Invariant 4.1(2), S he d for all i  log n. Consider now some i-ring R with i  s. i.e. for td-i+2 chosen by u it must hold that bd  min{|S |, 2 }/kd . Then it holds for any i + 1-ring R on top of R that  Since on one hand |S |  |R| and |R|  e ˇ n/2i + 1 1 . | |R  2 (|R| - 1), 1 (|R| + 1) and on the other hand d  log n and therefore 2d-i+2  2 4 ˇ n/2i , it holds that min{|S |, 2d-i+2 } = |S |. Hence, we Using this it follows by complete induction that for any only have to show that bS  |S |/kd . This is certainly d i + s-ring R on top of an s-ring R, o 1 1 , 1 . true, because if no bridge is in the neighborhS od of 1 1 1 another bridge, then the bridges counting for bd must | (4.2) |R  i |R| - -i |R| + -i be by at least kd nodes apart, which proves the lemma 2 2 2i 2 for the case that a node is inserted. Combining (4.1) with (4.2) yields the claim. I It remains to consider the event that some node u has to be deleted in R. The proof for this follows along t is well-known that for all s < log n, the same lines as for an insertion event and is therefore 1 s 1 s 1 1 - 1 /2 1 /2 omitted here. F - >e and + <e 2 log n 2 log n inally, using the invariant, we can prove a bound Hence, for every i-ring R with i  1, on the distortion of edges. 1 . n n |R|   ˇ i - 1, e ˇ i + 1 Lemma 4.3. For every i,he number of 0-edges bridged t 2 e2 by an i-edge is at most 2 e ˇ 2i . According to our rules for Add and Remove, the second highest ring R must have a size in [8, 15]. If there are Proof. Consider some fixed i-edge e. For some fixed two second highest rings R1 and R2 at different places of j  {0, . . . , i - 1}, let S be the sequence of (j + 1)-level the Hyperring where R2 is more than two levels higher edges bridged by e and S be the sequence of j -level than R1 , then there must be an i with edges bridged by S . Then it holds that  n 1n d  ˇ i - 1  15 and e ˇ i+3 + 1  8 . 2 e2 |S |  2|S | + bS . d   i i This means that 2  n/(16 e) and 2  e ˇ n/56, which is not possible. Hence, the highest rings can From Invariant 4.1 and the proof of Lemma 4.1 it follows be by at most two levels apart. Thus, the number of that levels in which the nodes participate must be either all . 1 d |S | in [ log n - 1, log n + 1] or all in [ log n , log n + 2]. bS  max , d 2(log |S | + j + 1) W

Claim 4.1. For every i-ring R with i  1, f 1 1 s s 1 n 1 n |R|  - - 1, + +1 2 log n 2i 2 log n 2i

ith the help of Lemma 4.1 we can prove that the invariant also holds after operation q + 1.

325


Hence, = |S |  T

1 |S | 2|S | + max , 2(log |S | + j + 1) 2 1 1 max |S | + 1, 2 + 2(log |S | + j + 1)

| S|

hus, using complete induction similar to the proof of Claim 4.1 it follows that for every i-level edge e, the amount of 0-edges that are bridged by e is at most 1 2ˇ2 4 .1 Randomized Hyp errings At the end of this section, we also address the issue of randomizing Hyperring updates. Suppose that we modify Add for the case that there is no bridge in the k + 2-neighborhood of u in some ring R in the following way: u chooses a random number in {0, 1}. If it is 0, then u includes itself in the ring of its left neighbor, and otherwise u includes itself in the ring of its right neighbor on top of R. If each node now chooses kd = 15(log d + 2), then the same invariant as in the deterministic case can be used to show that Proposition 4.1 also holds, with high probability, in the randomized case. One may ask whether for random updates kd can be further reduced. However, kd has to be (log d) to avoid a more than constant factor ring and edge distortion.
i

Search(name): A request P is sent to name with the help of the fact that the names of the nodes are in a sorted order on the 0-ring: Suppose that v is currently the node storing P . As long as name  [Name(v ), Name(succ0 (v ))), forward P to the node w with maximum i so that w = succi (v ) and Name(w)  name (if there is no such w, choose the w among v 's successors with maximum name). If this rule stops at a node w with Name(w) < name, then go to succ0 (w) and stop.

1 + 2i

i

2 eˇ2 . Lemma 5.2. The congestion caused by n Search requests with random destinations is O(log n), w.h.p.


Figure 5: The Search algorithm.
i

Proof. Fact 5.1 implies that every i-ring R can only receive requests from rings on top of it. Thus, it can only receive requests from its own nodes. Consider now an arbitrary node v in R. It is easy to check that only those requests will be sent to v whose destination is bridged by the i-edge e leaving v in R. From Proposition 4.1 we know that e bridges at most 4 ˇ 2i nodes and that R consists of at most 3 ˇ n/2i nodes. Since every node is the starting point of one request and every request has a random destination, the expected number of requests that want to reach v in R is at most (4 ˇ 2i /n) ˇ (3 ˇ n/2i ) = 12. Combining this with the fact that every request only uses at most 2 edges in R (see Fact 5.1), the expected number of requests that traverse v in R is at most 24. Because every node participates in at most log n + O(1) levels, the overall expected number of search requests 5 The Hyp erring data structure In this section we show how to use the Hyperring as a passing through v is O(log n). Using the fact that every data structure. We assume that every node represents request picks a random destination independently from a data item and that data items are sorted on the 0-ring other requests, it follows from the Chernoff bounds (e.g., [7]) that the congestion caused by Search is also O(log n) according to their names. withT igh probability. h his result can be extended to routing requests according to arbitrary permutations  : V  V on the set of nodes V . The problem here is that we cannot just use Valiant's trick of first routing the packets to random Fact 5.1. Any Search request moves along a sequence intermediate destinations before routing them to their of edges of non-increasing level and uses at most two correct destinations, because in an order-based network (where the nodes do not necessarily have uniformly edges in each level. distributed names as in DHT-based approaches) it is not Combining these facts with Proposition 4.1, we obvious how to select a random destination. However, the following trick works instead: immediately obtain the following lemma. A request P for name is sent to a random ring in each level, starting with level 0. In level 0, the source Lemma 5.1. Every Search request uses a path of length node s of P decides with probability 1/2 whether to at most O(log n). keep P or to forward it to its closest predecessor in level Furthermore, we get the following result. 0 that belongs to the 1-ring s does not belong to. Once 5.1 The Search op eration Consider the Search algorithm in Figure 5. Since Search prefers edges of higher level and every i + 1-edge bridges at most 3 i-edges for every i, we obtain the following facts.

326


P has reached the node it is supposed to reach in level i - 1, it enters level i. In level i, the node v currently storing P decides with probability 1/2 whether to keep P or to forward it to its closest predecessor in level i that belongs to the (i + 1)-ring v does not belong to. Once P reaches the highest possible level, it chooses a node in the ring belonging to this level uniformly at random as its intermediate destination. Arguments based on the structural properties of the Hyperring suffice to show that this results in a near-uniform, random request distribution and creates a congestion of O(log n), w.h.p. 5.2 Insert and Delete op erations The Insert(Name) and Delete(Name) operations are simply a combination of Search(Name) and Add resp. Remove. Hence, they achieve the following result. Theorem 5.1. The work for an Insert and Delete operation is O(log3 n). 6 Conclusions In this paper, we presented the first deterministic data structure suitable for distributed environments that can route Search requests with low congestion. The ma jor drawback of this data structure is that Insert and Delete operations need O(log3 n) work. It would be interesting to come up with deterministic data structures that only need O(log2 n) or even O(log n) work for this. Techniques similar to transforming a hypercube into a butterfly may help. References
[1] J. Aspnes and G. Shah. Skip graphs. In Proc. of 14th ACM/SIAM Symposium on Discrete Algorithms (SODA), pages 384­393, 2003. [2] Y. Aumann and M. A. Bender. Fault tolerant data structures. In Proc. of 37th IEEE Symposium on Foundations of Computer Science (FOCS), pages 580­ 589, 1996. [3] B. Awerbuch and C. Scheideler. The Hyperring: A lowcongestion deterministic data structure for distributed environments. Technical report, Johns Hopkins University, 2003. See http://www.cs.jhu.edu/scheideler. [4] P. Beame and F. Fich. Optimal bounds for the predecessor problem. In In Proc. of the 31st ACM Symposium on Theory of Computing (STOC), pages 295­304, 1999. [5] T. H. Cormen, C. E. Leiserson, and R. L. Rivest. Introduction to Algorithms. MIT Press/McGraw-Hill, 1990. [6] E. W. Dijkstra. Self stabilization in spite of distributed control. Communications of the ACM, 17:643­644, 1974.

¨ [7] T. Hagerup and C. Rub. A guided tour of Chernoff bounds. Information Processing Letters, 33:305­308, 1989/90. [8] N. J. Harvey, M. B. Jones, S. Saroiu, M. Theimer, and A. Wolman. Skipnet: A scalable overlay network with practical locality properties. In USITS 2003. [9] N. J. Harvey and I. Munro. Brief Announcement: Deterministic SkipNet. To appear in Proc. of the ACM/SIAM Symp. on Principles of Distributed Computing (PODC), 2003. [10] P. Krishna. Highly Scalable Data Balanced Distributed Search Structures. Ph.D. thesis, University of Florida, 1995. [11] X. Li and C. Plaxton. On name resolution in peerto-peer networks. In Proc. of the 2nd Workshop on Principles of Mobile Computing, pages 82­89, 2002. [12] D. Malkhi, M. Naor, and D. Rata jczak. Viceroy: A scalable and dynamic emulation of the butterfly. In Proc. of the 21st ACM Symposium on Principles of Distributed Computing (PODC), 2002. [13] X. Messeguer. Skip trees, an alternative data structure to skip lists in a concurrent approach. Informatique Theorique et Applications 31(3):251­269, 1997. [14] I. Munro, T. Papadakis, and R. Sedgewick. Deterministic skip lists. In Proc. of the ACM/SIAM Symposium on Discrete Mathematics (SODA), pages 367­ 375, 1992. [15] M. Naor and U. Wieder. Novel architectures for p2p applications: the continuous-discrete approach. In SPAA Proceedings, 2003. [16] W. Paul, U. Vishkin, and H. Wagener. Parallel computation on 2-3 trees. RAIRO ­ Theoretical Informatics 17(4):397­404, 1983. [17] W. Pugh. Skip lists: A probabilistic alternative to balanced trees. In Workshop on Algorithms and Data Structures, pages 437­449, 1989. [18] S. Ratnasamy, P. Francis, M. Handley, R. Karp, and S. Shenker. A scalable content-addressable network. In SIGCOMM 2001. [19] A. Rowstron and P. Druschel. Pastry: Scalable, distributed ob ject location and routing for large-scale peer-to-peer systems. In IFIP/ACM International Conference on Distributed Systems Platforms (Midd leware), 2001. [20] C. Scheideler. Universal Routing Strategies for Interconnection Networks. Lecture Notes in Computer Science 1390. Spinger, 1998. [21] D. Shasha and N. Goodman. Concurrent search structure algorithms. TODS 13(1):53­90, 1988. [22] I. Stoica, R. Morris, D. Karger, M. F. Kaashoek, and H. Balakrishnan. Chord: A scalable peer-to-peer lookup service for internet applications. In SIGCOMM 2001. [23] B. Y. Zhao, J. Kubiatowicz, and A. Joseph. Tapestry: An infrastructure for fault-tolerant wide-area location and routing. UCB Technical Report UCB/CSD-011141, 2001.

327