Caching Queues in Memory Buffers Ra jeev Motwani Abstract Motivated by the need for maintaining multiple, large queues of data in modern high-p erformance systems, we study the problem of caching queues in memory under the following simple, but widely applicable, model. At each clock-tick, any numb er of data items may enter the various queues, while data-items are consumed from the heads of the queues. Since the numb er of unconsumed items may exceed memory buffer size, some items in the queues need to b e spilled to secondary storage and later moved back into memory for consumption. We provide online queue-caching algorithms under a numb er of interesting cost models. Dilys Thomas ceed the average number of unconsumed items in the queues by over an order of magnitude. Some systems resort to load-shedding [2, 12, 5, 22], i.e., dropping excess data items in times of high load. However, a large class of applications, e.g., transaction pro cessing [14] and real-time sto ck transactions, have integrity requirements which do not allow load-shedding. We take the position that it is preferable to sacrifice performance by spilling data items to disk when the memory buffer overflows, rather than sacrifice data integrity for the sake of performance. The problem then amounts to designing queue spilling strategies for maximum performance. In the context of maintaining packet queues in network routers/switches, recent work [15, 20] have shown that a small amount of fast expensive SRAM along with slower inexpensive DRAM and queue-spilling algorithms, provides substantial benefits over architectures using a large amount of SRAM alone. IBM MQSeries [14], a distributed messaging system, which has to be resilient to bursty traffic, employs heuristics for the queue spilling problem. Unfortunately, current data stream systems do not use secondary storage (disk) efficiently. An experimental study in the context of the Gigascope system [10] demonstrated a sudden degradation in performance and throughput when the system started spilling to disk storage. We formulate a mo del for the problem of spilling data queues. Our mo del and algorithms are applicable to a wide variety of data stream systems. We provide competitive online algorithms for the efficient use of secondary storage to maintain queues in memory buffers. 1.1 Framework In our mo del, at any clo ck-tick an arbitrary number of data items (called "tuples") may enter their respective queues, but only a single tuple is consumed from the head of one of the FIFO (firstin-first-out) queues1 . We assume that the tuples have uniform size and n different queues (corresponding to different data streams) are being maintained in a memory buffer of a fixed size, say M tuples. Since the total number of tuples in the queues may exceed the size 1 Our algorithms work for arbitrary numb er of tuples b eing consumed at every clock-tick, but we make this assumption only for ease of exposition. 1 Intro duction Building infrastructure and middleware for highperformance systems dealing with real-time and streaming data is an important problem in a number of disparate settings such as data stream systems [1, 7, 8, 9, 11], networking [15, 20, 21] and distributed transaction processing [14]. Queues are used by these systems to buffer incoming data items, to transfer intermediate data across internal mo dules and to temporarily buffer the output produced. While the size of the queues may entail the use of large but slow secondary storage, performance benefits motivate the use of much faster but comparatively smaller memory caches. "Queue spilling" in memory hierarchies is an important problem in all the above-mentioned data flow management systems including data stream systems, packet pro cessing in network routers and switches, and distributed messaging and transaction-pro cessing systems. Irregularity and burstiness in arrival rates of dataitems for the above systems have been well studied and documented [19, 16, 17, 23]. Due to bursts in arrival of data, the number of unpro cessed data items at any instant may well exceed the size of the available memory buffer. Over-provisioning memory for perio ds of high load is not economically sensible, as most often a small memory buffer suffices, and for very short bursts of time the number of unconsumed items in the queues may ex Department of Computer Science, Stanford University, Stanford, CA 94305. Email: rajeev@cs.stanford.edu. Supported by NSF Grant IIS-0118173, NSF ITR Award Number 0331640, an SNRC Grant, and grants from Microsoft and Veritas. Department of Computer Science, Stanford University, Stanford, CA 94305. Email: dilys@cs.stanford.edu. Supported by NSF Grant EIA-0137761 and NSF ITR Award Number 0331640. Copyright © 2004 by the Association for Computing Machinery, Inc. and the Society for industrial and Applied Mathematics. All Rights reserved. Printed in The United States of America. No part of this book may be reproduced, stored, or transmitted in any manner without the written permission of the publisher. For information, write to the Association for Computing Machinery, 1515 Broadway, New York, NY 10036 and the Society for Industrial and Applied Mathematics, 3600 University City Science Center, Philadelphia, PA 19104-2688 541 of the memory buffer, some of the tuples in the queues may need to be spilled to secondary storage, which is assumed to be unbounded. We require that the tuple for consumption at the current clock-tick must be read back to memory if it is presently on secondary storage, since tuples can only be consumed from memory. The queuespilling algorithm must decide in an online manner which (and how many) tuples to write or read, as new tuples arrive and old tuples are consumed. The following aspects of our model and algorithms deserve to be highlighted: · Acyclicity: A common problem in using a memory hierarchy is that of thrashing ­ where due to some pathological cases, data is repeatedly moved up and down the hierarchy. This, we suspect, is one of the main reasons for the sudden decrease in performance upon using disk storage in Gigascope [10]. It should be intuitively clear that thrashing is aggravated in stream systems since for FIFO queues the disk blocks that are to be consumed earliest are precisely the ones that are the oldest in the system; consequently, conventional paging algorithms such as LRU will always write exactly the wrong set of blocks onto disk. Formally speaking, we say that an algorithm thrashes if it writes out blocks to slower storage, and then reads it back for consumption into faster storage, only to be written back to disk once again before it is actually consumed. We desire algorithms that do not thrash, formally defined as those which have a property we call acyclicity ­ the algorithm must ensure that it moves each tuple to disk (and back) at most once. · Cost Model: The cost of an algorithm is defined to be the sum of the cost it incurs for its reads and writes on disk. To a first approximation, disk access times can be decomposed into seek time (typically 5-10 ms) to position the disk-head, and transfer time to actually read/write the data. Current disk transfer speeds are fairly high, e.g., SCSI disks support 10-160MBps. For moderate size data transfers, it is usually accepted that a good way to account for disk I-O cost is to model each read/write to disk as having unit cost, irrespective of the amount read or written. Therefore, we define the unit-cost model where each read/write of contiguous tuples in the queue is assumed to have the same cost, regardless of the number of tuples being transferred. Note that in the case of multiple queues, for data locality purposes on disk, we will assume that each read/write on disk involves a contiguous block of tuples from a single queue. Non-contiguous blocks, or tuples from different queues, will require multiple disk accesses with a higher cost. For large data transfers, the cost of writing t blocks is better modeled as c0 + c1 × t, where c0 and c1 are constants with c0 c1 . We refer to this model as the extended cost model. · Queue Updates: Unless explicitly stated otherwise, we assume that at any clock-tick, at most M /2n new tuples enter the system. In other words, we assume that the input arrival is slow enough that there is at least time to write the excess tuples to disk. This also ensures that tuples currently entering the system first get placed in memory, which is required in many settings. · Queue Depletion: For every queue in the system, the order in which tuples are removed from the queues is determined completely by their arrival order (FIFO). In the case of multiple queues, there is a degree of freedom in choosing the queue for consumption. We consider two different queue depletion scenarios: adversarial and round-robin. In adversarial we do not make any assumption about the order or rate at which the scheduler depletes the different queues. In round-robin the scheduler consumes the heads of the queues in a round-robin fashion. · Competitiveness: We desire online algorithms that incur cost close to that of any offline or clairvoyant algorithm which knows the future arrival pattern of tuples. The competitive ratio of an online algorithm is defined to be the maximum ratio, over all arrival patterns, of its cost and that of an optimal offline algorithm [6]. We will provide competitive online algorithms for both cost models discussed above. 1.2 Summary of Results We now summarize our results and present a road-map of the rest of the paper. In Section 2, for a single queue (n = 1) in the unit-cost model, we provide a 2-competitive online algorithm and establish a matching lower bound. Then, in Section 3, we consider the extended cost model where the cost of a disk access depends on the number of blocks accessed. We present a 4-competitive algorithm which can also be extended to the case of multiple queues. For the n-queue problem in Section 4, if the scheduler consumes the heads of the different queues in a round-robin fashion, we have a 2n-competitive online acyclic algorithm and a ( n) lower bound. On the other hand, in the case of adversarial schedulers, we show that it is not possible to have an acyclic online algorithm which is o(M )-competitive, where the memory size M is much 542 larger than the number of queues n. However if the online algorithm has an extra M /2 memory (vis-avis the offline algorithm), we provide a 2n-competitive algorithm, even for adversarial schedulers. We end the paper in Section 5 with some concluding remarks. 2 Single Queue and Unit Cost Mo del · [Read-In]: If head becomes empty before tail reaches M/2, the algorithm will read-in M /2 oldest tuples from spilled to head. · [Transfer]: If after a read-in, spilled becomes empty, the tuples in tail are moved to head. The resulting configuration is exactly the one we had at the beginning of the algorithm, before any reads/writes were made and tail and spilled were empty. It should be clear that all reads/writes involve exactly M /2 tuples, implying that the size of spilled is always a multiple of M /2. Note that since the tail portion is at most M /2 in size, there will always be sufficient space for the tuples being read-in. From Invariant 1, it can be seen that Algorithm Half implements FIFO semantics. Half also stays within the desired memory bound, and in fact, maintains the invariants enumerated above. We now focus on the performance analysis of the algorithm, starting with a few simple observations.2 Lemma 2.1. In Algorithm Half, al l writes/reads are exactly M /2 in size and when a write is performed the size of the queue is more than M . Lemma 2.2. Algorithm Half is acyclic. Proof. The M /2 oldest tuples, which contain all the read-in tuples are never written-out. T heorem 2.1. Algorithm Half is 2-competitive. Proof. Let us number the tuples in the order in which they arrive into the queue. Suppose Half 2 An alternate version of Algorithm Half writes not when the tail portion becomes M /2 in size but rather waits until the head and tail tuples fill up memory, and then writes all the tail tuples. In this version, whenever spilled is nonempty, head is at most M /2 in size as before, but tail may exceed M /2 in size. The sum of sizes of the tail and head portions is at most M , as before. Since the number of head tuples is at most M /2 when spilled is nonempty, in this case each write will involve at least M /2 tuples, but can also involve more than M /2 tuples. For this modified algorithm the reads need not be exactly M /2 in size. It will have to read the M /2 tuples to be consumed next from spilled, or all of the tuples on disk, whichever is smaller. This is because the last read will read the residual tuples when the amount written is not an integral multiple of M /2. For this modified version of Half, it may be the case that a read may not be directly feasible, since the number of tuples read from disk may not fit in memory along with the tail tuples, which may exceed M /2 in size. Then, the modified algorithm will need to write the tail tuples onto disk, adding them to spilled, before actually performing the desired read. Both variants of the algorithm are 2-competitive, but the modified version will have fewer writes than Half. We only present the analysis of the simpler version of Half, leaving the analysis of the other version to the full version of the paper. In this section, we consider the problem of maintaining a single queue in a memory buffer of size M , under the unit-cost model. We present Algorithm Half for this version of the problem. This algorithm will form the basis for later algorithms maintaining multiple queues. Recall that each read/write transfers some set of contiguous tuples and has unit cost. The cost of the algorithm is the total number of such reads and writes. Algorithm Half keeps the two active ends of the queue buffered in memory. At any point during its execution, Half divides the set of unconsumed tuples at that instant into three parts: tail, spilled, and head. The head portion of the queue contains the oldest tuples, while tail portion contains the most recent tuples. The tail and head portions are in memory, while the spilled portion resides on disk. Throughout the algorithm, the oldest tuple in head is the one that is consumed at the next clock-tick. Initially, all unconsumed tuples are in the head portion, and both tail and spilled portions are empty. As new tuples arrive, they are appended to the head portion. The first write is forced when the memory buffer of size M is full. At this time, Half writes out M /2 of the newest tuples from head to spilled. The sizes of head and spilled are both exactly M /2 at this point. Any new incoming tuples now become a part of the tail portion. Note that if spilled is empty, the incoming tuples are placed into head whereas if spilled is non-empty, the incoming tuples are placed into tail. Algorithm Half will schedule writes/reads to ensure that the following invariants are maintained. Invariant 1: The tuples in head are always older than the tuples in spilled, which in turn are always older than the tuples in tail. Invariant 2: Whenever spilled is empty, tail is empty too. Invariant 3: When spilled is nonempty, the number of head tuples and the number of tail tuples are each at most M /2. The actions of the algorithm to maintain these invariants can be summarized as below. · [Write-Out]: If the number of tail tuples reaches M /2, while spilled is nonempty, the algorithm will write-out all M /2 tuples from tail to spilled. 543 performs writes upon the arrival of tuples numbered w1 , w2 , w3 , . . . , wi . Upon the arrival of each such tuple wj , the M /2 newest tuples in memory are written out. Since the algorithm is acyclic, no two writes have a common tuple. Therefore the number of tuples between wj and wj +2 is at least M. Also at each wj , by Lemma 2.1, there are at least M + 1 unconsumed tuples, including the set of M tuples just preceding wj , i.e., wj - M + 1, wj - M + 2, . . . , wj - 1, wj . As the oldest unconsumed tuple, which is not amongst the M tuples preceding wj , must be in memory for consumption at the current clock-tick, at least one of the M tuples just preceding wj must be written to disk by any algorithm, including an offline one. Since the various windows of the M tuples preceding each of the odd numbered tuples w1 , w3 , w5 , . . . are disjoint, they would cost any offline algorithm distinct writes, thus establishing the 2-competitiveness of Half with respect to the writes. The 2-competitiveness of reads can be established by a similar argument. Note that if an algorithm attempts to read tuples from among the windows of the M tuples preceding both wj and wj +2 in a common read, then it will be forced to perform another write of tuples among the M tuples preceding wj +2 after the common read, and then will have to read them in again, makWg the situation worse. in e now establish the optimality of Algorithm Half by showing that no online algorithm can achieve a competitive ratio better than 2. Theorem 2.2. There does not exist any online algorithm with competitive ratio smal ler than 2. Proof. Assume, for contradiction, that we have an online algorithm A with competitive ratio smaller than 2. Consider an arrival pattern where when the first write is required, at that instant (say time ) the queue has just 1 tuple exceeding the memory buffer size, i.e., there are M + 1 unconsumed tuples. We assume that the online algorithm does not perform a write before time , since otherwise it cannot be competitive. (Note that the optimal offline algorithm will not perform any writes. Now, we can wait for the queue to flush out without any new arrivals, and repeat the same pattern.) The proof idea is to first establish that within the first couple of writes, algorithm A must write out a number of tuples p which is at least 2x + 1, even though the size of the queue exceeds the memory buffer size by only x; otherwise, A's competitive ratio cannot be less than 2. Then, we will show that having written excessively to disk, algorithm A can be forced into a situation where it performs much worse than an optimal offline algorithm, and hence is not better than 2-competitive. At time , suppose the online algorithm writes out 3 tuples, we are done since we have established our first goal with x = 1. Suppose, however, that the online algorithm writes at most 2 tuples at time . In the next instant, at time + 1, suppose 3 new tuples arrive into the queue. Given that the algorithm had made a smaller write earlier, it will be forced to write some tuples at this time. If the online algorithm now writes out 7 tuples, again we are done with our first goal. Otherwise, we inject 7 new tuples into the queue at time + 2 which is just enough to force a third write. At this point we stop giving input. The online algorithm has now made 3 writes and will need at least 1 read. In contrast, an optimal offline algorithm for this arrival pattern would have written more tuples at time , and would not need any further writes, although it would need to perform a single read later. Since we could wait till the queue is flushed out and then repeat the entire arrival pattern, it follows that the online algorithm is not better than 2-competitive, a contradiction. We still have to consider the two scenarios: 1) where at time , the online algorithm wrote at least 3 tuples onto the disk; and 2) where at time the online algorithm wrote at most 2 tuples, but when at time + 1 there was an arrival of 3 new tuples, it wrote at least 7 tuples onto the disk. We will extend the arrival pattern in both cases such that: in the first scenario, an optimal offline algorithm would have written exactly 1 tuple at time ; and, in the second scenario, an optimal offline algorithm would have written exactly 3 tuples at time . In either case, we have arranged a situation where the online algorithm writes out p tuples but an optimal offline algorithm writes out p-1 tuples only. Note 2 that in the second scenario the online algorithm has performed 2 writes, while the optimal offline algorithm performs only 1 write in either case. In what follows, we will establish the negative result for the first scenario only, a similar argument works for the second scenario even if we do not charge the online algorithm for the second write it has already performed. Suppose the arrivals stop at this point and only the queue depletion carries on at each clock-tick. As the online algorithm has written out more, it will have to perform a read before one is required by the offline algorithm. If the online algorithm reads in the p tuples in 3 or more read operations, its competitiveness is no better than 2. Therefore, it must use at most 2 reads and in one of them read in at least p tuples. Just after 2 this large read, the online algorithm has more tuples in memory than the offline algorithm, regardless of when the offline algorithm performs its read. At this time, inject enough new tuples to just exceed the memory buffer for the online algorithm. The online algorithm is 544 forced to write, but the offline algorithm does not need to do so. The online algorithm must perform a total of 2 reads and 2 writes, as opposed to the single read and single write required by the offline algorithm. After waiting for the queue to empty out, we can repeat the same sequence, and keep doing so indefinitely. Thus, algoNthm A cannot be better than 2-competitive. ri ote that the argument for the lower bound of 2 does not need a large number of tuples to enter the queue at any instant. It uses arrival patterns where only O(1) tuples enter the queue at any clock-tick. 3 Single Queue and Extended Cost Mo del For large memory buffers the cost of reading/writing t disk blocks is better modeled as c0 + c1 × t, with c1 c0 . Under this extended cost model, we provide a 4-competitive algorithm for maintaining a single queue. To understand the solution for this cost model, consider the natural greedy algorithm which attempts to minimize the number of tuples written to disk. Whenever the memory buffer overflows upon the arrival of new tuples, write the minimum number of tuples to disk needed to correct the overflow; furthermore, for consuming tuples that have been written onto disk, read them in, a single tuple at a time, just when they have to be consumed. Each read can cause a write of at most one tuple. The following theorem is easily verified. Theorem 3.1. The greedy algorithm is optimal from the perspective of minimizing the number of tuples written to disk. Let T = c0 /c1 and assume that T < M /2. Let Algorithm GreedyChunk be the variant of the greedy algorithm which operates on blocks of T tuples, i.e., always reads/writes the minimum possible number of blocks of T tuples. Theorem 3.2. Algorithm GreedyChunk is acyclic and 4-competitive, provided T = c0 /c1 < M /2. Proof. Since T < M /2, the oldest T tuples are never written from memory and upon overflow there are at least T new tuples that can be written. GreedyChunk is hence acyclic. At any instant of a queue-caching algorithm, let D denote the sets of contiguous tuples on disk and M denote the sets of contiguous tuples in memory. Let us define the state of the algorithm at that instant to be the given by S = (D, M ). Note for all algorithms with the same memory buffer size and for the same tuple arrival history, D M is the same at a given t can be shown that the competitive ratio for instant and is the set of unconsumed tuples. Given two Algorithm GreedyChunk has a lower bound of 4, algorithms, A1 and A2 in states S1 and S2 respectively, using techniques from Theorem 2.2. we say S1 = (D1 , M1 ) subsumes S2 = (D2 , M2 ) if there is an injection from D1 to D2 , so that the preimage of contiguous tuples are contiguous tuples, and every tuple is mapped to an older tuple. It can be easily shown that if S1 subsumes S2 and if A2 pays cost c for its future arrival pattern, there exists an algorithm that starts at state S1 and pays cost at most c for the same future arrival pattern. We can relax the subsumption property for an acyclic algorithm A so that D1 has extra t tuples, the newest t in D1 , which need not participate in the function (or the function values on these points are some constants) if the cost of reading these extra tuples has been accounted for. Non-contiguity in the preimage can also be allowed if the overhead of a new read (c0 ) has already been accounted for each non-contiguity. Suppose now that the optimal offline algorithm performs a write of aT + b tuples, where a, b N and 0 b < T . This will cost the offline algorithm at least (a + 1)c0 . For the same set of tuples, Algorithm GreedyChunk will need at most a + 1 writes of T tuples each, with cost 2(a + 1)c0 (which is within factor 2 of the offline cost) to be in a state that subsumes the optimal offline if you do not consider the non-contiguity and T - b (newest) extra tuples in the last write. However, since GreedyChunk will read in blocks of T tuples at a time, reading in more tuples than the optimal offline algorithm may cause it to have more tuples in the buffer than the offline algorithm at some intermediate instants. This may cause GreedyChunk to perform an extra write of a block of T tuples, due to incoming tuples (so the subsumption property is broken by at most this extra batch of T tuples written out by GreedyChunk), at an additional cost of 2c0 for the write and 2c0 later for reading back these tuples when they need to be consumed. But this single extra write by GreedyChunk will indemnify it from further writes caused and there will be no further breakage of the subsumption property within this batch of writeouts due to larger reads of tuples corresponding to the aT + b tuples in this batch. GreedyChunk has T fewer tuples in the memory buffer as compared to the offline algorithm after this extra write. Note that this extra space of T can hold the further reads of size T in this batch. The cost paid by the online algorithm for reads and writes is thus at most 2(2a + 4)c0 , while the optimal offline algorithm pays at least 2(a + 1)c0 . This argument holds for every read done by offline. Thus every readwrite pair that costs offline at least 2(a + 1)c0 can cost GreedyChunk at most 2(2a + 4)c0 . This implies 4competitiveness of GreedyChunk. I 545 Theorem 3.3. For T M /2, Algorithm Half is 4competitive under the extended cost model. Proof. We have already seen that Algorithm Half performs at most twice as many reads and writes as compared to any offline algorithm. Furthermore, since it never reads or writes more than M /2 tuples at a time, under the extended cost model, for each read/write this algorithm pays at most c0 + c1 × M /2 which is bounded by 2c0 for T M /2. Since any algorithm has to pay at least cost c0 for each read/write it performs, this implies the 4-competitiveness of Algorithm Half. C orollary 3.1. There is an acyclic 4-competitive online algorithm for the extended cost model. Proof. The following hybrid algorithm suffices: Use Algorithm Half when T > M /2, and use Algorithm GreedyChunk otherwise. N ote that for current disk technology, the value of T lies between 100,000 and 2,000,000. Our results above suggest that we must read/write blocks of size much larger (100KB-2MB) than the standard disk-block size of 4KB in order to minimize the access time costs due to seeking. Conventionally, smaller blocks are used to avoid memory wastage due to fragmentation. But in data stream systems, bigger blocks are more suitable. Our results are corroborated by the observations3 of Patterson and Gray [13] on how disks of the future should be looked upon as sequential devices and not random access devices. 4 Multiple Queues We now consider the general version of the problem in which the memory buffer of size M is shared by n independent queues. The cost model is once again the sum of the number of reads and the writes, i.e., the unit-cost model. As discussed in Section 1, a single read/write can only involve contiguous tuples from a single queue. We will assume, as before, that the input to the various queues is adversarial. The new issue in the case of multiple queues is the choice by the scheduler as to which of the queues' head is to be consumed next. Round-robin schedulers cycle through the queues, consuming the oldest tuple in each 3 Patterson and Gray [13] mention that disks providing 200 accesses per second, each involving data blocks of size a few KB, will give a throughput of few MBps and this will take a year to read a 20-Terabyte disk of the future. On the other hand, sequential scans avoiding random seeks will give 500 times more bandwidth, making it possible to read the entire disk in one day. This is because disk density and capacity is increasing much more rapidly than their access speeds. queue in turn, skipping over any empty queues. This model makes sense when the goal is to provide fair and equitable service to each queue. Under roundrobin schedulers, we provide a 2n-competitive, cyclic a algorithm and also provide a lower bound of ( n) on the competitive ratio of any acyclic online algorithm. We also consider adversarial schedulers, which at each clock-tick picks up an arbitrary nonempty queue and consumes the oldest tuple from that queue. Here we establish a negative result that it is not even possible to have an o(M )-competitive acyclic algorithm. However, if we allow the online algorithm an extra M /2 units of memory (i.e., the online algorithm has memory size 3M /2 while competing with an offline algorithm with memory size M ), we can devise an online algorithm which is O(n)-competitive. We begin by pointing out that we cannot use the obvious approach of statically allocating M /n units of memory to each queue and running an algorithm such as Half for each queue. To see this, consider the case when essentially all the tuples in the system belong to a specific queue and the number of unconsumed tuples of that queue is close to M , say M - 1, by having a new tuple arrive every clock-tick when a tuple is consumed. In this case the optimal offline algorithm allocates the entire memory to the active queue and has cost 0, while the online algorithm, which allocated only M /n memory to that specific queue will keep making writes and reads, and will be unable to provide any reasonable competitive ratio. 4.1 Algorithm BufferedHead We first describe an acyclic algorithm, BufferedHead, and then show that it can be used to derive the positive results for the two types of schedulers discussed above. Our algorithm will have a protected head, i.e., some fraction of the oldest tuples from every queue will never be written out to disk to make space for the incoming tuples. We maintain a protected head of size up to M /2n for each queue in the system. Algorithm BufferedHead is a generalization of Algorithm Half and works as follows. When tuples enter the system they are placed in their respective queues. The memory is not statically partitioned between the queues and the queues dynamically change in size as tuples enter and leave. As before, we will only analyze a simple version4 of BufferedHead 4 As discussed for Algorithm Half in Fo otnote 2, there are modified versions of BufferedHead that perform more relaxed form of writes. At a clock-tick, if there is excess input that does not fit into the available memory, select the largest queue in memory, and write all but its M /2n oldest tuples to disk, or say instead write-out the newest half of the largest queue in memory 546 which writes exactly M /2n of the newest tuples of the largest queue in memory when there is a memory buffer overflow (by our arrival rate assumption, it never needs to write more than these many tuples). It also reads tuples from disk in chunks of size M /2n. Note that in both versions of this Algorithm, read-ins may also cause write-outs to make space for the tuples being read-in. Theorem 4.1. Algorithm BufferedHead is acyclic. a natural generalization of the subsumption property (see Theorem 3.2) to multiple queues to show for every batch of 2n write-outs performed by BufferedHead, any offline algorithm would have to perform atleast one write-out. Reads as usual can be accounted to writes or analyzed similarly. W hile we do not have a matching lower bound, the following indicates that BufferedHead may not be Proof. The oldest M /2n tuples of any queue are never far from the best possible acyclic algorithm. written-out to disk in all versions of BufferedHead. Since tuples read-in from disk will always be the oldest Theorem 4.3. Under round-robin schedulers, the comit tuples in their queue, and read-ins are done in chunks of petive ratio of an acyclic online algorithm must be M /2n, it follows that no tuple is written-out and then ( n). readW back more than once. -in Proof. Let tuples be provided as input only to queue e analyze this algorithm for the two kinds of 1 at the maximum possible rate, until the number of unconsumed tuples reaches 2M . We assume that the schedulers in the following subsections. rate is large enough that the optimal offline algorithm 4.2 Round Robin Schedulers Under round robin will incur only O(1) writes. At this time, any algorithm schedulers, the oldest tuple of each queue is consumed should have at least M tuples on disk. Stop giving any m before the second oldest of any queue. This can be input now, until the algorith performs a "big" read, as described next. Clearly, a n-competitive algorithm modeled as having to consume the oldest tuple of every nonempty queue at each clock-tick, which requires must perform at least one read of size M / n if no new having the oldest tuple of each queue in memory at every input tuples arrive, since an offline algorithm could read clock-tick. Note that under round-robin schedulers, the in chunks of (M ) tuples. Once this big read has been M /2n tuples read-in for any queue will be among the performed, inject new tuples at the maximum possible next M /2 tuples consumed over all queues, as only at input rate so as to add M /n new tuples into each queue, on most M /2n from each of the n queues will be consumed for a total of M tuples. Since the line algorithm is acyclic, it cannot write back the M / n tuples from the before them. Also note that, for every batch of 2n writes b by BufferedHead atleast M new tuples would have big read ack to disk, so it will have to write-out at le entered the system, as those written-out tuples that ast M / n of the arriving tuples which amounts to n queues, as each queue has size M /n. The optimal were read-in are now in the protected head. offline algorithm, foreseeing this sudden input spurt, Theorem 4.2. Algorithm BufferedHead is a 2n- would have read-in only M /2n tuples of queue 1, just competitive algorithm for round-robin schedulers. before this new batch of M tuples arrived. Thus, the offline algorithm can have O(1) cost, while any online Proof Sketch. We give only a brief sketch of the proof algorithm will have cost (n). 4 here, deferring the detailed proof to the extended version of the paper. The intuitition is that, as write sizes .3 Adversarial Schedulers We first show that are at least M /2n, at most a 2n factor is lost for larger there are no o(M )-competitive acyclic algorithms unwrites performed by the optimal offline algorithm (they der adversarial schedulers. We then use Algorithm cannot exceed M in size). For smaller writes by the BufferedHead to show that if we give the online aloptimal offline algorithm, BufferedHead would have gorithm a larger memory buffer size vis-a-vis an offline to do an early read, but this can cause at most one algorithm, then it is an O(n)-competitive algorithm. extra write after which the memory buffer for the onSince M n, we are interested only in competitive line algorithm would have newer tuples on disk as com- ratios that are bounded independent of M . The followpared to the offline algorithm. The detailed proof uses ing result shows that under adversarial schedulers, we cannot achieve this goal. to disk. When tuples have to be read from disk for consumption in a particular queue, read M /2n tuples for that queue, or the entire portion of that queue on disk, whichever is smaller. We defer the analysis of these forms of the algorithm to the extended version of the paper. Theorem 4.4. Under adversarial schedulers, there is no acyclic online algorithm with competitive ratio f (n), for any bound f (n) independent of M . 547 5 Conclusion ctually, using the preceding argument we can also We studied the problem of maintaining queues in a show an even more negative result as indicated by the memory cache, which arises in a number of important corollary below. applications such as data stream systems, networking, and distributed messaging services. We analyzed why Corollary 4.1. If up to k tuples can enter the system data stream systems built on top of buffer managers at any instant, then no acyclic online algorithm can be that use traditional caching algorithms like LRU perk -competitive for adversarial schedulers. form badly as noticed elsewhere [10]. We provided onThe reason why a result similar to Theorem 4.2 does line competitive algorithms for this problem under different interesting cost models. These algorithms will be not hold for adversarial schedulers is that, under roundrobin schedulers, the oldest unconsumed tuple amongst implemented in the Stanford Stream system [1]. all the queues will be consumed at the next clock-tick and hence has to be in memory. But an adversarial scheduler may not select the queue corresponding to the oldest unprocessed tuple for consumption and therefore it need not be in memory. In other words, it may be possible that at an instant when there are more than M unprocessed tuples, an offline algorithm maintains the newest M tuples in memory and does not pay for any reads/writes. Acknowledgements We thank Gurmeet Manku for his comments in improving the exposition of the paper. References [1] A. Arasu, B. Bab cock, S. Babu, M. Datar, K. Ito, R. Motwani, I. Nishizawa, U. Srivastava, D. Thomas, Proof. Let tuples be provided as input only to queue 1 at the maximum possible rate, until the number of unconsumed tuples reaches 2M . At this point, any algorithm should have at least M tuples on disk. We assume that the rate is large enough that offline will incur only O(1) writes. Stop giving any input now, until the algorithm makes a big read, as described next. For an f (n)-competitive algorithm to read-in the M tuples on disk, there must be at least one read of size M /f (n) if no new tuples arrive, since an optimal offline algorithm could read in chunks of size (M ). As the tuples in the system are all from a single queue, the adversarial scheduler is forced to consume only from this queue up to this point in time. Once this big read has been performed, inject tuples for queue 2 into the system, until the number of unconsumed tuples in queue 2 reaches M - 1. Also from now on suppose the scheduler only consumes tuples from queue 2, and after this at every instant in the future exactly one tuple for queue 2 enters the system to replenish the tuples of queue 2 being consumed. Thus M - 1 unconsumed tuples of queue 2 are constantly being maintained in the system. Observe that the space occupied by the M /f (n) tuples of queue 1, from the large read, can never be reclaimed. Therefore, an acyclic online algorithm will incessantly be performing writes and reads, while an offline algorithm foreseeing this, could have performed a read of just a single tuple for queue 1, just before this deluge of queue 2 tuples began. Since the number of writes and reads of the online algorithm in this case is unbounded, while the offline does not incur any more writes, the competitive ratio of the online algorithm is unbounded. A In spite of the previous result, it turns out that BufferedHead is still applicable under adversarial schedulers, provided we compare its performance to that of offline algorithms with smaller memory size. Theorem 4.5. Under adversarial schedulers, Algorithm BufferedHead is an acyclic 2n-competitive algorithm when provided with extra M /2 memory. Proof Sketch. We give only a brief sketch of the proof here, deferring the detailed proof to the extended version of the paper. The protected region (the oldest parts of all queues that cannot be written to disk) occupy at most M /2 space and all the writes happen from the unprotected region of size at least M . Just as in Theorem 4.2, between 2n writes of BufferedHead there must be atleast one write by the optimal offline algorithm. W e can improve the competitive ratio to n for writes when given memory of total size 3M /2, by changing BufferedHead to write out all except the head (M /2n oldest tuples) of the largest queue in memory, or for a simpler analysis, M /n of the newest tuples from the largest queue in memory. Algorithm BufferedHead remains acyclic, but now writes have size at least M /n. 4.4 Extended Cost Mo del The single-queue algorithm for the extended cost model in Section 3 can be easily extended to maintaining multiple queues under the extended cost model to give a 4-competitive algorithm for round-robin consumption. We defer the details to the extended version of this paper. 548 [2] [3] [4] [5] [6] [7] [8] [9] [10] [11] [12] [13] [14] R. Varma, and J. Widom: "STREAM: The Stanford Stream Data Manager." IEEE Data Engineering Bulletin 26(1):19-26, 2003. Yossi Azar and Yossi Richter. "Management of multiqueue switches In QOS networks." In: Proceedings of the 35th Annual ACM Symposium on Theory of Computing, pp. 82­89, 2003. Brian Bab cock, Shivnath Babu, Mayur Datar, Ra jeev Motwani, and Jennifer Widom. "Models and issues in data stream systems." In: Proceedings of the TwentyFirst ACM SIGMOD Symposium on Principles of Database Systems, pp. 1­16, 2003. Brian Bab cock, Shivnath Babu, Mayur Datar, and Rajeev Motwani. "Chain: Op erator Scheduling for Memory Minimization in Stream Systems." In: Proceedings of the 2003 ACM SIGMOD International Conference on Management of Data, pp. 253­264, 2003. Brian Bab cock, Mayur Datar, and Ra jeev Motwani. "Load shedding techniques for data stream systems." In: Proceedings of the 20th International Conference on Data Engineering, 2004 (to app ear). A. Borodin and R. El-Yaniv. Online Computation and Competitive Analysis. Cambridge University Press, 1998. D. Carney, U. Cetintemel, M. Cherniack, C. Convey, S. Lee, G. Seidman, M. Stonebraker, N. Tatbul, and S. Zdonik. "Monitoring streams - A New Class of Data Management Applications. In Proceedings of the 28th International Conference on Very Large Data Bases, pp. 215­226, 2002. S. Chandrasekaran, O. Coop er, A. Deshpande, M.J. Franklin, J.M. Hellerstein, W. Hong, S. Krishnamurthy, S.R. Madden, V. Raman, F. Reiss, and M.A. Shah. "TelegraphCQ: Continuous Dataflow Processing for an Uncertain World." In: First Biennial Conference on Innovative Data Systems Research, 2003. Jianjun Chen, David J. DeWitt, Feng Tian, and Yuan Wang. "NiagaraCQ: a scalable continuous query system for Internet databases." In: Proceedings of the 2000 ACM SIGMOD International Conference on Management of Data, pp. 379­390, 2000. Charles Cranor, Theodore Johnson, Oliver Spatsheck, and Vladislav Shkap enyuk. "Gigascop e: A Stream Database for Network Applications." In: Proceedings of the 2003 ACM SIGMOD International Conference on Management of Data, pp. 647­651, 2003. Charles Cranor, Theodore Johnson, Oliver Spatsheck, and Vladislav Shkap enyuk. "The Gigascop e Stream Database." IEEE Data Engineering Bul letin 26(1):2732, 2003. Abhinandan Das, Johannes Gehrke, and Mirek Riedewald. "Approximate join processing over data streams." In: Proceedings of the 2003 ACM SIGMOD International Conference on Management of Data, pp. 40­51, 2003. Jim Gray and David Patterson. "Storage: A Conversation with Jim Gray." ACM Queue 1 (2003). IBM Corp oration. I BM MQ Series. http://www-3.ibm.com/software/integration/wmq/. [15] Sundar Iyer, R.R. Komp ella and Nick McKeown. "Analysis of a Memory Architecture for Fast Packet Buffers." In: IEEE Workshop on High Performance Switching and Routing, 2001. [16] Jon Kleinb erg. "Bursty and hierarchical structure in streams." In: Proceedings of the Eighth ACM SIGKDD International Conference on Know ledge Discovery and Data Mining, pp. 91­101, 2002. [17] Will E. Leland, Murad S. Taqqu, Walter Willinger, and Daniel V. Wilson "On the self similar nature of ethernet traffic." IEEE/ACM Transactions on Networking, 2(1):1­15, 1994. [18] R. Motwani, J. Widom, A. Arasu, B. Bab cock, S. Babu, M. Datar, G. Manku, C. Olston, J. Rosenstein, and R. Varma. "Query Processing, Approximation, and Resource Management in a Data Stream Management System." In: First Biennial Conference on Innovative Data Systems Research, pp. 245­256, 2003. [19] Vern Paxson and Sally Floyd. "Wide-area Traffic: The failure of Poisson modeling." IEEE/ACM Transactions on Networking, 3(3):226-244, 1995. [20] Devavrat Shah, Sundar Iyer, Bala ji Prabhakar, and Nick McKeown. "Maintaining statistics counters in line cards." IEEE Micro, pp. 76­81, 2002. [21] Sandeep Sikka and George Varghese. "Memoryefficient state lookups with fast up dates." In: Proceedings of the ACM SIGCOMM Conference, pp. 335­347, 2000. [22] Nesime Tatbul, Ugur Cetintemel, Stan Zdonik, Mitch Cherniack, and Michael Stonebraker. "Load Shedding in a data stream manager." In Proceedings of the 29th International Conference on Very Large Data Bases, pp. 309­320, 2003. [23] Walter Willinger, Murad S. Taqqu, and Ashok Erramilli. "A Bibliographical Guide to Self-Similar Traffic and Performance Modeling for Modern High-Sp eed Networks." In: Stochastic Networks: Theory and Applications, F.P. Kelly, S. Zachary, and I. Ziedins (Eds.), Oxford University Press, pp. 339-366. 1996. 549