Minimizing Migrations in Fair Multiprocessor Scheduling of Persistent Tasks Tracy Kimbrel Baruch Schieber Maxim Sviridenko Abstract Suppose that we are given n persistent tasks (jobs) that need to be executed in an equitable way on m processors (machines). Each machine is capable of performing one unit of work in each integral time unit and each job may be executed on at most one machine at a time. The schedule needs to specify which job is to be executed on each machine in each time window. The goal is to find a schedule that minimizes job migrations between machines while guaranteeing a fair schedule. We measure the fairness by the drift d defined as the maximum difference between the execution times accumulated by any two jobs. Since jobs are persistent we measure the quality of the schedule by the ratio of the number of migrations to time windows. We show a tradeoff between the drift and the number of migrations. Let n = q m + r with 0 < r < m (the problem is trivial for n m and for r = 0). For any d 1, we show a schedule that achieves a migration ratio of r(m - r)/(n(q (d - 1) + 1)) + o(1); namely, it asymptotically requires r(m - r) job migrations every n(q (d - 1) + 1) time windows. We show how to implement the schedule efficiently. We prove that our algorithm is almost optimal by proving a lower bound of r(m - r)/(nq d) on the migration ratio. We also give a more complicated schedule that matches the lower bound for an infinite number of instances. Our algorithms can be extended to the dynamic case in which jobs enter and leave the system over time. each persistent task is associated with some requirement of resources over time and the challenge is to allocate the limited resources to satisfy all these requirements in a "fair" way. Fair scheduling has also been of great interest in general-purpose operating systems research, mainly due to the popularity of "soft real time" multimedia applications. A variety of schemes have been developed such as the ones in [19, 20, 14, 8, 9]. Recent research has aimed towards extending fair scheduling to multiprocessor systems [11, 17]. Process migration in a multiprocessor system is a ma jor cause of performance degradation, as observed experimentally (see, e.g., [12, 17]). A primary reason for the degradation is the increase in cache misses due to migrations. The performance impact becomes more and more substantial as the processor-memory speed gap widens. Migration also requires system overhead to maintain process scheduling data structures. In this paper we consider both fair scheduling and migration simultaneously, and demonstrate a tradeoff between number of migrations and the level of fairness in the schedule. Suppose that we are given m processors (machines) M1 , . . . , Mm , and n > m persistent tasks (jobs) 1, 2, . . . , n. Each machine can process one unit of work in each unit of time, and each job may be executed on at most one machine at a time. For each time window [t - 1, t], t 1, a schedule needs to determine which m jobs to execute and further must specify which machine executes each job. Each machine must execute a job in each time window; idle time is not allowed. 1 Intro duction A job migrates between two of its consecutive Real-time multiprocessor systems have evolved rapidly executions (which may or may not be in consecutive in recent years (see, e.g., [15, 2, 11, 10]). Such systems time windows) if these two executions are not on the are used in avionics, automotive and astronautics syssame machine. We will think of migrations occurring tems, and also to control automatic manufacturing sysinstantly at integer times, between the unit-length time tems and other autonomic systems [13]. In many conwindows over which jobs are executed. Since jobs are trol applications the system needs to run a collection of persistent we measure the performance of the schedule persistent tasks, each of which senses and controls part by the ratio of the number of migrations to the number of the overall system. Each such task needs to be exof time windows. We call this the migration ratio. Our ecuted frequently enough to guarantee quick response. goal is to schedule the jobs fairly, while minimizing An obvious challenge is the scheduling of these tasks the number of migrations. The notion of fairness is as to ensure proper functioning of the system. Frequently, follows. For a job j and an integer t 0, let PT(j, t) denote IBM T.J. Watson Research Center, P.O. Box 218, Yorktown the number of time windows, out of time windows Heights, NY 10598. email:{kimbrel,sbar,sviri}@us.ibm.com Copyright © 2004 by the Association for Computing Machinery, Inc. and the Society for industrial and Applied Mathematics. All Rights reserved. Printed in The United States of America. No part of this book may be reproduced, stored, or transmitted in any manner without the written permission of the publisher. For information, write to the Association for Computing Machinery, 1515 Broadway, New York, NY 10036 and the Society for Industrial and Applied Mathematics, 3600 University City Science Center, Philadelphia, PA 19104-2688 982 1, . . . , t, in which job j is executed on any of the machines. We will refer to this as the processing time of job j at time t. A perfectly fair schedule would guarantee PT(j, t) = tm/n, for every job j and every time t. Of course this is not always possible since this value is not necessarily an integer. It is not difficult to show that there is a schedule such that for each job j and time t, tm/n PT(j, t) tm/n . Note that in this case the difference between the processing times of any two jobs is at most one. We generalize this notion of fairness and define the drift as the maximum over all times t of max1i,j n {PT(j, t) - PT(i, t)}, i.e., the maximum difference between the processing times of any two jobs at time t. We show a tradeoff between the drift and the number of migrations. Let n = q m + r with r < m and q > 0. (Note that the problem is trivial for n m and for r = 0.) For any d 1, we show a schedule that achieves a migration ratio of r(m - r)/(n(q (d - 1) + 1)) + o(1); namely, it asymptotically incurs r(m - r) job migrations every n(q (d - 1) + 1) time windows. We show how to implement the schedule efficiently in a distributed system. We prove that our algorithm is almost optimal by proving a lower bound of r(m - r)/(nq d) on the migration ratio. We also give a more complicated schedule that matches the lower bound whenever 2q d and m = 2r. In the full version we give an improved lower bound that matches the ratio achieved by the complicated schedule whenever m = 2r and 2q d, and in an infinite number of instances in which 2q > d. Finally, we observe that our algorithm can be extended to the dynamic case in which jobs enter and leave the system over time. The proofs of the upper and lower bound on the number of migrations use potential function arguments. Interestingly, this simplifies the proof considerably. Previous versions of our proofs involved a tedious analysis of the group of integers modulo m and the sequence of its elements as generated by r (assuming g cd(m, r) = 1). The correctness proof of our algorithm is based on a careful analysis of the average processing times of jobs on each machine. We concentrate on the case in which all tasks have the same resource requirement. A natural generalization is the case in which each task has a different fractional requirement and the drift is normalized by these fractions. It is not difficult to prove that the computation of a schedule that minimizes the drift is this case is NP-Hard in the strong sense by reduction from 3partition. Still, it is an interesting open problem to design a polynomial time algorithm that achieves a proven bound on migrations in this case. 1.1 Related work Early works by Liu [15] and Liu and Layland [16] consider persistent scheduling of tasks in a uniprocessor environment. They presented a model in which time is partitioned into discrete time windows. Each persistent task has periodic release times and deadlines and an integer processing requirement (number of time windows) that must be satisfied in each period. Deadline i is equal to release time i + 1, each period is the same length, and each processing requirement is the same for a given task. The schedule must meet these requirements for all tasks indefinitely. Liu and Layland [16] give a scheduling algorithm for this model that achieves full processor utilization. More recently, the problem of scheduling persistent periodic tasks has been studied in the work of Baruah et al. [5]. Their setting is closer to ours. They considered the problem of scheduling a set of n tasks with given (arbitrary) frequencies on m machines. To measure "fairness" of a schedule they introduced the notion of P -fairness. A schedule is P -fair (proportionate-fair) if at each time t for each task j , the absolute value of the difference in the number of times j has been scheduled and its "fair" share fj t is strictly less than 1, where fj is the frequency of task j . It is not difficult to see that any P -fair schedule has drift one, and vice versa. Baruah et al. [5] provided an algorithm for computing a P -fair solution to any feasible instance of their problem. Since their introduction much work has been done on P -fair schedules. This follow-up work includes (among others) simplification of the scheduling algorithm [1, 6, 3] and extensions to more general models that include sporadic and persistent tasks, and fixed and migrating tasks [4, 18]. Central to the notion of P -fairness of Baruah et al. [7] is the lag of a schedule, defined to be the maximum absolute value of the difference in the number of times a task has been scheduled and its "fair" share. It is not difficult to see that any schedule with lag d has drift no more than 2d, and any schedule with drift d has lag d as well. The rest of the paper is organized as follows. In Section 2 we introduce our algorithm, prove its correctness and bound the number of migrations as a function of the drift. In Section 3 we prove a lower bound on the number of migrations in any schedule that achieves a given drift. 2 The Algorithm In this section we describe an algorithm that schedules n = q m + r jobs on m machines with drift d and migration ratio r(m - r)/(n(q (d - 1) + 1)). We show how this algorithm can be implemented efficiently, by performing a constant number of operations on each machine per time window, and a constant amount of 983 fully distributed communication between machines for each migration. The algorithm maintains the invariant that at all times, each machine has either q or q + 1 jobs. It consists of a migration phase in which some jobs may migrate from machines with q + 1 jobs to machines with q jobs, and an execution phase in which each machine executes one of the jobs on it. We assume that initially each job is on the first machine that executes it, and thus there are no initial migrations. To describe the algorithm we introduce some notation. Let S (t) and F (t) (for "slow" and "fast") denote the sets of machines with q + 1 and q jobs, respectively. For these and all other time-varying variables, the value at time t denotes the state after the execution of the jobs in the tth time window, but before any migrations in preparation for the next; t = 0 indicates the initial state. Intermediate states during the execution of the algorithm will be indicated using t+ . For a machine M , let J (M , t) be the set of jobs on machine M at time t. Let avg(M , t) be the average processing time of all jj bs in J (M , t); i.e., for o 1 PT(j, t), and for M S (t), avg(M , t) = q+1 jJ (M ,t) 1 M F (t), avg(M , t) = q J (M ,t) PT(j, t). For a subset of machines M, let J (M, t) = M M {J (M , t)}. Let minM (M, t) (resp. maxM (M, t)) be a machine with the lowest (resp. highest) average processing time among machines in M. (Ties are broken arbitrarily.) Let minPT(M , t) = minj J (M ,t) {PT(j, t)} and maxPT(M , t) = maxj J (M ,t) {PT(j, t)}. Define minPT(M, t) and maxPT(M, t) similarly. The algorithm Min-Migrations-Fair-Schedule (MMFS) is described below. Algorithm 2.1. Min-Migrations-Fair-Schedule (MMFS) Migration Phase: Let M1 = minM (S (t), t) Let M2 = maxM (F (t), t) If avg(M2 , t) - avg(M1 , t) = (d - 1) + 2/(q + 1) Migrate a job with processing time minPT(M1 , t) from M1 to M2 Update S (t+ ) and F (t+ ) accordingly Repeat the Migration Phase Execution Phase: For each machine M Execute any job with processing time minPT(M , t+ ). bounds the drift and the migration ratio. First, we describe an efficient implementation of the algorithm. To find M1 = minM (S (t), t) and M2 = maxM (F (t), t) we maintain two queues: QS (t) of all the machines M in S (t) in increasing order of avg(M , t), and QF (t) of all the machines M in F (t) in decreasing order of avg(M , t). The machine M1 (resp. M2 ) is at the head of QS (t) (resp. QF (t)). We prove below that when a job migrates from M1 to M2 , machine M1 leaves QS (t) and joins QF (t) at its tail and machine M2 leaves QF (t) and joins QS (t) at its tail. It follows that if we maintain the machines in a circular list ordered according to their positions in the queue, i.e., the machines in QS (t) ordered (clockwise, say) from head to tail followed by the machines in QF (t) ordered (clockwise) from head to tail, then this order remains fixed throughout the algorithm, and the only things that change are the locations of the heads of the queues, which advance one position (clockwise) whenever a migration occurs. We also show that for each machine M F (t) the processing times of all jobs on M differ by at most one, and that for each machine M S (t) the processing times of q out of the q + 1 jobs on M differ by at most one; one job may be as far as d behind. This implies that a job with processing time minPT(M , t) can be found in constant time. Theorem 2.1. For any d 1, algorithm MMFS produces a schedule with drift d and migration ratio r(m - r)/(n(q (d - 1) + 1)) + o(1). To prove the theorem we first bound the drift and then bound the migration ratio. 2.1 Bounding the drift We prove the bound on the drift by induction. The drift is obviously at most d at time 0, when the processing times of all jobs are 0. Assume that the drift is bounded up to time t - 1. We show that it is bounded also at time t. For the proof we need a few lemmas. Lemma 2.1. At every integer time t 0, the processing times of al l jobs on any machine M F (t) differ by at most one. The processing times of q out of the q + 1 jobs on any machine M S (t) differ by at most one, and the processing time of the remaining job on M may be smal ler than the rest. Proof. We prove the lemma by induction. The basis is trivial since all jobs start at time 0 with processing time 0. Consider a time t > 0 and assume that the lemma holds at time t - 1. If no job migrated from or into M at time t - 1, then since machine M executed a job with We show that the migration phase always termi- the least processing time, the lemma holds also at time nates in the course of proving Theorem 2.1 below, which t. Suppose that a job migrated from M at time t - 1 984 minPT(M1 , t) d. Under the assumption that there was no job migration into M1 or from M2 the remaining possibility is minPT(M1 , t - 1) = minPT(M1 , t) and maxPT(M2 , t) = maxPT(M2 , t - 1) + 1. By Lemma 2.1 the processing times of all jobs on M2 differ by at most one. Since maxPT(M2 , t) = maxPT(M2 , t - 1) + 1, at time t - 1 the processing times of all jobs on M2 must have been the same and thus avg(M2 , t - 1) = maxPT(M2 , t - 1). Since minPT(M1 , t - 1) = minPT(M1 , t), at time t - 1 there were at least two jobs on M1 with processing time minPT(M1 , t - 1). From Lemma 2.1, it follows that at time t - 1 the processing times of all jobs on M1 were either minPT(M1 , t - 1) or minPT(M1 , t - 1) + 1, and thus avg(M1 , t - 1) minPT(M1 , t - 1) + 1 - 2/(q + 1). However, in this case Lemma 2.2. Consider machines M1 S (t) and M2 since avg(M2 , t - 1) - avg(M1 , t - 1) < d - 1 + 2/(q + 1), F (t). If there are no migrations from M1 or to M2 at we must have maxPT(M2 , t - 1) - minPT(M1 , t - 1) < d time t - 1, then at time t, avg(M2 , t) - avg(M1 , t) = and thus maxPT(M2 , t) - minPT(M1 , t) d. 1 Case 2: A job migrated to M1 from machine M3 avg(M2 , t - 1) - avg(M1 , t - 1) + (q+1)q . at time t - 1. By the algorithm this happens only if Proof. The average processing time of J (M1 , t) in- avg(M1 , t - 1) - avg(M3 , t - 1) = d - 1 + 2/(q + 1). creases by 1/q and that of J (M2 , t) by 1/(q + 1). The By the induction hypothesis J (M1 , t) consists of q jobs 1 with processing time maxPT(M1 , t - 1) and a single job difference is (q+1)q . with processing time maxPT(M1 , t - 1) - d + 1; thus, Lemma 2.3. For any two machines M1 S (t) and avg(M1 , t) = maxPT(M1 , t - 1) - (d - 1)/(q + 1). By the induction hypothesis all the jobs in J (M3 , t) have M2 F (t) at time t, processing time maxPT(M1 , t - 1) - d + 1, and thus 1. avg(M2 , t) - avg(M1 , t) d - 1 + 2/(q + 1), avg(M3 , t) = maxPT(M1 , t - 1) - (d - 1). It follows that 2. maxPT(M2 , t) - minPT(M1 , t) d. avg(M3 , t) - avg(M1 , t) = -(d - 1) + (d - 1)/(q + 1) If avg(M2 , t) - avg(M1 , t) = d - 1 + 2/(q + 1) then = -(d - 1)q /(q + 1) 1. PT(j, t) is the same for al l jobs in J (M2 , t), d - 1 + 2/(q + 1). Clearly, maxPT(M3 , t) - minPT(M1 , t) d. For every machine M F (t) F (t - 1), i.e., each machine M F (t) from which no job migrated at time t - 1, 3. PT(j, t) = minPT(M2 , t) - d, for the remaining two avg(M , t) 1/q + avg(M1 , t - 1). This is because jobs j in J (M1 , t). machine M1 had the maximum average processing time among machines in F (t - 1) at time t - 1, and the Proof. The proof is by induction. The induction basis average processing time of machine M grew by 1/q . for t = 0 is trivial. Consider a time t > 0. We consider Since avg(M , t) = avg(M , t - 1) - (d - 1)/(q + 1), 1 1 several cases. we get Case 1: There was no job migration into M1 or avg(M , t) - avg(M1 , t) 1/q + (d - 1)/(q + 1) from M2 at time t - 1. By the induction hypothesis and the algorithm, avg(M2 , t - 1) - avg(M1 , t - 1) < d - 1 + 2/(q + 1). d - 1 + 2/(q + 1). Since the difference in the averages is an integer multiple of q(q1 1) it follows that avg(M2 , t - Since maxPT(M , t) maxPT(M , t - 1) + 1, and + maxPT(M , t - 1) maxPT(M1 , t - 1), we also get 1) - avg(M1 , t - 1) d - 1 + 2/(q + 1) - q(q1 1) . In this maxPT(M , t) - minPT(M , t) d. + 1 case Lemma 2.2 implies that avg(M2 , t) - avg(M1 , t) Case 3: A job migrated from machine M2 to d - 1 + 2/(q + 1). machine M4 at time t - 1. Similar to the analysis above, If either minPT(M1 , t) = minPT(M1 , t - 1) + avg(M2 , t) - avg(M4 , t) = -(d - 1)q /(q + 1) 1 or maxPT(M2 , t - 1) = maxPT(M2 , t), then by the induction hypothesis we have maxPT(M2 , t) - d - 1 + 2/(q + 1), 2. PT(j, t) = minPT(M2 , t) - (d - 1), for q - 1 jobs j in J (M1 , t), and thus M moved from S (t - 1) to F (t). Note that the migrated job is the one with the least processing time, and by the induction hypothesis the processing times of the rest of the jobs on M at time t - 1 differ by at most one. Since machine M executes a job with the least processing time, the lemma holds also at time t. Suppose that a job migrated from machine M into M at time t - 1. By the induction hypothesis the processing times of the rest of the jobs on M differ by at most one at time t - 1. Thus, minPT(M , t - 1) = avg(M, t - 1) . By the algorithm avg(M , t - 1) < avg(M , t - 1). At time t - 1 the processing time of the migrated job is minPT(M , t - 1); thus it is also strictly less than avg(M , t - 1), implying the lemma in this case. 985 and maxPT(M2 , t) - minPT(M4 , t) d. For every machine M S (t) S (t - 1) to which no job migrated at time t - 1, avg(M , t) 1/(q + 1) + avg(M2 , t - 1). This is because machine M2 had the minimum average processing time among machines in S (t - 1) at time t - 1, and the average processing time of machine M grew by 1/(q + 1) at time t - 1. By the induction hypothesis J (M2 , t - 1) consists of q - 1 jobs with processing time maxPT(M2 , t - 1) and two jobs with processing time maxPT(M2 , t - 1) - 1, and J (M2 , t) consists of q jobs with processing time maxPT(M2 , t - 1). This implies avg(M2 , t) = avg(M2 , t - 1) + 2/(q + 1), and thus we get avg(M2 , t) - avg(M , t) 1/(q + 1) d - 1 + 2/(q + 1). If minPT(M , t) = minPT(M , t - 1) + 1, then since maxPT(M2 , t) = maxPT(M2 , t - 1), by the induction hypothesis maxPT(M2 , t) - minPT(M , t) d. Otherwise, since minPT(M , t - 1) = minPT(M , t), at time t - 1 there were at least two jobs on M with processing time minPT(M , t - 1). From Lemma 2.1, it follows that at times t - 1 and t the processing times of all jobs in J (M , t) differ by at most one. Since avg(M , t - 1) avg(M2 , t - 1), we also have minPT(M , t) = minPT(M , t - 1) minPT(M2 , t - 1). induction hypothesis implies avg(M2 , t) - avg(M1 , t) (d - 1) + 1/q . If there were migrations from both M1 and M2 then by Lemma 2.3 both averages are the same. Suppose that a job migrated from machine M3 F (t) to machine M4 F (t - 1) at time t - 1. By the algorithm, the maximum average processing time among the machines in F (t - 1) at time t - 1 is avg(M4 , t - 1). By Lemma 2.3, avg(M3 , t) + (d - 1) = avg(M4 , t - 1); thus for every machine M F (t) F (t - 1), avg(M , t - 1) avg(M3 , t) + (d - 1). By the induction hypothesis for every machine M F (t) F (t - 1), avg(M4 , t - 1) - avg(M , t - 1) (d - 1) + 1/q . Substituting avg(M3 , t) + (d - 1) = avg(M4 , t - 1), we get avg(M3 , t) - 1/q avg(M , t - 1). Since avg(M , t) = avg(M , t - 1) + 1/q , avg(M3 , t) avg(M , t) avg(M3 , t) + (d - 1) + 1/q . It follows that the bound on the average differences holds also if there were migrations from either M1 or M2 . Note also that if a job migrated at time t - 1, then a machine from which a job migrated at time t - 1 has the smallest average processing time in F (t) at time t. This is because avg(M3 , t) avg(M , t) holds for every machine M F (t) F (t - 1). Lemma 2.5. For any two machines M1 , M2 S (t) at time t, avg(M2 , t) - avg(M1 , t) (d - 1) - (d - 2)/(q + 1). Since maxPT(M2 , t) = minPT(M2 , t - 1) + 1, we get maxPT(M2 , t) - minPT(M , t) d. We still need to prove the configuration in case avg(M2 , t) - avg(M1 , t) = d - 1 + 2/(q + 1). By the algorithm, if equality holds then a job migrates to M2 and a job migrates from M1 . (It may not necessarily be the same job.) Thus, M1 S (t) F (t + 1), and M2 F (t) S (t + 1). By the algorithm, since avg(M2 , t) is an integer multiple of 1/q and avg(M1 , t) is an integer multiple of 1/(q + 1) , the equality implies that avg(M2 , t) must be an integer. Lemma 2.1 implies that in this case all the jobs in J (M2 , t) have the same processing time maxPT(M2 , t). By the above analysis maxPT(M2 , t) - minPT(M1 , t) d. Thus, from the equality avg(M2 , t) - avg(M1 , t) = d - 1 + 2/(q + 1) and Lemma 2.1 it follows that there must be exactly two jobs in J (M1 , t) with processing time maxPT(M2 , t) - d, and q - 1 jobs with processing time maxPT(M2 , t - 1) - d + 1. Proof. If there were no migrations to M1 and M2 , the proof is similar to the proof of Lemma 2.4. If there were migrations to both M1 and M2 then by Lemma 2.3 both averages are the same. Suppose that a job migrated from machine M3 S (t - 1) to machine M4 S (t) at time t - 1. By the algorithm, the minimum average processing time among the machines in S (t - 1) at time t - 1 is avg(M3 , t - 1). By Lemma 2.3, avg(M4 , t) - (d - 1) + (d - 3)/(q + 1) = avg(M3 , t - 1); thus for every machine M S (t) S (t - 1), avg(M4 , t) - (d - 1) + (d - 3)/(q + 1) avg(M , t - 1). By the induction hypothesis for every machine M S (t) S (t - 1), avg(M , t - 1) - avg(M3 , t - 1) (d - 1) - (d - 2)/(q + 1). Substituting avg(M4 , t) - (d - 1) + (d - 3)/(q + 1) = avg(M3 , t - 1), we get avg(M , t - 1) avg(M4 , t) - 1/(q + 1). Since avg(M , t) = avg(M , t - 1) + 1/(q + 1), avg(M4 , t) - (d - 1) + (d - 2)/(q + 1) avg(M , t) avg(M4 , t). It follows that the bound on the average differences holds also if there were migrations to either M1 or M2 . Note also Lemma 2.4. For any two machines M1 , M2 F (t), at that if a job migrated at time t - 1, then a machine time t, avg(M2 , t) - avg(M1 , t) (d - 1) + 1/q . to which a job migrated at time t - 1 has the largest Proof. The proof is by induction. The induction basis average processing time in S (t) at time t. for t = 0 is trivial. Consider a time t > 0. If there were no migrations from M1 and M2 , i.e., M1 , M2 Lemma 2.6. For any two machines M1 F (t) and F (t) F (t - 1), then since avg(M1 , t) = avg(M1 , t - M2 S (t), at time t, avg(M2 , t) - avg(M1 , t) d - 1) + 1/q , and avg(M2 , t) = avg(M2 , t - 1) + 1/q , the 1 - (d - 1)/(q + 1). 986 Proof. The proof is again by induction. Clearly, it is enough to consider machines M1 = minM (F (t), t) and M2 = maxM (S (t), t). If both M1 F (t - 1) and M2 S (t - 1), then by Lemma 2.2, avg(M2 , t) - avg(M1 , t) < avg(M2 , t - 1) - avg(M1 , t - 1), and the bound follows by induction. If this is not the case, then either a job migrated from M1 or a job migrated to M2 at time t - 1. By Lemmas 2.4 and 2.5 in this case minM (F (t), t) is a machine from which a job migrated, and maxM (S (t), t) is a machine to which a job migrated, thus it must be that a job migrated from M1 and a job migrated to M2 at time t - 1. In this case by Lemma 2.3, avg(M2 , t) - avg(M1 , t) = d - 1 - (d - 1)/(q + 1). Lemma 2.7. For any two machines M1 and M2 at time t, maxPT(M2 , t) - minPT(M1 , t) d. a job migrates from machine M at time t - 1, M has the smallest average processing time in F (t). It follows that only a constant amount of work per migration is required to maintain the queues QS and QF . 2.3 Bounding the migration ratio To prove the bound on the migration ratio we use the following potential function. Let f (q , d) be any function of q and d, and let 1 2f (q , d) + j j tm n) (t) = J (F (t),t) (PT(j, t) - tm J (S (t),t) ( n - PT(j, t)) . Let M (t) denote the number of migrations up to time t. We show by a standard telescoping argument Proof. This bound is proved in Lemma 2.3, for that for a suitably chosen f (q , d), for each t, M1 S (t) and M2 F (t). For the rest of tr(m - r) the cases we prove it by induction. Again, the in+ O(nd). M (t) + (t) nf (q , d) duction basis t = 0 is trivial. If for some machine M , either minPT(M1 , t) minPT(M , t - 1) + This readily implies the bound on the migration ratio. 1 or maxPT(M2 , t) = maxPT(M , t - 1), then the For the upper bound we set f (q , d) = q (d - 1) + 1; in induction hypothesis implies that maxPT(M2 , t) - a later section we use a different function for the lower minPT(M1 , t) d. Note that this case includes bound proof. all machines that participated in a migration at time If a schedule has drift d, the minimum and maxit - 1. Thus, from now on we assume that M1 and mum processing times over all jobs differ by at most d, M2 have not participated in a migration at time t - and the fair share tm/n lies between the minimum and 1. If minPT(M1 , t) = minPT(M1 , t - 1), then there maximum. Thus each of the n terms in the two sumare at least two jobs in J (M1 , t - 1) with processing mations in (t) is bounded in absolute value by d and time minPT(M1 , t - 1) at time t - 1. Similarly, if for all t, |(t)| = O(nd). maxPT(M2 , t) = maxPT(M2 , t - 1) + 1, then all jobs Given the bound on (t), to prove the upper in J (M2 , t - 1) have processing time minPT(M2 , t - 1) bound on M (t) + (t) we have to show that µ(t) + at time t - 1. It follows from Lemma 2.1 that the pro- (t) r(m-r) , where µ(t) is the number of migrations nf (q ,d) cessing times of the jobs in J (M1 , t) (resp. J (M2 , t)) at performed in the migration phase at time t - 1, and time t differ by at most one. The bounds on the average (t) = (t) - (t - 1). We do this by considering the difference of Lemmas 2.4, 2.5 and 2.6 imply the lemma. effects of executing jobs and of migrations separately. From now on we drop the argument t and simply write for each change we consider. As indicated earlier 2.2 Termination and work p er migration We we consider migrations to occur at integer times, and note that in case avg(M2 , t - 1) - avg(M1 , t - 1) = d - 1 + an execution phase to occur for a unit of time between 2/(q + 1), Lemma 2.3 implies that after the migration integer times. and the execution phases, M1 F (t) and all the jobs in In the execution phase from time t - 1 to t, exactly J (M1 , t) have processing time maxPT(M2 , t - 1) - d + 1. r jobs out of the (q + 1)r jobs in J (S (t), t) execute (one The lemma also implies that M2 S (t) and J (M2 , t) per machine) and exactly m - r jobs out of the q (m - r) consists of q jobs with processing time maxPT(M2 , t - 1) jobs in J (F (t), t) execute. So we have and a single job with processing time maxPT(M2 , t - m m 1) - d + 1. It follows that avg(M1 , t) - avg(M2 , t) < = 2f (1 ,d) - r - q (m - r) + q n d - 1 + 2/(q + 1), and thus the migration phase is = m (q + 1)r - r guaranteed to terminate. n The proofs of Lemmas 2.4 and 2.5 imply that when 1 ((m - r)n - q (m - r)m+ 2nf (q ,d) a job migrates to machine M at time t - 1, M has (q + 1)rm - rn) . the largest average processing time in S (t), and when 987 Recall that n = q m + r and thus (m - r)n - = = = q (m - r)m + (q + 1)rm - rn (m - 2r)(n - q m) + rm r(m - 2r) + rm 2r(m - r). Theorem 3.1. For any d 1, any algorithm that produces a schedule with drift d has a migration ratio at least r(m - r)/(nq d) - o(1). Consider such a schedule with drift d. Without loss of generality assume that migrations are lazy, that is, a job migrates to a new machine immediately before it is first executed on that machine. This assumption is trivially justified by noting that any migration that does not satisfy the assumption can be delayed without changing the migration ratio or the times of any job executions. Without loss of generality assume that at time 0 each machine has either q or q + 1 jobs on it. This assumption is justified since any other initial configuration can be reached in O(nd) migrations, and this does not change the migration ratio. We use the same notation as in the previous section. In addition, we define for each machine M and time t, a set of jobs H (M , t) which is defined as the set of jobs whose home machine is M at time t. We note that H (M , t) may not be identical to J (M , t). We maintain the invariant that for any time t, there are r machines M with |H (M , t)| = q + 1 and m - r machines M with |H (M , t)| = q . Let HS (t) be the set of machines M with |H (M , t)| = q + 1 and HF (t) be the set of machines M with |H (M , t)| = q . Initially, sets H (M , t) are given by the job allocation at time 0. By our assumption the invariant holds for this initial configuration. For each time t, define an edge-labelled vertexweighted directed multigraph Gt = (M, Et , wt , Lt ). The vertex set M of Gt corresponds to the set of machines. The set of edges Et and their labels are defined as follows. For any two machines M1 and M2 , there are |H (M1 , t) J (M2 , t)| directed edges (M1 , M2 ). The label of each such edge is a distinct job in H (M1 , t) J (M2 , t). The set |H (M1 , t) J (M2 , t)| consists of those jobs whose home machine is M1 such that at time t, their most recent executions were on machine M2 . In G0 , the set of edges is empty and all vertex weights (which are actually counters, as we will see shortly) are set to zero, i.e., E1 = and w1 0. We maintain two properties of the graphs Gt . m-r ) It follows that = r(f (q,d) . n As seen above the inequality holds for any value of f (q , d). The change in the value of the potential function at the time of a migration phase determines the function we can use as f (q , d) and thus the performance of the algorithm. Since migrations are considered to occur instantaneously we must ensure that the change in the potential function during the migration phase plus the number of migrations made at that time moment is at most zero. Each migration at time t - 1 is from a machine M1 S (t - 1) to a machine M2 F (t - 1) and those two machines each switch sets, i.e., M1 F (t) and M2 S (t). Accordingly, all jobs in J (M1 , t - 1) except the migrating job switch sets from J (S (t - 1), t - 1) to J (F (t), t) and all jobs in J (M2 , t - 1) switch sets from J (F (t - 1), t - 1) to J (S (t), t). The migrating job does not switch sets and is in J (S (t - 1), t - 1) J (S (t), t). It follows that |J (F (t) \ F (t - 1), t)| = |J (S (t) \ S (t - 1), t)|. Since the same number of summands switch from one summation to the other and vice versa in (t) during the migration phase, we get that during the migration phase (but before the execution phase) at time t is j 1 · = J (F (t)\F (t-1),t) PT(j, t)- f (q , d) . j J (S (t)\S (t-1),t) PT(j, t) Consider any of the µ(t) pairs of machines M1 F (t) \ F (t - 1) and M2 S (t) \ S (t - 1), such that a job migrated from M1 to M2 . Let J2 = J (M2 ) J (S (t) \ S (t - 1), t), i.e., the q jobs that stay on M2 and switch from S (t - 1) to S (t) because of the migration. It follows from Lemma 2.3 that the processing times of the q jobs in J2 are minPT(M2 , t - 1), the processing times of q - 1 jobs in M1 are minPT(M2 , t - 1) - (d - 1), and the processing Degree Prop erty: For each vertex in Gt , either the time of the remaining job on M1 is minPT(M2 , t - 1) - d. indegree or outdegree is zero. In other words, there (Note that these jobs are in J (F (t) \ F (t= 1), t).) Thus, - j are no directed paths of length greater than one. j -q (d - 1) - 1. J (M1 ) PT(j, t) - J2 PT(j, t) Setting f (q , d) = q (d - 1) + 1 ensures that µ(t) + is Forbidden Edges Prop erty: There are no directed edges in Et from machines in HS (t) to machines at most zero. in HF (t). 3 Lower Bound We prove a lower bound on the migration ratio of any schedule with drift d. We need to show how these properties and the invariant are maintained moving from Gt-1 to Gt . Each migration at time t - 1 may induce a change in Gt and 988 the sets H (M , t), HS (t) and HF (t) relative to Gt-1 and the sets H (M , t - 1), HS (t - 1) and HF (t - 1). Initially, we assume that the sets remain unchanged when we advance from t - 1 to t. We consider one migration at a time. First, we show how to maintain the Degree Property and then we handle the Forbidden Edges Property. Suppose that at time t - 1 a job j migrates from machine M1 to machine M2 ; that is, j J (M1 , t - 1) J (M2 , t). If j H (M1 , t - 1), we add an edge to Et from M1 to M2 . Note that the Degree Property may be violated. We may introduce one of three illegal configurations: (1) a 2-cycle on the vertices M1 and M2 , (2) a path of length 2 that either starts at M1 or ends at M2 , and (3) a path of length 3 whose middle edge is (M1 , M2 ). If j H (M1 , t - 1), let j H (M0 , t - 1). We remove the edge (M0 , M1 ) and add the edge (M0 , M2 ). Again, the Degree Property may be violated. There are no edges incoming to M0 , but we may introduce a path of length two starting at M0 . We show how to modify Gt so that the Degree Property holds for any of the illegal configurations. Case 1: A 2-cycle on M1 and M2 . Assume that the labels of the edges are jobs j12 and j21 . Swap the home machines of j12 and j21 ; i.e., move job j12 from H (M1 , t) to H (M2 , t) and vice versa for j21 . Increment the counters wt (M1 ) and wt (M2 ) by 1. Accordingly, remove the edges of the 2-cycle. Case 2: A path M3 , M4 , M5 of length 2, where M3 = M1 , M3 = M0 , or M5 = M2 . Let the labels of the edges on the path be jobs j34 , and j45 . We introduce a "shortcut" as follows. Swap the home machines of j34 and j45 . Note that after the swap j34 is on its home machine. Increment the counter wt (M4 ) by 1. Accordingly, replace the path by the single edge (M3 , M5 ). Case 3: A path M3 , M1 , M2 , M4 of length 3. Assume that the labels of the edges on the path are jobs j31 , j12 , and j24 . We introduce a shortcut using a "circular" move of home machines. Namely, set the home machine of j31 to be M1 , set the home machine of j12 to be M2 , and set the home machine of j24 to be M3 . Note that after the move j31 and j12 are on their home machines. Increment the counters wt (M1 ) and wt (M2 ) by 1. Accordingly, replace the path by the single edge (M3 , M4 ). Observe that in all these cases we have not changed the cardinalities of the sets of jobs in the home machines and thus HS (t) and HF (t) remain the same. Suppose that after the modification there is an edge (M3 , M4 ) labelled by job j H (M3 , t) J (M4 , t), such that M3 HS (t) and M4 HF (t), violating the Forbidden Edges Property. Change the home machine of j from M3 to M4 . Accordingly, delete the edge (M3 , M4 ). Note that after the change |H (M3 , t)| = q and |H (M4 , t)| = q + 1. Thus, machine M3 moves from HS (t - 1) to HF (t), and vice versa for machine M4 . We call the migration that caused this change a transfer and increment a global counter T (t) (initially zero) by one. The counter T (t) is the number of transfers made up to time t. The construction guarantees that the total number of migratM ns made up to time t is lower bounded by io T (t) + M wt (M ) since for every migration at time t an edge in Et is created and every counter is incremented only when some edge is deleted, although sometimes we delete edges without incrementing counters. Proof of Theorem 3.1. We use a potential function (t) similar to the one used in the upper bound proof. Define j 1 tm (t) = J (HF (t),t) (PT(j, t) - n )+ 2f (q , d) . j tm J (HS (t),t) ( n - PT(j, t)) Let f (q , d) = q d. We show for any t 0, M t T (t) + wt (M ) + (t) r(m - r). nq d M This readily implies the lower bound on the migration ratio. As before, each of the n terms in the two summations in (t) is bounded in absolute value by d and for all t, |(t)| = O(nd). We have to show that µ(t) + (t) r(mq-r) , where nd M µ(t) = T (t) - T (t - 1) + M (wt (M ) - wt-1 (M )), and (t) = (t) - (t - 1). We do this by considering the effects of executing jobs and of migrations separately. From now on we drop the argument t and simply write for each change we consider. As indicated earlier we consider migrations to occur at integer times, and an execution phase to occur for a unit of time between integer times. Consider the execution phase starting at time t - 1. Since there are no edges from HS (t) to HF (t), no job whose home machine is in HS (t) is executed on a machine in HF (t). It follows that at least m - r jobs out of the q (m - r) jobs in J (HF (t), t) execute, and at most r jobs out of the (q + 1)r jobs in J (HS (t), t) execute. So we have ( 1 · m - r - q (m - r) m + n 2q d . (q + 1)r m - r n As before using the relation n = q m + r, we get (m - r)n - q (m - r)m + (q + 1)rm - rn = 2r(m - r), 989 and it follows that r(mq-r) . nd Consider now the change in potential caused by migrations at time t - 1. Since migrations are done at the start of each time window we need to show +µ(t) 0. We consider the change in potential after the modifications made to satisfy the Degree Property of Gt and then the change after the modifications made to satisfy the Forbidden Edges Property. Consider the modifications made to satisfy the Degree Property of Gt . If no shortcuts were done then is zero. A shortcut of either a 2-cycle or a path of length 2 may result in swapping a pair of jobs j1 and j2 between machines in HS (t) and HF (t). Since |PT(j1 , t) - PT(j2 , t)| d the effect of this swap on is lower bounded by -d/(q d) -1. Since at least one vertex counter is incremented as a result of the shortcut we have + µ(t) 0. A shortcut of a path of length 3 may result in swapping up to two pairs of jobs between machines in HS (t) and HF (t). Since |PT(j1 , t) - PT(j2 , t)| d the effect of these swaps on is lower bounded by -1 times the number of counter increments and thus + µ(t) 0. Finally, consider a transfer made to satisfy the Forbidden Edges Property of Gt . This transfer causes q pairs of jobs to swap between machines in HS (t) and HF (t). The change in the potential function is lower bounded by -2q d/(2q d) = -1. Since T (t) is incremented, + µ(t) 0. 4 Tightness of the b ounds We present an algorithm of mainly theoretical interest that matches the lower bound of Section 3 when m = 2r and d 2q . Note that if m = 2r we may assume m = 2 since any instance for a larger m can be broken into m/2 identical subproblems. For m = 2 and r = 1, the function f (q , d) in the upper bound proof can be improved to f (q , d) = q (d - 1) + 1 + min( d/2 , q - 1). The potential function proof of Theorem 2.1 is modified for this improved value by using the following trick. For simplicity, suppose d 2q ; the other case is similar. We run the algorithm as before until time t, for which minPT(M1 , t) = maxPT(M1 , t) and minPT(M2 , t) = maxPT(M2 , t) = minPT(M1 , t) + d - 1. At this point we let M1 execute a single job j exclusively for q steps, and then migrate this job. Note that at time t + q , PT(j, t + q ) = minPT(M1 , t) + q and the processing time of the rest of the jobs on M1 is minPT(M1 , t). After the migration, the migrating job j executes exclusively for q steps on M2 . Since d 2q , we get that maxPT(M2 , t + 2q ) minPT(M1 , t) + d and minPT(M1 , t + 2q ) minPT(M1 , t). Moreover, at most q + 1 jobs on M2 have processing time minPT(M1 , t) + d at time t + 2q and at most d - q jobs on M1 have processing time minPT(M1 , t) at time t + 2q . This implies that if the execution of the original algorithm is resumed the drift would be bounded by d. Note that the difference in processing times at the time of each migration is j j PT(j, t) - PT(j, t) J (F (t)\F (t-1),t) J (S (t)\S (t-1),t) = qd as needed to show the migration ratio is for d 2q . r (m-r ) nq d + o(1) 5 Conclusion We have presented a model for fair scheduling of a multiprocessor allowing the minimization of process migration sub ject to a bound on fairness or vice versa. We have given a simple and practical algorithm and shown that it is near optimal for a static set of jobs in the case that all jobs' processing shares are equal. The algorithm requires only a constant amount of work per scheduling operation and a constant amount of work per migration; this work is distributed evenly across the machines. Finally, we note that our algorithm can be easily modified to allow new jobs to enter the system and completing jobs to exit, under a suitable definition of fair share for a newly arriving job, although these operations may require more than constant time. References [1] J. Anderson and A. Srinivasan. Towards a More Efficient and Flexible Pfair Scheduling Framework. 20th IEEE Real-time Systems Symposium (RTSS), Dec. 1999, pp. 46-50. [2] S.K. Baruah. Scheduling Periodic Tasks on Uniform Multiprocessors. 12th Euromicro Conference on RealTime Systems (Euromicro-RTS), 2000, pp. 7-14. [3] S.K. Baruah. Dynamic- and static-priority scheduling of recurring real-time tasks. Real-time Systems, Vol. 24 No. 1, 2003, pp. 93-128. [4] S. Baruah and J. Carpenter. Multiprocessor fixedpriority scheduling with restricted interprocessor migrations. 15th Euromicro Conference on Real-Time Systems (Euromicro-RTS), 2003. [5] S. Baruah, N. Cohen, G. Plaxton and D. Varvel. Proportionate progress: A notion of fairness in resource allocation. Algorithmica Vol. 15, No. 6, June 1996, pp. 600-625. [6] S. Baruah, J. Gehrke and G. Plaxton. Fast scheduling of periodic tasks on multiple resources. Int. Paral lel Processing Symposium (IPPS), April 1995, pp. 280288. [7] S. Baruah, J. Gehrke, G. Plaxton, I. Stoica, H. AbdelWahab and K. Jeffay. Fair on-line scheduling of a dynamic set of tasks on a single resource. Information Processing Letters, Vol. 64, No. 1, Oct. 1997, pp 43-51. 990 ¨ [8] J. Bruno, E. Gabber, B. Ozden and A. Silberschatz. The Eclipse Operating System: Providing Quality of Service via Reservation Domains. 1998 USENIX Annual Technical Conference, pp. 235-246. [9] A. Bar-Noy, A. Mayer, B. Schieber and M. Sudan. Guaranteeing fair service to persistent dependent tasks SIAM J. on Computing, Vol. 24, 1998, pp. 1168-1189. [10] A. Chandra, M. Adler and P.J. Shenoy. Deadline Fair Scheduling: Bridging the Theory and Practice of Proportionate Fair Scheduling in Multiprocessor Systems. IEEE Real Time Technology and Applications Symposium. 2001, pp. 3-14. [11] K.J. Duda and D.R. Cheriton. Borrowed-virtualtime (BVT) scheduling: supporting latency-sensitive threads in a general-purpose schedular. Symposium on Operating Systems Principles (SOSP). 1999, pp. 261276. [12] S. Harizopoulos and A. Ailamaki. Affinity Scheduling in Staged Server Architectures. Technical Report CMU-CS-02-113, March 2002. [13] Autonomic Computing: IBM's perspective on the state of Information Technology. http://www.research.ibm.com/autonomic/manifesto/. [14] M.B. Jones, D. Rosu and M.C. Rosu. CPU Reservations and Time Constraints: Efficient, Predictable Scheduling of Independent Activities. Symposium on Operating Systems Principles pp. 198-211, 1997. [15] C.L. Liu. Scheduling Algorithms for Hard-Real-Time Multiprogramming of a Single Processor. JPL Space Program Summary, Vol. 2, Nov. 1969, pp. 37-60. [16] C.L. Liu and J.W. Layland. Scheduling Algorithms for Multiprogramming in a Hard-Real-Time Environment. Journal of the ACM, Vol. 20, No. 1, January 1973, pp. 46-61. [17] C.P. Sapuntzakis, R. Chandra, B. Pfaff, J. Chow, M.S. Lam and M. Rosenblum. Optimizing the Migration of Virtual Computers. 5th Symp. on Operating Systems Design and Implementation, Dec. 2002. [18] A. Srinivasan, P. Holman and J. Anderson. Integrating Aperiodic and Recurrent Tasks on Fair-scheduled Multiprocessors. 14th Euromicro Conference on Real-Time Systems (Euromicro-RTS), 2002. [19] A. Srinivasan, P. Holman, J.H. Anderson and S. Baruah. The Case for Fair Multiprocessor Scheduling 1th International Workshop on Paral lel and Distributed Real-Time Systems, April 2003. [20] C.A. Waldspurger and W.E. Weihl. Stride Scheduling: Deterministic Proportional-Share Resource Mangement. Technical Memorandum MIT/LCS/TM-528, MIT Laboratory for Computer Science, June 1995. 991