A Provisioning Model and its Comparison with Best-Effort for Performance-Cost Optimization in Grids Gurmeet Singh, Carl Kesselman, Ewa Deelman Information Sciences Institute, Marina Del Rey, CA 90292 {gurmeet, carl, deelman}@isi.edu ABSTRACT The resource availability in Grids is generally unpredictable due to the autonomous and shared nature of the Grid resources and stochastic nature of the workload resulting in a best effort quality of service. The resource providers optimize for throughput and utilization whereas the users optimize for application performance. We present a cost-based model where the providers advertise resource availability to the user community. We also present a multi-objective genetic algorithm formulation for selecting the set of resources to be provisioned that optimizes the application performance while minimizing the resource costs. We use tracebased simulations to compare the application performance and cost using the provisioned and the best effort approach with a number of artificially generated workflow-structured applications and a seismic hazard application from the earthquake science community. The provisioned approach shows promising results when the resources are under high utilization and/or the applications have significant resource requirements. queuing-based local resource management systems such as PBS [8], LSF [9], Condor [10], etc. Due to the shared nature of these resources and the dynamic nature of the workload, the completion time of the application tasks on these resources is not predictable. Moreover, the applications require co-allocation of multiple resources such as for example storage and network for storing the input data and the generated results, communicating data between executing tasks etc. In the absence of a more explicit control, resources are offered to an application on a best effort basis with little or no influence as to exactly when the resource will be delivered to the application. This makes it difficult to predetermine, much less optimize the resulting application performance and meet deadline constraints and performance goals. Agreement-based resource management [11] is an alternative to the best effort execution for overcoming these deficiencies. The agreement-based management model allows the VO-level resource brokers to provision resources ahead of the execution of the applications by entering into agreements with the resource providers about the guaranteed availability of desired resources for a mutually agreed upon timeframe and cost. While there can be various approaches for creating agreements, in this paper, we use a model where the providers create resource offers or slots and advertise them to the general community periodically or on demand. Each slot represents the availability of a certain resource capability such as the number of processors, for a certain timeframe (start time and duration), for a certain cost and can be provisioned independently of any other offered slot i.e. we don't assume overbooking of resources. Once a slot has been provisioned, it can be used to execute any workload e.g. set of tasks subject to the capacity constraints of the slot without any further interaction with the provider. While slot-based management significantly reduces the unpredictability in application performance over best effort execution of applications, it introduces a new issue: how to identify and select a set of slots to provision for an application that would allow the execution of the application while optimizing the application performance and minimizing the cost of the provisioned resources from the perspective of the user. While there can be many performance objectives such as the execution time, reliability etc, in this paper, we use the execution time of the application, e.g. the makespan of a workflow, as the performance metric that has to be minimized. The slot selection problem is non-trivial because of the possibility of a large number of resource slots with different capabilities on offer from various providers at the time when an application is submitted and the fact that the resource providers are free to ask any price for an offered slot. In previous work [12], we explored the problem with two algorithms for minimizing the weighted sum of the two objectives. However, since the objectives are generally non-commensurate, in this paper, we create a set of Categories and Subject Descriptors C.4 [Performance of Systems]: Performance attributes, Reliability, availability, and serviceability; D.4.8 [Operating Systems]: Performance - Modeling and prediction, Simulation. General Terms: Algorithms, Management, Performance Keywords: Resource provisioning, Best effort service, Agreement based resource management. 1. INTRODUCTION Shared distributed infrastructures such as the Teragrid [1], the Open Science Grid [2], and others provide the software and hardware resources for cross-organizational collaborative work in different scientific fields such as astronomy [3], high energy physics [4], earthquake science [5], etc. Such collaborations, also called Virtual Organizations (VO) [6] require resources from multiple providers for executing large-scale scientific applications. Furthermore, emerging classes of deadline-driven scientific applications such as severe weather modeling [7] require simultaneous access to multiple resources and predictable completion times. Most of the Grid resources are managed using Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. HPDC'07, June 25-29, 2007, Monterey, California, USA. Copyright 2007 ACM 978-1-59593-673-8/07/0006...$5.00. 117 Pareto-optimal solutions using a multi-objective genetic algorithm (MOGA) formulation [13] of the problem and then select one solution from this set using a specified trade-off factor and a normalized objective function. We also examine how the application performance with resource provisioning compares with the application performance using the best effort quality of service using trace-based simulations and workload logs from a supercomputer center. In addition to the application performance, we also compare the cost of the two approaches. The rest of the paper is structured as follows. Section 2 presents the provisioning model. Section 3 discusses the principle of Pareto optimality and describes a heuristic for the provisioning problem and some initial results. Section 4 presents the design of a Grid scheduler that supports provisioning. Sections 5 and 6 compare the performance of the best effort and provisioned approach using a synthetic and a real application. Section 7 presents the discussion followed by related work in Section 8 and conclusions in Section 9. a workflow or a DAG (directed acyclic graph). We note that our techniques can be applied to other application models as well, however within this paper, we use the term application and workflow interchangeably. The task runtimes on each of the computational resources is assumed to be known. The runtimes could be estimated analytically using component performance modeling [14] or empirically using historical information. The amount of data transferred between each parent and child task in the application is assumed to be known. If the parent and child tasks are executed on the same site, the data transfer time is assumed zero. Otherwise, the data transfer time can be computed using bandwidth and latency information. Resource Provisioning Problem The goal of resource provisioning is to identify a subset â of Â0,t such that the application makespan and the provisioning costs are minimized. The subset â is also called the resource plan. The provisioning cost of the resource plan â, is termed as the allocation cost and is denoted by AC(â). The allocation cost is defined as 2. PROVISIONING MODEL We assume the system to be composed of r = 1...R Grid sites. The latency and bandwidth between these sites is assumed to be known. Each site advertises its resource availability in the form of a set of resources slots or offers. While the model can be extended to other types of resources, in this paper, we focus our attention on compute resources only. Each slot represents the availability of a certain number of processors for a certain duration starting at a certain time at a certain cost. The set of resource slots available from resource r are denoted by Er . AC(â) = ^ <.>a (n i × d i × ci + f i ) (Eq 4) The allocation cost represents the total payment for allocating the slots in the plan from the providers. Minimizing the allocation cost encourages the efficient utilization of the provisioned resources since any unused capacity represents an unnecessary addition to the allocation cost. The application makespan on â is termed as the scheduling cost of the plan, SC(â). Er = {,....,,...} (Eq 1) where represents the availability of ni processors for duration di starting at time si from resource r. The term ci is called the multiplicative cost and represents the per-unit charge for the resource and the term fi is called the additive cost and represents the overhead cost of provisioning the resource. The total cost of the slot is (ci x ni x di + fi). bi is a boolean indicating whether the slot has to be provisioned in its entirety or can be partially provisioned (bi=true:divisible, bi=false:non-divisible). ei indicates whether the duration of the slot can be extended past the specified endtime (si+di), i.e. in case we want to provision resources for tasks with longer runtimes than the specified slot duration. The total cost is always calculated based on the actual number of provisioned processors and duration as allowed by the extensibility and divisibility attributes. The set of resource slots available from resource r between times t0 and tn are denoted by Er,to,tn SC(â) = makespan of the application over â. (Eq 5) In order to determine the makespan of the application over the resource plan, â, a scheduling algorithm is required. Since the resource availability and the application task runtimes are known deterministically, the schedule of the application tasks over â can be computed using any DAG scheduling heuristic. We use the Heterogeneous Earliest Finish Time (HEFT) [15] that is discussed in Section 3 and is regarded as a good scheduling heuristic. A resource plan is called infeasible if the application cannot be completely scheduled onto it and its scheduling cost is considered infinite. The resource provisioning problem faced by the resource broker is the following multi-objective optimization problem ^ AC (a ) min ^^ a A SC ( a ) ^ (Eq 6) Er,to,tn = { <...> Er | t0 si tn} (Eq 2) The global resource availability, denoted by Â0,t is the combination of the aggregated resource availability of each resource from the current time to some time in future, t. where both the allocation cost, AC(â) and the scheduling cost, SC(â) are sought to be minimized. In the next section, we describe the heuristic for obtaining a solution using the concept of Paretooptimality. Â0,t = r =1.. R UE r , 0,t (Eq 3) 3. CONCEPT OF PARETO OPTIMALITY AND MOGA EVALUATION Application Model For the purposes of evaluation we assume that an application can be represented as a set of tasks with dependencies, also known as While two different solutions can be directly compared to each other based on the value of a single objective, it is not possible to do so in the case of multiple objectives. For example, a particular resource plan, â1 may have a lower allocation cost and a higher scheduling cost than another resource plan, â2. Thus â1 and â2 118 cannot be directly compared with each other. However, if â1 or â2 have a lower allocation cost and a lower scheduling cost than another resource plan â3, then both â1 and â2 are superior to â3. We use the concept of domination from [16] in order to compare two resource plans in the context of our optimization problem. A resource plan âi is said to dominate another resource plan âj, if both condition 1 and 2 are true: 1. The allocation cost and scheduling cost of âi is no worse than that of âj i.e. AC(âi) AC(âj) and SC(âi) SC(âj) The solution âi is strictly better than âj in at least one objective i.e. either AC(âi) < AC(âj) or SC(âi) < SC(âj) or both. 2. If any of the above conditions is violated, âi does not dominate âj. Given, the entire set of solutions to a multi-objective problem, the set of solutions that are not dominated by any other solution in the entire set is known as Pareto-optimal set. A multi-objective problem can have multiple solutions represented by the Pareto-optimal set instead of having a single solution as would be the case in a single objective optimization problem. However, since a single solution is usually required for the implementation instead of multiple ones, the most common approach in multi-objective optimization problems is to create a weighted sum of the objectives as the single objective. The Pareto-optimal set allows the user to normalize the objective function values by defining the following normalization functions. calculated as the makespan of the application as determined by applying a specified scheduling algorithm, such as HEFT to the workflow and slots. The allocation cost is computed from the size, duration, multiplicative and additive costs of each slot in the plan (Equation 4). The detailed working of MOGA is described in [16] and a high-level description follows. MOGA, like any other GA operates using a population of certain number of solutions. Each solution is a resource plan. The initial population is created by randomly picking numbers between 0 and (2n-1) and creating resource plans based on the binary representation of the numbers. The allocation and scheduling cost of each solution in the population is evaluated and the non-dominated members of this initial population are added to a Pareto-optimal set. MOGA goes through a specified number of iterations, where the new populations are generated and evaluated and the Pareto-optimal set is updated. At the end of the specified number of iterations, the current Pareto-optimal set is returned as the solution. Note that the Pareto-optimal solutions created by MOGA are only an approximation to the real set of Pareto-optimal solution since MOGA like any other GA is a stochastic algorithm and cannot guarantee optimality. Heterogeneous Earliest Finish Time (HEFT) scheduling heuristic HEFT [15] has been found to perform better than other scheduling heuristics in some studies [17] and also in our experiments (later this section). HEFT was originally developed for scheduling task graphs on heterogeneous dedicated multiprocessing systems. It works by assigning a rank to each task in the application and then scheduling tasks based on the rank. While in the original HEFT [15], a task can be scheduled on any processor, in our case, it may not be possible to schedule a task on a slot if there isn't sufficient overlap between the timeframe when the slot is active and the possible execution time of the task. Thus additional checks are required to determine the feasibility of running a task on a slot. If HEFT returns failure to schedule a task on any slot, then the resource plan is considered infeasible and its scheduling cost is considered infinite. MOGA experiments In order to evaluate MOGA, we created a set of slots representing four Grid sites where each is a cluster of 10, 20, 30, and 40 processors respectively. Slots for each site are generated using a Poisson process with a mean slot runtime of 1000 seconds and a mean inter-arrival time of 1000 seconds. For each site, the start time of the first slot is zero. The start times of the other slots is equal to the start time of the previous slot plus the inter-arrival time. The number of processors in each slot is randomly generated from 1 to the maximum number of processors on the resource. The set Â0,t is generated by using t = 30000 and |Â0,t| = 124. All the slots have ci = 1, fi = 0, bi = false, ei = false. The application is generated using a parameterized task graph generator. The application consists of 100 tasks with an average runtime of 100 seconds (< 1000s which is the average slot runtime) which ensures that the tasks can be easily scheduled on the slots. The amount of data transferred between the parent and child tasks is such that the average data transfer time is of the same magnitude as the average runtime of tasks. We instrumented MOGA to record all feasible solutions encountered. While there were 1742 feasible solutions for this particular problem, only 40 of them were Pareto-optimal (Figure 1). Figure 2 shows the makespan and the allocation cost of the ^ ACnr (a) = ^ ^ AC(a) - ACmin SC(a) - SCmin (Eq 7) ^ , SCnr (a) = ACmax - ACmin SCmax - SCmin where, ACmin(max) = minimum(maximum) allocation cost and SCmin(max) = minimum(maximum) scheduling cost of any solution in the Pareto-optimal set The approach that we have taken in this paper is to create the Pareto-optimal set of solutions (PO) and then select a solution from the set using the following objective function and a trade-off factor, ^ ^ min{ × ACnr (a) + (1 - ) × SC nr (a )} ^ aPO (Eq 8) The trade-off factor, represents the user's preference for the allocation and the scheduling cost. Using this approach multiple solutions representing different tradeoffs can be selected from the Pareto-optimal set without any extra effort. There has been a tremendous amount of work in the last decade on using GA for multi-objective optimization where the goal is to find the Pareto-optimal set [16]. In this paper, we use a multiobjective GA (MOGA) [13] because of its simplicity and the fact that it operates in the objective variables space rather than the decision variable space (explained below). Multi-objective Genetic Algorithm (MOGA) formulation A resource plan in MOGA is encoded as a n bit binary number where n = |Â0,t|. Each bit in the number represents a slot in Â0,t. If the bit is 1, the slot is included in the plan otherwise it is not. The fact that MOGA operates in the objective variable space implies that it uses allocation cost and scheduling cost values of resource plans for determining their fitness rather than the absolute value of the binary number representing the plan which doesn't represent any meaningful value. The scheduling cost of a resource plan is 119 # non-dominated solutions solution selected from the Pareto-optimal set using different values of . The value of the trade-off factor is shown in X-axis while the makespan and allocation cost are plotted against the Y1(left) and Y2(right) axis respectively. Figure 2 shows a clear trade-off between the allocation and scheduling costs. We can see that using the combined metric, we can achieve considerable reduction in the allocation cost for a little increase in the scheduling cost. A ll Solutions 350 300 alloc ation cost 250 200 150 100 50 0 0 10 20 s c heduling cost 30 Par eto Solutions superiority. HEFT has also been shown to perform better than other myopic or evolutionary based workflow scheduling strategies in Grid environments [17]. Hence, we use HEFT as the scheduling policy for all the experiments described in this paper. Uniform Slot Pricing: Scheduler Policy 50 40 30 20 10 0 HEFT Gr eedy MinMin Max Min s c heduling algorithm Total Figure 3. Effect of scheduling policy. We also performed experiments to determine the effect of different MOGA parameters on the solution quality. The experiments compared the solutions using the domination principle as described previously in the scheduler policy experiment. The population size was found to have a greater effect than the number of iterations. This can be understood by the fact that in the initial population, each member represents a unique solution. All the later solutions are generated from this initial population using different genetic operators and so their solution quality is related somehow to the solution quality of the initial population. Hence, in situations where the runtime complexity of MOGA becomes important, we prefer to have a bigger population size and a smaller number of iterations. Also, we found the 2 point crossover operator to be more effective than a 1 point or a 3 point crossover operator. Similarly, mutation with a mutation probability of 1/n was found helpful in finding a diverse set of Pareto-optimal solutions where n is the total number of slots. Figure 1. The set of all feasible and Pareto solutions. s c heduling cost 6 s c heduling cost 5 4 3 2 0 0.2 0.4 0.6 0.8 tr ade- of f factor 1 alloc ation cost 150 alloc ation cost 130 110 90 70 50 Figure 2. The scheduling and allocation cost of solutions selected from the Pareto set using different . We also evaluated three other scheduling algorithms in addition to HEFT. The first one was a low complexity greedy scheduling algorithm that creates a topologically sorted list of tasks in the application. It then goes through each task in the list, scheduling it on a slot that will minimize the current makespan. The other algorithms are the Min-Min and the Max-Min algorithms as described in [14]. These algorithms partition the application into levels where each level is a set of independent tasks. At each level the Min-Min and the Max-Min algorithms are used the schedule the tasks onto slots. We created the set of Pareto-optimal solutions using each of the scheduling algorithms and then combined them. In the combined set, the dominated solutions were purged, leaving behind 39 non-dominated solutions. Out of these 39 nondominated solutions, Figure 3 shows the number of solutions found by each of the four scheduling algorithms. Clearly, most of the non-dominated solutions were found using the HEFT algorithm. While all the scheduling algorithms found similar number of solutions, the solutions found by HEFT dominated the solutions found using other scheduling algorithms showing its 4. DESIGN OF A SLOT GENERATING LOCAL SCHEDULER In this section, we describe extensions that allow a conservative backfilling based scheduler to advertise available resources in the form of free slots. Conservative backfilling is an optimization to the first come first serve scheduling policy that allows a task to be started earlier if it would not delay the start time of any task that has come before this task [18]. When a task is put in the queue, the scheduler allocates resources for it i.e. creates a reservation at the earliest possible time. In order for the non-delaying guarantee to work, we do not move the reservations forward in time if a task completes earlier then the specified runtime and tasks are not allowed to execute past their specified runtime either. Thus for all practical purpose, the expiration time of the reservation for the task is considered as the task completion time. The set of reservations for the currently running and queued tasks represents the site schedule. The scheduler can then determine the free slots in its schedule by creating windows for each processor that shows when the processor is free from the current time to some time in future called the slot horizon. The slot horizon is strictly larger then the end time of any current reservation thereby 120 guarantying that each processor has at least one free window. For this paper, we use a slot horizon that in addition, falls on a 24 hour boundary from the current time in order to model availability in multiples of 24 hr periods. Each window is a tuple containing < processor id, start time, end time>. Windows across processors having the same start and end time are consolidated into free slots and advertised periodically or on demand. The set of free slots is affected when a new task is submitted and resources are allocated to the task from those currently available. In order to illustrate the slot formation process, Figure 4 shows a simple site schedule with three reservations (Resv A, Resv B, Resv C) for submitted tasks A, B, and C respectively. The current time is represented by the leftmost end with the future moving to the right. The slot horizon is at end of time 6 while the maximum scheduled end time of any reservation is time 4 (for task C). Then for each processor, we create windows of free time from the current time to the slot horizon based on the current schedule. The scheduler then consolidates windows with the same start and end time into slots resulting in the six free slots shown in the figure. Nodes 5 4 3 2 1 1 Now Resv B Slot 2 Slot 1 Slot 5 Slot 3 Resv C Slot 6 Resv A Slot 4 5. COMPARISON OF BEST EFFORT AND PROVISIONED APPROACH USING TRACE SIMULATION In this section, we do a comparison of the application performance (makespan) when the application is executed using the best effort and the provisioned approach. In addition, we also compare the resource costs of the two approaches from the perspective of the user. We simulated a Grid site representing a 430 node cluster. The workload simulated on the cluster was based on logs from the 512 node IBM SP2 Cornell Theory Center (CTC) (430 nodes operational) obtained from the Parallel Workloads Archive [19]. We have also done experiments simulating another Grid site and found similar results [20]. However, due to space constraints, we only present results for the CTC cluster. The simulator was a modified version of GridSim simulator [21] with extensions for parallel job scheduling and conservative backfilling. We use the actual runtime of the tasks as reported in the logs as their requested runtime with the result that the end time of the jobs is also known in advance. The implication of using the requested runtime is discussed in Section 7. Executing an application using the best effort scheduling was done by simulating a just in time scheduling policy. Each task in the application was submitted to the resource queue when it became ready for execution i.e. all its parent tasks finished execution. The time difference between the time when the first task in the application was submitted to the resource queue and the time when the last task finished execution is taken as the best effort makespan of the application. The allocation cost of best effort execution is the sum of the runtime of the task multiplied by the number of processors required by the task over all the tasks in the application. 2 3 Time 4 5 6 Figure 4. Free slots in the resource schedule. In order to reduce the time required to create the set of free slots, the scheduler always maintains this set and it is kept sorted by the start time of the slots in an ascending order. This set is updated every time a task is submitted. When the end time of a slot is followed by the start time of a reservation on the processors constituting the slot, the slot is termed indivisible and nonextensible (bi = false, ei = false, Section 2). When the end time of a slot coincides with the slot horizon, then the slot is termed divisible and extensible (bi = true, ei = true). By making these slots extensible, the scheduler is able to advertise future resource availability beyond the slot horizon as well. While extensibility constraints can be easily understood, the divisibility constraints are meant to reduce fragmentation of the resources. The best effort jobs get a better treatment by the scheduler since a best effort job can get backfilled onto an indivisible slot paying only for the resource requested while if the same job were to be run by provisioning the slot, payment for the whole slot would have to be made regardless of usage if the slot were indivisible. Divisible and extensible slots are more attractive for the provisioned approach but tend to lie farther in the future in this design. All the slots are advertised with a multiplicative cost of 1 and an additive cost of 0. ACbest -effort = (vit × vip ) V (Eq 9) V = set of all tasks in the application (v1,...,vn) vit = runtime of a task vi vip = number of processors required by task vi This allocation cost reflects the fact that in best effort execution, as per the current practice, one only pays for the resources used by the task during its execution. Thus the allocation cost for an application is constant for best effort execution and is the lower bound on the allocation cost of the provisioned approach. For the experiments a synthetic workflow is generated using a parameterized task graph generator. The application consists of 100 tasks with an average runtime of 1000 seconds. The workflow consists of 10 levels (depth of the workflow) with 10 tasks at each level on the average. The data transfer time between the tasks is zero since all tasks are executed on the same resource. The average number of processors per task is 22, making the average width of the application, half of the maximum number of processors on the resource. We further modified the HEFT algorithm to do cross slot scheduling. This implies that if a slot does not have enough processors to schedule a task, then HEFT would consider combinations of slots that overlap in time to schedule the task. Furthermore, it uses the extensibility attributes of slots to extend 121 them for scheduling tasks with longer runtimes than the specified slot duration. After the schedule is completed, for divisible slots, we allocate only those parts of the slots where tasks are scheduled leading to lower allocation costs. The client queries the set of free slots (using t = ) when the application is submitted and generates a resource plan using MOGA and a given trade-off factor . We compute and record the makespan and the allocation cost of this resource plan. The application is then executed using best effort. The allocation cost of the best effort approach is always constant for an application as mentioned earlier. This process is repeated 50 times by submitting the application at different times during the simulation run and the average is taken. The makespan and allocation cost are measured in hours and service units (SU) respectively with 1 SU = 1 processor hour. P ro v A v g B E A vg P ro v Std Dev B E Std Dev Figure 5 also shows the standard deviation of the makespan and the allocation cost. The standard deviation of the makespan using the best effort is generally higher than that of the provisioned approach. This proves that the provisioned approach provides better insulation for the application against dynamic changes in the resource workload. As the value of increases, the standard deviation of the provisioned makespan increases and the allocation cost decreases due to the increased emphasis on reducing the allocation cost. The allocation cost of the best effort is a constant and hence its deviation is zero. Bes t Effort 110 % makespan 100 90 80 70 60 3 10 20 Ave r age # processors per task Bes t Effort % allocation cost Pr ov ( = 0) 30 Pr ov ( = 0) Pr ov ( = 1) 6 makespan (hrs) 5 4 3 2 1 0 0 0.2 0.4 0.6 t r ade - o f f factor B E A vg Pr ov ( = 1) 0.8 1 115 110 105 100 95 90 3 10 20 30 Ave r age # processors per task P ro v A v g P ro v Std Dev allocat ion cost (SU) 600 500 400 300 200 100 0 0 0.2 t r ade - of f factor Figure 6. Makespan and allocation cost as a percentage of best effort values with increasing task sizes. 0 .4 0.6 0.8 1 Figure 5. Makespan and allocation cost of best effort and provisioned approach for different trade-off factors. Figure 5 shows the makespan and the allocation cost of the best effort and the provisioned approach. The results of the provisioned approach are shown for different values of the tradeoff factor . The provisioned makespan is lower than the best effort makespan for all values of at a cost of little or no increase in the allocation cost. While the best effort allocation cost remains a lower bound on the provisioned makespan, MOGA is able to approach that lower bound using higher values of . In general, the lower makespan of the provisioned approach is due to the fact that each best effort task experiences some queue wait time at the sites based on the current workload and the child tasks are not released for submission until the parent tasks have finished execution. With the provisioned approach, resources are provisioned for the entire application ahead of execution leading to lower makespans in general. We also performed experiments to determine the effect of the task size on the makespan and the allocation cost. The task size is varied by changing the average number of processors per task in the workflow while the workflow structure remains the same. Figure 6 shows the effect of the task size on the makespan and the allocation cost of the best effort and the provisioned approach. The performance of the provisioned approach is shown using tradeoff factors of zero and one that represents the lower and upper bounds for the makespan and the allocation cost of the provisioned approach. The figure shows the provisioned makespan and allocation cost in percentage terms relative to the best effort, which is considered 100 percent. As the task size increases, the application performance in the provisioned approach relative to the best effort improves. Moreover, there is no significant effect on the allocation cost since most of the tasks are scheduled on the divisible slots or their combinations due to the large task size. With larger task sizes, the chances of the best effort scheduler being able to backfill them earlier in the schedule decreases and their queue wait time increases. This shows that the provisioned approach is well suited for workflows with highly parallel tasks. 122 Production Grid deployments such as the Teragrid [1] often experience high to very high utilization levels [22]. We performed experiments to ascertain the effect of the resource utilization on the performance of the provisioned approach. In order to increase the load, we made two copies of the workload trace using a method adopted in [23]. The first trace started on midnight 28th June 1996 and the second one on midnight 5th July 1996 (exactly one week later). Simulating only the first trace gave a resource utilization of 63%. For increasing the load, both the traces were fed simultaneously to the simulator. While all tasks in the first trace were submitted to the resource as per their recorded submission time, the tasks in the second trace was submitted to the resource only with a certain probability. Increasing this probability allowed us to increase the load while not changing the task characteristics in the workload. the current operational Grid sites such as TeraGrid. These sites target large-scale applications with highly parallel tasks which are a natural match for these resources. Moreover due to the shared nature of these resources, the large number of users, and their production nature, these sites generally show high to very high utilization levels (60-80+%) [22]. 6. SEISMIC HAZARD ANALYSIS NumPE 3 100 % makespan 80 60 40 20 0 NumPE 10 NumPE 20 NumPE 30 In the previous sections, we compared the performance of the provisioned and the best effort approaches using artificiallygenerated applications. In this section, we use a seismic hazard analysis application taken from the earthquake engineering community, called CyberShake [5]. The workflow has 5 levels with 8039 tasks and the structure of the workflow along with the module names is shown in Figure 8. Table 1 shows the average runtime and number of processors required for each module. The last two levels in the workflow contain 4017 tasks each with the same module (synthSGT and peakValCal respectively) operating on different datasets. Table 1. Module details of the CyberShake workflow. Module name fd_grid_xyz preSGT fd_grid_cvm pvml_chk1 pmvl_chk2 synthSGT peakValCal # of tasks 1 1 1 1 1 4017 4017 Avg runtime (seconds) 1 300 2100 86400 86400 519 1 # of processors 1 1 288 288 288 1 1 63 76 % Resource Utilization 88 NumPE 3 120 % allocation cost 100 80 60 40 20 0 NumPE 10 NumPE 20 NumPE 30 Level 0 fd_grid_xyz Level 1 63 76 % Resource Utilization 88 preSGT fd_grid_cvm Level 2 pmvl_chk1 pmvl_chk2 Figure 7. Makespan and allocation as a % of best effort values with increasing utilization and task sizes. Figure 7 shows the allocation cost and the makespan of the provisioned approach as a percentage of the best effort, which is plotted as 100 percent. In this case, a single tradeoff factor of 0.5 is used for generating the provisioned results. The results are shown for different average task sizes and different resource utilization levels. As the average task size in the workflow increases, the provisioned makespan becomes smaller as compared to the best effort makespan. The same trend is visible when the resource utilization increases. The reason is that due to the increased load, the queue wait times of the best effort tasks increase leading to an increased makespan. The allocation cost of the provisioned approach is not significantly affected. This shows that the provisioned approach is more attractive when the resources are highly loaded and the applications have significant resource requirements. Both of these characteristics are visible at Level 3 ..... synthSGT Level 4 ..... peakValCal Figure 8. The seismic hazard analysis workflow. We simulated the execution of this workflow on the CTC cluster. The runtimes of the tasks in the workflow were acquired from the provenance records of the previous runs of the workflow using Pegasus [24, 25] and Condor DAGMan [26] on the HPCC (High Performance Computing & Communications) cluster at the University of Southern California (USC). We simulated different background loads on the CTC cluster by making two copies of the workload and probabilistically adding portions of the second copy to the simulated workload as described in the previous section. As a result, we were able to vary the resource utilization of the CTC 123 cluster from 63% to 94%. For each utilization level, we simulated the workflow execution using both best effort and the provisioned approach at different points during the simulation run and averaged the makespan and allocation cost metrics. We used a trade-off factor of 0.5 for creating the resource plan. Figure 9 show the makespan and the allocation cost of the workflow execution using the best effort and the provisioned approach. At 63% utilization, the best effort makespan is 476811 seconds (5.5 days) which is similar in magnitude to the observed makespan of the workflow on the HPCC cluster. The provisioned makespan at this utilization level is 23% lower than the best effort value. The best effort makespan sharply increases as the resource utilization exceeds 80%. The growth in the provisioned makespan is less pronounced. At 94% utilization level, the provisioned makespan is 56% lower than the best effort makespan. Figure 9 also shows the standard deviation (denoted by bars) of the makespan and allocation cost. The standard deviation of the best effort makespan is more than that of the provisioned makespan and increases with the utilization of the resource. The standard deviation of the provisioned allocation cost is very small at the highest utilization level and negligible at lower levels. Bes t Effort 700 makespan (hrs) 600 500 400 300 200 100 0 63 70 76 82 88 % Resource Utilization Bes t Effort 20000 allocat ion cost (SU) 15000 10000 5000 0 63 70 76 82 88 % Resource Utilization 94 Pr ov is ioned 94 Pr ov is ioned 7. DISCUSSION The makespan using best effort was generally more than the makespan of the provisioned approach in our evaluation due to the fact that the Grid scheduler was not capable of handling dependencies and hence the child tasks has to be submitted only after the parent tasks had finished execution. Some resource managers such as PBS [8], Maui [27], Condor [10], and LSF [9] allow submission of multiple tasks with dependencies and initiate child tasks when the parent has completed. This has the potential to decrease makespan. However, depending on the system used, these tasks might or might not gain priority in queue while waiting for their dependencies to be satisfied. For example, PBS when used standalone allows such tasks to gain priority but when used with the Maui scheduler (which is the most common case), it is generally configured otherwise. The dependency manager in Condor (DAGMan) [26] also waits for the completion of parent tasks before submitting child tasks to resource queue. Any production system must disallow such tasks with unsatisfied dependencies from gaining priority while waiting in order to prevent system abuse. Abuse may occur when a single user submits a large-scale workflow in its entirety thus dominating the resource allocation for the timeframe of the execution of the workflow, preventing other users from gaining their fair share of the resource. Moreover, since dependencies are only recognized between the tasks submitted to the same resource queue, this mechanism does not work across platforms, or across queues on the same platform. One major concern with the use of Genetic Algorithms is their run time complexity. In our case, the complexity derives from the fact that in order to compute the fitness of an individual (resource plan), a schedule of the application has to be created. This fitness computation has to be done for each individual in the population at each iteration. Thus the total complexity is roughly equal to the scheduling complexity times the population size times the number of iterations. While the scheduling complexity is fixed, the runtime complexity of the GA can be controlled by reducing the population size and the number of iterations. For example in the CyberShake workflow with 8039 tasks (Section 6) in the 63% utilization case, the average time taken by MOGA to compute the pareto set was 5.6 minutes with a population size of 10 and 10 iterations on a 2 GHz Pentium 4 machine. Moreover, this runtime cost is negligible as compared to the reduction in makespan due to provisioning which was 30.6 hours in this case. In Section 4, we assumed that the tasks report their actual runtimes so that the end times of these tasks are accurately known to the scheduler. However, in practice users generally quote task run times that may be significantly more than the actual run times. In a provisioning based environment, these loose runtime estimates can adversely impact the performance perceived by the provisioned users since the free slot calculation is based on these task runtime estimates. For example, increasing the runtimes of the tasks in the workload by 10% and 20%, the provisioned makespan increased by 3% and 7.7% respectively in the case of CyberShake workflow with 63% resource utilization. It was still less than best effort makespan which was 30% more than the provisioned makespan. Thus the resource providers might decide to charge best effort users based on their runtime estimates or charge penalties for wrong estimates which would encourage users to provide tight runtime estimates. Figure 9. Makespan and allocation of best effort and provisioned approach for seismic hazard application. The allocation cost of the best effort and provisioned approach does not differ significantly as the resource utilization level is changed. The reason is that with the provisioned approach, due to the large size of the parallel tasks, these and the latter parts of the workflow had to be scheduled on the divisible/extensible slots at the end of the resource schedule which are very cost effective. In practice, special reservations had to be set up for these tasks on the HPCC cluster since the best effort queues were not configured for such large tasks. 124 The advertisement of a slot is advisory and does not guarantee that the slot will actually be available should an application decide to allocate it. Actual allocation of the slot will be subject to policy enforcement by the resource provider at the point the allocation request is made. However, the model adopted in this paper leads to fewer interactions between the resource brokers and providers than other negotiation based approaches that provision resources for each application task separately [28, 29]. Considering the large-scale nature of Grid applications, the multitude of users and resource providers, it is essential to make the provisioning process fast and efficient. Moreover, mechanisms such as the SNAP-based three phase commit protocol [30] may be used to converge to a resource plan quickly using retries when contention between users or policy issues make some desirable slots available to the user. In this paper, we have restricted our attention to provisioning of compute resources. However, this framework can be extended to consider other types of resources as well. In our experiments, provisioning of network resources was not considered since each application was executed on a single cluster. Provisioning of network resources would become important when co-allocating resources from multiple providers. Design of a generic slot based resource manager that can be used to manage network resources is discussed in [31]. The idea is to manage allocations using a reservation table similar to the site scheduler in Section 4 while doing the actual reservation using the RSVP [32] protocol. Hard guarantees can be obtained if the resource manager does admission control as well. is done separately by negotiating with the resource providers. The cost of allocation is not considered in [28, 29]. In [38], authors employ a cost aware resource model similar to ours. However, the goal here is to concurrently maximize the resource and application utility using a centralized resource allocator. In our case, we use a distributed approach where the resource providers and users maximize their own utility. There has been a recent focus on the framework [11] and protocols for agreement-based resource management [30]. These provide the underlying plumbing required to instantiate a set of resource agreements for an application and complement the work presented in this paper. However, there has been little work on the reasoning process for selecting the set of agreements to be created for workflow structured applications. 9. CONCLUSIONS 8. RELATED WORK There is a large body of research on application scheduling on dedicated systems [14, 15]. The resource provisioning is implicit in that the entire resource is considered provisioned. In Grid computing, due to the non deterministic nature of the resource availability, prediction services such as the Network Weather Service [33], queue wait time estimators [34] etc are used to make scheduling decisions for the applications [35]. However, the resulting application performance is highly dependent on the quality of the predictions. Moreover, frequent adaptation is required for countering the dynamic nature of resource availability. Advance reservations have been widely proposed for provisioning resources for performance predictability, meeting resource requirements and providing guaranteed quality of service to applications. A resource model similar to ours is adopted in [36] where a client can probe the possible start times of a task on a resource. However, the reservation is for a single job instead of an application and a single resource slot has to be reserved. This has been extended in [37] to support co-reservations. However, the reservations are made based on an abstract resource description from the user. In our case, we create the set of reservations based on the resource availability in the Grid and a given application. Additionally it creates all combinations of the resource slots to find a feasible candidate. This might not be a feasible strategy if there are large numbers of available resource slots. In [29], authors create a reservation plan for a workflow by making reservations for each task individually. This may not be a feasible approach for large scale workflows containing thousands of fine-grained tasks. The focus in [29] is on increasing reliability of execution of the workflow and contention for resources is not considered. A similar strategy is adopted in [28, 38] where reservation or allocation for each task or activity in the application In this paper, we presented a multi-objective GA formulation for provisioning resources for an application using a slot-based resource model for optimizing the application performance and minimizing the provisioning cost. We have extended a conservative backfilling-based Grid scheduler to support resource provisioning in addition to providing best effort quality of service. Using a trace-based simulation and an artificial and a real application, we have compared the application performance using the best effort and provisioned approach. We evaluated the sensitivity of the provisioned approach to changes in the user preference (trade-off factor), application task size and resource utilization. The results show that the provisioned approach outperforms the best effort execution when the size of the tasks in the application increases and when the background load on the resources increases. In the future, we plan to experiment with an advance discount pricing strategy and investigate the effect on the application and system performance. 10. ACKNOWLEDGEMENTS This work is supported in part by the National Science Foundation under Cooperative Agreement CCR-0331645 and NGS-0305390. 11. REFERENCES 1. Catlett, C. The philosophy of TeraGrid: building an open, extensible, distributed TeraScale facility. in Cluster Computing and the Grid 2nd IEEE/ACM International Symposium CCGRID2002. 2002. The Open Science Grid Consortium, http://www.opensciencegrid.org. Katz, D.S., et al. A Comparison of Two Methods for Building Astronomical Image Mosaics on a Grid. in Parallel Processing, 2005. ICPP 2005 Workshops. International Conference Workshops on. 2005. Deelman, E., et al. GriPhyN and LIGO, Building a Virtual Data Grid for Gravitational Wave Scientists. in 11th Intl Symposium on High Performance Distributed Computing. 2002. Deelman, E., et al. Managing Large-Scale Workflow Execution from Resource Provisioning to Provenance Tracking: The CyberShake Example. in e-Science and Grid Computing, 2006. e-Science '06. Second IEEE International Conference on. 2006. 2. 3. 4. 5. 125 6. 7. 8. 9. 10. 11. 12. 13. 14. 15. 16. 17. 18. 19. 20. 21. Foster, I., C. Kesselman, and S. Tuecke, The Anatomy of the Grid: Enabling Scalable Virtual Organizations. International Journal of High Performance Computing Applications, 2001. 15(3): p. 200-222. Droegemeier, K.K., et al. Linked Environments for Atmospheric Discovery (LEAD): A CyberInfrastructure for Mesoscale Meteorology Research and Education. in 20th Conference on Interactive Information Processing Systems for Meteorology, Oceanography, and Hydrology. 2004. Seattle, WA. Henderson, R.L., Job Scheduling Under the Portable Batch System in Proceedings of the Workshop on Job Scheduling Strategies for Parallel Processing 1995 Springer-Verlag. p. 279-294 Zhou, S., et al., Utopia: a load sharing facility for large, heterogeneous distributed computer systems Softw. Pract. Exper. , 1993 23 (12 ): p. 1305-1336 Litzkow, M.J., M. Livny, and M.W. Mutka. Condor-a hunter of idle workstations. in Distributed Computing Systems, 1988., 8th International Conference on. 1988. Czajkowski, K., I. Foster, and C. Kesselman, Agreementbased resource management. Proceedings of the IEEE, 2005. 93(3): p. 631-643. Singh, G., C. Kesselman, and E. Deelman. Application-level resource provisioning on the Grid. in 2nd IEEE conference on e-Science and Grid computing. 2006. Amsterdam. Fonseca, C.M. and P.J. Fleming. Genetic algorithms for multiobjective optimization: Formulation, discussion, and generalization. in Proceedings of the Fifth International Conference on Genetic Algorithms. 1993. Mandal, A., et al. Scheduling Strategies for Mapping Application Workflows onto the Grid. in The 14th IEEE International Symposium on High Performance Distributed Computing (HPDC-14). 2005. Topcuouglu, H., S. Hariri, and M.-y. Wu, PerformanceEffective and Low-Complexity Task Scheduling for Heterogeneous Computing IEEE Trans. Parallel Distrib. Syst. , 2002 13 (3 ): p. 260-274 Deb, K., Mutli-Objective Optimization using Evolutionary Algorithms. 2001: John Wiley & Sons. Wieczorek, M., R. Prodan, and T. Fahringer, Scheduling of scientific workflows in the ASKALON grid environment SIGMOD Rec. , 2005 34 (3 ): p. 56-62 Feitelson, D.G. and A.M. Weil. Utilization and predictability in scheduling the IBM SP2 with backfilling. in Parallel Processing Symposium, 1998. 1998 IPPS/SPDP. Proceedings of the First Merged International...and Symposium on Parallel and Distributed Processing 1998. 1998. Feitelson, D.G., Logs of real parallel workloads from production systems, in URL: http://www.cs.huji.ac.il/labs/parallel/workload. Singh, G., C. Kesselman, and E. Deelman, A Provisioning Model and its Comparison with Best-Effort for PerformanceCost Optimization in Grids, in USC Technical Report no. 07890. 2007. Buyya, R. and M. Murshed, GridSim: A Toolkit for the Modeling and Simulation of Distributed Resource Management and Scheduling for Grid Computing. Concurrency and Computation: Practice and Experience, 2002. 14(13-15): p. 1175-1220. 22. Iosup, A., et al. How are Real Grids Used? The Analysis of Four Grid Traces and its Implications. in 7th IEEE/ACM International Conference on Grid Computing. 2006. Barcelona, Spain. 23. Sabin, G., V. Sahasrabudhe, and P. Sadayappan. Assessment and enhancement of meta-schedulers for multi-site job sharing. in High Performance Distributed Computing, 2005. HPDC-14. Proceedings. 14th IEEE International Symposium on. 2005. 24. Deelman, E., et al., Pegasus: A framework for mapping complex scientific workflows onto distributed systems. Scientific Programming, 2005. 13(3): p. 219-237. 25. Deelman, E., et al., Pegasus: Mapping Large-Scale Workflows to Distributed Resources, in Workflows for eScience: Scientific Workflows for Grids, I. Taylor, et al., Editors. 2007, Springer. 26. Condor DAGMan: http://www.cs.wisc.edu/condor/dagman. 27. Jackson, D.B., Q. Snell, and M.J. Clement, Core Algorithms of the Maui Scheduler in Revised Papers from the 7th International Workshop on Job Scheduling Strategies for Parallel Processing 2001 Springer-Verlag. p. 87-102 28. Wieczorek, M., et al. Applying Advance Reservation to Increase Predictability of Workflow Execution on the Grid. in Second IEEE International Conference on e-Science and Grid Computing. 2006. Amsterdam. 29. Zhao, H. and R. Sakellariou. Advance Reservation Policies for Workflows. in 12th Workshop on Job Scheduling Strategies for Parallel Processing (JSSPP). 2006. SaintMalo, France. 30. Haji, M.H., et al., A SNAP-Based Community Resource Broker Using a Three-Phase Commit Protocol: A Performance Study. The Computer Journal, 2005. 48(3): p. 333-346. 31. Foster, I., et al. A Distributed Resource Management Architecture that Supports Advance Reservations and CoAllocation. in Proc. International Workshop on Quality of Service. 1999. 32. Zhang, L., et al., RSVP: a new resource ReSerVation Protocol. Network, IEEE, 1993. 7(5): p. 8-18. 33. Wolski, R., N. Spring, and J. Hayes, The Network Weather Service: A Distributed Resource Performance Forecasting Service for Metacomputing. Future Generation Computer Systems, 1999. 15(5-6): p. 757-768. 34. Brevik, J., D. Nurmi, and R. Wolski. Predicting Bounds on Queueing Delay in Space-Shared Computing Environments. in IEEE International Symposium on Workload Characterization. 2006. 35. Nurmi, D., et al. Evaluation of a Workflow Scheduler Using Integrated Performance Modelling and Batch Queue Wait Time Prediction. in SuperComputing Conference. 2006. Tampa, Florida. 36. Roblitz, T., F. Schintke, and J. Wendler. Elastic Grid Reservations with User-Defined Optimization Policies. in Proceedings of the Workshop on Adaptive Grid Middleware. 2004. 37. Roblitz, T. and A. Reinefeld. Co-reservation with the concept of virtual resources. in IEEE International Symposium on Cluster Computing and the Grid. 2005. 38. Siddiqui, M., A. Villazon, and T. Fahringer. Grid Capacity Planning with Negotiation-based Advance Reservation for Optimized QoS. in SuperComputing Conference. 2006. Tampa, Florida. 126