Using Queue Structures to Improve Job Reliability Thomas J. Hacker Discovery Park Cyber Center Purdue University 250 N. University Street West Lafayette, Indiana 47907 Zdzislaw Meglicki Office of the Vice President for Information Technology Indiana University Bloomington, Indiana 47405 hacker@cs.purdue.edu gustav@indiana.edu are commonly used to mediate access to the system by partitioning the system and enforcing time limits jobs running in a partition. The size of the partitions, time limits, and other restrictions are based on the needs of the user community, application requirements, and site p olicies. This approach works well when the size of the system is small, and when application run time is limited. However, growth in the numb er of nodes and execution time required by applications has exp osed several problems. Many computational tasks require the availability of all nodes in the partition for the duration of the job. When even one of these nodes fail, the entire job must b e ab orted. Strategies such as checkp ointing and short queue run times are used to limit the wasted cycles caused by these failures and to improve the chances of job success, but there has b een little work on linking the inherent structure of the queue partitions and run time limits with estimated system reliability. In this pap er, we link queue structure and system reliability, and derive a mathematical model that can b e used to configure a system to provide an estimated probability of job success. We link the model we derive with Daly's optimal checkp oint model [1] to determine the numb er of nodes required to achieve optimal checkp oint time. Using the system initiated checkp oint feature of LAM/MPI [2], we prop ose a p eer-to-p eer checkp ointing method that can b e used with spare failover nodes to significantly improve the chances of job success. Through the combination of queue structures, system initiated checkp ointing, failover, and the use of optimal checkp oint intervals, we b elieve that it is p ossible to design a reliable high p erformance computing system from moderately reliable commodity comp onents. Our approach in developing this model required several steps. First, we determined if the exp onential, Weibull, or some other distribution b est characterized the failure b ehavior of a system in the field, and estimated reasonable parameters to use for our model. Second, based on observed failure characteristics, we derived a queue reliability model, and assessed this model against known existing queue structures. Third, using the inferred queue MTTF and node MTTF along with an optimal checkp oint interval, we derived the numb er of nodes necessary to maximize effective work time. Finally, using a k-out-of-N reliability model, we determined the reliability of a queue using k failover nodes. ABSTRACT Many high p erformance computing systems today exploit the availability and remarkable p erformance characteristics of standalone server systems and the impressive price / p erformance ratio of commodity comp onents. Small scale HPC systems, in the range from 16 to 64 processors, have enjoyed significant p opularity and are an indisp ensable tool for the research community. Scaling up to hundreds and thousands of processors, however, has exp osed op erational issues, which include system availability and reliability. In this pap er, we explore the impact of individual comp onent reliability rates on the overall reliability of an HPC system. We derive a mathematical model for determining the failure rate of the system, the probability of failure of a job running on a subset of the system, and show how to design a reasonable queue structure to provide a reliable system over a broad job mix. We also explore the impact of reliability and queue structure on checkp oint intervals and recovery. Our results demonstrate that it is p ossible to design a reliable high p erformance computing system with very good op erational reliability characteristics from a collection of moderately reliable comp onents. Categories and Subject Descriptors C.4 [Performance of Systems]: Reliability and availability General Terms Management, Reliability 1. INTRODUCTION The introduction of cluster computing systems based on commodity comp onents has greatly increased the use of high p erformance computing in research and industry. Many of the systems in use today contain thousands of processors and hundreds of computational nodes. Queueing systems Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. HPDC'07, June 25­29, 2007, Monterey, California, USA. Copyright 2007 ACM 978-1-59593-673-8/07/0006 ...$5.00. 2. RELATED WORK Kleban [3] studied the effects of queue structure on the risk and rewards of running a p ortfolio of jobs on a cluster with a sp ecific queue p olicy. Our work differs in that 43 we investigate the effects of queue structure along with system reliability on the probability of success for an individual job. Ryan [4] analyzed reliability data for the Blue Mountain system at the Los Alamos National Lab oratory to estimate trends in system reliability. The authors derived a reliability pro jection model based on non-homogenous Poisson processes. Rather than focusing on the prediction of the reliability of the system comp onents, our work uses exp ected reliability metrics to predict the failure characteristics of a computational task running for a finite time on the system under study. Heath [5] investigated techniques to improve cluster reliability by dividing cluster nodes into production and testing p ools. Although Heath's approach could b e used to improve the reliability of some nodes, our work focuses on assessing the effects of failures on job reliability for tasks running on a full queue or system. Nurmi, Brevik, and Wolski [6, 7, 8] assessed system availability data to derive several availability models. Our work is a continuation of Nurmi's work in that we assess the effects of availability on computational tasks running on high p erformance computing systems, which are often aggregations of commodity desktop systems. Zhang [9] examined the impact of failures and p erformance for a range of scheduling p olices. The authors describ e a scheduling strategy to reduce the probability of a failed node terminating a running job using a priority-based node selection scheme that avoids nodes that have recently failed. Our work is different in that we base our analysis on the aggregate statistical b ehavior of the entire systems comprised of nodes with individual reliability characteristics, rather than on simulation based models. In a recent pap er [10], Schroeder and Gibson analyzed system failures collected from several high-p erformance computing systems at LANL. Total number of failures across system 2500 2000 1500 1000 500 0 Year 1 Year 2 Year 3 Year 4 Year 5 Year 6 Year 7 Year 8 Figure 1: Total number of failures per year. in the life of a system, as the comp onents age and b egin to fail, the failure rate increases. Many manufacturers routinely provide mean time to failure (MTTF) statistics for computer systems and network equipment, which can b e used to assess the exp ected reliability of a system. While these statistics are useful for single server installations, mapping from comp onent MTTF to overall system reliability, architecture, and op erational parameters is more complex. To investigate this effect, we analyzed nine years of field failure data from a large sup ercomputing system at LANL. LANL kept detailed failure logs for 22 high p erformance computing systems in production b etween 1997 and Novemb er 2005 [13] 1 . We analyzed the failure data for the largest system in the data set, a 49 node, 6152 processor SMP system. Our analysis sought to answer several questions: i) which statistical distribution b est fits observed failures? ii) at what p oint in the life of the system does the failure rate b ecome constant? iii) is it p ossible to distill conclusions from this system that could b e applicable to other systems in the field? iv) can we derive a set of Weibull parameters for our queue failure model? Number of failures per node per year 120 100 80 60 40 20 0 Year 1 Year 2 Y ear 3 Y ear 4 Year 5 Year 6 Year 7 Year 8 3. DETERMINING SYSTEM FAILURE DISTRIBUTION FROM FIELD DATA To develop a model for queue reliability, we first needed an accurate characterization of the reliability of the system. The purp ose of this was several fold. First, we wanted to use an accurate failure model that was well supp orted by field data and the literature. Second, we wanted a reliability model that could b e used for newly installed systems, as well as for systems that have b een in op eration for a long time. Finally, we wanted to explore the significant difference b etween the Weibull distribution and the exp onential distribution typically used to model failures. It is well known that the failure of electrical comp onents follow an exp onential distribution [11], and many studies assume that system failures also follow an exp onential failure distribution. This assumption is valid, as long as the failure rate is constant over time. However, studies of systems in the field have shown that the observed failure rate changes over the useful life of the system, and is not well fit with an exp onential distribution [9, 5, 10]. The early failure rate of a system is high, and follows a Weibull distribution during the initial deployment and burn-in p eriod of the system. Failures in systems op erating in the field are due to premature hardware comp onent failures, as well as software, network, environmental faults, and human error. After initial problems have b een resolved, and the system has b een in op eration long enough to cull out defective hardware and software comp onents, the failure rate decreases [12]. Late Figure 2: Number of failures per node over time. To answer these questions, we carefully examined over 6,900 failure records from the LANL data. Since failures during the lifetime of a system generally follows a "bathtub curve", we b elieved that the failure rate of the system would 1 Available online [13] 44 change over time, and finally stabilize to a constant value as faulty nodes were replaced. To assess this, we sub divided the failure data for each node into annual p eriods, and calculated the time b etween failure for each failure event. Figure 2 shows a b ox plot of the numb er of failures p er node p er year of op eration. As might b e exp ected, the numb er of failures p er node for every node in the system drops over time. The total numb er of failures p er year across the complete system is shown in Figure 1. 1 1 Model 0.9 0.8 0.7 Weibull Exponential P(model) 0.6 0.5 0.4 0.3 0.2 Model 0.9 0.8 0.7 Weibull Exponential 0.1 0 0.0 0.1 0.2 0.3 0.4 0.6 0.7 0.8 0.9 1.0 P(empirical) P(model) 0.6 0.5 0.4 0.3 0.2 0.1 0 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 Figure 4: P-P Plot of Weibull and Exponential Fit in Year 5. (see Equation 3), steadily increases until the sixth year, after which it declines. Third, remains within the range 0.57 to 0.75, p eaking in the fifth year. We b elieve that this is an indication that the system is not simply deployed and left untouched. Over time, as hardware and software bugs are discovered and patches are applied, system reliability improves. This may explain why the system never settles to a constant value, which would manifest itself as an exp onential distribution, rather than a changing failure rate characteristic of a Weibull distribution. In the third year of op eration, the numb er of failures p er year seem to stabilize, then decrease in the remaining years as the effects of system repairs and patches increase the reliability of nodes to their design reliability. However, as evidenced 1 P(empirical) Figure 3: P-P Plot of Weibull and Exponential Fit in Year 1. Examining these plots, it is clear that the first two years of system op eration suffer from a large numb er of failures. The numb er of failures in the first year is 2.5 times worse than the numb er of failures in the third year. To determine the failure distribution for each year of operation, we then tested the fit of 17 different distributions to the set of annual failures using EasyFit [14]. For each p eriod, the Weibull distribution was either the b est fit or an acceptable next closest fit for all of the tested distributions. Figures 3 - 5 show the P-P plot for years 1, 5, and 7 for the Weibull and exp onential distributions. Clearly, Weibull is a much closer fit than exp onential for these data. Table 1 shows the fitted Weibull distribution shap e parameter and the scale parameter for each year analyzed. N represents the numb er of logged failure events p er year. Table 1: Weibull fit parameters for annual failure data. Year N 1 2002 0.71 80.4 2 1604 0.57 185.3 3 761 0.68 416.4 4 704 0.71 494.2 5 721 0.75 514.8 6 618 0.67 542.9 7 416 0.66 478.8 8 138 0.64 332.7 From this table, it is clear that never approaches 1, which indicates that the distribution is not exp onential. Second, , which is directly related to overall system MTTF Model 0.9 0.8 0.7 Weibull Exponential P(model) 0.6 0.5 0.4 0.3 0.2 0.1 0 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 P(empirical) Figure 5: P-P Plot of Weibull and Exponential Fit in Year 7. by the decrease in after the sixth year, the nodes that do continue to fail exp erience an decreasing MTTF, which may stem from effects of end-of-life failures. From these data, we conclude that: i) node failures follow a Weibull distribution during the useful lifetime of the system; ii) the first two years of op eration are by far the worst years of the system, in terms of reliability; iii) system failures seem to stabilize 45 by the third year; and iv) in the out years, the nodes that continue to fail do so with increasing regularity. We searched the literature for other work that measured system reliability, and found several examples. Nurmi, Brevik, and Wolski [6, 7, 8] assessed system availability data from a 24-hour computer lab at UCSB. Nurmi collected system availability data on 83 workstations b etween Feb and Oct of 2003, and derived a Weibull value of 0.53 ± 0.02, and a b etween 109,710 and 678,720 with a 95% confidence interval. Ra ju [15] also analyzed one of the systems at LANL (the 512 node cluster) and found that the Weibull distribution was the b est fit for the logged failure data. In a recent pap er [10], Schroeder and Gibson analyzed system failures collected over a p eriod of 9 years from several high-p erformance computing systems at Los Alamos National Lab oratory (LANL). They found that the time b etween system failures most closely matched a Weibull distribution with shap e parameter in the range of 0.7 to 0.8. Xu [16] analyzed failure data field failure data from 503 Windows NT systems running over a p eriod of four months, and found that the time b etween failures for all outages b est fit a Weibull distribution with a shap e parameter of 0.46 and scale parameter of 1.29. Other work [10, 5] has shown that observed failures rates on deployed systems also follow a Weibull distribution. Determining and for an individual system greatly dep ends on the sp ecific architecture, op erating environment, and characteristics of the system. From these results, several general conclusions can b e drawn: i) system failures do not follow an exp onential distribution during the useful lifetime of the system; ii) the early p eriod of op eration, approximately the first two years, demonstrate the worst and most variable reliability characteristics during the lifetime of the system; iii) constant repair and maintenance will somewhat stabilize reliability after the second year, and improve node reliability to their design MTTF; iv) the 'problem' nodes that continue to fail will do so with increasing regularity. The range of observed values collected from thousands of failure events on hundreds of nodes from different vendors running different op erating systems ranges from 0.46 to 0.8. 4.1 System Reliability Based on Weibull Failure Distribution To calculate the probability of failure over all of the nodes of a sup ercomputer, we must first determine the probability of failure of an individual node from vendor supplied MTTF information. We assume that this failure statistic represents a node failure, not the failure of an individual comp onent or failover to a redundant comp onent. 2 For this analysis, we assume a series model of indep endent comp onents. In this model, the failure of one node results in the termination of a computational job running across all of the nodes, and thus represents a complete failure of the system. We also assume that the comp onent failure distributions are identical, with identical Weibull shap e parameters. The probability of a comp onent failure during the time interval [0, t] for the Weibull distribution of scale and shape is [19] t -1 t e -(t / ) d t F (t, , ) = 0 = 1 - e - ( t / ) . (1) The probability that the comp onent will not fail in this time is R ( t , , ) = = 1 - F ( t , , ) e - ( t / ) . (2) The mean of the Weibull distribution is given by (1 + 1/ ), where is the Euler Gamma function. Assuming that the vendor MTTF corresp onds to the mean we find that is = MTTF . 1 1 + (3) If the node contains several discrete comp onents, such as a switch interconnect card, disk drive, memory banks, etc., then the probability that the node will not fail in time t, Rnode (t) is the product of reliability functions for the node's comp onents: c Rnode (t) = Rc (t) = c {comp onents} 4. SYSTEM RELIABILITY MODELS In this section, we derive a model for determining the availability of a high p erformance computing system built up from a numb er of individual computational nodes. We p erform this analysis for failures arising from an exp onential distribution and a two-parameter Weibull distribution, with the goals of deriving a queue reliability model and investigating the difference b etween the two commonly used distributions. Ideally, long-term detailed and accurate failure logs would b e available for the system under study. For deployed systems that have b een in op eration for a long p eriod of time, this information could b e used to determine the failure characteristics of an individual system. In practical terms, new installations do not have this information available, and collecting accurate and fully annotated failure logs is a considerable effort that not all sites can undertake. In the absence of this information, a reasonable metric for assessing reliability is the mean time to failure statistics that are provided by manufacturers of the comp onents that make up the system. e-(t/c ) {comp onents} c = exp - c {comp onents} t c . (4) c We can determine the Weibull reliability function for a complete cluster comprising N nodes similarly--by treating the nodes as individual comp onents. Assuming that all nodes are identical, i.e., characterized by the same node and 2 The statistics provided by the manufacturer represent the exp ected reliability of a node built up from a numb er of discrete comp onents, each of which have individual failure characteristics. Validating this information requires significant field data. Recent work [17, 18] assessed empirical data for disk drives. The results of this work found that the time b etween drive failures follow a Weibull, not exp onential, distribution, and that effects other than load and temp erature effect reliability. 46 node [20] we find Rcluster (t) = iN =1 Rnode (t) e-(t/node ) - N node Equations derived for the Weibull distribution apply to the exp onential distribution on setting = 1.3 Thus, for c numb ering the comp onents, Rnode (t) = e-t which yields . (5) Also Rcluster (t) = e-N t/node , which yields cluster = MTTFcluster = 1 n o d e , N (15) (14) n o d e = 1 n o d e = c c 1 = c . c (13) c (1/c ) = iN =1 , (12) = exp t node n o d e We assume that the cluster's reliability function is also governed by the Weibull distribution with the cluster scale parameter cluster and the cluster shap e parameter cluster . Hence cluster node t t =N (6) cluster n o d e Determining node for each individual node in the cluster would b e difficult for several reasons. First, although the nodes are identical in terms of hardware comp onents, each node op erates in a slightly different op erating environment, and failures in the nodes may not b e indep endent. For example, a misb ehaving cooling fan may induce several comp onent failures b efore the fan finally fails. Second, it is clear from the previous section that we can model the aggregate b ehavior of the system, but the individual b ehavior of each nodes is unique, and reflects deterministic and dep endent sources of failure that may not b e well fit with a distribution. Third, analyzing the failure of each node to determine node and combining these data into a single model for system failure is unnecessary, since we already know cluster from analyzing the aggregate failure data. Since we assume that the node MTTF is identical for all nodes, we also assume that the same Weibull shap e parameter applies to the nodes and to the cluster as a whole, and (similar to the case for an exp onential distribution) using node = cluster = , we obtain 1 cluster = 1/ node (7) N and the cluster's MTTF: = 1 1 1 n o d e + MTTFcluster = N 1/ 1 MTTFnode (8) N 1/ 1 MTTFnode . (16) N Comparison b etween equations (16) and (8) shows that for < 1 and for the same values of MTTFnode the Weibull distribution predicts a shorter MTTF for the cluster. 4.3 Calculating Failure Probabilities Using these models with MTTF data provided by manufacturers, we can now calculate system reliability. Table 2: Component MTTF for a cluster node Comp onent MTTF (hours) Intel SE7529AF2 Mainb oard 67,640 Power supply 100,000 CPU Fan (indicator of CPU life) 400,000 Memory 3,860,000 Seagate Cheetah Hard Drive 1,400,000 The first example describ es a typical sup ercomputer cluster node based on commodity comp onents and commodity assembly, consisting of a mainb oard, p ower supply, memory, CPU fan, and hard drive. The MTTF parameters for this node are quoted in Table 2. We have used the exp ected lifetime of the CPU fan as an indicator of the lifetime of the processor, since processor b oards are often configured to shutdown immediately on detecting cooling fan malfunction. If we assume that the node comp onents and the node itself are describ ed by the exp onential distribution, we can calculate the failure rate of a node as the sum of reciprocals of each comp onent MTTF. The resulting MTTFnode is 35,999 hours, which is ab out 1, 500 days or a little more than 4 years. This should not b e understood in terms of a guarantee that the node will not fail for 4 years. On the contrary. The reliability function in this case b eing exp (-t/4years) tells us that only ab out 1/3 of all nodes in the cluster will still op erate without a single failure after 4 years. This can b e read from Figure 6, which plots the exp onential reliability function for this node. The second example is a blade from a ma jor manufacturer that has b een designed sp ecifically for use in large clusters. The vendor supplied MTTF p er blade is 105,000 hours, but the addition of a Myrinet adapter to the blade 3 Since (n + 1) = n! we get that (2) = 1 and equation (3) turns into (11). 4.2 System Reliability Based on Exponential Failure Distribution For the shap e distribution equal to 1, the Weibull distribution turns into the well known exp onential distribution t -1 1 e -(t / ) = e -t / (9) lim 1 It is common to use = 1/ , the rate parameter , in place of , so that the distribution b ecomes e - t (10) with mean 1/. Assuming that the vendor MTTF numb er corresp onds to the mean we obtain 1 = = MTTF (11) 47 lowers MTTFnode to 102,840 hours. Figure 6 shows reliability for b oth node typ es using the Weibull ( values of 0.7 and 0.8) and exp onential failure distributions. 1 0.9 0.8 If we fix time t at some value, say, 100 hours and vary the numb er of nodes, we can assess the dep endence of the cluster reliability on its size. This is shown in Figure 8 for the exp onential and Weibull distributions. 1 0.9 0.8 0.7 Commodity Cluster Node (C) Blade Node (B) Blade Node =0.7 =0.8 Exponential =0.7 =0.8 Exponential Reliability 0.7 0.6 Reliability 0.5 0.4 0.3 0.2 0.1 0 0 10000 20000 30000 40000 50000 60000 70000 80000 90000 100000 0.6 0.5 0.4 0.3 Commodity Cluster Node Hours 0.2 0.1 0 0 100 200 C B Figure 6: Exponential and Weibull reliability functions for the two types of nodes with MTTFnode of 35,399 hours and 102,840 hours respectively. It is clear from b oth the MTTF numb ers and from Figure 6 that the sp ecially engineered blades are more reliable. What is worth p ointing out is that the Weibull and the exp onential distributions converge for large times, the difference b etween the two manifesting primarily in the early stages of the system op eration. The next step is to calculate overall system reliability of a sup ercomputer comp osed of several individual computational nodes. The reliability of the entire system dep ends on the successful op eration of N nodes without failure for time t and is given by equations (5) and (14) for the Weibull and exp onential distributions resp ectively. Using MTTF data for the blade and cluster architectures, and assuming a cluster of N = 256 nodes, we can plot system reliability over time. The resulting reliability functions are shown in Figure 7 for the exp onential and Weibull failure distributions. Figure 7 shows that the cluster reliability estimates derived from the exp onential distribution function are significantly inflated in the early life of the system, as we have also observed for the individual nodes. 1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 0 C B =0.7 =0.8 Exponential Nodes 300 400 500 Figure 8: Reliability function at t = 100 hours in function of cluster size, based on the exponential failure distribution. Choosing the Weibull distribution parameters for reliability analysis dep ends on where in the op erational life cycle the system is op erating. During early deployment and burn-in, a conservative low MTTF would b e most appropriate. Once the system has stabilized, the failure rate decreases. The trend in system MTTF (directly related to ) is apparent for the LANL system describ ed in Table 2 that we analyzed. The reliability trend apparent in these figures echoes the operational reality exp erienced by many computer centers. 4.4 Architecting a Mixed System The model for the reliability of a single system built from homogeneous nodes can easily b e extended to systems built from a mix of different nodes. If the system contains M discrete groups of nodes numb ered by m = 1, . . . , M with Nm nodes p er group m numb ered by n = 1, . . . , Nm , and the MTTFm of each group of nodes is known, then the Weibull reliability function for the whole cluster is simply the product of the reliability function of each node in the system: Rcluster (t) = MN m nm =1 =1 Blade Node (B) Rm (t) M m =1 Reliability - Nm = exp Commodity Cluster Node (C) t m ( 17) m This can b e simplified, if we were to assume universality of to - , M m ( t ) Nm m (18) Rcluster (t) = exp 900 1000 =1 100 200 300 400 500 600 700 800 Hours Figure 7: Reliability of 256 Node System Based on the Weibull and Exponential Failure Distribution where m = 1/m . Finally, for the exp onential distribution = 1: - . M m t Nm m (19) Rcluster (t) = exp =1 48 1 0.9 0.8 Exponential Mixed Weibull Mixed Beta 0.75 Exponential Cluster Only Weibull Cluster Only Beta 0.75 Reliability 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 0 ous jobs running in a queue. In this model, the numb er of nodes corresp onds to the maximum numb er of computational nodes that can b e assigned to a job. The time over which reliability is calculated corresp onds to the maximum run time associated with a particular queue. 5.1 Probability of a Job Failure for Entire System Based on Job Mix To b egin, consider the queue structure shown in Table 3, which describ es the structure and parameters of a queue system. For each queue q {"Huge", "Big", "Medium", "Small", "Interactive"} 100 200 300 400 500 600 700 800 900 1000 Hours Figure 9: Reliability of mixed versus commodity based 256 node system To calculate Rcluster (t) for the Weibull distribution function, the node failure rate m must b e determined for each group as well as the shape m of the aggregate group, unless we can assume the 's universality. To demonstrate this, let us consider a system made of up a 50/50 mix of commodity cluster nodes and blades, as describ ed in section 4.3. Figure 9 shows the reliability improvement of this system compared to a system that is made entirely of commodity nodes. Adding a prop ortion of sp ecially designed high reliability nodes improves the reliability of the resulting architecture. In addition to reliability, another factor to consider is cost. To illustrate this, consider a scenario in which the cost of blades is twice the cost of commodity nodes. If the system is a 50/50 split, the overall cost of the system will rise by 25%. Based on the Weibull analysis, if jobs are limited to less than 100 hours, there is only a small difference in system reliability. At t of 100 hours, the reliability rises from 2.8% to 6.6% only. Given this small increase in reliability, the additional 25% in exp ense may have b een b etter sp ent, for example, on other comp onents and software to improve the system's functionality rather. we sp ecify the maximum numb er of blades that can b e assigned to a job, Nbq , the maximum time that the job may take, Tq , and the maximum numb er of jobs that may b e run in this queue simultaneously, Nj q . Table 3: Example queue structure. Failure Rate Queue Nbq Tq Nj q Weibull Exp hours % % Huge 258 48 2 62.38 11.35 Big 130 48 4 38.90 5.89 Medium 65 72 4 28.40 4.45 Small 5 240 4 6.10 1.16 Interactive 516 1 2 10.10 0.13 In our analysis, we use a Weibull = 0.75, which is within the the range of observed values. We can use this queue structure to determine the probability that a job will fail on the entire system within some time t due to a hardware fault. Supp ose that the entire system is op erated for a very long time t--so that we can neglect idle time for blades due to job sizes not fitting the p ools available, and granularity of job duration Tq in comparison with t. Assuming that the system was fully utilized over this time, and that jobs fully utilize the queue, we would have completed tNj q /Tq jobs in each queue. The numb er of failed jobs p er each queue is then Fq (t)tNj q /Tq , where Fq (t) is the failure rate for the queue, and the numb er of failed jobs across all queues is q Fq (t)tNj q /Tq . The probability of a job failing in any queue, Fjob , is the ratio of the numb er of failed jobs to the total numb er of jobs over a very long time t, which is q Fq (t)Nj q /Tq q . (20) Fjob = Nj q /Tq b ecause t cancels out. Using the prop osed queue structure and estimated failure rates from Table 3, we estimate that the overall job failure probability for the entire system will b e 12.7%, and the probability of successful job completion over the entire system will b e 87.3%. The queue structure of the LANL system analyzed earlier in Section 3 is shown in Table 4 along with the probability of job failure using the method develop ed in this section for this queue structure. The overall job success rate for this system is 54.5% 4 . This figure seems quite p essimistic. 4 5. DETERMINING QUEUE RELIABILITY Now that we have derived the reliability for a homogenous and mixed system we can apply this analysis to the problem of estimating queue reliability. High p erformance computing resources are often shared among a workgroup or broadly shared at an institutional level. In practice, the queue structure is designed to meet the needs of the user community, application requirements, and p olicies of the institution operating the system. Access to HPC resources (processors, storage, and switch fabric) must b e shared among jobs. A queue structure that fairly mediates access can b e used to implement an appropriate p olicy that strikes a good balance b etween comp eting needs. We can also use the queue structure and p olicy to shap e the overall availability and reliability profile of the system and the mix of jobs that make use of the system. By extending the mathematical models derived in the previous section, we can estimate the success rate for multiple jobs running across the system with constraints on wall clock, processor, and a limit on the numb er of simultane- Note that for the case of Nq = 1 over a p eriod of 96 hours, 49 However, Petrini [21] notes that on a large ASCI class system with over 12,000 processors, a 4096 processor job has a less than 50% chance of success over a runtime of only 5 hours. Checkp ointing on this system is recommended once every hour. Table 4: Queue structure of the LANL system Failure Rate Tq Nj q Weibull Exp Queue Nbq hours % % Largeq 32 12 1 83.3 49.5 Smallq 14 24 1 22.3 45.0 Single Node 1 96 1 7.0 15.7 N is E(N ) = p(N )1 1 + (1 - p(N ))2 2 . (21) For the range N [2, 256], E(N ) ranges from [2.96, 12.71] hours. Thus, using Lublin's collected job statistics, over a large numb er of jobs, at least half of the offered load requires <= 12.71 hours of running time. Using Equation 21 to calculate Tq1 , and selecting a p essimistic maximum wall clock time Tq2 , a less p essimistic queue reliability statistic can b e calculated as the average of the Fq value calculated for Tq1 and Tq2 . 5.3 Improving the Overall Job Reliability Profile An interesting question to consider is whether it is p ossible to improve the overall job reliability profile of this system by making slight changes to the queue structure. The "Largeq" queue shown in Table 4 has the highest failure rate. If we reduce the numb er of nodes to 25 and the maximum runtime to six hours, the failure rate for the queue would drop to 23.4% (exp onential) and 56.4% (Weibull), and the system reliability would rise to 72.8% (exp onential) and 42% (Weibull). In general, to improve the overall job reliability, there are a numb er of available options. We can redistribute the nodes assigned to individual queues. For example, if we reduce the numb er of nodes in the interactive queue shown in Table 2 from 516 to 128, and redistribute the 388 nodes evenly to the Huge and Big queues (increasing the numb er of nodes assigned to Huge and Big), the system failure rate drops to 7.4%, and the reliability rises to 92.6%. This new queue structure would b etter serve a user community that needs to run very large jobs, rather than a large numb er of interactive jobs. To conclude this section, using MTTF statistics for a node, it is p ossible to determine the overall op erational reliability of individual queues and the overall reliability of a set of queues. Using the analysis presented in this section, a model of overall system reliability can b e develop ed by combining Equation 5 (substituting N with Nbq ) and Equation 20. A simple spreadsheet implementation of the model can then b e used as a design aid to architect a reliable system when node MTTF is known or can b e derived from op erational data. The exp onential failure distribution provides an optimistic upp er b ound on reliability, and Weibull a practical lower b ound. We assumed that all jobs will require all of the nodes assigned to a queue. In reality, only a fraction of jobs use the maximum numb er of nodes. By varying Nj q in Equation 20, the effects of a varied mix of jobs on reliability can b e assessed. 5.2 Distribution of Job Runtimes The queue reliability model presented in the previous section assumes that the offered job load will utilize the maximum amount of resources available for use in the queue. This assumption represents a p essimistic view of queue reliability. In practice, the offered job load will consist of a varying numb er of nodes, and the actual runtime will b e less than or equal to the queue runtime limit. Site administrators usually monitor the numb er of unused nodes in a queue and adjust queue node limits over time to minimize the numb er of unused nodes in the system. Schedulers also backfill the queue to increase node utilization. Given this, we b elieve that the assumption that over a long p eriod of time, the numb er of unused nodes in a queue will b e very small relative to the total numb er of available nodes in the queue is valid. The actual distribution of job runtimes, however, is more complex. In the worst case, all jobs will run up to the wall clock limit of the queue. Lublin and Feitelson [22] analyzed the run time of 166,042 jobs from three locations over a p eriod of 32 months to derive a statistical model of job runtimes. They found that: i) a correlation exists b etween the run time of the job and the numb er of nodes used; ii) the distribution of run times most closely matched a hyp ergamma distribution, which is a bi-model distribution made up from two gamma distributions; iii) the shap e of the distribution for a p opulation of jobs dep ended on the numb er of nodes used by the job (e.g. small queue, medium queue, large queue). To determine the relative weighting of each gamma distribution for a given numb er of nodes, they derived a linear relationship b etween the hyp ergamma parameter p and the numb er of nodes N . Using this relationship, we can calculate the exp ectation value of the hyp ergamma distribution for a given N . Using the average fitted and parameters for the studied systems (values expressed as the natural log 5 of run time in seconds) 1 = 14.20, 1 = 0.94, 2 = 312.0, 2 = 0.685, and p(N ) = -0.0054N + 0.78 seconds (using constants also derived from their model), the exp ectation value for the hyp ergamma distribution as a function of the numb er of nodes the failure rate is higher than might b e exp ected. This statistic indicates that within a large p opulation of single nodes, 7% of the systems will exp erience a failure after 96 hours. 5 The source code for the model describ ed in [22] returns the exp() of the run time, rather than 10t 6. DETERMINING OPTIMAL QUEUE SIZE FOR A CHECKPOINT INTERVAL We can now use these results to calculate the numb er of nodes required for a desired checkp oint interval opt . Daly ~ [1] determined that for the case in which the time required to create a checkp oint file is < M T T F /2, the optimal checkp oint interval is (22) opt = 2 M T T F - ~ Related work by Vaidya [23] describ es a model for determining checkp oint time using a Markov Modulated Poisson 50 Process that requires a memoryless failure process, which is satisfied with an exp onentially distributed time b etween failures. Since the Weibull distribution is not memoryless, there is no convenient closed form solution for the optimal checkp oint time. Nurmi and Wolski [24] develop ed an approach to dynamically calculate a schedule of optimal checkp oint intervals during execution time. They investigated the effect of memoryless (exp onential) and non-memoryless (Weibull, and 2- and 3-phase hyp erexp onential) on the checkp oint intervals and application efficiency. Their conclusion was that from the p ersp ective of a user, the choice of distribution had little effect on p erceived efficiency. However, from an op erations p ersp ective over a large numb er of jobs, the difference was significant. We apply Daly's formula for optimal checkp oint interval determination which is optimal if the lifetime distributions are well-modeled by an exp onential. In our case, we are interested is using it to determine a single "good" checkp oint interval to use for the program. Note that if the lifetime distributions are modeled by a Weibull distribution with shap e parameter 1, the conditional distributions for life time get heavier tailed as time moves forward. That is, a machine modeled by such a Weibull is less likely to fail the longer it remains up. Because of this non- memoryless prop erty, there is no single interval for taking checkp oints that minimizes checkp oint overhead (i.e. is optimal). In practice, however, Daly's method seems to work well in cases where only a single interval is desirable. Substituting MTTF with the derived Weibull MTTF from Equation 8, the numb er of nodes required 6 to achieve a desired checkp oint interval opt is ~ 2 N= M T T Fnode (opt - )2 ~ (23) 7. PROBABILITY OF JOB FAILURE USING M HOT SPARE FAILOVER BLADES Using spare computational nodes within a job to provide failover capability for a running program in the event of a blade failure would help improve job reliability. The reliability model for this scenario is a k-out-of-N system [12]. In this system, at least k out of N nodes must remain functional during the time [0, t] in order for the entire system to b e reliable. The reliability function is then NR N -k i N -i R(t)k/N = (t) (1 - R(t))i , (24) i =0 Using this equation, we can calculate the probability of job failure with k spare nodes. Table 5 shows the job failure rate for the blade based system (describ ed earlier) with a maximum queue runtime of 128 hours, and a total of 256 computational nodes, which include p otentially up to 5 spares. Table 5: Reliability Using Spares # Spares Reliability (%) Exp Weibull 0 72.7000 6.9 1 95.9000 25.4 2 99.6000 50.2 3 99.9000 72.2 4 99.9990 86.9 5 99.9999 94.7 Clearly, during the start-up phase of a new system installation, provisioning several spares can significantly improve system reliability. Once the system has stabilized and the failure rate b ecomes less variable, jobs will continue to exp erience a higher reliability rate. Although this option seems obvious from the hardware p oint of view, it transfers the fault tolerance burden from hardware to software. For example, if the time required to create a checkp oint file is 15 minutes, node MTTF is 100,000 hours, = 0.7, and the desired optimal checkp oint interval opt is 10 hours, the ~ numb er of nodes required to achieve this checkp oint interval is N = 81. Note that as the node MTTF increases, the numb er of nodes that can b e used also increases, since opt is fixed. ~ This will allow more work to b e p erformed within a fixed time p eriod with a greater numb er of nodes, b ecause the nodes are more reliable. Also note that as decreases, the numb er of nodes N that can b e used also decreases. This is b ecause as the system grows less reliable, fewer nodes can successfully b e used to complete a task in the allotted time opt . ~ Design decisions made to minimize cost and maximize job reliability must also take into account the effects of node MTTF, , and the numb er of nodes assigned to a queue on the optimal checkp oint interval and the numb er of nodes necessary to achieve a checkp oint interval. As demonstrated in Table 2, the value of can vary over the useful lifetime of the system. Designers building a mixed architecture system can blend the ratio of high and low reliability nodes assigned to long runtime queues to achieve a desirable checkp ointing profile for the queue and system. 6 We assume that similar to the practice at Pittsburgh Sup ercomputing Center for the Terascale Computing System (TCS)[25], checkp oints are written to local disk, and are not related to the numb er of nodes checkp ointed. 7.1 Combining System Initiated Checkpointing with Failover We prop ose a method to use these spares along with LAM / MPI system initiated checkp ointing [2] to improve the probability of job completion. Recovery after failure dep ends on the existence of p eriodically created checkp oint files. If no checkp oints are created, the application must b e resubmitted and restarted from the b eginning with initial parameters. However, using hot spare nodes for failover, the job scheduler or application controller has the option of replacing the failed node with a spare node and completely restarting the job without requeueing. As long as the wall clock limit for the queue p ermits another complete run, this strategy will allow an application to quickly restart. Applications can use checkp ointing to solve this problem. Sankaran [2] has recently develop ed a LAM/MPI extension for system initiated checkp ointing that does not require user action and is invisible to the application. Using LAM/MPI, we can devise a simple checkp ointing strategy to recover from a single node failure or the simultaneous failure of two nodes. LAM/MPI supp orts limited process migration by reestablishing network connections b etween MPI nodes when a job is restarted from checkp oint files. By creating 51 a single cycle of nodes, in which each node has two neighb ors, when the node checkp oint file is created, it can b e distributed to the two adjacent neighb ors. When a node fails, the foreman node can retrieve the checkp oint file from the failed node's neighb or, and reassign the file to a spare node that is ready to restart the application as a proxy for the failed node. Since every node retains its own checkp oint file, healthy nodes can quickly recover from their own file. If two neighb or nodes fail simultaneously, the checkp oint file for each failed node can b e found at the alternate neighb ors. As shown in Table 5, overall system reliability significantly increases with the addition of two spares. Going b eyond two spares, however, requires a significant amount of data transfer b etween p eers. If checkp oint file transfer uses multicast groups with a high p erformance full crossbar switch, it may b e p ossible to minimize the time necessary for file replication. To supp ort a larger numb er of spares, p eer-to-p eer file distribution schemes, such as Pastiche [26] or using neighb or node memory for storing checkp oint data [27], can b e used to efficiently distribute checkp oint files and reduce checkp ointing time. We plan to continue investigating this method to determine if it is p ossible to efficiently use a large numb er of failover nodes to improve job reliability. erational reliability of a system built up of a heterogenous set of queues, computational nodes, and spare failover nodes. We based our model on the analysis of failures of systems operating in the field. We found that the time b etween failures arises from a Weibull distribution over the useful lifetime of the system. Based on our analysis, we describ ed a method to calculate the numb er of nodes necessary to achieve an optimal checkp oint time, and prop osed a model for combining system initiated checkp ointing with hot spare failover nodes to improve system reliability. In comparing the effects of using an exp onential versus Weibull distribution for analysis, we have found that the exp onential distribution results in inflated failure statistics in the initial stage of the system's op eration. Based on these results, we b elieve that the Weibull distribution, rather than the exp onential distribution, should b e used for failure analysis of systems op erating in the field. Acknowledgment The authors would like to thank Los Alamos National Laboratory for making available detailed failure data information, and helpful encouragement and comments from Rich Wolski. 8. PRACTICAL APPLICATIONS The mathematical models and checkp ointing approaches describ ed in this pap er have one goal in mind - to provide a realistic, rigorous, and methodical approach to designing reliable high p erformance computing clusters from a collection of moderately reliable commodity comp onents. Making use of the results of this pap er for practical applications requires several steps to develop a working model of the system to assess the effects of design choices on reliability. First, the node level mean time to failure statistic must b e determined. This data is often available from a vendor, or can b e calculated from field data. The next step is to estimate or derive a Weibull value to characterize the reliability distribution more accurately than is p ossible by using an exp onential distribution. An imp ortant p oint to note here is that the Weibull distribution is identical to an exp onential distribution when = 1. Using Equation 20 and a simple spreadsheet model describing the numb er of nodes p er queue (Nbq ), the maximum wall clock time limit of the queue (Tq ), the maximum numb er of simultaneous jobs that can run p er queue (Nj q ), and the failure rate for each queue, the comp osite failure rate for the entire system can b e calculated. The optimal checkp ointing time opt can then b e calculated by ~ substituting M T T F in Equation 22 with Equation 8. By adjusting spreadsheet model parameters, the reliability of the queue and the comp osite reliability of the system can b e determined. To improve reliability b eyond the inherent reliability of the nodes and queue, the techniques describ ed in Sections 7 and 7.1 can also b e used to significantly increase job reliability. Also note that as shown in Figure 1, the worst system reliability is during the first two or three years of op eration. This demonstrates the value of negotiating an extended warranty with a system vendor to cover failures in the early years of op eration. 10. REFERENCES [1] J. T. Daly, "A strategy for running large scale applications based on a model that optimizes the checkp oint interval for restart dumps," International Workshop on Software Engineering for High Performance Computing System Applications, 2004. [2] S. Sankaran, J. M. Squyres, B. Barrett, V. Sahay, and A. Lumsdaine, "The lam/mpi checkp oint/restart framework: System-initiated checkp ointing," International Journal of High Performance Computing Applications, vol. 19, no. 4, pp. 479 ­ 493, 2005. [3] S. D. Kleban and S. H. Clearwater, "Computation-at-risk: Assessing job p ortfolio management risk on clusters," in IPDPS. IEEE Computer Society, 2004. [4] K. J. Ryan and C. S. Reese, "Estimating reliability trends for the world's fastest computer," Los Alamos National Lab oratory, Tech. Rep. LA-UR-00-4201, 2000. [5] T. Heath, R. P. Martin, and T. D. Nguyen, "Improving cluster availability using workstation validation," in SIGMETRICS. ACM, 2002, pp. 217­227. [6] D. Nurmi, J. Brevik, and R. Wolski, "Quantifying machine availability in networked and desktop grid systems," University of California, Santa Barbara, Computer Science, Tech. Rep. ucsb cs:TR-2003-37, Nov. 2003. [7] J. Brevik, D. Nurmi, and R. Wolski, "Automatic methods for predicting machine availability in desktop grid and p eer-to-p eer systems," in CCGRID. IEEE Computer Society, 2004, pp. 190­199. [8] D. Nurmi, J. Brevik, and R. Wolski, "Modeling machine availability in enterprise and wide-area distributed computing environments," in Euro-Par 2005, Paral lel Processing, 11th International Euro-Par Conference, Lisbon, Portugal, August 30 - September 9. CONCLUSIONS In this pap er, we derived a model for calculating the op- 52 [9] [10] [11] [12] [13] [14] [15] [16] [17] 2, 2005, Proceedings, ser. Lecture Notes in Computer Science, vol. 3648. Springer, 2005, pp. 432­441. Y. Zhang, M. S. Squillante, A. Sivasubramaniam, and R. K. Sahoo, "Performance implications of failures in large-scale cluster scheduling," in JSSPP, ser. Lecture Notes in Computer Science, vol. 3277. Springer, 2004, pp. 233­252. B. Schroeder and G. A. Gibson, "A large-scale study of failures in high-p erformance computing systems," in Proceedings of International Symposium on Dependable Systems and Networks (DSN). IEEE Computer Society, 2006, pp. 249­258. C. Eb eling, An Introduction to Reliability and Maintainability Engineering. Boston, MA: McGraw-Hill, 1997. D. L. Grosh, Primer of Reliability Theory. New York, NY: John Wiley, 1989. Los Alamos National Lab oratory. (2006) Raw op erational data on system failures. [Online]. Available: http://www.lanl.gov/pro jects/computerscience/data/ EasyFit Statistical Package, "http://www.mathwave.com/products/easyfit.html." N. Ra ju, Gottumukkala, Y. Liu, C. B. Leangsuksun, R. Nassar, and S. Scott2, "Reliability analysis in hp c clusters," Proceedings of the High Availability and Performance Computing Workshop, 2006. M. Kalyanakrishnam, Z. Kalbarczyk, and R. Iyer, "Failure data analysis of a LAN of windows NT based computers," in Proceedings of the 18th IEEE Symposium on Reliable Distributed Systems (SRDS '99). Washington - Brussels - Tokyo: IEEE, Oct. 1999, pp. 178­189. B. Schroeder and G. Gibson, "Disk failures in the real world: What does an mttf of 1,000,000 hours mean to you?" in Proceedings of the 5th USENIX Conference on File and Storage Technologies (FAST 2007). USENIX, Feb. 13­16 2007. [18] E. Pinheiro, W.-D. Web er, and L. A. Barroso, "Failure trends in a large disk drive p opulation," in Proceedings of the 5th USENIX Conference on File and Storage Technologies (FAST 2007). USENIX, Feb. 13­16 2007. [19] D. N. P. Murthy, M. Xie, and R. Jiang, Weibul l Models. Wiley Series in Probability and Statistics, Wiley-Interscience, 2003. [20] M. Rausand and A. Høyland, System Reliability Theory: Models, Statistical Methods and Applications Second Edition. Wiley-Interscience, 2003. [21] F. Petrini, "Scaling to Thousands of Processors with Buffer Coscheduling," in Scaling to New Heights Workshop, Pittsburgh, PA, Aug 2002. [22] Lublin and Feitelson, "The workload on parallel sup ercomputers: Modeling the characteristics of rigid jobs," JPDC: Journal of Paral lel and Distributed Computing, vol. 63, 2003. [23] N. H. Vaidya, "Impact of checkp oint latency on overhead ratio of a checkp ointing scheme," IEEE Trans. Computers, vol. 46, no. 8, pp. 942­947, 1997. [24] D. Nurmi, R. Wolski, and J. Brevik, "Model-based checkp oint scheduling for volatile resource environments," University of California, Santa Barbara, Computer Science, Tech. Rep. TR-2004-25, Nov. 6 2004. [25] N. Stone, J. Kochmar, R. Reddy, J. R. Scott, J. Sommerfield, and C. Vizinok, "A checkp oint and recovery system for the pittsburgh sup ercomputing center terascale computing system," Pittsburgh Sup ercomputer Center, Tech. Rep. CMU-PSC-TR-2001-0002, 2001. [26] L. P. Cox, C. D. Murray, and B. Noble, "Pastiche: Making backup cheap and easy," in Proceedings of the 5th ACM Symposium on Operating System Design and Implementation (OSDI-02), ser. Op erating Systems Review. New York: ACM Press, Dec. 9­11 2007, pp. 285­298. [27] R. A. Oldfield, "Investigating lightweight storage and overlay network for fault tolerance," Proceedings of the High Availability and Performance Computing Workshop, 2006. 53