Environmentally Responsible Middleware: An Altruistic Behavior Model for Distributed Middleware Components Joe Meehean Depar tment of Computer Sciences University of Wisconsin-Madison 1210 W. Dayton St, Madison WI 53706 Miron Livny Depar tment of Computer Sciences University of Wisconsin-Madison 1210 W. Dayton St, Madison WI 53706 jmeehean@cs.wisc.edu miron@cs.wisc.edu Categories and Subject Descriptors C.2.4 [Computer Systems Organization]: Computer-Communication Networks--Client/server, Distributed applications ; C.4 [Computer Systems Organization]: Performance of Systems--Fault tolerance, Reliability, availability, and serviceability General Terms Design, Management, Reliability Keywords middleware, altruistic behavior model, resource exhaustion To build a reliable distributed infrastructure, the middleware that comprises this infrastructure must follow a behavior model that is environmentally aware and responsible. Current behavior models rely the operating system and human vigilance to prevent middleware components from exhausting system resources. Our experience with Open Science Grid (OSG) [2] and Enabling Grids for EsciencE (EGEE) [1] has revealed the inadequacy of these behavior models for distributed infrastructure software. The operating system is concerned with preventing full system failure over the failure of individual processes and often does not respond to resource shortages until circumstances are dire and drastic measures are required. By the time the operating system does respond, middleware components may have already failed because of denied resource requests or may appear unresponsive due to resource contention. Therefore, relying on the operating system to handle resource exhaustion only ensures that the operating system will not crash; it will not prevent failure of middleware components. In a distributed environment, users are unable to micro-manage software execution. Fundamental management functionality, such as process listing, may be unavailable or may offer inaccurate results in a large, distributed environment. Furthermore, to reduce dependencies among middleware components and maximize resource usage, some components are deployed on-the-fly [6] and thus cannot be managed with the same vigilance as their stationary counterparts. Distributed environments often span many administrative domains, further complicating component management. An adCopyright is held by the author/owner(s). HPDC'07, June 25­29, 2007, Monterey, California, USA. ACM 978-1-59593-673-8/07/0006. ministrator or user in one domain may not have sufficient security permissions to perform fine-grained management of the middleware components running in other domains. Therefore, the traditional mechanism of human vigilance to prevent and resolve issues of overextended resources is insufficient in a distributed environment. To compound this management problem, many middleware components often run concurrently on the same machine. For example, a typical OSG job submission node, or compute element (CE), runs several middleware components, including a grid gatekeeper, at least one job submission service, monitoring services, an OpenLDAP server, file transfer services, and certificate revocation software [3]. These nodes are further loaded as users can execute jobs directly on the CE itself. Starting a single user job can result in the creation of several processes, some by the job and others to manage the job. Any one of the administrator-installed components, user jobs, or any combination thereof can unintentionally place the CE and the grid infrastructure under high load and cause it to become unresponsive. Access to the entire compute site is jeopardized by an unresponsive CE. Additionally, unresponsive or unexpectedly terminated middleware components on the CE can result in lost work. We argue that relying solely on human vigilance and operating system management to cope with overextended resources in a distributed infrastructure increases maintenance costs and reduces reliability. Middleware components must provide reliable service, but abrupt terminations and overloaded system resources undermine this reliability. These components require new behavior models that handle resource exhaustion before the operating system must take extreme action or components become unresponsive. We propose an altruistic behavior model for middleware components that is based loosely on the reliability model used in electrical grids. Industrial circuit breakers increase the reliability of the electrical grid by preventing high load from damaging equipment or creating cascading failures. In the electrical grid, temporary, small-scale unavailability is preferable to long-term or large-scale unavailability. Although a tripped circuit breaker can result in a blackout for a portion of the electrical grid, it can also prevent damage to the infrastructure and isolate the fault to a small portion of the grid. Conversely, an unnoticed, unmanaged power surge can result in damaged power lines and large-scale, long-term blackouts [7]. Middleware components that follow our altruistic behavior model voluntarily shutdown when local system resources are nearly exhausted to prevent failure due to resource contention and to prevent placing the system into a state of duress. Although this may seem counter- 209 intuitive, voluntary shutdown provides more reliability than unresponsive components and resource-starved machines. We have labeled our behavior model altruistic because it prevents placing the entire system into a state of duress through selfsacrifice. A system can enter a state of duress for three reasons: (1) the altruistic component is unintentionally overloading system resources, (2) another component is placing a large demand on the system resources, or (3) no individual component is misbehaving but the sum of their resource demands exceeds an acceptable limit. In the first instance, shutting down the altruistic component eliminates the problem. Although removing the altruistic component in the second instance does not resolve the problem, it prevents the altruistic component from aggravating the problem. If the machine is simply under-provisioned, as in the third instance, removing the altruistic component can lower the system load to a safe level. At the very least, the self-termination of the altruistic component prevents it from adding to the system overload. Our altruistic behavior model for middleware components has four main functional advantages over autistic behavior models: (1) it reduces administration costs, (2) it prevents unexpected failures due to resource shortages, (3) it reduces cascading failures, and (4) it allows middleware components to act as information providers. Unresponsive machines and crashed programs often require administrative intervention. Altruistic middleware reduces the frequency of this intervention by automatically handling system resource exhaustion without the need for administrative assistance. Machines can still require administrative assistance if they are chronically under-provisioned, but our behavior model separates these under-provisioning problems from programming errors and demand spikes. This is especially beneficial in a distributed environment where middleware is deployed on multiple machines across multiple administrative domains. Ultimately, the altruistic behavior model's response to system duress reduces administrative costs and may foster good will from administrative staff. When system resources are exhausted, the operating system may terminate processes to prevent its own failure or programs may fail attempting to allocate more resources. Altruistic middleware mitigates these problems by self-terminating to reduce the strain on system resources. This ensures that an altruistic component does not indirectly cause the unexpected termination of itself or another program because of resource shortages. Unresponsive machines and middleware components place an additional load on the distributed system; active components may become unresponsive from attempting to contact their unresponsive counterparts, or they may be forced to wait until fault-detection software determines that the unresponsive component has failed [4]. This can create a cascade effect of unresponsive components, as each unresponsive component causes a set of dependent components to become unresponsive. Altruistic middleware may reduce the number of these cascading failures; by performing voluntarily shutdown, the middleware can notify dependent components of its impending unavailability. This notification allows other components to isolate the failure and moves the distributed system closer to a fail-stop model [8]. Altruistic middleware components also act as information providers for the management layers that spawned them. The selftermination of an altruistic component notifies its management layer that the system is in a state of duress. This notification ensures that the management layer will not distribute more work to that component, and can influence future decisions about work distribution once the component is restarted. Some possible objections to our altruistic behavior model are that it (1) appears to introduce a new class of failures, (2) reduces availability, and (3) may add to administrative costs through manual restarts. We have carefully examined each of these objections and address them in turn. Altruistic behavior does not introduce new failures, but rather allows components to handle resource exhaustion before it causes unexpected failures. An altruistic component can perform shutdown maintenance, such as saving application state, when it receives an advance warning of resource exhaustion. This graceful shutdown prevents data loss or expensive data recovery when a component restarts. Although our altruistic behavior model may reduce the availability of a single component, genuine high availability requires features beyond a die-hard attitude from the component. Components and machines crash for a variety of reasons beyond developer control. Thus, additional mechanisms, such as replication and fail-over, are needed to provide high availability [5]. These mechanisms can treat altruistic shutdown as any other failure, but altruistic components have the added ability to warn the distributed system of their impending unavailability, thereby reducing latency for fail-over mechanisms [4]. Another objection to the altruistic behavior model is that it may increase the workload for administrators who must continually restart middleware that has shutdown because of transient increases in resource demands. This problem is mitigated by providing mechanisms to automatically restart self-terminated middleware components when conditions improve. One approach is to restart terminated altruistic middleware periodically; if the system is still under duress, the middleware will simply exit immediately. A more sophisticated approach only restarts an altruistic component once the system resource usage has fallen below a given threshold. With these techniques, altruistic middleware requires administrative intervention only if a system remains in a state of duress for an extended period of time. These extended periods of system duress may indicate that the machine is under-provisioned and requires administrative assistance beyond simply restarting failed components. 1. REFERENCES [1] Enabling Grids for E-Science. http://www.eu-egee.org. [2] Open Science Grid. http://www.opensciencegrid.org/. [3] Personal communications with A. Roy, OSG Software Coordinator. [4] W. Chen, S. Toueg, and M. Aguilera. On the quality of service of failure detectors. IEEE Transactions on Computers, 51(5):561­580, 2002. [5] J. Gray and D. P. Siewiorek. High-availability computer systems. IEEE Computer, 24(9):39­48, 1991. [6] S. Klous, J. Frey, S.-C. Son, D. Thain, A. Roy, M. Livny, and J. van den Brand. Transparent access to grid resources for user software. Concurrency and Computation: Practice and Experience, 18(7):787­801, 2006. [7] E. J. Lerner. What's wrong with the electric grid? The Industrial Physicist, 9(5), 2003. [8] F. B. Schneider. Implementing fault-tolerant services using the state machine approach: a tutorial. ACM Computing Surveys, 22(4):299­319, 1990. 210