Charisma: Orchestrating Migratable Parallel Objects Chao Huang ´ Laxmikant Kale University of Illinois at Urbana-Champaign 201 N Goodwin Ave Urbana, IL 61801, USA {chuang10,kale}@cs.uiuc.edu ABSTRACT The parallel programming paradigm based on migratable ob jects, as emb odied in Charm++, improves programmer productivity by automating resource management. The programmer decomp oses an application into a large numb er of parallel ob jects, while an intelligent run-time system assigns those ob jects to processors. It migrates ob jects among processors to effect dynamic load balance and communication optimizations. In addition, having multiple sets of ob jects representing distinct computations leads to improved modularity and p erformance. However, for complex applications involving many sets of ob jects, Charm++'s programming model tends to obscure the global flow of control in a parallel program: One must look at the code of multiple ob jects to discern how the multiple sets of ob jects are orchestrated in a given application. In this pap er, we present Charisma, an orchestration notation that allows expression of Charm++ functionality without fragmenting the expression of control flow. Charisma separates expression of parallelism, including control flow and macro data-flow, from sequential comp onents of the program. The sequential comp onents only consume and publish data. Charisma expression of multiple patterns of communication among message-driven objects. A compiler generates Charm++ communication and synchronization code via static dep endence analysis. As Charisma outputs standard Charm++ code, the functionality and p erformance b enefits of the adaptive run-time system, such as automatic load balancing, are retained. In the pap er, we show that Charisma programs scale up to 1024 processors without introducing undue overhead. General Terms Design, Languages Keywords Adaptivity, Parallel Programming Productivity, Migratable Ob jects, Orchestration 1. INTRODUCTION Our approach to parallel programming seeks an optimal division of lab or b etween the run-time system and the programmer. In particular, it is based on the idea of migratable ob jects. The programmer decomp oses the application into a large numb er of parallel computations executed on parallel ob jects, while the run-time system assigns those ob jects to processors (Figure 1). This approach gives the run-time system the flexibility to migrate ob jects among processors to effect load balance and communication optimizations. User View System Implementation Figure 1: With Charm++, User Programs with Objects and System Maps Ob jects to Processors Charm++ is a framework that emb odies this concept. Charm++ ob jects, or Chares, execute parallel subtasks in a program and communicate via asynchronous method invocations. A method on a remote ob ject can b e invoked by a message, and the caller does not wait for the method to return. Many chares can b e organized into an indexed collection called a chare array, and a single program may contain multiple chare arrays for different sets of subtasks. Applications develop ed with Charm++ enjoy several p erformance b enefits including adaptive overlap of communication and computation, automatic load balancing, system-level fault tolerance supp ort, and communication optimizations. The idea of using over-decomp osition and indirection in mapping work to processors has b een studied in the past, including in DRMS [1]. Using process migration for load balancing Categories and Subject Descriptors D.1.3 [Concurrent Programming]: Parallel Programming; D.3.3 [Programming Languages]: Language Constructs and Features Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. HPDC'07, June 25­29, 2007, Monterey, California, USA. Copyright 2007 ACM 978-1-59593-673-8/07/0006 ...$5.00. 75 Figure 2: Structure of a Molecular Dynamics Simulation Application: NAMD has also b een investigated [2], and this broad approach has gained momentum recently [3, 4]. A large numb er of applications have b een develop ed using the Charm++ framework, such as NAMD [5], a productionlevel molecular dynamics program which has demonstrated unprecedented sp eedups on several thousand processors, and LeanCP [6], a Quantum-Chemistry simulation application. Other examples include rocket simulation, crack propagation, space-time meshing with discontinuous Galerkin solvers, dendritic growth in solidification processes, level-set methods, computational cosmology simulations, and parallel visualization of cosmology data. Although Charm++ has demonstrated its utility in runtime optimizations such as load balancing, and although it is more modular than MPI (see [7]), it can b e challenging to clearly express the flow of control due to its local view of control, esp ecially for complex applications that involve multiple sets of chare-arrays, as seen in the motivating example in the next section. Also, in Charm++, methods clearly distinguish the places where data is received, but the places where data is sent (invocations) can b e buried deep inside functions of the ob ject code. This asymmetry often makes it hard to see the parallel structure of an application, which is useful for understanding p erformance issues. We present Charisma, a higher-level language notation that retains the b enefits of Charm++ while allowing for easy expression of global flow of control as well as symmetric expression of communication. Charisma separates sequential code fragments (methods) from the parallel constructs, and allows the programmer to describ e the global control flow with a script language. Since the script language controls the b ehavior of a col lection of migratable ob jects, we also call it an orchestration language. tween high p erformance and programming productivity well. Op enMP [8] programs have a shared view of data and control. The programmer writes code for all the comp onents of the program, with only indep endent loop iterations executed in parallel. This model may b e easy to program for a subset of applications, but it is often incapable of taking advantage of large scale parallelism among modules and concurrent control flows, and consequently suffers p oor scalability. MPI [9], which represents the message passing model, provides a processor centric programming model. A parallel job will b e divided into subtasks according to the numb er of available processors, and data needed for each subtask is localized onto that processor. Then the user expresses an algorithm in the context of local MPI processes, inserting message passing calls to exchange data with other processes. Basically, it provides a local view of data and a local view of control, although for SPMD programs, the global flow of control is often similar to the local flow of control. Performance wise, MPI programs can achieve high scalability, esp ecially if the program has "regular" patterns, typically with systolic computation-communication sup er-steps. Some algorithms are simply too difficult to b e written in such a fashion. In terms of productivity, this model is fairly easy to program when the application does not involve many modules. Otherwise the programmer will have to first partition the processors b etween modules, losing the p otential p erformance opp ortunity of overlapping communication and computation across modules, as well as doing resource management across modules. Some programmers may choose to assign multiple roles to the same group of processors for the sake of p erformance. This results in complexity in writing the message passing procedures, and compromises productivity. For a concrete example, consider a 3D molecular dynamics simulation application NAMD [10] illustrated in Figure 2 (taken from [10]). This simplified version of NAMD contains 3 typ es of comp onents. The spatially decomp osed cub es, shown by squares with rounded corners, are called patches. A patch, which holds the coordinate data for all the atoms in the cub e of space corresp onding to that patch, is resp onsible for distributing the coordinates, retrieving forces, and integrating the equations of motion. The forces used by the 2. MOTIVATION Many scientific and engineering applications have complex structures. Some may involve a large numb er of comp onents with complicated interactions b etween them. Others may contain multiple modules, each with complex structures. Unfortunately, for these applications, conventional parallel programming models do not handle the balance b e- 76 Figure 3: Structure of a Quantum Chemistry Simulation Application: LeanCP patches are computed by a variety of compute ob jects, with Angle Compute and Pairwise Compute shown in the figure as examples. There are several varieties of compute ob jects, resp onsible for computing different typ es of forces (b ond, electrostatic, constraint, etc.). Some compute ob jects require data from one patch and only calculate interaction b etween atoms within that single patch. Others are resp onsible for interactions b etween atoms distributed among neighb oring patches. PME ob jects implement the Particle Mesh Ewald method [11], which is used to compute the longrange electrostatic forces b etween atoms. PME requires two 3D Fast-Fourier-Transform (FFT) op erations. The 3D FFT op erations are parallelized through a plane decomp osition, where first a 2D FFT is computed on a plane of the grid, followed by a global transp ose and a 1D FFT along the third dimension. The simulation in NAMD is iterative. At each time step, the patches send out coordinate data to compute ob jects and PME ob jects as necessary, and the compute objects and PME ob jects p erform the force calculations in parallel. When force information is available, it is then communicated back to the patches, where integration is p erformed. When we consider the various programming models for this relatively simple molecular dynamics application, we find it often difficult to reach a graceful balance b etween productivity and p erformance. Programming with Op enMP, one will write code that, in effect, serializes the coordinate distribution, angle force calculation, pairwise force calculation, PME calculation, force reduction, and patch integration. The flow of the program looks clear, but it is unable to parallelize concurrent subtasks, such as angle force calculation and pairwise force calculation, unless wildcard receives with awkward crossmodule flow of control are used. Performance and scalability are sacrificed for the ease of programming. MPI allows the programmer to partition the job into groups of subtasks and assign the subtasks onto partitions of available MPI processes. The programmer can choose to overlap several subtasks onto same set of processes to keep the CPUs busy. For example, if we have some patch ob jects and some compute ob jects residing on the same processor, the patches may use the CPU for the coordinates multicast, and subsequently yield the CPU to the compute ob jects for force calculation. Since MPI message passing is based on processors, when the programmer wants to express the intention to "send message to subtask S ", he/she needs to make the MPI call to send the message explicitly to processor rank K instead of subtask S 's ID. Therefore, the programmer has to maintain a mapping b etween the subtask IDs to the process ranks. To achieve higher CPU utilization, we want to b e able to process the messages as soon as they are received. When the message passing model does in-order message processing with tag matching, the interconnect may deliver out-of-order messages. Therefore, system overhead of buffering out-oforder arrivals difficult to avoid. The programmer can take advantage of wildcard source and tag matching, accepting any incoming message, and process them accordingly. While it is p ossible to achieve high efficiency, this approach has a ma jor productivity drawback. When there are multiple subtasks from multiple comp onents on one processor, it is difficult to maintain a definite mapping from an arbitrary incoming message to its destination and handler function. The message passing calls will look confusing, and the flow of control cannot b e expressed clearly. 77 Charm++, like MPI, provides a local view of control, but unlike MPI, it takes an ob ject-based approach. The programmer writes code for various classes for different subtasks, then instantiates ob ject arrays of arbitrary size from such classes. These ob jects are assigned onto physical processors by the run-time system automatically, and therefore the programmer does not have to b e restricted by the concept of processor. In Charm++'s asynchronous method invocation model, each ob ject's code sp ecifies, in a reactive manner, what the ob ject will do when presented with a particular message. When a message is delivered, its destination ob ject and the method to invoke on that ob ject are stated. Because the message contains information on what to do with it at the receiver side, this can b e called an active message [12]. Such active messages ensure the prompt processing of data as they b ecome available, and the Adaptive Run-Time System (ARTS) offers further opp ortunities for p erformance optimization. However, for complex programs with a large numb er of ob ject arrays, this comes at a cost of obscuring the overall flow of control: The transfer of control is fragmented by the message sending b etween ob jects. To follow the flow of control, one often needs to dig deep into the ob jects' class code and hop from one to another, and in the meanwhile, to understand parallel op erations, such as broadcast, multicast and reduction, among the ob jects. This p oses some difficulty for the expression of the parallel program for b oth the programmer and its readers. The ab ove example is not an extremely complicated parallel program. Indeed, it has only 3 typ es of comp onents and a few short-running concurrent control flows. A quantum chemistry simulation [6] under development using Charm++ involves 11 different parallel structures, together with complex concurrent flows (See Figure 3). Clearly, understanding the global control flow is difficult by looking at individual ob ject's codes. The language we prop ose, Charisma, aims at achieving high programming productivity without losing the p erformance b enefits from the ARTS. It describ es the global view of control in a parallel application or module. ments and integrate the sequential methods to produce the target Charm++ program. By separating parallel code from sequential code, the programmer can focus b etter on the local actions on the ob jects, such as physics computation. 3.1 Parallel Object Array In Charisma, a program is comp osed of parallel ob jects. A collection of such ob jects can b e organized into an array to p erform a subtask, such as the patches and the force calculators in the previous NAMD example. Although they are called "arrays", these are really a collection of ob jects indexed by a very general indexing mechanism. In particular, the ob jects can b e organized into 1-D or multi-dimensional arrays that can b e sparse, or into collections indexed by arbitrary bit-patterns or bit-strings. One can also dynamically insert and delete elements in an ob ject array. Charm++'s ARTS is resp onsible for adaptively mapping the ob ject array elements onto available physical processors efficiently. Moreover, these ob jects are migratable with supp ort from the ARTS. Once created, these parallel ob jects rep ort the workload at run-time to the system load balancer, and the load balancer will automatically migrate the ob jects as necessary to achieve higher overall utilization. class Cell : ChareArray2D; class CellPair : ChareArray4D; obj cells : Cell[N,N]; obj cellpairs : CellPair[N,N,N,N]; Ab ove is an example of ob ject array declaration in orchestration code for a 2-D Molecular Dynamics (MD) application. The first part is class declaration for class Cell and CellPair. The second part is the instantiation of two ob ject arrays cells and cellpairs from these classes. The array cells is resp onsible for holding the atom information in the 2-D partition that corresp onds to its index, and the array cellpairs does the pair-wise force calculation for a pair of cells ob jects. The programmer also provides sequential code that sp ecifies the b ehavior of individual ob jects. There will b e a .h file for each class, with class memb er variables and methods that are needed for sequential user code. Note that this header file does not have complete class declaration. It just has the variables and methods declaration used in the sequential code. The definition of those sequential functions is provided in the .C files. 3. LANGUAGE DESIGN Charisma employs a macro dataflow approach for productive parallel programming. The programmer writes a script-like orchestration program containing statements that produce and consume collections of values. From analyzing such producing and consuming statements, the control flows can b e organized, and messages and method invocations can b e generated. This idea is similar to the macro dataflow model [13] and the hybrid dataflow architecture model [14]. In [14], the data-driven distributed control model is combines with the traditional von Neumann sequential control model to exploit fine-grain parallelism without sacrificing the p erformance b enefits of the existing optimizations such as pip elining. In contrast to the instruction level dataflow, Charisma's ob ject-level macro dataflow mechanism takes advantage of the message-driven execution model in Charm++'s and enables dynamic resource management such as automatic load balancing. A Charisma program consists of two comp onents: the orchestration code (in .or file) that describ es the global view of control, and the sequential code (in .h and .C files) that sp ecifies the local b ehavior of individual ob jects. Charisma compiler generates parallel code from the orchestration state- 3.2 Foreach Statement In the main b ody of orchestration code, the programmer describ es the b ehavior and interaction of the elements of the ob ject arrays using orchestration statements. The most common kind of parallelism is the invocation of a method across all elements in an ob ject array. Charisma provides a foreach statement for sp ecifying such parallelism. The keywords foreach and end-foreach forms an enclosure within which the parallel invocation is p erformed. The following code segment invokes the entry method doWork on all the elements of array myWorkers. foreach i in myWorkers myWorkers[i].doWork(); end-foreach The foreach statement looks very much like the FORALL statement in HPF [15]. Indeed, they b oth express the global 78 flow of control. In HPF, FORALL provides a parallel mechanism for value assignment of elements of a distributed data array, whereas the foreach statement in Charisma sp ecifies the parallelism among the entry method invocation of parallel ob jects. The programmer can have multiple statements within one foreach enclosure, if those statements are invoked on the same ob ject array with the same indexing. This is really a shorthand notation for having one foreach enclosure for each of these statements. Note also that the implementation does not need to broadcast a control message to all ob jects to implement this. Global control can b e compiled into local control, and modulated by data dep endencies describ ed b elow. param error : double; param atoms : AtomBucket; param p : double [256]; When comp osing the sequential code in Charisma, the programmer does not need the knowledge of the sources of the input data or the destinations of the output data. The input data is seen as parameters passed in, and the output data is published via a local function call. Sp ecifically, for producing, a reserved keyword outport is used to mark the parameter name to b e produced as app ears in the orchestration code, and a produce call associates the outp ort parameter name with an actual local variable whose value is to b e sent out. For instance, in the sequential code for WorkerClass::foo, the programmer makes a local function call produce with outport variable q to publish the value of a local variable local q (assuming p and q are double precision typ e). WorkerClass::foo(double p[], outport q) { local_q = ...; ... produce(q, local_q); } Fortran-M [16] is similar to Charisma b ecause they b oth use the concept of port. In Fortran-M, p orts are connected to create channels from which p oint-to-p oint communications are generated. It is useful in facilitating data exchange b etween dissimilar subtasks. Charisma analyzes the inports and outports of data and generate communications for b oth p oint-to-p oint and collective op erations among ob ject arrays, by analyzing data dep endencies among parameters in the orchestration code. The goal of Charisma is to provide a way of clearly expressing global flow of control in complicated parallel programs. In addition, Charisma is built on top of a p owerful adaptive run-time system which offers the generated program p erformance b enefits at no additional cost of programming complexity. 3.3 Producer-Consumer Model In MPI model, message passing is via sp ecifying the destination processor's rank and communicator, with a tag to b e matched. As explained earlier, this mechanism does not always work well in achieving b oth p erformance and clear algorithm expression in presence of complex parallel programs. Charm++'s message delivery sp ecifies the destination ob ject and the function handler. With this information, the destination ob ject knows which function to invoke to process the incoming message. While Charm++ offers a more intuitive way to deal with communications b etween subtasks, the programmer still needs to worry ab out sending and receiving messages while writing sequential part of the code. To further separate the task of writing communication code for parallelism and comp osing the sequential computation blocks in a parallel program, Charisma supp orts producer-consumer communication directly. In the orchestration code, there is no function call for explicitly sending or receiving message b etween ob jects. Instead, each ob ject method invocation can have input and output parameters. Here is an orchestration statement that exemplifies the syntax for input and output of an ob ject method workers.foo. foreach i in workers := workers[i].foo(p[i+1]); end-foreach Here, the entry method workers[i].foo produces (or publishes in Charisma terminology) a value q, enclosed in a pair of angular brackets b efore the publishing sign ":=". Meanwhile, p is the value consumed by the entry method. An entry method can have an arbitrary numb er of published (produced and reduced) values and consumed values. In addition to basic data typ es, each of these values can also b e an ob ject of arbitrary typ e. The values published by A[i] must have the index i, whereas values consumed can have the index e(i), which is an index expression in the form of iąc where c is a constant. Although we have used different symb ols (p and q) for the input and the output variables, they are allowed to overlap. The variables that can b e used as input and output values are declared in the parameter space in Charisma. The variables in the parameter space corresp ond to global data items or data arrays of a restricted shared-memory abstraction. The programmer uses them solely in the orchestration code to facilitate the producer-consumer model, and has no knowledge of them in the local-view sequential code. A parameter variable can b e of an intrinsic or user-defined data typ e, or a data array. 3.4 Organizing Parallel Control Flows The control transfer in a Charisma program is clearly expressed in the orchestration code. After the initial statements in the control chain, which typically do not consume any value, the control flow progresses in a data-driven fashion. If a statement consumes some values, then as soon as the values are available, the statement can b e executed, without any barrier across the ob ject array or global synchronization. Charisma extends the message driven model of Charm++, taking advantage of its high efficiency and offering clear expression of the control flow and programming productivity. In the producer-consumer model, Charisma resp ects the program order in connecting producing and consuming p orts. In other words, a consuming statement will look for the value produced by the latest producing statement in the program order. In a legal orchestration program, each consuming statement and tagged input value has its corresp onding unique producing statement. Of course, a single produced value may b e consumed by multiple consuming statements. If a producing statement does not have a later consuming statement, the produced value will not have any effect on the program b ehavior. Beyond the program order restriction of the data flow, Charisma is consistent with Charm++'s asynchronous in- 79 vocation model, in which explicit barrier or other synchronization op eration is not supp orted. If the programmer does need to enforce a barrier op eration, a dummy reduction can b e used (see Section 3.5). This also means there is no further implicit barrier b etween foreach statements. For instance, during any iteration in the following code, workers[2].bar does not have to wait till workers[2].foo has completed. As soon as p[1] is published by workers[1].foo, even if workers[2].foo has not started yet, workers[2].bar can start executing b efore workers[2].foo. for iter = 1 to MAX_ITER foreach i in workers := workers[i].foo(); end-foreach foreach i in workers workers[i].bar(p[i-1]); end-foreach end-for Loops are supp orted with for statement and while statement. The first consuming statement will look for values produced by the last producing statement b efore the loop for the first iteration, and the last producing statement within the loop b ody for the following iterations. At the last iteration, the last produced values will b e disseminated to the code segment following the loop b ody. Take the code segment in Figure 5 as an example, the coords produced in the first foreach statement is consumed by the first consuming statement in the for-loop. Thereafter, each iteration produces a fresh coords from the integrate function at the end to b e consumed at the next iteration. The produced parameter of coords is available after the for-loop, although it is not used here in this example. the code segment ab ove, B[2].g() does not have to wait on all A[i].f() is completed to start its execution; as soon as A[2].f() is done and the value p[2] is filled, B[2].g() can b e invoked. In fact, even b efore A[i].f() completes, p[i] can b e sent as soon as it is produced, using callback in the implementation. ˇ Reduction In Charisma, the publishing statement uses a + to mark a reduced parameter whose value is to b e obtained by a reduction op eration across the ob ject array. Following is an example of a reduction of value err on a 2-D ob ject array A. foreach i,j in workers <..., +err> := workers[i,j].bar(..); end-foreach ... Main.testError(err); In the sequential code for WorkerClass::bar, the programmer calls a local function reduce to publish its local value local err and sp ecifies the reduction op eration ">" (for MAX). Similar to the produce call, an outport keyword indicates for which output p ort parameter this reduce call is publishing data. This call is almost identical to the produce primitive, only with an extra parameter for sp ecifying the reduction op eration. WorkerClass::bar(..., outport err) { local_err = ...; ... reduce(err, local_err, ">"); } The dimensionality of the reduced output parameter must b e a subset of that of the array publishing it. Thus reducing from a 2-D ob ject array onto a 1-D parameter value is allowed, and the dimension(s) on which the reduction will b e p erformed on is inferred from comparison of the dimensions of the ob ject array and the reduced parameter. ˇ Multicast A value produced by a single statement may b e consumed by multiple ob ject array elements. For example, in the following code segment, A[i] is a 1-D ob ject array, B[j,k] is a 2-D ob ject array, and points is a 1-D parameter variable. Supp ose they all have the same dimensional size N. foreach i in A := A[i].f(...); end-foreach foreach k,j in B <...> := B[k,j].g(points[k]); end-foreach There will b e N messages to send each published value to the consuming places. For example, point[1] will b e multicast to N elements in B[1,0..N-1]. ˇ Scatter, Gather and Permutation Operation A collection of values produced by one ob ject may b e split and consumed by multiple ob ject array elements for a scatter op eration. Conversely, a collection of values from different ob jects can b e gathered to b e consumed by one ob ject. Combining the two, we have the p ermutation op eration. 3.5 Describing Communication Patterns The method invocation statement in the orchestration code sp ecifies its consumed and published values. These actions of consuming and publishing are viewed as input and output p orts, and Charisma run-time will connect these p orts by automatically generating efficient message b etween them. Using the language and the extensions describ ed b elow, the programmer is able to express various communication patterns. ˇ Point-to-point communication We now introduce the mechanism to allow p oint-to-p oint communication among ob jects via the producer-consumer model. For example, in the code segment b elow, p[i] is communicated via a message in asynchronous method invocation b etween elements of ob ject array A and B. foreach i in A := A[i].f(...); end-foreach foreach i in B <...> := B[i].g(p[i]); end-foreach From this code segment, a p oint-to-p oint message will b e generated from A[i]'s publishing p ort to B[i]'s consuming p ort. When A[i] calls the local function produce(), the message is created and sent to the destination B[i]. By this mechanism, we avoid using any global data and reduce p otential synchronization overhead. For example, in 80 foreach i,j,k in cells := cells[i,j,k].produceCoords(); end-foreach for iter := 1 to MAX_ITER foreach i1,j1,k1,i2,j2,k2 in cellpairs <+forces[i1,j1,k1],+forces[i2,j2,k2]> := cellpairs[i1,j1,k1,i2,j2,k2]. calcForces(coords[i1,j1,k1],coords[i2,j2,k2]); end-foreach foreach i,j,k in cells := cells[i,j,k].integrate(forces[i,j,k]); end-foreach MDMain.updateEnergy(energy); end-for Figure 4: MD with Charisma: Clear Expression of Global View of Control /* Scatter Example */ foreach i in A := A[i].f(...); end-foreach foreach k,j in B <...> := B[k,j].g(points[k,j]); end-foreach A wildcard dimension "*" in A[i].f()'s output points sp ecifies that it will publish multiple data items. At the consuming side, each B[k,j] consumes only one p oint in the data, and therefore a scatter communication will b e generated from A to B. For instance, A[1] will publish data points[1,0..N-1] to b e consumed by multiple array objects B[1,0..N-1]. /* Gather Example */ foreach i,j in A := A[i,j].f(...); end-foreach foreach k in B <...> := B[k].g(points[*,k]); end-foreach Similar to the scatter example, if a wildcard dimension "*" is in the consumed parameter and the corresp onding published parameter does not have a wildcard dimension, there is a gather op eration generated from the publishing statement to the consuming statement. In the following code segment, each A[i,j] publishes a data p oint, then data p oints from A[0..N-1,j] are combined together to for the data to b e consumed by B[j]. Combining scatter and reduction op erations, we get the p ermutation op eration. Please refer to Section 6.2 for an code example. MainChare::MainChare{ cells.sendCoords(); } MainChare::reduceEnergy(energy){ totalEnergy += energy; if iter++ < MAX_ITER cells.sendCoords(); else CkExit(); } Cell::sendCoords(){ for index in 26 neighbor cellpairs cellpairs(index).recvCoords(coords); } Cell::recvForces(forces){ totalforces += forces; if not all forces from all cellpairs received return; else // neighborhood reduction completed integrate(); mainProxy.reduceEnergy(energy); } 4. CODE EXAMPLE: MD In this section, we show how Charisma can overcome some of Charm++'s difficulty of describing global view of control with a concrete example to . This example is a simplified version of the NAMD simulation explained in Section 2, with only the pairwise force calculation included. Cells are the ob jects that hold the coordinates of atoms in patches, and cellpairs are the ob jects calculating pairwise forces b etween two cells. In the following comparison, definitions for sequential functions such as Cell::Integrate and CellPair::calcForces are not listed, since they access only local data and should b e the same for b oth versions. With Charisma, the MD code is listed in Figure 4. First, elements in ob ject array cells produce their coordinates, Cellpair::recvCoords(coords){ if not coords from both cells received buffer(coords); return; else // all coords ready force = calcForces(); for index in 2 cells cells(index).recvForces(forces); } Figure 5: MD with Charm++: Flow Buried in Ob jects' Code Overall Control providing the initial data for the first iteration. During each iteration, cellpairs calculate forces by consuming the coordinates provided by two cells elements. In the same state- 81 ment, cellpairs produce forces combined via a reduction within a cell's neighb orhood. These values get consumed in the integration phase. The integration also produces coordinates for the next iteration and total energy via a reduction op eration across all cells. In the Charisma code, each orchestration statement sp ecifies which pieces of data it consumes and produces, without having to know the source and destination of those data items. Figure 5 lists corresp onding Charm++ pseudo code for the same program. In three b oxes are method definitions for three classes MainChare, Cell, and CellPair, which are typically separated in different C files. To organize the global control flow, one has to dig into the files and hop among them (represented by the arrows). Thus, the flow is fragmented and buried in the ob ject code. Following control flow in such a parallel program is more complicated than in sequential ob ject-oriented programming code, due to the complexity of the parallel op erations among the ob jects. For instance, collecting force data among a cell's neighb oring cellpairs through a neighb orhood reduction requires nontrivial code (not shown in the pseudo code here), and this kind of code is automatically generated in the Charisma version. We are not listing the corresp onding MPI code here, b ecause it would b e much more complicated than the Charm++ version. In addition to handling the collective op erations, the MPI programmer has to write code for explicitly managing various sets of subtasks, maintaining mapping scheme b etween subtasks' identities and their physical locations (processor numb er), and auxiliary code such as load balancing. When the programmer wants to achieve higher degree of overlap b etween computation and communication, more code is needed to handle the wildcard source and tag matching as discussed in Section 2. such as in a gather op eration, the naive solution of buffering messages already received into user-allocated memory incurs overhead for a memory copy. Charisma eliminates this unnecessary memory copy by p ostp oning the deleting of the received Charm++ messages until after all the messages have b een received. Charisma also offers the user great flexibility to customize the parallel program. The current implementation supp orts creation of a sparse ob ject array and its collective op erations. The user can supply sequential functions to provide hints to Charisma on issues such as which elements to create in the sparse ob ject array. 6. EXPERIMENTS AND RESULTS In this section, we show the results of a few b enchmarks with Charisma. We compare the productivity, in terms of source lines of code (SLOC), as well as the p erformance and scalability. The b enchmarking platforms are PSC's Cray XT3 MPP system with 2068 dual 2.6 GHz AMD Opteron compute nodes linked by a custom-designed interconnect, and NCSA's Tungsten Cluster with 1280 dual 3.2 GHz Intel Xeon nodes and Myrinet network. 6.1 Stencil Calculation Our first b enchmark is a 5-p oint stencil calculation. This is a multiple timestepping calculation involving a group of regions in 2-D decomp osition of a 2-D mesh. At each timestep, every region exchanges its b oundary data with its immediate neighb ors in 4 directions and p erforms local computation based on the neighb ors' data. This is a simplified model of many applications including fluid dynamics and heat disp ersion simulation, and therefore it can serve the purp ose of demonstration. Figure 6 compares the p erformance of the stencil calculation b enchmark written in Charisma vs. Charm++. The total problem size is 163842 decomp osed onto 4096 ob jects. The p erformance overhead introduced by Charisma is 2 6%, scaling up to 1024 processors. Because this b enchmark is relatively simple, the parallel code in Charm++ forms a significant part of the code. Therefore we see a 45% reduction on SLOC with Charisma. 5. IMPLEMENTATION Charisma generates Charm++ code which can b e compiled and run in the adaptive run-time system. The code generation starts with parsing the orchestration code in the .or file. Once the input and output parameters are identified for each orchestration statement, static dep endency analysis is p erformed to find the connections b etween these input and output parameters. By analyzing the indices of the parameters and of the ob ject arrays, a global graph of control flow is created. Next, Charisma generates appropriate method invocations and messages from the graph of parallel control flow, since the program progress in Charm++ is driven by asynchronous remote invocation with messages. During this process, parallel code for expressing a variety of communication patterns, including broadcast, multicast and reduction, is produced. After the parallel flow is set up, the user's sequential code is integrated into the final output of the Charm++ program. An imp ortant goal in the implementation of Charisma is to ensure the high efficiency of the generated code. One technique is immediate outgoing messages. As soon as the data for an outgoing message b ecomes available (indicated to the ARTS by a publish statement), the message is assembled and sent out, without having to wait for the function to complete. This mechanism allows for a larger degree of adaptive overlap b etween communication and computation. Another optimization improves memory efficiency. When multiple messages are needed to drive the next link in the flow, 6.2 3 D FFT FFT is frequently used in engineering and scientific computation. Since highly optimized sequential algorithms are available for 1-D FFTs, multi-dimensional FFT containing multiple 1-D FFTs on each dimension can b e parallelized with a transp ose-based approach [17]. Following is the main b ody of the orchestration code for the transp ose-based algorithm for 3D FFT. From this code segment, Charisma generates the transp ose op eration b etween the two planes holding the data. Messages are created and delivered accordingly. foreach x in planes1 :=planes1[x].fft1d(); end-foreach foreach y in planes2 planes2[y].fft2d(pencildata[*,y]); end-foreach Figure 8 compares the p erformance overhead of runs with problem size of 5123 on 256 ob jects, scaling up to 128 processors. From the results, we can see that Charisma in this 82 PSC Cray XT3 System 600 b) Transpose with Pencils x y z Charm++ (SLOC=255) Charisma (SLOC=140) 500 Execution Time Per Step (ms) 400 300 200 100 a) 1D FFT on Y Direction c) 2D FFT in XZ Plane Figure 7: Transpose-based 3D FFT 0 8 16 32 64 128 256 512 1024 Number of Processors NCSA Tungsten Cluster 2500 NCSA Tungsten Cluster 10 Charm++ (SLOC=255) Charisma (SLOC=140) Charm++ (SLOC=199) Charisma (SLOC=126) Execution Time Per Step (ms) Execution Time Per Step (s) 4 8 16 32 64 128 256 2000 8 1500 6 1000 4 500 2 0 0 8 16 32 64 128 Number of Processors Number of Processors Figure 6: Performance of Stencil Calculation Figure 8: Performance of 3D FFT b enchmark incurs up to 5% p erformance overhead, which can b e attributed to additional buffer copy for parameter variables. The reduction on SLOC is only 37%. In this sp ecific b enchmark, sequential code dealing with local FFT computation consists of a bigger p ortion of the program, and therefore the reduction on the SLOC is not as significant as simpler programs. This p ercentage of SLOC reduction is exp ected to b e even smaller on larger and more complex programs. It must b e noted, however, that SLOC alone does not make a good metric of productivity as it does not reflect the actual programming effort. In fact, in more complicated applications, to express parallel flow of control is far more difficult than in simpler cases, and tools such as Charisma can b etter help programmers code with less effort. 7. CONCLUSION We describ ed Charisma, a higher level notation that allows expression of global view of control in parallel programs, while still allowing decomp osition into multiple collections of dynamically mapp ed ob jects as in Charm++. This approach cleanly separates parallel and sequential code, strongly encourages locality aware programming, allows ex- pression of global flow of control in one place, and still reaps the b enefits of runtime optimizations of migratable ob jects. The language prop osed here does not cover expression of all application patterns, esp ecially the highly asynchronous patterns supp orted by Charm++. Indeed, it is not even intended to b e a complete language. Instead it will b e used in conjunction with other paradigms where needed or appropriate. Currently, the orchestration language coexists with Charm++ modules and mechanisms thus ensuring completeness and high interop erability. Also, our implementation of MPI, the Adaptive MPI (AMPI)[18] also interop erates with Charisma. Beyond these languages, the ability to supp ort modules written in Charisma is crucial for productivity via code reuse. We are designing language features to this end so that we can provide user-level Charisma libraries such as parallel 3D FFT. Charisma supp orts a global view of control but a local view of data, since only the ob ject's local variables are accessible in the sequential methods. In contrast, Multi-phase Shared Array (MSA) [19] supp orts a global view of data. Integrating the two notations is an interesting future direction to explore. 83 Last but not least, SLOC is not necessarily a p erfect metric for measuring productivity. We plan to conduct classroom exp eriments among parallel programming students to obtain a more realistic evaluation of Charisma's productivity advantage. 8. REFERENCES [1] V.K.Naik, Sanjeev K. Setia, and Mark S. Squillante. Processor allocation in multiprogrammed distributed-memory parallel computer systems. Journal of Paral lel and Distributed Computing, 1997. [2] Derek L. Eager, Edward D. Lazowska, and John Zahorjan. The Limited Performance Benefits of Migrating Active Processes for Load Sharing. Conf. on Measurement & Model ling of Comp. Syst., (ACM SIGMETRICS), pages 63­72, May 1988. [3] K. Barker and Nikos Chrisochoides. An Evaluation of a Framework for the Dynamic Load Balancing of Highly Adaptive and Irregular Applications. Proceedings of IEEE/ACM Supercomputing Conference (SC'03), Novemb er 2003. [4] H. Jiang and V. Chaudhary. Process/Thread Migration and Checkp ointing in Heterogeneous Distributed Systems. In In Proceedings of the 37th Hawaii International Conference on System Sciences (HiCSS-37). IEEE Computer Society, 2004. [5] Laxmikant V. Kal´, Sameer Kumar, Gengbin Zheng, e and Chee Wai Lee. Scaling molecular dynamics to 3000 processors with pro jections: A p erformance analysis case study. In Terascale Performance Analysis Workshop, International Conference on Computational Science(ICCS), Melb ourne, Australia, June 2003. [6] Ramkumar V. Vadali, Yan Shi, Sameer Kumar, L. V. Kale, Mark E. Tuckerman, and Glenn J. Martyna. Scalable fine-grained parallelization of plane-wave-based ab initio molecular dynamics for large sup ercomputers. Journal of Comptational Chemistry, 25(16):2006­2022, Oct. 2004. [7] L. V. Kale and Attila Gursoy. Modularity, reuse and efficiency with message-driven libraries. In Proc. 27th Conference on Paral lel Processing for Scientific Computing, pages 738­743, February 1995. [8] Leonardo Dagum and Ramesh Menon. Op enMP: An Industry-Standard API for Shared-Memory Programming. IEEE Computational Science & Engineering, 5(1), January-March 1998. [9] Message Passing Interface Forum. MPI: A Message Passing Interface. In Proceedings of Supercomputing '93, pages 878­883. IEEE Computer Society Press, 1993. [10] James C. Phillips, Gengbin Zheng, Sameer Kumar, and Laxmikant V. Kal´. NAMD: Biomolecular e simulation on thousands of processors. In Proceedings of the 2002 ACM/IEEE conference on Supercomputing, pages 1­18, Baltimore, MD, Septemb er 2002. [11] T.A. Darden, D.M. York, and L.G. Pedersen. Particle mesh Ewald. An Nˇlog(N) method for Ewald sums in large systems. JCP, 98:10089­10092, 1993. [12] T. von Eicken, D.E. Culler, S.C. Goldstein, and K.E. Schauser. Active Messages: a Mechanism for Integrated Communication and Computation. In Proceedings of the 19th International Symposium on Computer Architecture, Gold Coast, Australia, May 1992. [13] Jean-Luc Gaudiot and Liang-Teh Lee. Multiprocessor systems programming in a high-level data-flow language. In Proceedings of the Paral lel Architectures and Languages Europe, Volume I: Paral lel Architectures PARLE, pages 134­151, London, UK, 1987. Springer-Verlag. [14] G. R. Gao. An efficient hybrid dataflow architecture model. J. Paral lel Distrib. Comput., 19(4):293­307, 1993. [15] C.H. Koelb el, D.B. Loveman, R.S. Schreib er, G.L. Steele Jr., and M.E. Zosel. The High Performance Fortran Handbook. MIT Press, 1994. [16] I. Foster and K.M. Chandy. FORTRAN M: A Language for Modular Parallel Programming. Journal of Paral lel and Distributed Computing, 25(1), 1995. [17] Vipin Kumar, Ananth Grama, Anshul Gupta, and George Karypis. Introduction to paral lel computing: design and analysis of algorithms. Benjamin-Cummings Publishing Co., Inc., Redwood City, CA, USA, 1994. [18] Chao Huang, Gengbin Zheng, Sameer Kumar, and Laxmikant V. Kal´. Performance evaluation of e adaptive MPI. In Proceedings of ACM SIGPLAN Symposium on Principles and Practice of Paral lel Programming 2006, March 2006. [19] Jayant DeSouza and Laxmikant V. Kal´. MSA: e Multiphase sp ecifically shared arrays. In Proceedings of the 17th International Workshop on Languages and Compilers for Paral lel Computing, West Lafayette, Indiana, USA, Septemb er 2004. 84