Feedback-Directed Thread Scheduling with Memory Considerations Fengguang Song University of Tennessee Computer Science Depar tment Knoxville, TN Shirley Moore University of Tennessee Computer Science Depar tment Knoxville, TN Jack Dongarra University of Tennessee Oak Ridge National Laboratory Knoxville, TN song@cs.utk.edu shirley@cs.utk.edu dongarra@cs.utk.edu ABSTRACT This pap er describ es a novel approach to generate an optimized schedule to run threads on distributed shared memory (DSM) systems. The approach relies up on a binary instrumentation tool to automatically acquire the memory sharing relationship b etween user-level threads by analyzing their memory trace. We introduce the concept of Affinity Graph to model the relationship. Exp ensive I/O for large trace files is completely eliminated by using an online graph creation scheme. We apply the technique of hierarchical graph partitioning and thread reordering to the affinity graph to determine an optimal thread schedule. We have p erformed exp eriments on an SGI Altix system. The exp erimental results show that our approach is able to reduce the total execution time by 10% to 38% for a variety of applications through the maximization of the data reuse within a single processor, minimization of the data sharing b etween processors, and a good load balance. Categories and Subject Descriptors D.3.4 [Software]: Processors--run-time environments, optimization General Terms Performance, exp erimentation Keywords Distributed shared memory, shared-memory programming, affinity graph, scientific applications 1. INTRODUCTION High p erformance computing platforms that supp ort a shared-memory paradigm attain the b enefits of large-scale parallel computing without surrendering much programmability [7]. On distributed shared memory (DSM) systems, Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. HPDC'07, June 25­29, 2007, Monterey, California, USA. Copyright 2007 ACM 978-1-59593-673-8/07/0006 ...$5.00. a program can b e written as if it were running on a symmetric multiprocessor (SMP) machine. A typical sub class of the DSM systems builds on the cache-coherent non-uniform memory architecture (ccNUMA). A contemp orary ccNUMA system such as the SGI Altix consists of a large numb er of nodes, each of which has a couple of processors and a fixed amount of memory. With the emergence of chip multiprocessors (CMP), a compute node could have a numb er of multi-core chips each of which has many cores. Figure 1 depicts the hierarchy of such a system. Our previous research has shown that it is non-trivial to schedule threads on a single multi-core chip with a shared L2 cache due to resource contention and the nature of memory sharing b etween the threads [17]. In this pap er, we focus on thread scheduling for highly scalable shared memory machines that have a multi-level memory hierarchy. The placement of threads onto the memory hierarchy often has impact on program p erformance if they have overlapping memory footprints. Compute-intensive scientific applications usually consist of a set of threads to conduct identical computation on either overlapping or disjoint subsets of the global data. When two threads are accessing the same data, the location to launch them will affect the program p erformance greatly. For instance, the worst situation would b e to place them on different nodes which induces a great numb er of remote memory accesses, and the b est one would b e to place them on the same multi-core chip b ecause the shared L2 cache can p otentially eliminate the redundant loading of the same data [9, 17]. The intermediate one is either to place them on the same node but different chips, or place them on the same module but different chips (Power4, 5 machines have a module level [16]). Our goal of thread scheduling is to improve the effectiveness of the memory hierarchy by identifying user-level threads and reorganizing them to enhance the program temp oral and spatial locality, as well as placing tightly-coupled threads as close as p ossible. The user-level thread may b e considered as a "logical task" whose granularity varies from as tiny as a single instruction to as big as an actual kernellevel thread. With the help of data dep endence analysis, we can even reorder tiny instructions to maximize data reuse across the entire data set. However for simplicity and quick analysis, we identify fine-grained user-level threads as conceptual units of computation from the viewp oint of a programmer. Most of the time it is straightforward to identify the threads. For instance, an iteration of a loop nest or an up date of an ob ject may b e created as a thread. 97 chip Node chip Application Executable feedback core core core core core core core core core ... shaL2 L2 red shaL2 L2 red Memory Trace Analysis Tool Rerun Memory ... Affinity Graph Figure 1: Memory hierarchy on a ccNUMA distributed shared memory system. In order to improve the data locality, we must decide which user-level threads will b e in the same group and in what order to execute them, as well as how to map the groups to different processors. Figure 2 illustrates the overall structure of our approach to realizing it. The approach relies up on a binary instrumentation tool to (i) obtain and analyze the memory trace of each thread and find out the nature of memory sharing b etween threads in an "affinity graph". An affinity graph is an undirected weighted graph where each vertex represents a thread and the weight for edge eij denotes the total numb er of distinct addresses accessed in common by threads i and j. To make the memory tracing method more practical, the instrumentation tool generates affinity graphs dynamically without storing the huge memory trace to disk. After the instrumented executable finishes, an affinity graph is built and written to a file. Next, (ii) we partition the graph into a numb er of subgraphs (in our exp eriments, the numb er is equal to the numb er of processors). Based on the partitions, (iii) we compute a "good" schedule to put threads on different processors corresp ondingly. The schedule is written in a file which will b e later used as feedback to future executions. Finally, (iv) a user reruns the program taking as input the feedback file. We have exp erimented with the feedback-directed thread scheduling method using several application programs, including sparse matrix-vector multiplication (stored in compressed row storage or compressed column storage format), sparse matrix-matrix multiplication, parallel radix sorting, and a kernel from computational fluid dynamics codes. Our results show that the feedback-directed method can significantly reduce the execution time. In particular, it can successfully improve the p erformance of programs with dynamic memory access patterns and little compile-time information. Our work makes the following contributions: · We present a feedback-directed approach for thread scheduling for general-purp ose programs. Since it is an offline method, compute-intensive optimizations are allowed to determine an optimal schedule. Also the overhead to execute (or follow) the predetermined schedule is less than that of dynamic scheduling methods. · By identifying indep endent user-level threads, we are able to parallelize the original program while maximizing the program locality. · We develop techniques for instrumenting the executable Thread Scheduling # of Processors, Nodes Optimized Schedule Figure 2: Overall structure of the thread scheduling method with memory considerations. and analyzing the memory trace without triggering any disk I/O. These techniques are more practical than traditional trace-based methods. · We prop ose the concept of affinity graph to represent the nature of memory sharing b etween threads. We have applied it successfully to various applications. · Our graph partitioning technique has a hierarchical structure that corresp onds to the actual architecture of a real machine. For instance we can first partition threads to a set of nodes, then to a set of chips, and finally to a set of processor cores (we assume one kernel thread p er processor core). 2. FEEDBACK-DIRECTED METHOD A feedback-directed method strives to improve p erformance by using profiling information to exploit opp ortunities for optimizations. Our work is concerned with how to maximize data locality on each processor and minimize the data sharing b etween processors (each processor runs one kernel thread). Since we determine a thread schedule based on the referenced addresses, collecting a memory trace is typically necessary. There are four typ es of approaches to obtaining the memory trace: compiler-based, run-time system, online feedback-directed optimization (FDO), and offline feedback-directed optimization. We choose to use the offline FDO method due to the following reasons: · The feedback-directed method can attain dynamic information ab out memory references, particularly for programs with irregular access patterns. The dynamic information cannot b e obtained at compile time. · The offline method has much less overhead than online methods since the optimal schedule is determined during the very b eginning profiling program run. 98 · Our method collects a more accurate and detailed memory trace for each thread to determine an optimal schedule. Current run-time systems usually use a couple of heuristics to schedule threads. We now present the overall structure of our feedbackdirected method. We first need to collect the memory trace of applications by using a memory trace analysis tool. The analysis tool supp orts binary instrumentation and is built up on Pin [8]. sent affinity graphs by an adjacency list and only store edges eij where i < j to reduce the memory requirement. This is analogous to storing an upp er-triangular adjacency matrix. A simple algorithm for building a graph from the recorded memory trace is shown in Figure 3. The memory trace analysis tool instrumented the executable and stored addresses in thrd addrs for each thread. Next, the algorithm compares the memory trace of every two threads i and j where i < j. create edge() creates an edge b etween thread i and thread j and assigns a prop er weight to it. 1 map thrd_addrs; 2 for i = 1, num_thrds-1 3 for j = i+1, num_thrds //Compare traces of threads i, j. 4 create_edge(thrd_addrs, i, j); 5 end for 6 end for 2.1 Memory Trace Analysis Pin follows the model of ATOM and allows tool writers to analyze applications at the instruction level [8]. It uses a dynamic just-in-time (JIT) compiler to instrument binary codes while they are running. The set of Pin APIs provide supp ort for observing a process's architectural states (e.g., register contents and memory references). We wrote a memory trace analysis tool in C++ and used the Pin API to implement two typ es of routines: instrumentation routine and analysis routine. The instrumentation routine tells Pin to insert instrumentation to every instruction that reads or writes data. The virtual address of the referenced datum is passed as an argument to the analysis routine. In order to differentiate addresses from distinct threads, the thread ID is passed as another argument to the analysis routine. Instrumenting the binary executable also enables us to keep the original memory access pattern while reflecting various compiler optimizations. Two analysis routines RecordMemRead and RecordMemWrite are implemented for read and write op erations, resp ectively. The analysis routine works as an event handler. For each memory reference, the routine identifies the thread ID for that reference and stores it into a buffer. Each thread owns a buffer for keeping distinct addresses (i.e., two references to the same address will b e stored once). We tried writing the memory trace to a file, but the size of the trace file and the I/O cost increased so fast that it soon b ecame impractical to use. The space requirement of our memory trace analysis tool is equal to the sum of the distinct addresses referenced by each thread. It is p ossible for us to imp ose a limit on the numb er of distinct addresses for each thread so that all the tracing op erations can b e p erformed in memory (i.e., without disk). This diskless tracing method is much less exp ensive and more practical than writing the memory trace to disk. Figure 3: graphs. A simple algorithm to build affinity 2.2 Techniques to Process Large Graphs While a modern computer system has no problem keeping the distinct addresses in memory for most applications, the size of the generated affinity graph can explode very quickly. For instance, 105 fine-grained threads (105 vertices) may require 40GB memory if the graph is totally connected (1010 edges). The size of the graph is limited by the capacity of the memory. Given a fixed amount of memory, it is not trivial to build an appropriately sized affinity graph without exceeding the available memory. We adopt several techniques to improve the time complexity and space complexity of the process of graph creation. It rarely happ ens in practice that every thread has a memory sharing relationship to all other threads if we don't consider the very small numb er of global variables. Thus affinity graphs are often sparse and symmetric. We repre- Supp ose there are T threads and each thread accesses N addresses. Even though the function create edge() has a linear time complexity O(N ) (by merging two ordered lists), the overall time complexity is equal to O(T 2 N ). In practice N is often b ounded but T could b e very large (e.g., 106 to 108 ). Furthermore, no matter how sparse a graph is, this algorithm always takes time O(T 2 N ) to build the graph, which could lead to many hours of computation. To b e more efficient in processing large graphs, we change to a different data structure and develop a new algorithm. The new algorithm employs an adjustable parameter of DenseRatio [0 . . . 1] to control the density of the graph. If an address is accessed by all threads, the graph is fully connected. However this address will not help graph partitioning in the next step. Therefore we adjust DenseRatio to eliminate those edges that can form a clique of size greater than N umber T hr eads × DenseRatio. DenseRatio = 0.0 will yield a graph with no edges and DenseRatio = 1.0 will yield a graph with all p ossible edges. Figure 4 lists the pseudo-code for the more efficient algorithm. The new data structure addr thrds stores addr as keys instead of prior t id as keys. The algorithm takes as input the memory trace stored in addr thrds and builds an affinity graph. For each address, we determine the numb er of threads that have accessed it. If the numb er is greater than DenseRatio times the total numb er of threads, we skip the analysis for this address and the creation of the relevant edges. Otherwise, we create edges among the set of threads. Again, let N b e the total numb er of addresses referenced by the application and T b e the total numb er of threads. The time complexity of the algorithm varies b etween O(N ) and O(N T 2 ) dep ending on how sparse the graph is. The worst-case time complexity occurs when the graph is fully connected and DenseRatio = 1.0. Certainly we can use DenseRatio to reduce the graph density. Note that the previous algorithm in Figure 3 always has a time complexity of O(N T 2 ). The new algorithm not only has a b etter average-case time complexity but also adopts an adjustable parameter DenseRatio to eliminate particular edges. In our exp eriments, we use DenseRatio = 0.9 to reduce edges and are able to create sparse graphs. In addition, most of the eliminated edges are due to a couple of global 99 1 T = total number of threads; 2 map addr_thrds; 3 for each addr in addr_thrds do 4 thrd_set = threads accessing addr; 5 m = size of thrd_set; 6 if (m > DenseRatio*T) continue; 7 create edges between any pair of threads within thrd_set; 8 end for 0 C T0 T1 T2 T3 1 3 10003 3 3 2 3 3 Figure 4: A more efficient algorithm to build affinity graphs. A = x B 10003 0 0 variables. Note that a single global variable can lead to a fully connected graph. The generated affinity graphs can b e written either in Graphviz DOT format for the purp ose of displaying, or in Chaco/METIS format for the next stage of graph partitioning. Figure 5: Matrix multiplication using four threads and the corresponding affinity graph. Each thread of T0-T3 computes for one block of matrix C. Matrix B is a sparse matrix. 2000 1800 1600 1400 1200 1000 800 600 400 200 0 1000 2000 3000 matrix dimension 4000 5000 optimized original 3. AFFINITY GRAPH MODEL We use a weighted undirected graph G = V, E called "affinity graph" to model the nature of memory sharing b etween threads. In the graph, each vertex v V denotes a thread and each edge eij E indicates that there exists a sharing relationship b etween threads vi and vj . The weight wij associated with edge eij denotes that thread i and thread j have accesses to a numb er wij of virtual addresses in common. Note that wij does not measure the frequency of references to the addresses. We assume that the greater the weight wij , the higher probability it indicates of improved data locality if we place threads i and j on the same processor. We also assume all threads are indep endent. There are two typ es of threads: user-level threads and kernel threads. The user-level thread may b e either as small as a single instruction or as large as a parallel task. When the thread has a single instruction, users can derive an optimal schedule satisfying the data dep endence constraint to minimize the cache miss rate. For medium to large applications, we regard a certain numb er of loop iterations as a user-level thread. On the other hand, if a vertex denotes a kernel thread, we can use the corresp onding graph to determine how to map the kernel threads onto different processors. Figure 5 shows an example of multiplying 200 × 200 matrices. A and C are dense matrices. Matrix B has a sp ecial structure where the top right and b ottom left blocks are all zeros. We use four threads T0-T4 to compute matrix multiplication. Corresp ondingly, T0-T4 compute the result for sub-matrices C11, C12, C21 and C22 in parallel. After instrumentation and execution of the program, an affinity graph is created as shown on the right side in Figure 5. This example demonstrates what an affinity graph is and its ability to reveal the memory sharing relationship b etween threads. We conducted a few exp eriments on the SGI Altix 3700 machine which is comp osed of dual-processor nodes. From the affinity graph generated for the four threads shown in Figure 5, we conclude that threads 0 and 2 should run on the same node while threads 1 and 3 should run on another node. Otherwise the program will incur a lot of remote memory accesses. We compare the p erformance of putting threads 0 and 2 together to that of putting 0 and 1 together (an intuitive way). Figure 6 shows the wallclock execution Figure 6: Comparison of two different thread schedules. The revealed affinity relationship helps improve the multi-threaded program performance. time of these two placements. The optimized placement is b etter than the original one by 20%. 3.1 Affinity Graph Partitioning For most of the programs, it is not obvious to decide how to map threads onto different processors. In this section we give a formal definition of affinity graph and describ e our approach to using graph partitioning to derive thread groups. Figure 5 shows an example of affinity graph. Definition 1. An affinity graph is an undirected weighted graph G = T, E , wt , we , where · T = {ti is a user-level thread | ti is indep endent of tj , i = j }, · E = {(ti , tj ) | i = j , address x such that b oth ti and tj access x }, · wt : T - Z + , / · we : E - Z + , and we (ti , tj ) = 0 if (ti , tj ) E . 100 time (seconds) The purp ose of graph partitioning is to identify tightlycoupled threads and divide a graph into subgraphs as disjoint as p ossible. We also hop e to group together threads that have large overlapping footprints. The partitioning process is hierarchical since it corresp onds to the actual memory hierarchy on a real computer. For instance, on a ccNUMA DSM system, we first divide the threads into a numb er of groups which is equal to the numb er of nodes. Next we divide each group further into subgroups that corresp ond to the processors in a node. Definition 2. Given an affinity graph G and a partition P = {T1 , . . . , Tn }, P jP : i=j uTi v Tj we (u, v ) S har ing (Ti , Tj ) = 0 : i=j The optimization criterion aims to minimize the sharing b etween groups. Assume there are n compute nodes each of which has p processors. (a) We divide the threads in graph G into n parties N1 , N2 , . . . , Nn by minimizing X S har ing (Ni , Nj ). 1i,j n 1 for iter = 1, NUM_ITER 2 #pragma omp parallel for 3 for i = 1, num_rows 4 y = 0; 5 for j = row[i], row[i+1]-1 6 y += val[j]*x[col[j]]; 7 end for 8 y[i] = y; 9 end for 10end for Figure 7: Parallel iterative method calling SpMV y = Ax. Sparse matrix A is stored in CRS format by arrays val, col, row. Now each node Ni has b een assigned a set of threads. Since each node also has p processors, (b) we further partition Ni to p parties P1 , P2 , . . . , Pp . Similarly, this is achieved by minimizing the sharing b etween any pair of the p processors on node Ni : X S har ing (Pi , Pj ). 1i,j p where aj is the jth column of A. For the parallel version with p threads, each thread computes a partial sum and up dates the vector y. But an up date of element yi may invalidate a set of neighb oring yi elements in other processors due to false sharing. By grouping the columns that access the same elements in vector y, we are able to improve the temp oral locality and reduce the chances of false sharing on vector y. As for sparse matrix A stored in CRS format, each thread computes a subset of vector y, where yi = n X j =1 aij xj . We use the Chaco software package to partition affinity graphs. Chaco provides several methods for finding small edge separators: inertial, sp ectral, Kernighan-Lin and multilevel methods [4]. In our exp eriments, we use the multilevel Kernighan-Lin method [5] with recursive bipartitions to determine optimal subgraphs. Note that the only chance of data reuse lies in vector x. If two rows read the same set of elements xs1 , xs2 , . . . , xsd in vector x, running them continuously will reuse the d elements and improve the temp oral locality. 4.2 Sparse Matrix-Matrix Multiplication To compute sparse matrix-matrix multiplication (SpMM), we store matrix A row by row in CRS format and store matrix B column by column in CCS format. The parallel SpMM program is describ ed in Figure 8. Sparse matrices are distributed by rows across the p processors. Each processor computes for a numb er of N consecutive rows in matrix C. p The outer two loops i and j in Figure 8 are data indep endent and can b e executed in parallel. A user-level thread may either compute a dot product of the ith row of matrix A and the jth column of matrix B (i.e., the outer two loops are parallelized), or multiply the ith row of A and the whole matrix of B (i.e., only the outermost loop i is parallelized). The former definition is more fine-grained and can identify a greater numb er of threads to reorder to maximize the program locality. But it is too costly to compute since the numb er of threads N 2 may result in a very large graph. Instead, we use the latter coarse-grained definition to do our exp eriments. The exp erimental results demonstrate that we can still sp eedup the program by 10-20% even using the coarse-grained threads. 4. APPLICATIONS We p erformed exp eriments with a variety of applications to evaluate our feedback-directed method of thread scheduling. These applications cover a range of domains and have b een widely used by users. 4.1 Sparse Matrix-Vector Multiplication Sparse matrix-vector multiplication (SpMV) is an imp ortant subroutine in many iterative methods. We use two formats of Compressed Column Storage (CCS) and Compressed Row Storage (CRS) to implement iterative methods and try to improve their p erformance. The program using the CRS format is presented in Figure 7. The inner for loop is distributed to a numb er of processors and executed in parallel. Arrays val, col, row store the sparse matrix A while x, y store the column vectors for computing y = Ax. The code using the CCS format is similar to that in Figure 7 and not shown here. For b oth formats of CRS and CCS, we try to use the technique of affinity graph partitioning to improve the program p erformance. Supp ose matrix A is stored in CCS format, the vector y = Ax is computed as follows: y= n X j =1 4.3 Parallel Radix Sorting Sorting is also an imp ortant kernel for high-p erformance multiprocessing. Parallel radix sorting has b een shown to b e a simple and efficient parallel method that outp erforms other parallel sorting algorithms [6, 15]. It is one of the imp ortant kernels in the NAS Parallel Benchmark and SPLASH2 b enchmark suites. The radix sorting algorithm p erforms one round of sorting for every r bits of the keys. For a set of 32-bit integer keys aj xj , 101 1 struct CRS A; 2 struct CCS B; 3 double *C; 4 #pragma omp parallel for 5 for i = 1, N 6 for j = 1, N 7 c = C[i*N+j]; 8 for idx_a = A.row[i],A.row[i+1]-1 9 for idx_b = B.col[j],B.col[j+1]-1 10 if(A.col[idx_a]==B.row[idx_b]) 11 c+=A.val[idx_a]*B.val[idx_b]; 12 end for 13 end for 14 C[i*N+j] = c; 15 end for 16end for 1 for iter = 1, NUM_ITER 2 #pragma omp parallel for 3 for i = 1, edges 4 n1 = left[i]; 5 n2 = right[i]; 6 force = f(x[n1],x[n2]); 7 y[n1] += force; 8 y[n2] -= force; 9 end for 10end for Figure 9: Parallel version of IRREG. Figure 8: Parallel version of SpMM. and r=8, four rounds of iterations are needed. Supp ose N keys are stored in an array and distributed to p threads t0 , t1 , . . . , tp-1 , and that thread ti has to p erform a local radix sort on the segment of [ N × ti , N × ti+1 ) keys. During each p p round, a thread builds a local histogram of the occurrences of its local keys. Next it computes a global histogram by combining other threads' histograms. Finally, each thread writes its keys back to corresp onding p ositions in the array. Likewise, the next round of sorting starts for another r bits of keys. Regarding the memory access pattern, each thread first reads keys from its assigned segment, sorts them, and then writes keys to remote p ositions in other segments. Therefore if we place two threads onto the same node that write many keys to each other, the remote memory accesses will b ecome local memory accesses, leading to b etter p erformance. In our exp eriments, we schedule threads based on the first round of r bits. We apply the optimization to the partitioned parallel radix sorting algorithm which sorts keys in a left-toright fashion instead of the traditional right-to-left [6]. The algorithm uses the most significant r bits at the b eginning. After the first round, the rest of sorting is always confined locally within each thread. Hence the numb er of remote memory accesses is greatly reduced during the whole process of radix sorting. 3700 system with 256 nodes, each of which has two 1.6 GHz Itanium processors. The system has a ccNUMA Distributed Shared Memory (DSM) architecture and the memory is physically distributed across nodes. Each processor can access any memory location through the SGI NUMAlink 4 interconnect. Memory access time dep ends on the distance b etween the processors and the nodes where the physical memory is located. Our approach is a generic approach that works for various typ es of applications, therefore we compare our programs using the feedback-directed method to the programs built by compiler optimizations. 5.1 Implementation Issues All the parallel applications are implemented in C using Pthreads. We obtain the optimized schedule automatically through our memory trace analysis tool describ ed in Section 2. Since we can only set thread and memory affinity on the login node of the system, to verify our prototyp e, we run a small numb er of four kernel threads on two nodes with four processors to conduct the exp eriments. In all of the exp eriments, the fine-grained user level threads are totally indep endent and can b e executed in parallel. In order to execute the program with the new schedule, we need to make a minor change to the original program manually. This step could b e done easily by extending a compiler. For instance, instead of running for i = 0 to n, the compiler can wrap the fundamental computation unit by for i = mytasks[0] to mytasks[n], where mytasks is sp ecified in the optimal schedule. After finding the affinity relationship b etween threads, we use the cpuset library provided by SGI Linux [18] to bind threads to different processors. Note that the binding of threads to processors happ ens only once when the threads are created. The cpuset APIs cpuset pin() and cpuset membind() allow our C programs to place b oth processor and physical memory within a cpuset. 4.4 IRREG The IRREG kernel is an iterative irregular-mesh partial differential equation (PDE) solver abstracted from computational fluid dynamics (CFD) applications [14]. The irregular meshes are used to model physical structures and consist of nodes and edges. The computational kernel iterates over the edges of the mesh, computing the forces b etween b oth end p oints of each edge. It then modifies the values of all nodes. Figure 9 shows the parallel version of IRREG. Each edge is a user-level thread in our exp eriment. We partition the threads (or edges) into p sets (for p processors) to maximize data reuse by grouping together the threads accessing the same nodes. In addition, for each group, we reorder the threads by means of the breadth-first traversal of the group's corresp onding subgraph. Running the optimized schedule reduces the execution time by around 35%. 5.2 SpMV Sparse matrix-vector multiplication is called by a synchronous iterative method. For each sparse matrix, we run a fixed numb er of iterations. There are two programs written for SpMV. One program uses the CCS storage format and the other one uses the CRS format. Users often choose one of the two formats to implement their programs. Table 1 lists the sparse matrices used in our SpMV exp eriments. They were downloaded from the UF Sparse Matrix Collection [1]. A matrix was selected if the amount of computation for each row is approximately equal. Figure 10 shows the effect of the optimization. Values less than 1 indicate p erformance sp eedup. Our scheduling 5. EXPERIMENTAL RESULTS All of the exp eriments are conducted on an SGI Altix 102 Table 1: Sparse matrices used in the SpMV experiment. Name Dimension NNZ 1 msc01440 1,440 44,998 2,624 35,823 2 circuit 1 822 4,661 3 bp 1000 4 coater1 1,348 19,457 5 msc23052 23,052 1,142,686 6 mark3jac040 18,289 106,803 7 utm3060 3,060 42,211 Table 2: Sparse matrices used in the SpMM experiment. Name Dimension NNZ 1 msc01440 1,440 44,998 2 msc01050 1,050 26,198 3 utm3060 3,060 42,211 4 b csstk13 2,003 83,883 1.2 SpMM 1 1.2 0.8 matrix in CCS format matrix in CRS format 1 normalized to original 0.6 0.8 normalized to original 0.4 0.6 0.2 0.4 0 msc01440 msc01050 utm3060 bcsstk13 0.2 Figure 11: Performance of SpMM. 0 msc01440 circuit_1 bp_1000 lhr01 msc23052 coater1 utm3060 5.4 Parallel Radix Sorting Figure 10: Performance of SpMV. Both formats of CCS and CRS are used for each sparse matrix. The input keys array is distributed across four processors (i.e., four kernel threads) to do the parallel radix sorting. Supp ose we create the four threads t0 , . . . , t3 on processors p0 , . . . , p3 . Thread ti keeps the corresp onding ith block of the array. Processors p0 and p1 are co-located in one compute node while p2 and p3 are in another one. We use a synthetic input to compare the effect of different placements of threads on processors. Taking as input the synthetic input, t0 and t3 have to swap keys while t1 and t2 have to swap as well. A straightforward placement would b e placing thread ti on processor pi corresp ondingly (we call this exp eriment "original"). In this exp eriment each swapping involves a pair of remote memory writes. A b etter way would b e to put p0 and p3 on the same node so that data swapping only involves local memory accesses (we call this exp eriment "optimized"). To find out what the b est p erformance could b e, we modified the program and removed those op erations p erforming remote memory write (we call it "original without memory write"). The modified program provides further insight into the lower b ound of the execution time, which is the b est we could achieve. Figure 12 depicts the wallclock execution time of the three programs with different input sizes. Although the execution time is improved by just 10%, the reduction in remote memory accesses is equal to 30% - 50% of the total memory access time (i.e., 30% - 50% of the total communication time). method reduces execution times by 10% to 25% for four out of seven matrices in b oth CCS and CRS formats. The improvement for the CRS program comes from maximized data reuse, and the improvement for the CCS program comes from reduced false sharing. Only three matrices (circuit 1, lhr01, and coater1 in CCS format) show a little slowdown (less than than 5%). There are three p ossible reasons: 1) each row has too few elements so that the scheduling overhead exceeds the gain of the improved data reuse; 2) the amount of computation for each processor b ecomes imbalanced; 3) there is no opp ortunity to enhance the data reuse for certain sparse matrices (i.e., the original order is nearoptimal). 5.3 SpMM We conducted sparse matrix-matrix multiplication C = AB with four test inputs which are shown in Table 2. Instead of attempting various combinations of pairing sparse matrices, we simply let B equal the transp ose of A. In the exp eriment, matrix A is stored row by row in the CRS format and matrix B is stored by columns in the CCS format. We let a user-level thread compute the product of the ith row of A and the whole matrix of B. Figure 11 indicates that the reduction of the total execution time is around 20%. Unlike the ab ove SpMV exp eriment, this time the thread scheduling method is effective for all the test inputs. 5.5 IRREG The IRREG program takes an irregular mesh as the input and iterates through all the edges and up dates the end 103 40 35 original optimized original w/o memory write 30 execution time (seconds) 25 20 15 10 5 0 24 25 26 27 keys array size (2^n) 28 29 30 Figure 12: Performance of one-round parallel radix sorting. p oints of each edge. Supp ose a mesh has N nodes and E edges. We use a random generator to assign nodes to the end p oints of every edge which are uniformly distributed b etween 0 and N-1. We analyze the memory trace of the program to determine an optimal schedule. The new schedule defines four sets of edges (for four processors) and each set has an ordered list of edges. We compare the p erformance of the original program to the optimized program with the new schedule. As shown in Table 3, our feedback-directed optimization method reduces the program execution time by 33% to 38%. The reason for the increased p erformance improvement with the larger mesh size is b ecause a larger working set often results in a greater numb er of cache misses which is later reduced by using an optimized schedule. Table 3: Irregular Meshes Nodes Edges 400 4,000 4,000 40,000 40,000 400,000 Performance of IRREG. Time Time Reduced (original) (optimized) Time 1.11 0.74 33.3% 10.88 6.87 36.9% 102.88 63.57 38.2% 6. RELATED WORK A lot of previous research work has prop osed ways of reorganizing data structures and altering programs to maximize the data reuse. Philbin et al. [11] describ e a user-level thread library to improve cache locality using fine-grained threads. All data-indep endent units of computation implied in the sequential program are created as fine-grained threads all at once. When a thread is created, a hint of the starting addresses of the accessed arrays must b e provided as an argument. After thread creation, the thread library reschedules the threads at runtime to reduce the numb er of L2 cache misses. This method only works well for sequential programs. Yan et al. [19] develop ed a runtime library to maximize data reuse. Yan's approach is more generic and can b e ap- plied to parallel programs on SMP machines. In addition, they adopt an adaptive scheduler to achieve load balance b etween processors. Like Philbin's approach, they also use the starting addresses of arrays as hints to determine an optimal schedule. Pingali et al. [14] use locality groups to restructure computations for a variety of applications but require hand-coded optimizations. In contrast, we use a binary instrumentation tool to analyze memory trace offline and automatically acquire more precise relationship information b etween threads. Such precise information is critical for thread scheduling on a ccNUMA distributed shared memory system. Our approach also has less overhead than the runtime approaches by employing a feedback-directed metho d . Ding [3] improves the program locality through tracedriven computation regrouping. He develop ed a trace-driven tool to measure the exact control dep endence b etween instructions and applied techniques of memory renaming and reallocation. Due to the exp ense of scheduling individual instructions, this method is limited to small fragments of a few kernels. It is also only applicable to sequential programs. Ding and Kennedy [2] introduced a compiler technique to minimize program bandwidth consumption. They prop ose a two-step strategy to fuse loops based on the reuse distance and regroup (intermix) several arrays based on their reference affinity. Similar to our approach, Pichel et al. [12, 13] formulate sparse matrix-vector product as a graph problem. In the graph, each row of the sparse matrix represents a vertex. The distance b etween vertices are determined by their locality model. The prop osed technique tries to improve the program locality by reorganizing rows of the original matrix through graph partitioning and subgraph reordering. It works effectively on ccNUMA DSM systems, but is limited to the SpMV application. Affinity loop scheduling minimizes the cache miss rate by allocating loop iterations to the processor whose cache already contains the necessary data [10]. The affinity scheduling emphasizes loop iteration assignment and assumes that the loop iterations have an affinity to particular processors. A typical use of this method is to schedule a parallel loop that is nested within a sequential loop. Unlike affinity loop scheduling, our method tries to reorder and parallelize the inner loop iterations b efore assigning them to particular processors. Furthermore, we use more generic fine-grained threads as the scheduling unit rather than loop iterations. Marathe et al. investigated how to place pages on a ccNUMA DSM system using a hardware profile-guided method [9]. They run a truncated version of a user application to decide a good page placement through a hardware monitor. Then they leverage the "first-touch" technique to allocate pages to assigned processors. Differently, we use a binary instrumentation tool to analyze the memory trace to determine the memory sharing relationship b etween threads and we don't rely on any hardware facilities. 7. DISCUSSION In our model, we assume each thread has the same amount of computation. Unfortunately it is not true for many applications. If two threads have heavier workload and larger footprints than the others, they are more likely to b e group ed together b ecause the numb er of shared addresses might b e large too. This kind of grouping may result in a load imbal- 104 ance problem. To solve this problem, we are extending the current graph model by assigning weights to vertices. The graph model doesn't distinguish read and write operations. On a ccNUMA DSM system, a memory write will invalidate a numb er of cache lines in other processors. It is also an exp ensive op eration. Therefore when p erforming graph partitioning, we hop e to give a higher priority to writes than to reads. Our model represents the affinity relationship by the numb er of addresses accessed in common. A more accurate way would b e using virtual cacheline numb ers to reflect the real data movement (load/store of cache lines). It can also reduce the space complexity of our method. We adopt DenseRatio to control the numb er of edges (or sparsity) in the graph. The current prototyp e cannot handle too huge a graph with hundreds of millions of threads. Such a huge graph without edges still requires a lot of memory. We could build the graph and write it to disk to solve this issue. Since we adopt a diskless approach to analyzing the memory trace, the ability to record traces is limited by the amount of memory. To reduce the memory requirement, we hop e to represent contiguous memory addresses by regions instead of individual p oints. Other alternative solutions might b e designing an online algorithm, or collecting a partial memory trace for each thread. Furthermore, executing the complete instrumented executable could b e much slower (e.g., 10 to 100 times) than the original executable due to the cost of calling the analysis subroutine, storing the memory trace in associated arrays, and creating graphs in the end. However, a partial execution of the truncated program can overcome the problem. For instance, a single outer-loop iteration of an iterative method is sufficient to build an affinity graph. Our ongoing work is implementing a mechanism to supp ort the partial memory tracing method. conduct more exp eriments on a large numb er of processors. It would also b e p ossible to implement the feedback-directed strategy in commercial compilers so that the programmer can automatically achieve high-p erformance on leading-edge shared memory systems. 9. ACKNOWLEDGMENTS The authors would like to thank HPDC'07 reviewers for their valuable comments and suggestions on the initial draft of the pap er. This research is supp orted by the National Science Foundation under grant No. 0444363. 10. REFERENCES [1] T. Davis. University of Florida sparse matrix collection. In http://www.cise.ufl.edu/research/sparse, 1997. [2] C. Ding and K. Kennedy. Improving effective bandwidth through compiler enhancement of global cache reuse. J. Paral lel Distrib. Comput., 64(1):108­134, 2004. [3] C. Ding and M. Orlovich. The p otential of computation regrouping for improving locality. In SC, page 13. IEEE Computer Society, 2004. [4] B. Hendrickson and R. Leland. The Chaco user's guide: Version 2.0. In Sandia Tech Report SAND94-2692, 1994. [5] B. Hendrickson and R. Leland. A multilevel algorithm for partitioning graphs. In Supercomputing '95: Proceedings of the 1995 ACM/IEEE conference on Supercomputing, page 28, New York, NY, USA, 1995. ACM Press. [6] S.-J. Lee, M. Jeon, D. Kim, and A. Sohn. Partitioned parallel radix sort. J. Paral lel Distrib. Comput., 62(4):656­668, 2002. [7] H. Lu, S. Dwarkadas, A. L. Cox, and W. Zwaenep oel. Message passing versus distributed shared memory on networks of workstations. In Supercomputing '95: Proceedings of the 1995 ACM/IEEE conference on Supercomputing (CDROM), page 37. ACM Press, 1995. [8] C.-K. Luk, R. S. Cohn, R. Muth, H. Patil, A. Klauser, P. G. Lowney, S. Wallace, V. J. Reddi, and K. M. Hazelwood. Pin: building customized program analysis tools with dynamic instrumentation. In V. Sarkar and M. W. Hall, editors, PLDI, pages 190­200. ACM, 2005. [9] J. Marathe and F. Mueller. Hardware profile-guided automatic page placement for ccNUMA systems. In J. Torrellas and S. Chatterjee, editors, PPOPP, pages 90­99. ACM, 2006. [10] E. P. Markatos and T. J. LeBlanc. Using processor affinity in loop scheduling on shared-memory multiprocessors. IEEE Trans. Paral lel Distrib. Syst., 5(4):379­400, 1994. [11] J. Philbin, J. Edler, O. J. Anshus, C. C. Douglas, and K. Li. Thread scheduling for cache locality. In ASPLOS, pages 60­71, 1996. [12] J. C. Pichel, D. B. Heras, J. C. Cabaleiro, and F. F. Rivera. Improving the locality of the sparse matrix-vector product on shared memory multiprocessors. In PDP, pages 66­71. IEEE Computer Society, 2004. 8. CONCLUSION We present a feedback-directed framework to maximize program locality on distributed shared-memory systems. We first run a binary instrumentation tool to automatically identify the nature of memory sharing b etween threads which is represented by an affinity graph. The second step is to p erform graph partitioning to determine an optimized schedule for assigning threads to particular processors. The optimized schedule improves the data locality of the original program and reduces TLB misses as well as the numb er of remote memory references on DSM systems. Exp eriments on four different typ es of applications have shown that our method is significantly effective and can reduce the program execution time by up to 38%. Although we are using a memory tracing method, the online analysis completely eliminates exp ensive disk I/O op erations. Our approach demonstrates that the feedback-directed method is esp ecially good for applications with irregular computation and dynamic memory access patterns. The overhead to execute the applications using the optimal schedule (i.e., feedback) is also cheap. With the affinity graph model, we are able to use a hierarchical mechanism to partition the graph corresp onding to the actual memory hierarchy on a real system (e.g., node, chips, and processor cores). Our exp eriments so far involved either the node level or the processor level. After incorp orating the hierarchy structure into our framework, we shall 105 [13] J. C. Pichel, D. B. Heras, J. C. Cabaleiro, and F. F. Rivera. A new technique to reduce false sharing in parallel irregular codes based on distance functions. In International Symposium on Paral lel Architectures,Algorithms and Networks, 2005 (ISPAN 2005)., 2005. [14] V. K. Pingali, S. A. McKee, W. C. Hsieh, and J. B. Carter. Restructuring computations for temp oral data cache locality. International Journal of Paral lel Programming, 31(4):305­338, 2003. [15] H. Shan and J. P. Singh. Parallel sorting on cache-coherent dsm multiprocessors. In Supercomputing '99: Proceedings of the 1999 ACM/IEEE conference on Supercomputing, page 40, New York, NY, USA, 1999. ACM Press. [16] B. Sinharoy, R. Kalla, J. Tendler, R. Eickemeyer, and J. Joyner. Power5 system microarchitecture. IBM Journal of Research and Development, 49(4/5):505­521, 2005. [17] F. Song, S. Moore, and J. Dongarra. Modeling of L2 cache b ehavior for thread-parallel scientific programs on Chip Multi-Processors. In UT Computer Science Technical Report UT-CS-06-583, 2006. [18] SGI. Linux resource administration guide. In SGI Techpubs Library 007-4413-011, 2006. [19] Y. Yan, X. Zhang, and Z. Zhang. Cacheminer: A runtime approach to exploit cache locality on SMP. IEEE Trans. Paral lel Distrib. Syst., 11(4):357­374, 2000. 106