Bandit-Based Optimization on Graphs with Application to Library Performance Tuning Fr´d´ric de Mesmay e e fdemesma@ece.cmu.edu Arpad Rimmel rimmel@lri.fr Yevgen Voronenko yvoronen@ece.cmu.edu Markus P¨ schel u pueschel@ece.cmu.edu Department of Electrical and Computer Engineering, Carnegie Mellon University, Pittsburgh, PA 15213 USA TAO (Inria), LRI, UMR 8623 (CNRS - Universit´ Paris-Sud), 91405 Orsay, France e Abstract The problem of choosing fast implementations for a class of recursive algorithms such as the fast Fourier transforms can be formulated as an optimization problem over the language generated by a suitably defined grammar. We propose a novel algorithm that solves this problem by reducing it to maximizing an objective function over the sinks of a directed acyclic graph. This algorithm valuates nodes using Monte-Carlo and grows a subgraph in the most promising directions by considering local maximum k-armed bandits. When used inside an adaptive linear transform library, it cuts down the search time by an order of magnitude compared to the existing algorithm. In some cases, the performance of the implementations found is also increased by up to 10% which is of considerable practical importance since it consequently improves the performance of all applications using the library. opers of high performance libraries, who often create different implementations for each platform and whenever new platforms are released (Intel, 2008). One way to reduce the recurring development costs is automatic performance tuning. The basic idea is to use feedback-driven search or learning techniques to automatically find the fastest implementation of a given functionality on a given platform among a set of possible choices. Variants arise from different possibilities of recursion for divide-and-conquer algorithms or from implementation decisions such as unrolling and parallelizing. Covered functionalities include dense (Whaley & Petitet, 2005) and sparse (Vuduc et al., 2005) linear algebra but we particularly focus on fast Fourier transforms (FFT) with an adaptive library such as FFTW (Frigo & Johnson, 2005) or the libraries generated by Spiral (P¨ schel et al., 2005; Voronenko, 2008). u In each of the above cases, the space of alternatives is extremely large, requiring efficient search methods. Various have been tried for FFTs: hill climbing, genetic algorithms, regression trees (Singer & Veloso, 2002), etc. In practice, however, the most efficient search method, denoted DP, seems to be restricting the search space and using dynamic programming. However, when the search space becomes even larger (large problem sizes or more complicated libraries), DP may take very long to terminate and the very best solution might not even be in the restricted space­a waste since the library may have the ability to run faster, and also of considerable practical relevance if the fastest existing code can be improved further. Contributions. In this paper we first abstractly formulate the performance tuning problem of adaptive divide-and-conquer type libraries as an optimization problem associated with a large acyclic formal grammar. Each word in the language generated by the 1. Introduction and Related Work The computing platforms available today can differ substantially in their memory hierarchies, numbers of processors, and many other microarchitectural details. Further, these details tend to change with every new generation of processors. As a consequence, code that runs fast on one platform may perform suboptimally if not poorly on another: performance cannot easily be ported. This problem particularly affects the develAppearing in Proceedings of the 26 th International Conference on Machine Learning, Montreal, Canada, 2009. Copyright 2009 by the author(s)/owner(s). Bandit-Based Optimization on Graphs with Application to Library Performance Tuning grammar is a recursion strategy the library can perform. We then solve the problem using a novel onlinesearch algorithm, Threshold Ascend on Graph (TAG), that exploits the inherent structure of the problem. TAG is similar to UCT (Kocsis & Szepesvi, 2006), in that it gradually builds up the graph generating the language by considering local bandit problems and by valuating the nodes with Monte-Carlo simulations. TAG differs from UCT in that it optimizes for the best single reward over a graph instead of maximizing the accumulative reward over a tree. We implemented TAG to be composable with adaptive transforms libraries generated by Spiral (Voronenko, 2008). These libraries are vectorized and parallelized, possess a very large search space and are in many cases faster than any other existing code. We show that, compared to the dynamic programming search (typically used for these kinds of problems), TAG can dramatically reduce the performance tuning time (i.e. a good solution is found much faster) and in some cases finds a better solution (and hence improves the library performance). Therefore, it is a suitable optimization technique in this domain. As an additional benefit, TAG is an anytime algorithm. N = {S, A, B} T = {a, b, ac} S P = {S AB, A a, B b, aB ac} AB Ab aB ab ac Figure 1. Formal grammar F = (T, N, P, S) (left) and associated derivation graph G(F ) (right). S, A, B are nonterminals and a, b, c are terminals. The graph has two sinks (double circled), i.e., the language L(F ) has two elements. We assume the graph G(F ) to be large such that it is impossible to generate and evaluate all sinks in a reasonable time. Our goal is an algorithm that finds a "very good" sink with a small number of evaluations. 3. The TAG Algorithm TAG is an anytime algorithm that determines an approximate solution to Problem 2. Due to the size of the graph it is not meant to run until completion, in which case it would be equivalent to an exhaustive search. TAG finds solutions by incrementally growing and ex^ ^ ^ ploiting the subgraph G = (V , E) of G = (V, E): ^ ^ ^ V V , E E, starting with G = ({S}, {}). Evalu^ ations are used to direct the growth of G towards the expected bests sinks. ^ Assume the current subgraph is G. Then TAG proceeds in three high level steps visualized in Figure 2: 1. Descend: G is traversed starting at its root. Each choice along the way is solved by a bandit algorithm. The descent stops when it uses an arrow e ^ that is not in E. 2. Evaluate: If e is incident with a vertex not in ^ V , this vertex is evaluated using a Monte-Carlo expansion. 3. Backpropagate: The evaluation is stored in all ancestors of the vertex. We proceed with describing the three steps in detail, describe the pseudocode and conclude the section with a presentation of related algorithms. 2. Formal Problem Statement Below, we formally state the problem considered in this paper. Later, we will show that automatic tuning in the considered transform libraries is an instantiation of this problem. Problem 1 Given is an acyclic formal grammar F = (T, N, P, S) with T the set of terminals, N the set of nonterminals, P the set of production rules or simply rules, and S the starting symbol. L(F ) is the associated language and f is an objective function from L(F ) into the positive reals R+ . We want to compute wbest = argmax f (w). wL(F ) F has an associated derivation graph G = G(F ) which is directed, acyclic and weakly connected as shown in Figure 1: S is the root, the directed edges (arrows) correspond to applications of rules in P , the nodes are partially derived words in the language, and the sinks (outdegree = 0) are precisely the elements of L(F ). Hence we can reduce Problem 1 to: Problem 2 Given a weakly connected, acyclic, directed graph G = (V, E) and an objective function f (as above) on the sinks S(G) of G. We want to compute wbest = argmax f (w). wS(G) Bandit-Based Optimization on Graphs with Application to Library Performance Tuning S G S S store f(w) G Monte Carlo sinks f(w) Let si be the number of such rewards among the ni rewards received by the arm i. Also, let be a positive real parameter. The algorithm advises to pull the arm ibest given by ibest = argmax h(si , ni ), 1ik Descend Evaluate Backpropagate Figure 2. Visualization of the three main steps in TAG. ^ Note that G (shaded area) and G are not trees (e.g., see Figure 1). with h(si , ni ) = si ++ 2si +2 , ni , if ni > 0 else and = ln(2nk/). Descend. The graph descent is responsible for in^ crementally building the subgraph G G, initially restricted to the root. The purpose of the descent is ^ to select an arrow in E \ E that leads towards an expected good sink. It does so by tracing a path starting from the root and considering each successor choice as a max k-armed bandit problem (Figure 4). For now, assume that a table of positive real rewards R(v) has ^ been maintained for each vertex v V . Let v denote the current vertex in the descent. Starting from v, there are multiple ways to continue the path since it can follow any of the arrows originating from v (we denote these with E(v)). The arrows in ^ ^ E(v) that are also in E(v) lead to vertices of V corresponding to "arms" that have already been played (they have previous rewards attached to them). The other arrows lead to arms that have never been played. The bandit algorithm discussed above decides which arrow to follow, which has to be one that was not followed before if such an arrow exists (due to the infinite ^ weight in h(si , ni )). If the arrow belongs to E(v) and the successor is not a sink, the successor becomes the new descent vertex and the descent continues. If not, the descent ends. bandit A arm A1 3.1. Descend The goal of the descent step is to select the next edge ^ ^ to add to the subgraph G. It is chosen so that G grows towards the sinks that present the best expected rewards. Starting from the root S, the most promising path is layed out by successively chosing the most promising outgoing edges. Each choice is solved using a bandit algorithm that we describe first. Background: Max k-Armed Bandit Problem. The maximum k-armed bandit problem considers a slot machine with k arms, each one having a different pay-out distribution (Figure 3). The goal is to maximize the single best reward obtainable over n tri¯ als (Cicirello & Smith, 2005). Formally, if each arm has distribution Di and Rj (Di ) denotes the j-th reward obtained on arm i, the goal is to solve max Pkmax ¯ n 1ik i=1 ni =¯ 1j¯ i n max Rj (Di ). In this paper, we use a variation: an anytime version of the problem where the total number of pull n is not ¯ known in advance. Only the n previous pulls and their associated rewards are known. bandit arm 1 arm 2 D2 arm 3 arm A2 bandit B arm B1 arm B2 arm A3 D1 D3 Figure 3. A 3-armed bandit. The choice of the arm i leads to a realization of the distribution Di . Streeter & Smith (2006) solve the problem using Threshold Ascend, an algorithm that makes no assumptions on the form of the distributions. Using their notations, we present here a straightforward adaptation to the anytime variation. The main idea of the algorithm is to track only the s best rewards and the arms they are coming from. Figure 4. The descent in the graph is done as a cascade of multi-armed bandits. Solid arrows, circles and boxes are ^ ^ in G, dashed arrows and circles are in G \ G. For bandit A all arms had been played before, and A1 is chosen based on the stored rewards. Bandit B will now choose B1, since it is the only arrow not played before. Bandit-Based Optimization on Graphs with Application to Library Performance Tuning 3.2. Evaluate Assume the descent ended on an arrow pointing to a ^ vertex v that is not part of V . The arrow and vertex ^ are then immediately added to G and v is evaluated. If v is a sink of G, then f (v) can be directly computed. Otherwise a path to a sink of G is chosen by "MonteCarlo," which means in each step a (uniformly drawn) random choice is made until a sink w is obtained. The evaluation f (w) gives a value for v. Also, if the evaluation is better than f (wbest ), the current best sink is replaced. 3.3. Backpropagate After v has been evaluated, the reward is added to its reward list R(v) and to the reward lists of all its ancestors. Note that if the descent ended on an arrow pointing ^ to a vertex v that is already a part of V , we just discovered a new way to connect to an already evaluated ^ vertex. In this case, we add the new arc to E and propagate the rewards of v only to the vertices that would not be ancestors of v without the new arrow (since the other ancestors already have these rewards). 3.4. Pseudocode and Remark Pseudocode. Algorithm 1, the pseudocode of TAG, summarizes the previous discussion. After initializa^ ^ ^ tion, the graph G = (V , E) is grown one arc at a time until the user signifies an interruption. The vertex pointed by an arrow e is denoted head(e). BANDIT refers to the Treshold Ascend algorithm summarized in subsection 3.1. RANDOM refers to an uniform draw. Remark. Practically, if the objective function is deterministic, it is useless to evaluate a sink twice. It is therefore possible to modify the algorithm to guarantee that it never returns in a branch where choices have been exhausted. While we implemented this version we do not present it due to lack of space. 3.5. Related Algorithms The "classic" multi-armed bandit problem involves maximizing the expected sum of rewards of multiple slot machines with different pay-out distributions. Many proposed algorithms are based on optimism in front of uncertainty: the score of a slot machine is its current estimated value plus a term that grows with the uncertainty. For instance, Upper Confidence Bounds (UCB) proposes a term in log(n)/ni (Lai & Robbins, 1985; Auer et al., 2002). Algorithm 1 TAG ^ GS wbest ^ R(V ) while not interrupted do e BANDIT(E(S)) ^ while e E & E(head(e)) = do e BANDIT(E(head(e)) end while v head(e) ^ if v G or e G then / ^ ^ add v and e to G e RANDOM(E(v)) while E(head(e)) = do e RANDOM(E(head(e)) end while w head(e) if f (w) > f (wbest ) then wbest w end if r f (w) add r to R(v) ^ for a ancestor of v in G do add r to R(a) end for else ^ for a ancestor of v in G do mark a end for ^ add e to G ^ for a ancestor of v in G do if a is marked then unmark a else add all R(v) to R(a) end if end for end if end while return wbest Descend Evaluate Backpropagate Other Monte-Carlo based algorithms could be used to perform an optimization on a leaves-evaluated graph but, besides the fact that they usually are designed for trees, they differ from TAG by biasing the subgraph towards zones that are good on average which can be distinct from zones that are likely to contain maximums. Guillaume Chaslot et al. proposed an algorithm derived from the central limit theorem that gives good result on the production management problem (2006). UCT uses UCB as a local branch selector (Kocsis & Szepesvi, 2006) and is particularly efficient with huge search-spaces: it is at the origin of the current best computer Go players (Coulom, 2006; Wang & Gelly, 2007; Gelly & Silver, 2007). Bandit-Based Optimization on Graphs with Application to Library Performance Tuning 4. Application: Performance Tuning in Adaptive Libraries Our target application for TAG is the automatic performance tuning in adaptive libraries based on divideand-conquer algorithms with inherent degrees of freedom. Specifically, we implemented TAG to operate as a search strategy in the adaptive general-size linear transform libraries generated by Spiral (Voronenko et al., 2009). We first give brief background on transforms, transform algorithms, their implementations, and the notion of an adaptive library. Then we discuss the need for search and finally match the performance tuning problem to Problem 1, which shows that TAG is applicable. 4.1. Background: Linear Transforms Transforms. A linear transform is a matrix-vector product y = M x, where x is the input vector, y the output vector, and M the fixed transform matrix. We focus on the discrete Fourier transform (DFT) defined as DFTn = [e-2ik/n ]0k,