Bandit-Based Optimization on Graphs with Application to Library Performance Tuning

Fr´d´ric de Mesmay e e fdemesma@ece.cmu.edu Arpad Rimmel rimmel@lri.fr Yevgen Voronenko yvoronen@ece.cmu.edu Markus P¨ schel u pueschel@ece.cmu.edu  Department of Electrical and Computer Engineering, Carnegie Mellon University, Pittsburgh, PA 15213 USA  TAO (Inria), LRI, UMR 8623 (CNRS - Universit´ Paris-Sud), 91405 Orsay, France e

Abstract
The problem of choosing fast implementations for a class of recursive algorithms such as the fast Fourier transforms can be formulated as an optimization problem over the language generated by a suitably defined grammar. We propose a novel algorithm that solves this problem by reducing it to maximizing an objective function over the sinks of a directed acyclic graph. This algorithm valuates nodes using Monte-Carlo and grows a subgraph in the most promising directions by considering local maximum k-armed bandits. When used inside an adaptive linear transform library, it cuts down the search time by an order of magnitude compared to the existing algorithm. In some cases, the performance of the implementations found is also increased by up to 10% which is of considerable practical importance since it consequently improves the performance of all applications using the library.

opers of high performance libraries, who often create different implementations for each platform and whenever new platforms are released (Intel, 2008). One way to reduce the recurring development costs is automatic performance tuning. The basic idea is to use feedback-driven search or learning techniques to automatically find the fastest implementation of a given functionality on a given platform among a set of possible choices. Variants arise from different possibilities of recursion for divide-and-conquer algorithms or from implementation decisions such as unrolling and parallelizing. Covered functionalities include dense (Whaley & Petitet, 2005) and sparse (Vuduc et al., 2005) linear algebra but we particularly focus on fast Fourier transforms (FFT) with an adaptive library such as FFTW (Frigo & Johnson, 2005) or the libraries generated by Spiral (P¨ schel et al., 2005; Voronenko, 2008). u In each of the above cases, the space of alternatives is extremely large, requiring efficient search methods. Various have been tried for FFTs: hill climbing, genetic algorithms, regression trees (Singer & Veloso, 2002), etc. In practice, however, the most efficient search method, denoted DP, seems to be restricting the search space and using dynamic programming. However, when the search space becomes even larger (large problem sizes or more complicated libraries), DP may take very long to terminate and the very best solution might not even be in the restricted space­a waste since the library may have the ability to run faster, and also of considerable practical relevance if the fastest existing code can be improved further. Contributions. In this paper we first abstractly formulate the performance tuning problem of adaptive divide-and-conquer type libraries as an optimization problem associated with a large acyclic formal grammar. Each word in the language generated by the

1. Introduction and Related Work
The computing platforms available today can differ substantially in their memory hierarchies, numbers of processors, and many other microarchitectural details. Further, these details tend to change with every new generation of processors. As a consequence, code that runs fast on one platform may perform suboptimally if not poorly on another: performance cannot easily be ported. This problem particularly affects the develAppearing in Proceedings of the 26 th International Conference on Machine Learning, Montreal, Canada, 2009. Copyright 2009 by the author(s)/owner(s).

Bandit-Based Optimization on Graphs with Application to Library Performance Tuning

grammar is a recursion strategy the library can perform. We then solve the problem using a novel onlinesearch algorithm, Threshold Ascend on Graph (TAG), that exploits the inherent structure of the problem. TAG is similar to UCT (Kocsis & Szepesvi, 2006), in that it gradually builds up the graph generating the language by considering local bandit problems and by valuating the nodes with Monte-Carlo simulations. TAG differs from UCT in that it optimizes for the best single reward over a graph instead of maximizing the accumulative reward over a tree. We implemented TAG to be composable with adaptive transforms libraries generated by Spiral (Voronenko, 2008). These libraries are vectorized and parallelized, possess a very large search space and are in many cases faster than any other existing code. We show that, compared to the dynamic programming search (typically used for these kinds of problems), TAG can dramatically reduce the performance tuning time (i.e. a good solution is found much faster) and in some cases finds a better solution (and hence improves the library performance). Therefore, it is a suitable optimization technique in this domain. As an additional benefit, TAG is an anytime algorithm.

N = {S, A, B}

T = {a, b, ac}

S

P = {S  AB, A  a, B  b, aB  ac}

AB

Ab

aB

ab

ac

Figure 1. Formal grammar F = (T, N, P, S) (left) and associated derivation graph G(F ) (right). S, A, B are nonterminals and a, b, c are terminals. The graph has two sinks (double circled), i.e., the language L(F ) has two elements.

We assume the graph G(F ) to be large such that it is impossible to generate and evaluate all sinks in a reasonable time. Our goal is an algorithm that finds a "very good" sink with a small number of evaluations.

3. The TAG Algorithm
TAG is an anytime algorithm that determines an approximate solution to Problem 2. Due to the size of the graph it is not meant to run until completion, in which case it would be equivalent to an exhaustive search. TAG finds solutions by incrementally growing and ex^ ^ ^ ploiting the subgraph G = (V , E) of G = (V, E): ^ ^ ^ V  V , E  E, starting with G = ({S}, {}). Evalu^ ations are used to direct the growth of G towards the expected bests sinks. ^ Assume the current subgraph is G. Then TAG proceeds in three high level steps visualized in Figure 2: 1. Descend: G is traversed starting at its root. Each choice along the way is solved by a bandit algorithm. The descent stops when it uses an arrow e ^ that is not in E. 2. Evaluate: If e is incident with a vertex not in ^ V , this vertex is evaluated using a Monte-Carlo expansion. 3. Backpropagate: The evaluation is stored in all ancestors of the vertex. We proceed with describing the three steps in detail, describe the pseudocode and conclude the section with a presentation of related algorithms.

2. Formal Problem Statement
Below, we formally state the problem considered in this paper. Later, we will show that automatic tuning in the considered transform libraries is an instantiation of this problem. Problem 1 Given is an acyclic formal grammar F = (T, N, P, S) with T the set of terminals, N the set of nonterminals, P the set of production rules or simply rules, and S the starting symbol. L(F ) is the associated language and f is an objective function from L(F ) into the positive reals R+ . We want to compute wbest = argmax f (w).
wL(F )

F has an associated derivation graph G = G(F ) which is directed, acyclic and weakly connected as shown in Figure 1: S is the root, the directed edges (arrows) correspond to applications of rules in P , the nodes are partially derived words in the language, and the sinks (outdegree = 0) are precisely the elements of L(F ). Hence we can reduce Problem 1 to: Problem 2 Given a weakly connected, acyclic, directed graph G = (V, E) and an objective function f (as above) on the sinks S(G) of G. We want to compute wbest = argmax f (w).
wS(G)

Bandit-Based Optimization on Graphs with Application to Library Performance Tuning
S
G

S

S
store f(w)

G
Monte Carlo sinks f(w)

Let si be the number of such rewards among the ni rewards received by the arm i. Also, let  be a positive real parameter. The algorithm advises to pull the arm ibest given by ibest = argmax h(si , ni ),
1ik

Descend

Evaluate

Backpropagate

Figure 2. Visualization of the three main steps in TAG. ^ Note that G (shaded area) and G are not trees (e.g., see Figure 1).

with h(si , ni ) =

 si ++ 2si +2 , ni

,

if ni > 0 else

and  = ln(2nk/). Descend. The graph descent is responsible for in^ crementally building the subgraph G  G, initially restricted to the root. The purpose of the descent is ^ to select an arrow in E \ E that leads towards an expected good sink. It does so by tracing a path starting from the root and considering each successor choice as a max k-armed bandit problem (Figure 4). For now, assume that a table of positive real rewards R(v) has ^ been maintained for each vertex v  V . Let v denote the current vertex in the descent. Starting from v, there are multiple ways to continue the path since it can follow any of the arrows originating from v (we denote these with E(v)). The arrows in ^ ^ E(v) that are also in E(v) lead to vertices of V corresponding to "arms" that have already been played (they have previous rewards attached to them). The other arrows lead to arms that have never been played. The bandit algorithm discussed above decides which arrow to follow, which has to be one that was not followed before if such an arrow exists (due to the infinite ^ weight in h(si , ni )). If the arrow belongs to E(v) and the successor is not a sink, the successor becomes the new descent vertex and the descent continues. If not, the descent ends.
bandit A arm A1

3.1. Descend The goal of the descent step is to select the next edge ^ ^ to add to the subgraph G. It is chosen so that G grows towards the sinks that present the best expected rewards. Starting from the root S, the most promising path is layed out by successively chosing the most promising outgoing edges. Each choice is solved using a bandit algorithm that we describe first. Background: Max k-Armed Bandit Problem. The maximum k-armed bandit problem considers a slot machine with k arms, each one having a different pay-out distribution (Figure 3). The goal is to maximize the single best reward obtainable over n tri¯ als (Cicirello & Smith, 2005). Formally, if each arm has distribution Di and Rj (Di ) denotes the j-th reward obtained on arm i, the goal is to solve max Pkmax ¯ n 1ik i=1 ni =¯
1j¯ i n

max Rj (Di ).

In this paper, we use a variation: an anytime version of the problem where the total number of pull n is not ¯ known in advance. Only the n previous pulls and their associated rewards are known.
bandit arm 1 arm 2 D2 arm 3

arm A2 bandit B arm B1 arm B2

arm A3

D1

D3

Figure 3. A 3-armed bandit. The choice of the arm i leads to a realization of the distribution Di .

Streeter & Smith (2006) solve the problem using Threshold Ascend, an algorithm that makes no assumptions on the form of the distributions. Using their notations, we present here a straightforward adaptation to the anytime variation. The main idea of the algorithm is to track only the s best rewards and the arms they are coming from.

Figure 4. The descent in the graph is done as a cascade of multi-armed bandits. Solid arrows, circles and boxes are ^ ^ in G, dashed arrows and circles are in G \ G. For bandit A all arms had been played before, and A1 is chosen based on the stored rewards. Bandit B will now choose B1, since it is the only arrow not played before.

Bandit-Based Optimization on Graphs with Application to Library Performance Tuning

3.2. Evaluate Assume the descent ended on an arrow pointing to a ^ vertex v that is not part of V . The arrow and vertex ^ are then immediately added to G and v is evaluated. If v is a sink of G, then f (v) can be directly computed. Otherwise a path to a sink of G is chosen by "MonteCarlo," which means in each step a (uniformly drawn) random choice is made until a sink w is obtained. The evaluation f (w) gives a value for v. Also, if the evaluation is better than f (wbest ), the current best sink is replaced. 3.3. Backpropagate After v has been evaluated, the reward is added to its reward list R(v) and to the reward lists of all its ancestors. Note that if the descent ended on an arrow pointing ^ to a vertex v that is already a part of V , we just discovered a new way to connect to an already evaluated ^ vertex. In this case, we add the new arc to E and propagate the rewards of v only to the vertices that would not be ancestors of v without the new arrow (since the other ancestors already have these rewards). 3.4. Pseudocode and Remark Pseudocode. Algorithm 1, the pseudocode of TAG, summarizes the previous discussion. After initializa^ ^ ^ tion, the graph G = (V , E) is grown one arc at a time until the user signifies an interruption. The vertex pointed by an arrow e is denoted head(e). BANDIT refers to the Treshold Ascend algorithm summarized in subsection 3.1. RANDOM refers to an uniform draw. Remark. Practically, if the objective function is deterministic, it is useless to evaluate a sink twice. It is therefore possible to modify the algorithm to guarantee that it never returns in a branch where choices have been exhausted. While we implemented this version we do not present it due to lack of space. 3.5. Related Algorithms The "classic" multi-armed bandit problem involves maximizing the expected sum of rewards of multiple slot machines with different pay-out distributions. Many proposed algorithms are based on optimism in front of uncertainty: the score of a slot machine is its current estimated value plus a term that grows with the uncertainty. For instance, Upper Confidence Bounds (UCB) proposes a term in log(n)/ni (Lai & Robbins, 1985; Auer et al., 2002).

Algorithm 1 TAG
^ GS wbest   ^ R(V )   while not interrupted do e  BANDIT(E(S)) ^ while e  E & E(head(e)) =  do e  BANDIT(E(head(e)) end while v  head(e) ^ if v  G or e  G then / ^ ^ add v and e to G e  RANDOM(E(v)) while E(head(e)) =  do e  RANDOM(E(head(e)) end while w  head(e) if f (w) > f (wbest ) then wbest  w end if r  f (w) add r to R(v) ^ for a ancestor of v in G do add r to R(a) end for else ^ for a ancestor of v in G do mark a end for ^ add e to G ^ for a ancestor of v in G do if a is marked then unmark a else add all R(v) to R(a) end if end for end if end while return wbest

Descend

Evaluate

Backpropagate

Other Monte-Carlo based algorithms could be used to perform an optimization on a leaves-evaluated graph but, besides the fact that they usually are designed for trees, they differ from TAG by biasing the subgraph towards zones that are good on average which can be distinct from zones that are likely to contain maximums. Guillaume Chaslot et al. proposed an algorithm derived from the central limit theorem that gives good result on the production management problem (2006). UCT uses UCB as a local branch selector (Kocsis & Szepesvi, 2006) and is particularly efficient with huge search-spaces: it is at the origin of the current best computer Go players (Coulom, 2006; Wang & Gelly, 2007; Gelly & Silver, 2007).

Bandit-Based Optimization on Graphs with Application to Library Performance Tuning

4. Application: Performance Tuning in Adaptive Libraries
Our target application for TAG is the automatic performance tuning in adaptive libraries based on divideand-conquer algorithms with inherent degrees of freedom. Specifically, we implemented TAG to operate as a search strategy in the adaptive general-size linear transform libraries generated by Spiral (Voronenko et al., 2009). We first give brief background on transforms, transform algorithms, their implementations, and the notion of an adaptive library. Then we discuss the need for search and finally match the performance tuning problem to Problem 1, which shows that TAG is applicable. 4.1. Background: Linear Transforms Transforms. A linear transform is a matrix-vector product y = M x, where x is the input vector, y the output vector, and M the fixed transform matrix. We focus on the discrete Fourier transform (DFT) defined as  DFTn = [e-2ik/n ]0k,<n , i = -1. Na¨ computation of the matrix-vector product inive curs O(n2 ) operations, however, fast, O(n log(n)), algorithms, which exploit the particular structure of matrix M , exist for many transforms including the DFT. Fast algorithms. One way of writing transform algorithms is as sparse factorizations of the transform matrix. For example, the famous Cooley-Tukey fast Fourier transform (FFT) algorithm can be written as DFTn = (DFTk  Im ) Tn (Ik  DFTm ) Ln , n = km. m k (1) Here, In is the identity matrix of size n; Tn is a diagm onal matrix and Ln a permutation matrix, whose prek cise definition is not relevant here. Finally, the tensor (or Kronecker) product  of two matrices is defined as A  B = [ak,l B], where A = [ak,l ]. We show below a visualization of the non-zero values in the matrices for k = m = 4.
DFT16 DFT4  I4 T4 4 I4  DFT4 L16 4

Number of di erent DFT algorithms using only Cooley-Tukey 109

106

10

3

10

0

DFT size 20 25 210 215 220

Figure 5. Number of DFT algorithms based on standard Cooley-Tukey FFT, implemented na¨ ively. All algorithms for a given DFT input size have roughly the same operations count.

divide-and-conquer approach, and that there is a degree of freedom (choice of k|n). Assuming that n is a power of two1 , recursive applications of the algorithm yield O(n log(n)) computations. Implementation and search space. The above FFT suggests a library implementation using a recursive function dft. Given the input x, the function would first permute (t = Ln x), then call dft on multik ple segments of x, then scale the result with the entries of Tn , and then call again dft on segments, extracted m in a stride, of the result. The resulting library would have a simple call graph, as shown in Figure 6(a). Even such a simple implementation has a degree of freedom in the recursion due to the choice of k. Recursively compounded this yields an algorithm space of (5t /t3/2 ) that this library covers (see Figure 5) (Johnson & P¨ schel, 2000). All of these have roughly u the same operations count, yet, the performance can differ widely due to cache misses and other effects. The above implementation makes four passes through a vector of length n and has hence poor memory hierarchy performance. The performance can considerably improved as done in FFTW 2.x by replacing the explicit (and expensive) permutation Ln with a k readdressing in the subsequent smaller DFTs. Similarly, scaling by Tn can be fused with the subsequent m DFTs. However, this creates the need for additional functions--variants of the DFT with modified interfaces. The call graph of such a library is shown in Figure 6(b). The situation gets considerably more complicated with state-of-the-art libraries on current off-the-shelf computers. The reason is that to get maximal performance
Recursive application of equation (1) require to provide base cases for all prime factors of n. For simplicity, this paper will therefore only consider power-of-two sizes n = 2t . Note that these sizes also happen to be the most important usage cases of the DFT.
1

=

In both tensor products, all parts of equal gray shade constitute a single DFT4 . We observe that all four matrices are sparse, that the computation uses a

Bandit-Based Optimization on Graphs with Application to Library Performance Tuning

the libraries need to apply several restructuring transformations to (1). In particular, the algorithm must be 1) vectorized, to take advantage of vector instructions (e.g., SSE on x86 architectures); 2) parallelized, to exploit multiple processor cores using threading; 3) transformed by loop optimizations for buffering; 4) allowed to load from a precomputed table the constant elements of Tn from (1), also called "twiddle factors" m (Frigo & Johnson, 2005; P¨ schel et al., 2005; Vorou nenko, 2008; Voronenko et al., 2009). Applying the restructuring transformations described above increases the number of different mutually recursive functions that comprise the library, and also enlarge the algorithm search space. For example the Spiral-generated DFT library with all optimizations 1­3 contains 31 different functions which form the call graph in Figure 6(c).
1

The main search strategy in FFTW, UHFFT, and Spiral generated libraries is dynamic programming (DP). It is based on the assumption that the best solution to a problem is built out of optimal solutions to subproblems. Here, this means that an algorithm's performance is independent of its context which, unfortunately, does not always hold2 . However, in practice, DP has shown to work quite well except for very large transforms as we will see later in our benchmarks. Over these large search spaces, DP has another weakness which is that it is not an anytime algorithm: one has to wait for DP to solve all subproblems before it gives any solution. This waiting time is significant: for FFTW it can amount to days in the case of large transforms on multicore systems. A simple anytime strategy is Monte-Carlo (MC) which, each time there is a decision to take, chooses according to a uniform distribution. At the end of the descent, it evaluates the candidate and restarts. At any point in time the user can interrupt the search to retrieve the best known candidate. Since at each step, there is an equal chance for all branches to be picked but branches are not laid out uniformly, the overall space is not sampled uniformly. 4.3. Applicability of TAG Applying TAG in the context of adaptive libraries requires to identify the grammar G = (T, N, P, S) and the objective function f such that the performance optimization can be mapped to Problem 1. The start symbol S is the transform specification as entered by the user. The terminals T are the base cases, the set of problems that can be directly solved by the library. The non-terminals N are the set of all non base case subproblems that could be needed to solve the problem. The production rules P breakdown a problem from N into one or more subproblems by fixing a degree of freedom. The function to maximize, f , is the performance of the implementation. The acyclicity of the grammar is guaranteed by the fact that the underlying algorithms provably finish. Note that the grammar itself changes with the problem size. For instance, if a na¨ DFT library based on Cooleyive Tukey is used to compute DFT8 , we would define S = DFT8 T = DFT2 N = {DFT8 , DFT4 }
2

dft
2 6 4

3

7

5

10

8

11

9

15

13

14

12

dft

dft_scaled

19

20

21

16

17

26

28

18

25

30

33

29

dft_strided

27

32

31

24

(a)

(b)

(c)

Figure 6. Call graphs of three different recursive libraries: (a) na¨ (b) optimized scalar and (c) optimized vectorized ive, parallelized.

The above discussion holds for many other linear transforms including the discrete cosine transforms, the real discrete Fourier transform, finite impulse response filters, and the discrete wavelet transform. Further, not all algorithms decompose a transform into transforms of the same type. In this case the search space is further increased. 4.2. Adaptive Libraries and Search Consider a recursive library as discussed above. In each recursion step, the library has a degree of freedom. As a consequence, it can compute the transform in many different ways. What makes the library adaptive is the use of online search to find a fast recursion strategy. This search is part of an initialization routine (called planner in FFTW) that takes the input size n and returns a function pointer implementing the fastest known recursion strategy. After this initial overhead, the user can now compute as many transforms of size n as desired, compensating for the overhead.

P = {DFT8  (DFT2 , DFT4 ),

DFT8  (DFT4 , DFT2 ), DFT4  (DFT2 , DFT2 )}

It is fairly easy to build counter-examples: for instance, an algorithm running on one core will be slower if another core is active due to conflicts in the shared cache.

Bandit-Based Optimization on Graphs with Application to Library Performance Tuning
Sensitivity to parameter s for DFT 512k
Performance (MFlop/s) 4000 3500 3000 2500 2000

Comparison between anytime strategies for DFT 256k
Performance (MFlop/s) 5000 4500 4000 3500 3000 2500

Standard deviation of anytime strategies for DFT 256k
Performance (MFlop/s) 1400

TAG
1200

MC

1000 800 600 400 200 0

MC TAG

1500 1000 500 0 0 10 20 30 40 50 60 70 Time (seconds) s=5 s = 30 s = 100 s = 200 80 90

2000 1500 1000 500 0 10 20 30 40 50 Time (seconds) 60 70 80 90

0

10

20

30

40 50 Time (seconds)

60

70

80

90

(a)

(b)

(c)

Figure 7. (a) Parameters for TAG are optimized on DFT219 . (b) Mean performance (and standard error of the mean) for DP and Monte-Carlo on DFT218 . Data is averaged over 100 runs. (c) Standard deviation on the same experiment.

5. Experiments
Experimental Setup. We evaluate our search algorithm on a complex DFT C++ library generated by Spiral from (1). The library is vectorized using intrinsics, threaded using OpenMP, and optimized as explained in Section 4.1. We add TAG and Monte-Carlo (MC) methods to the already existing DP search infrastructure . We compile using the Intel Compiler 10.1 and benchmark on a 64bit Linux platform using two dual core 3 GHz Intel Xeon 5160 processors. We display performance using pseudo mega floatingpoint operations per second (MFlops) with the complex DFT operation count assumed to be 5n log2 n (standard practice). Parameter tuning. We tune the parameters for TAG on a specific problem, DFT219 . The sensitivity of the algorithm with variations in the s parameter of the bandit is shown on Figure 7(a). Since s is the size of the best rewards vector, a low s tweaks the bandit towards exploitation of previous good branches, while a bigger s leads to the exploration of new promising branches. We find that  = 0.1 and s = 30 work best and we use them for all following experiments. Comparison with Monte-Carlo. We compare the performance of TAG and MC on DFT218 . Figures 7(b) and 7(c) show that TAG performs better (higher mean) and more reliably (lower standard deviation) than Monte-Carlo. Note that the plots are done with respect to a fixed "wall clock" time and not to a fixed number of simulations. This is realistic in that the simpler MC algorithm performs more simulations than our more complex algorithm in the same time frame. Also it is worth remembering that, asymptotically, TAG and MC match since they both explore the full finite search space.

Comparison with dynamic programming. We compare TAG with DP on a single problem on 8(a). We observe that TAG quickly reaches the same performance as DP and then caps 10% above it. On Figure 8(b) we plot the time it takes for TAG to get results of the same quality as DP. We observe that TAG finds solutions of equal performance significantly faster on various DFT sizes. Comparison with other FFTs. Figure 8(c) shows the best performance attained by the generated library (with TAG and DP) and its competitors. We compare against FFTW 3.2 alpha 2 and Intel IPP 5.3. FFTW does platform adaptation using dynamic programming. As far as we know, IPP does not use search and branches out to a specialized implementation for each platform.

6. Conclusion
In this paper we tackled the problem of optimizing an objective function over the sinks of a directed acyclic graph. We solved it using a new anytime algorithm, TAG, that grows a subgraph towards the expected best sinks. Similarly to UCT, TAG traces the most promising path by considering local bandits and valuates nodes using Monte-Carlo simulations. In our context however, the optimization problem requires to consider the maximum variant of the k-armed bandit problem. Implementation inside a high-performance adaptive library for linear transforms considerably decreased the search time while providing a 10% increase in the quality of the solutions. One interesting feature of our problem setup is that evaluating "bad" nodes is much more costly than evaluating "good" ones since the objective function is the timer from the processor. In future work, we will try to modify the algorithm to take advantage of that fact.

Bandit-Based Optimization on Graphs with Application to Library Performance Tuning
Performance over time of DP and TAG (DFT 64k)
Performance (MFlop/s) 10000 9000 8000 7000 6000 5000 4000 3000 2000 1000 0 20 40 60 80 100 120 140 Time (seconds)
1 64k 128k 256k DFT size 512k 1M 10 100

Comparison of search time between DP and TAG
Time (seconds) 10000

DFT, double precision, up to 4 threads
Performance (MFlop/s) 14000 Spiral-TAG Spiral-DP FFTW 3.2a Intel IPP 5.3

TAG

DP
1000

DP

12000 10000 8000

TAG

6000 4000 2000 0 16k

32k

64k

128k DFT size

256k

512k

1M

(a) (b) (c) Figure 8. (a)Average performance of TAG compared with DP on a single problem size. (b)Search time of TAG and DP to achieve the same performance on different libraries. (c)Comparison with different FFT libraries.

Acknowledgments
The authors are grateful to O. Teytaud for his many valuable suggestions. The authors would also like to thank M. Streeter, S. Smith and A. de Mesmay for their help with the max k-armed bandit algorithms. This work was supported by NSF through awards 0325687, 0702386, by DARPA (DOI grant NBCH1050009), the ARO grant W911NF0710416, and by Intel.

Kocsis, L., & Szepesvi, C. (2006). Bandit based Monte-Carlo planning. Euro. Conf. on Mach. Learn. (ECML), LNCS 4212 (pp. 282­293). Springer. Lai, T., & Robbins, H. (1985). Asymptotically efficient adaptive allocation rules. Advances in Applied Mathematics, 6, 4­22. P¨ schel, M., Moura, J. M. F., Johnson, J., Padua, D., u Veloso, M., et al. (2005). SPIRAL: Code generation for DSP transforms. Proc. of the IEEE, 93, 232­ 275. Singer, B., & Veloso, M. (2002). Learning to construct fast signal processing implementations. Journal of Machine Learning Research, 3, 887­919. Streeter, M. J., & Smith, S. F. (2006). A simple distribution-free approach to the max k-armed bandit problem. Principles and Practice of Constraint Programming (CP) (pp. 560­574). Voronenko, Y. (2008). Library generation for linear transforms. Doctoral dissertation, Electrical and Computer Engineering, Carnegie Mellon University. Voronenko, Y., de Mesmay, F., & P¨ schel, M. (2009). u Computer generation of general size linear transform libraries. Int. Symp. on Code Generation and Optimization (CGO). Vuduc, R., Demmel, J. W., & Yelick, K. A. (2005). Oski: A library of automatically tuned sparse matrix kernels. Journal of Physics: Conf. Series, 16, 521­530. Wang, Y., & Gelly, S. (2007). Modifications of UCT and sequence-like simulations for Monte-Carlo Go. IEEE Symp. on Computational Intelligence and Games (CIG) (pp. 175­182). Whaley, R. C., & Petitet, A. (2005). Minimizing development and maintenance costs in supporting persistently optimized BLAS. Software: Practice and Experience, 35, 101­121.

References
Auer, P., Cesa-Bianchi, N., & Fischer, P. (2002). Finite-time analysis of the multiarmed bandit problem. Machine Learning, 47, 235­256. Chaslot, G., De Jong, S., Saito, J.-T., & Uiterwijk, J. W. H. M. (2006). Monte-Carlo tree search in production management problems. Proc. 18th BeNeLux Conf. on Artificial Intel. (BNAIC) (pp. 91­98). Cicirello, V., & Smith, S. (2005). The max k-armed bandit: A new model for exploration applied to search heuristic selection. Proc. 20th National Conf. on Artificial Intelligence (AAAI) (pp. 1355­1361). Coulom, R. (2006). Efficient selectivity and backup operators in Monte-Carlo tree search. Proc. 5th Int. Conf. on Computers and Games (CG), 72­83. Frigo, M., & Johnson, S. (2005). The design and implementation of FFTW3. Proc. IEEE, 93, 216­231. Gelly, S., & Silver, D. (2007). Combining online and offline knowledge in UCT. Proc. 24th Int. Conf. on Machine Learning (ICML) (pp. 273­280). ACM. Intel (2008). Integrated performance primitives. Johnson, J., & P¨ schel, M. (2000). In search of u the optimal Walsh-Hadamard transform. Proc. Int. Conf. on Acoustics, Speech and Signal Processing (ICASSP) (pp. 3347­3350).