Torpid Mixing of Simulated Tempering on the Potts Model Nayantara Bhatnagar Abstract Dana Randall verge exponentially slowly. Statistical mechanics offers a Simulated tempering and swapping are two families of sam- wealth of sampling problems for which these methods are pling algorithms in which a parameter representing temper- often applied; it is now well-known that phase transitions ature varies during the simulation. The hope is that this will in the underlying systems can cause local Markov chains to overcome bottlenecks that cause sampling algorithms to be require exponential time to reach equilibrium [1]. A particular example of this phenomenon is observed slow at low temperatures. Madras and Zheng demonstrate on the Potts model. In the -state Potts model, vertices of that the swapping and tempering algorithms allow efficient sampling from the low-temperature mean-field Ising model, an underlying graph are colored with one of colors. In a model of magnetism, and a class of symmetric bimodal the ferromagnetic case, vertices connected by an edge in the distributions [10]. Local Markov chains fail on these distri- graph prefer to have the same color. The strength of this preference is a function of the temperature: at high temperbutions due to the existence of bad cuts in the state space. Bad cuts also arise in the -state Potts model, another ature the correlation is negligible, while at low temperatures fundamental model for magnetism that generalizes the Ising the effect is strong. At low enough temperatures, local Marmodel. Glauber (local) dynamics and the Swendsen-Wang kov chains that change the color one vertex at a time will be algorithm have been shown to be prohibitively slow for prohibitively slow [1]. This is because to move from a consampling from the Potts model at some temperatures [1, 2, figuration that is predominantly red to one that is predomi6]. It is reasonable to ask whether tempering or swapping nantly blue, the chain will have to go through highly unlikely can overcome the bottlenecks that cause these algorithms to configurations where no color dominates. When the underlying graph in the Potts model is the converge slowly on the Potts model. We answer this in the negative, and give the first ex- complete graph, it is known as the mean-field or Curie-Weiss ample demonstrating that tempering can mix slowly. We model. Mean-field models are important because, despite show this for the 3-state ferromagnetic Potts model on the their simplicity, they capture key features present in more complete graph, known as the mean-field model. The slow complicated graphs. Moreover, for natural problems such as convergence is caused by a first-order (discontinuous) phase the mean-field Potts and Ising models, there remain obstacles transition in the underlying system. Using this insight, we to sampling efficiently, even on the complete graph. Gore define a variant of the swapping algorithm that samples ef- and Jerrum showed that the Swendsen-Wang algorithm, a ficiently from a class of bimodal distributions, including the method for sampling that often succeeds in circumventing bottlenecks in the state space, fails on the mean-field Potts mean-field Potts model. model for near the critical temperature (where the phase transition occurs) [6]. Subsequently, Cooper et al. 1 Introduction considered the mean-field Ising model ( ) and showed The standard approach to sampling via Markov chain Monte that Swendsen-Wang is fast everywhere, except possibly Carlo algorithms is to connect the state space of configuranear the critical point, where it remains unresolved. [2]. tions via a graph called the Markov kernel. The Metropolis algorithm proscribes transition probabilities to the edges 1.1 Tempering, swapping, and annealing. Simulated of the kernel so that the chain will converge to any desired annealing provides the insight that varying a parameter repdistribution [14]. Unfortunately, for some natural choices of resenting temperature during a simulation can be a key to the Markov kernel, the Metropolis Markov chain can condesigning efficient algorithms [8]. Annealing is intended for optimization problems when direct methods are likely to get trapped in local minima and never find the global optimum. College of Computing, Georgia Institute of Technology, Atlanta GA. Email: nand@cc.gatech.edu. Supported in part by NSF CCR- Similarly, simulated tempering and swapping are intended for sampling when direct methods are slow [5, 11]. 0105639. College of Computing and School of Mathematics, Georgia Institute For the tempering and swapping algorithms, we allow of Technology, Atlanta GA. Email: randall@cc.gatech.edu. Supa chain to modify the temperature and interpolate between ported in part by NSF CCR-0105639 and an Alfred P. Sloan research feldifferent distributions . At the lowest lowship. Part of this work was done while visiting Microsoft Research and the Isaac Newton Institute for Mathematical Sciences in Cambridge, UK. Copyright © 2004 by the Association for Computing Machinery, Inc. and the Society for industrial and Applied Mathematics. All Rights reserved. Printed in The United States of America. No part of this book may be reproduced, stored, or transmitted in any manner without the written permission of the publisher. For information, write to the Association for Computing Machinery, 1515 Broadway, New York, NY 10036 and the Society for Industrial and Applied Mathematics, 3600 University City Science Center, Philadelphia, PA 19104-2688 478 ¢ ¢ ( #! )'&!%#%#$" ¢ ¡ ¨¦ ©§¢ ¢ £ ¥ ¤ where is the normalizing constant. Note that at , this is just the uniform distribution on all (not necessarily proper) -colorings. We consider the ferromagnetic mean-field model where is the complete graph on vertices and all pairs of particles influence each other. For the 3-state Potts model, 1.2 Results. In this work, we show that for the mean field . Let , and be the number of vertices Potts model, tempering and swapping require exponential assigned the first, second, and third colors. Letting time to converge to equilibrium. The slow convergence , we can rewrite the Gibbs distribution for the 3-state of the tempering chain on the Potts model is caused by a Potts model as first-order (discontinuous) phase transition. In contrast, the Ising model studied by Madras and Zheng has a secondorder (continuous) phase transition, which distinguishes why tempering works for one model and not the other. In addition, we give the first Markov chain algorithm where the linear terms in the exponent are cancelled by those that is provably rapidly mixing on the Potts model. Tradi- in the denominator since . We will tionally, swapping is implemented by defining a set of in- use this formulation from now on, substituting for and terpolating distributions where a parameter corresponding to denoting by . temperature is varied. We make use of the fact that there is greater flexibility in how we define the set of interpolants. 2.2 Markov chains. To sample from a given distribution, Finally, our analysis extends the arguments of Madras and a common approach is to design a Markov chain so that Zheng showing that swapping is fast on symmetric distribu- an appropriately defined random walk run for a sufficiently tions so as to include asymmetric generalizations. long time provides a good sample. We formalize how long 2.1 The -state Potts model. The Potts model was defined by R.B. Potts in 1952 to study ferromagnetism and anti-ferromagnetism [15]. The interactions between particles are modeled by an underlying graph with edges between particles that influence each other. Each of the vertices of the underlying graph is assigned one of different spins (or colors). A configuration is an assignment of spins to the vertices, where denotes the spin at the vertex. The energy of a configuration is a function of the Hamiltonian D E FI N I T I O N 2 . 1 . The total variation distance at time D E FI N I T I O N 2 . 2 . Let 479 ¡ where is the Kronecker- function that takes the value 1 if its arguments are equal and zero otherwise. When the model corresponds to the ferromagnetic case where neighbors prefer the same color, while corresponds to the anti-ferromagnetic case where neighbors prefer to be differently colored. The state space of the -state ferromagnetic Potts model is the space of all -colorings of . We will thus use colorings and configurations interchangeably. Define is rapidly mixing if the mixing time is bounded above by a polynomial in and , where is the size of each configuration in the state space. When the mixing time is exponential in we say the chain is torpidly mixing. 2.2.1 The Metropolis algorithm. The MetropolisHastings algorithm is useful for sampling from non-uniform distributions [14]. Let be the distribution to be sampled from. A graph (the Markov kernel) is chosen so as to ! ¥v 1upts # r7q¦ nponm! kl '! Yj w i©"3fd 1Y" h g e £ 1£" @ 98 1) ) # Y£ ! y£ w # 1 ! w w 6(! y£7xw y £ v 2 Preliminaries "sufficiently long" must be, as well as when a sample is "good" as follows. Let be an ergodic (i.e., irreducible and aperiodic), reversible Markov chain with finite state space , transition probability matrix , and stationary distribution . Let denote the -step transition probability from to . is , then the mixing time is fE E db¥ f E E d b "¢ £ ¥ b ¢ b ¢ b ¢ u e¢ c¢ `¢ ! Xf E £ W R hP hP ' ¦qtV s ¦qqV as¦qrpV$ h P S d (¢! b (¢! ¥ !¢£ i "¢£ X ! XE£ R £ P ' 3V$ UP W S "!¢Q T d e¢ b¥ ca¢! `¢ E XYE£ W ¢ c4 E g ¨¢ @8 v A94 @ ¡ temperature, is the goal distribution from which we wish to generate samples; at the highest temperature, is typically less interesting, but the rate of convergence is fast. A Markov chain that keeps modifying the distribution, interpolating between these two extremes, may produce useful samples efficiently. Despite the extensive use of simulated tempering and swapping in practice, there has been little formal analysis. A notable exception is work by Madras and Zheng [10] showing that swapping converges quickly for two simple, symmetric distributions, including the mean-field Ising model. the inverse temperature , where is Boltzmann's constant. The Gibbs distribution on configurations at inverse temperature is given by I HG¥ F E ¢ ¨¢ $!¨§©§¨§!¢ ¦¥¢ ¤ ¢ £ ¢ @B DC4 ! ©7¢ ! ©¢ 6©§ & £5 ¡ 4 5 ' 32$ 10)#('& % $ "!¢ £ ¢ ¢ ¢ ¡ £ ( ¢ 5 Notice that while the exponential factor is simple to calculate given and , it is not easy to compute the ratio of partition functions since they are sums over exponentially many configurations at different temperatures. The swapping algorithm, also an aggregate chain using these temperatures, circumvents this difficulty in implementing temperature moves. 2.2.3 Swapping. The swapping algorithm of Geyer [5] is a variant of tempering. The state space is the product space , the product of copies of the original state space, corresponding to inverse temperatures . Let be the distribution from which we wish to sample and let (the uniform distribution), for . A configuration in the swapping chain is an -tuple , where each component represents a configuration chosen from the distribution. The probability distribution is the product measure to , i.e., it interchanges the and components, with the appropriate Metropolis probabilities on . In particular, 480 98 @ 0¨ £ 5 "¨ ¨ £ 4 "8 # ¨ ¥ ¥ ¢ 6 7 ¥ A level move connects and , where and are connected by one-step transitions of the Metropolis algorithm on at inverse temperature . The move is accepted with probability £ R # r' $ T ' T ' P r P ! pfe Xs oYy$ £ ¥X$ s e ey(£o$ e £ ¥ Xs Yy£ Yy£1Xs ! 3f ¥ ¥ £ oYy e £ p y ! 3f p y! y£ w £ ¢ ¥ 2 3 The tempering Markov chain consists of two types of moves: level moves, which update the configuration while keeping the temperature fixed, and temperature moves, which update the temperature while remaining at the same configuration. A swap move connects £ 1 ¢ ¥ The stationary distribution of the tempering chain , is chosen to be uniform over temperatures, and the conditional distributions are the fixed temperature Gibbs distributions: ( (y&!%#%#$#! Xs oy'!e(y&!%#&## ! Yy£ )y( y%!%#&## ! y! Xs y%!&#%#$# "! yS p y ¥ £ ¥ e £ £ #£ # `Yyya£(£e ! pf c p y! y£ w po p (y! y£ w £ p y )y%!%#&## ! y%!&#%#$#" p yy £ p y y y y )y%!%#&## ! p y%!&#%#$#" y£ ( ! ( ! # oYy£ae oy £ (E ( £ )y%!%#&## ! )Yy£ y ( £ £ y ! r $¨¨©! @ g !§§§ ¥ `Yy£ oYy£ oy£ ( ( E B &#%## B E ' is ( $ £ £ ¥ The swapping chain also consists of two types of moves: A level move connects and if and agree in all but the components, and and are connected by one-step transitions of the Metropolis algorithm on . The move is accepted with probability ¥ '& ¥ $$ % ) 0 ( ¥ ¥ 2.2.2 Simulated tempering. Simulated tempering attempts to overcome this bottleneck by introducing a temperature parameter that is varied during the simulation, effectively modifying the distribution being sampled from. Let be a set of inverse temperatures. The state space of the tempering chain is which we can think of as the union of copies of the original state space , each corresponding to a different inverse temperature. Our choice of corresponds to infinite temperature where the Metropolis algorithm converges rapidly to stationarity (on the uniform distribution), and is the inverse temperature at which we wish to sample. We interpolate by setting , and let the fixed temperature distribution be # ' $ £ for all , neighbors in , where is the maximum degree of . It is easy to verify that if the kernel is connected then is the stationary distribution. For the Potts model, a natural choice for the Markov kernel is to connect configurations at Hamming distance one. Unfortunately, for large values of , the Metropolis algorithm converges exponentially slowly on the Potts model for this kernel [1, 2]. This is because the most probable states are largely monochromatic and to go from a predominantly red configuration to a predominantly blue one we would have to pass through states that are highly unlikely at low temperatures. A temperature move connects the move is accepted with probability (! y£ (! y£ ` to TY' ¨ P "#r !¨ P $ R ¥ o£E YE£ W W ! ¢ £ y£ n! Y¥ (! Yy£ ¥ ! ¢ 3f e fp e ! ¢ £ £ Yy£! n! Yy`w ££ ¥ connect the state space, where vertices are configurations and edges are allowable 1-step transitions. The transition probabilities on are defined as Here to p y! Yya£ip w y y is the Metropolis probability of going from according to the stationary probability . and £ £ # `YyYya££ e ! pfe c p y(! y(£iw p n(! p yY£! n! yY££ w py ¥ y £ (¡ ! y ¡ £ ! `Yy ! pfe 6(! y£ w 6£ ¦ § `E n! p Yy£ £ ¤ ¢ £ ¥£ # P k k @ ! o o ( § ( E oE E @ )E ¡ y ! `Yya£e n! Yy £ ¢ n(! Yy£ ¡ ¨ © £ ( E B %#&## £ ¥ ¡ E B A @ ¥ (b) . The conductance, introduced by Jerrum and Sinclair, provides a good measure of the mixing rate of a chain [7]. For , let Then, the conductance 1 is given by denote the set of configurations , where ; , configurations where ; and , configurations where . The following lemmas will demonstrate that there is a critical temperature at which and have very large weight although there is a region around that has very small weight. This will allow us to bound the conductance. (For convenience, we assume throughout that for some integer .) L E M M A 3 . 1 . There exists (i) such that Let It has been shown by Jerrum and Sinclair [7] that, for any reversible chain, the spectral gap satisfies T H E O R E M 3 . 2 . For any Markov chain with conductance and eigenvalue gap , (ii) is exponentially larger than Proof. (i) First we determine using Stirling's equation. Let . Then, 481 £ £ 1 It suffices to minimize over , for any polynomial ; this decreases the conductance by at most a polynomial factor (see [16]). ¨ © ¨ © (a) . oYE£ ! !d # oYE£ d ! d ! d G ' d ) q $ W P R ' qqpr g s q ihgfW$ P R W W ab d ab P d P R eE ) £ £ ) £ £ c a # b ) £ £ bP d ) £ £ baP $ a a # £ d d ) (b £ £ bP d ) £ £ bP c YR `B ¨E B @ I ! 1 I )b £ d ) (b £ d ) £ b b) £ ! ! d X £! X ! £ ¢ ¢ )d ab £ d d ! £d ! d £ W ¢ W (b V £ ) d (¢! b (¢! ¥ ¢ # oYE£ ' qt V s qq V asqr W 3V$ P R V £ £ V£ Vt V ¨%q V % r V U d ¢ £ b ¢ ¥ ¢ d !r(¢ ( b! ¨§a¢©§! ¨§¥ ! ¢£@ g V£ R SE E T ¢ £ P ! Q£ ! ¨ © ¢ ¨ d ¢ (¢! b a¢! ¥ ¢ ¨ £ ¨ E £ I! " ¢ Notice that now the normalizing constants cancel out. Hence, implementing a move of the swapping chain is straightforward, unlike tempering where good approximations for the partition functions are required. Zheng proved that fast mixing of the swapping chain implies fast mixing of the tempering chain [17], although the converse is unknown. For both tempering and swapping, we must be careful about how we choose the number of distributions . It is important that successive distributions and have sufficiently small variation distance so that temperature moves are accepted with nontrivial probability. However, must be small enough so that it does not blow up the running time of the algorithm. Following [10], we set . This ensures that for the values of at which we wish to sample, the ratio of and is bounded from above and below by a constant. Thus, to lower bound the mixing time it is sufficient to show that the conductance is small. If a chain converges rapidly to its stationary distribution it must have large conductance, indicating the absence of "bad cut," i.e., a set of edges of small capacity separating from . The cut we will use to bound the conductance in the context of the Potts model comes from the first-order phase transition. This characterizes the following phenomenon. At high temperature (low ) we are in a disordered state and see roughly equal numbers of each color in a typical coloring, while at low temperature (high ) we are in an ordered state, where one color clearly dominates. The crucial concept is how we go from the disordered to the ordered state as we slowly lower the temperature. Rather than seeing a gradual change in the size of the largest color class, the change is discontinuous and we see an abrupt change around some critical value . To 3 Torpid Mixing of Tempering on the Potts model show slow mixing of the tempering chain, we show that this We will show lower bounds on the mixing time of the tem- discontinuity translates to a bad cut, even when we take the pering chain on the mean-field Potts model by bounding union of Metropolis chains at many temperatures. the spectral gap of the transition matrix of the chain. Let Let be the size of the be the eigenvalues of the transition ma- 3.1 Slow mixing. vertex set of the underlying graph being colored. Let trix , so that for all . Let be the set of (not necessarily proper) colorings . of the graph. Consider a partition of into sets The mixing time is related to the spectral gap of the so that and with chain by the following theorem (see [16]) : . Since there are exactly T H E O R E M 3 . 1 . Let . For all , colorings in , we have ¥ Xs £ QY ¡ ¦ @8 ! £ # %( % £ ! Yy£ nw`Yy % 0)) % % # ) % ' &$ # $ k w£ 5¤¡ k ¥£ b w£ 5¥¤¡ $ £ ' # &$ b ) ¥ 43pf% $ 2% $ % e1 A7 D BA7 0H04GFEC8@986 ¥ w£ £ ¢ ¦ ¥ ¢ 8 ¢ ¢ ¥ ¢ ¢ ! $#$## ! ¥ w ¢ ¦¥"¤¢¡ ! v ¥ b o1upts ' ¦$r 2 b ¦ Y" £© £ v ¥ eU3ts ' $ ¥ 2 k Y" £u © £ £ § oYy pf1)e d ¨ $ $ "$ $ (E ¥ Xs e o £ #! " quantities and and thus and , where which occurs when . At we have , giving a stationary point. Let . For (ii) Let . Then we have As is decreased, the slope of the line decreases from the (positive) slope of the line . Thus, it is sufficient that the above inequalities hold at to prove the lemma for since is independent of . L E M M A 3 . 3 . For every inverse temperature is exponentially smaller than . , Proof. Note that only the exponential term in varies with . Thus, for some function , we have Proof. Using the definition of conductance, we have Neglecting factors not dependent on and simplifying using Stirling's formula, we need to check for the stationary points of the function 482 r @ ) j 5 ¥ ! @ p y 9! 8 ¢ ¢k d r R Eb ¥ E ¢ k E ! 6 y n! y£ g £" `Yy£ #G) P %( % jP # # p (y! Yy£ w G ) xoYy£ P #F) # P # % ' &$ # 5 ' $ £s D R k &$ EC % R ¨E " ! ©yS ! g ! b 7 #"¥b ! @ ! y ! The first part of the lemma verifies that at the critical temperature there are 3 ordered modes (one for each color, by symmetry) and 1 disordered mode. In the next lemmas, we show that the disordered mode is separated from the ordered modes by a region of exponentially low density. To do this, we use the second part Lemma 3.1 and show that bounds the density of the separating region at each . Let , for , be the continuous extension of the discrete function . The claim follows by the second part of Lemma 3.1. We use these facts to bound the conductance of the tempering chain at the critical temperature . Let be the region bounded by and including the lines . Let . Let be the boundary of . The set defines a bad cut in the state space of the L E M M A 3 . 2 . For sufficiently large, the real function tempering chain. is unimodal for and attains its maximum T H E O R E M 3 . 3 . For sufficiently large, there exists at for all such that . such that . Proof. Examining on this line, we find ¨ ¨ £ " Setting to gives the desired result. £ ¢ ¢ 6 To test the sign of the derivative a R £ # d ) £ £ abP )b £ £ bP ' X b ) ¥ ¨$q a P Q Q3 k £3 ' X b ) ¥ $ q P R Q Q4 £3 Y'' rt % rt % rt $ T ' rf % rf % rq $ T $ q P Q Q4 bd )) ££ (££ o £ QY X3 eE '' qt 2121gg $$ b ) £ (£o R E k `E ) d £ (£` E j %% R E B E R SE 1y'& ¨E £R oE j $$ # ! & Q! p ( ( B 1'&n eE y y£R ! &! y Q! p ( ( 8 1y'&n eE £R @ £s @ 1 £Qpfs R E ! U©@ ¦ 1 X3f k E 0( ) b ¥ ab @ rq Qpfs £ y j% g j$ % j % % @ $ 1'& oA j $ $ y £ E ¥X j % % X ¥ ¥ j $ $ p '' $$ $% £ #' rq $ o' y ' ¥b q£ s y g r q `P y R ¥b iYy£ '' $$ $% q r$ q $ " 9 " ¨ and This implies ¨ © ¨ © " " YBA `CBB @ " y !oYE£ y # W R cx b ! cy! b oYya£" q rq £ s q s q rq £ q P R eE oE ¥b k B y B @ # £ ¥V Y U ¦qYtrr qt § ¥ q g r R ¨ ¥V Y £ U nb¥ ) ' b $ § ¥ R q ¨ R g V U V b U a b X X b ) q P d V d U ' d ) q $ aP R d ! d ! d P b d ) £ £ a Pc ' ©) q ad $ a P R X ! X ! b ) £ £ a c R b' $ § ¨b eE ¥ R ' b $ § ¦b eE ¥ 3fs b ¤ U Q £ pf eE s ! V ¥ " U ¤ £ V d U tg £ ¥V " Y U £ ¤ ¢ V ¥W U ¥ V bd U V £d Utg V ¢d U g t Vq ¢d U V W U V W U V ¡(d b U " # V ¥ " Y U £ " " V cy b ! ey! b U ¨ © £ £ £ £ " ¢ ¢ ¥X oyy (£" V cy b ! c(y! b U e `Yy(£Q b ) £E £ ¢ ' ) ¥ $ q Y¨ P R W ¢ , we compare the , , We briefly state the comparison and decomposition theorems, which will be the main tools used to prove the results in this section. Comparison: The comparison theorem of Diaconis and Saloff-Coste is useful in bounding the mixing time of The first inequality follows from Lemma 3.2, and the second a Markov chain when the mixing time of a related chain on from Lemma 3.3. By Stirling's formula, we find the same state space is known. and be two Markov chains on . Let Let and be the transition matrix and stationary distriand let and be those of . Let butions of where at . and be sets of directed edges. For It can be verified that and are within a linear such that , define a path , a sequence of factor of each other. By Theorem 3.2 the upper bound on states such that . bounds the spectral gap of the tempering chain at the denote inverse temperature . Applying Theorem 3.1, we find the Let tempering chain for the 3-state Potts model mixes slowly. As the set of endpoints of paths that use the edge . a consequence of Zheng's demonstrating that rapid mixing of the swapping chain implies fast mixing of the tempering T H E O R E M 4 . 2 . (Diaconis and Saloff-Coste [3]) chain [17], we also have established the slow mixing of the swapping chain for the mean-field Pott model. Decomposition: Decomposition theorems are useful for breaking a complicated Markov chain into smaller pieces that are easier to analyze [9, 12]. Let be a disjoint partition of . For each , define the Markov chain on whose transition matrix , the restriction of to is defined as if and ; . The stationary distribution of is . Define the projection to be the transition matrix on the state space T H E O R E M 4 . 3 . (Martin and Randall [12]) EC F3 D where is a normalizing constant. 483 # £ iw£ £ e Y) 5¥¤¡ pf ¢ w £ ¦¥¤¡ ¦ w£ 5¥¤¡ £ £ B ! ¨ @ A where is the normalizing constant. Define the interpolating distributions for the swapping chain as ¦ ¦ ¨ ¨ 8 9 6) 7¨ Example I: Let be a real constant. Let and be positive integers. The bimodal exponential distribution is defined as # # ! y£ nw`Yy # £ ! £ w £ 1) % 1) £ " w ' Y' $ G $ '5a£e v % # 1) o(y! y£ w ! ! Yy£ D 6(! y£ w w £ £w v 2 10 ) " 3 4 § om ! ! Yy£ w £y § ! y d) y £ £ A w " w £ ! ¨©¨! ¥ £ §§§ 5 #¤( " ' £ 5 #¤ " 5 We now reexamine the swapping chain on two classes of distributions: one is an asymmetric exponential distribution (generalizing a symmetric distribution studied by Madras and Zheng [10]), and the other a class of the mean-field models. First, we show that swapping and tempering are fast on the exponential distribution. The proofs suggest that a key idea behind designing fast sampling algorithms for models with first-order phase transitions is to define a new set of interpolants that do not preserve the bad cut. We start with a careful examination of the exponential distribution since the proofs easily generalize to the new swapping algorithm applied to bimodal mean-field models. £ £ ! y£ 6wf `Yyf © ' % $# ! £ w ' $ 10' ) % $ $ 4 Modifying the Swapping Algorithm for Rapid Mixing where !£ © ! ¥ @ 8 Xs ey! oYy£ © w £ (! y h 6(! y£ u f w £ gv "p ¦ ! ¦ £ 3 ! & ! f1w £ £ 5 5¥¤¡ § ¦ !w£ 5¥¤¡ £ ( § § E eE £h fw £ ! yY£ ! £ ¤ F (y!$©§¨§¨§! S y y @ 8 ! r Yy£ f w 6(! y£ f w rf @ 8 f w ! Yy£ w h ! YySgu w£ @v £ 8 v w p ! £ d ¦¦ §§§ ( (E! ¨©¨! E v &% p ¦ a b # QY bd )) ££ ££ bPP k £ a £ Q £ d ) £ £ aP 1§§ d ) £ U PP b ) £ £ a P §1§ b ) £ U c k £ £P oyQ #G) # P # £P oYyQc #F) # P # ! ' $ E ! "p ! ¦ ¦ ¦ T H E O R E M 4 . 1 . The swapping chain with inverse temperatures , where is rapidly mixing on the bimodal exponential distribution defined on where . W " p ! yQ! k k @ ! W ( `Yy£ W £ W `Yy£ oy ( ' b $ § ¦b R E V © U pfs ¥b X a b P A ¥ s C R Y'' $ § ¥ s pi £ § ¥ qg s q fg qa R % $ $ " 3 y %( £ ¤ ! $$ %( ¥ ¡ $ $ ¨© RE 8( § ¨ ¢ % &$ , Intuitively, we can think about a simpler random walk RW1 on the weighted hypercube as follows. Start at some bit vector, say . At each step we are allowed to transpose two neighboring bits, or we can flip just the lowest bit. Each of these moves is performed with the appropriate Metropolis probability. We will show that this chain is rapidly mixing for the weights that arise in This partition of into sets of fixed trace sets the stage the projection . This captures the idea that for the true for the decomposition theorem. The restrictions simulate projection chain, swap moves (transpositions) always have the swapping Markov chain on regions of fixed trace. constant probability, and at the highest temperature there The projection is the -dimensional hypercube, is high probability of changing sign. Of course there is a representing the set of allowable traces . The analysis of chance of flipping the bit at each higher temperature, but we the restrictions follows precisely from [10], the symmetric will see that this is not even necessary for rapid mixing. To analyze RW1, we can compare it to an even simpler case. Analyzing the projection, however, becomes more difficult, since in this case the stationary distribution over the walk, RW2, that chooses any bit at random and updates it to hypercube is highly non-uniform. This reflects the fact that 0 or 1 with the correct stationary probabilities. It is easy to at "low temperatures," one side of the bimodal distribution argue that RW2 converges very quickly and we use this to becomes exponentially more favorable. We resolve this by infer the fast mixing of RW1. More precisely, let be a new chain on the hypercube appealing to the comparison theorem. for the purpose of the comparison. At each step it picks Bounding the mixing rate of the restricted chains: and updates the component by If we temporarily ignore swap moves on the restrictions, choosing exactly according to the appropriate stationary . In other words, the component is the restricted chains move independently according to the distribution at Metropolis probabilities on each of the distributions. at stationarity as soon as it is chosen. Using the coupon The following lemma reduces the analysis of the restricted collector's theorem, we have chains to analyzing the moves of at each fixed temperaL E M M A 4 . 2 . The chain on mixes in time ture. and . L E M M A 4 . 1 . (Diaconis and Saloff-Coste [3]) For We are now in a position to prove the following theorem. , let be a reversible Markov chain on a finite of the swapping Markov state space . Consider the product Markov chain on T H E O R E M 4 . 4 . The projection chain is rapidly mixing on , defined by the product space Then consists of one step that flips the bit corresponding to the highest temperature to move to ; 484 " Now restricted to each of the distributions is unimodal, suggesting that should be rapidly mixing at each temperature. Madras and Zheng formalize this in [10] and show that the Metropolis chain restricted to the positive or negative parts of mixes quickly. Thus, from Lemma 4.1 and following the arguments in [10], we can conclude that each of the restricted Markov chains is rapidly mixing. ( (&!%#%#$#! (! n £ b¥ ¥ &#%#&!(( # ! is (! ¥ %!%#&## ! ! ¥ Y£¥ d ¥¥ $¥$ p d ¥ &b ¥ %¥ ¥ ( (&!%#&## ! n(&!%#&## ! Yu f w ( %!%#&## ! n %!&#%#$#! Y£ p £ p (! Y£ w fw ¥ Xs ( w r ! @ g w V ¥ eU3ts U u £u U3ts £ ¥ 1wf £ £ ¥ Xs ( r ! @ g ¦¥¤¡ f w To apply the comparison theorem, we translate transitions in the chain , (whose mixing time we know) into a canonical path consisting of moves in the chain . Let be a single transition in from to that flips the bit. The canonical path from to is the concatenation of three paths . In terms of tempering, is a heating phase and is a cooling phase. consists of swap ; moves from to n " fw " oE r w #g" &!%#%#$! @ # p possible values of the trace characterize the The partition we use. Letting be the set of configurations with trace , we have the decomposition # `Yy£ ¥ @ &!%#%#$#! @ ! @ ' $ !G # 1 ) £ ) w £ w £ # H w£ 5£¤¡ g ( fp 3 Xs ¥ ( w£ 5¤¡ r ¥ eq3o ¥ ¥£ §¥ ¥ ( # ¨§ Xw ©§ # w §§ §© §¥©§ § ¥©§ § §¨ ¨¦ ( ¨§©§§ £ ¥ " ¥ ¥ ¥ w ¥ 8© (£ D E FI N I T I O N 4 . 1 . Let Tr and if . The trace where if . w 4.1 Swapping on the exponential distribution. We are now prepared to prove Theorem 4.1. The state space for the swapping chain applied to Example I is . Bounding the mixing rate of the projection: The graph underlying the Markov chain for the projection is an dimensional hypercube. The stationary probabilities of the projection chain are given by w w w £ £¡ ¤¥ % ¢) # £ r £ £ ¥ Xs ( ! ©¨§©! ¦ y q y §§ @ ¥ @ £ Xs ( (y@ %!r &#%##$! "@! gd£y @ ( y %!%#&## ! Y£ `YyB £ ¥ Xs ( r p ! g £ ¦ ¥ ¥ ¦ ) ¥ w £ ¦ ¦ 5 ¥ ! $# $! ## ¥ for any transition in the canonical path. Second, we need to ensure that the number of paths using the transition Likewise, by taking one more term, we find that , , is at most a polynomial. These two conditions Together with equation 4.2 this implies are sufficient to give a polynomial bound on the parameter in the comparison theorem. For any we have , so it remains to establish the condition in Equation 4.1. Case 2: (The transition along ) Consider the transition from to Case 1: (Transitions along ) that flips the first bit of . Let and Repeating the argument from Case 1, it follows that . (4.2) Therefore, again we find equation 4.1 is satisfied. Case 3: (Transitions along By the comparison theorem we find that We have now established all the results necessary to apply the decomposition theorem 4.3 and show Theorem 4.1. 4.2 Bimodal mean-field spin models. We now look more closely at mean-field models to see how to modify the swapping algorithm. Consider the following very general class of mean-field models. where is the Kronecker- function that takes the value 1 if all of the arguments are equal and is 0 otherwise (when 485 !' ¨ ¨ ¦¦¦ §§§ ¨ E ¡C ¨ ¦¦¦ © §§§ Similarly, considering the next block of (i.e., the next set of bits such that ) until the first index such that , ¥ ( ( ) ) " ' ( ( ) ) where that of bits . We want to show It is useful to partition into blocks that equal 1, separated by one or more zeros. Let be the largest value such that . Then it is easy to verify that Example II: Bimodal mean-field spin models: Fix constants and let be a large integer. The state space of the mean-field model consists of all spin configurations on the complete graph , namely The probability distribution over these configurations is determined by , inverse temperature, and , the -wise interactions between particles. The Hamiltonian is given by #¥ " ¦ 2 ' ' Let us assume, without loss of generality, that Then we have #b k ¤¤ ¤ 1¤¤¤ ) ¤¤& "¤( % " $¤¤ ¤ ¤ "¤ " ¦ ¦ £ 4 ' ) ( ¢ £ ¡ ") ¡ ¢ ' First we consider . In all three cases, we find that if is one step on the canonical path from to , equation 4.1 is satisfied. Therefore, it follows that # § Y£ ¦ ( p ! £ £ 3f £ e ( %!&#%#$#! '!n £ p ( (!&%#%##$! ¥Xs n(! ¥ b(n&!%#¥&## ! (! Y£ £ f w Y ¦ p ! £ w ¥ Xs j F ¦ £ Xs j F ¥ 5 5 §# Y£ j F F £ % % r (¡ @ F pI % % r $ 5 F 5 # F # oy £ § ¥ is F ¥ Xs F I r F 25g E §# Y£ ¦ £ # r ¢ &!%#%#$! g £ # BI @ F ! F 5 %!&#%#$#! b 5! ¥ 5 ! D8 E @ £ § ( %!%#&## ! Xs n! ! ¥ n#%!&# %#$# Y£ k § Y £ ! ¥@ § £ ! £ § £ npY£ ! £ pf £e Y £ £ £ w £ 5¥¤¡ p Y ! pf p ! Y£ wf Y £e£ £ j p ! £ w £ j ' $ ¦10 )' % $ 5 # p k pnp(! Y£ f w Y j ' % $ £ # £ `Yy£ ' # $ G £ % # ( ( £ p ! £ p £ d¥ # ( p £ ! £ £ 3f c e £ e c p £ £ p ! 3f pU ! w £ £ £ ( !%&#%#$#! Xs n ! ¥ ( !%&#%#$#! ¥Xs n! ¥ n%!&#%#$#! ¥ & '!n%!%#&## ! Y£ p ¥ ¥'!&!$#%##&!(n % n! ¥ ¥ & (&!%#%#$#! Y£ ) This is similar to Case 1. # p (! Y£ " ' £ # § £ ¦ p £ ' £ ' p ! £ ' ¢ " ' " " ' " (4.1) and thus ' ! p (! Y£ p ! £ fw £ iY ¦ p ! £ w £ ) ( b p ! £ 5 k j % p ! £ 5 To bound in Theorem 4.2, we will establish that # § ¦ ££ & o) & `) §! Y£ ( ¦ £ ( consists of . swaps until we reach &!%#&## ! £ p ( (&!%#&## ! d n¥ ' "" ' Continuing in this way we find where is the normalizing constant. where is another normalizing constant. When is taken to be the constant function, then We consider here the cases when is a bimodal function we obtain the distributions of the usual swapping algorithm. in (i.e., when there are exactly two local optima). The Flat-Swap Algorithm: An important special case of Example II is the mean- For our variant, the Flat-Swap algorithm, let us consider field Ising model in the presence of an external field. This , , the inverse model is defined by parameters temperature, and , the external magnetic field. The Gibbs distribution over configurations is We shall see that this gradually flattens out the total spins distributions uniformly, thus eliminating the bad cut that can occur when we take constant. The function effectively dampens the entropy (multinomial) just as the change in temperature dampens the energy term coming from the Hamiltonian. We have the following theorem. T H E O R E M 4 . 5 . The Flat-Swap algorithm is rapidly mixing for any bimodal mean-field model. To prove Theorem 4.5, we follow the strategy set forth for Theorem 4.1, using decomposition and comparison in a similar manner. For simplicity, we concentrate our exposition here on the Ising model in an external field. The advantage of this special case is that the total spins configurations form a one-parameter family (i.e., the number of vertices assigned +1), much like in Example I. The proofs for the general class of models, including the Potts model on , are analogous. We sketch the proof of Theorem 4.5. For the Ising model, we have where is the normalizing constant. This can be described by the model in Example II by taking and . It can be shown that this distribution is bimodal for all values of and . A second important special case included in Example II is the -state Potts model where we restrict to the part of the state space such that . Note that . Consequently, sampling from is sufficient since we can randomly permute the colors once we obtain a sample and get a random configuration of the Potts model on the nonrestricted state space . Here we take , and and the Gibbs distribution becomes Restricting to provides a bimodal distribution, which is required for the arguments that follow. Our results from the previous section indicate that swapping is not always fast on models defined in Example II; it is and are assigned easy to see that the arguments directly apply to Potts model where vertices are assigned . Again we take . Note that is easy restricted to . In contrast, the new swapping algorithm to compute given . A simple calculation reveals that, for we define next is can be shown to be rapidly mixing for the entire class of models defined in Example II. Thus, all the total spins distributions have the same relative shape, but get flatter as is decreased. This no longer 486 " £ " 4.3 A new swapping algorithm. In the traditional swapping algorithm, the interpolating distributions are defined as `Yy£14 ! k k @ ! ' $ TW P R oYy£ ' G % % r G % P $ oy£ # V ' F % F $ £ £ ( 3 U p W oy£ 3 I ' F % F $ £ £ 3 %' F % F $ £ y '&2" £ $ I ¨© '&27£ $" ( E y `Yy(£¦4 I § § A oE I `Yy£ 4 ! # © ©¨ ! oYya£¦p %4oYya£e W 6 © ©¨ £ '6 ¤¢ # §a&!%#%#$! ¥ £ oy£(¦4%oy(£o 1o) y£5 p W ¦4 oy(£ 3 For any partition of , , , define as the set of configurations with vertices assigned color . Let us consider the total spins distribution: ¢ ¢ oYy£ 4 oy£4 ¢ ¢ " £ ¨ # 4 ! YE£ W ¤ `Yy£ 1 P $ oy R '% £ 1% 0 P ) @ ¥5 I ¨ B¨ ! 4 §b B B¨ ¨ ¨ © 5 B¨ ¨ ¨ ¦¦¦ §§§ ¦¦¦ §§§ '&2(£ $" ¨ we set =1 iff ). The Gibbs distribution is where and normalizes the distribution. Our new algorithm stems from the observation that this is a poor choice of interpolants because they preserve the first-order phase transition. We can do much better by exploring a wider class of interpolating distributions. To see the flexibility we have in defining the set of distributions, define W E ( ( d oE g £ U& %$(£ £ "£ ¤ §¢ ¦ &#%## ¦ b q¦ ¥ ¢ ¢ £ " '&%$#£ " ¢ 4 E b5 4 ¥ 5! I 4 ! E£ W ! YE£ ! R % £ r ! 4 s W £ P ¤ oy£ ' P $ `Yy £y 8 @8 E ¢ @ q4 ¥ ¤© V! ¨ # `Yy # V £ V £ §1) £ ! V£ ¤§¢ o¤ ¢ ¦ ¨¥(¢%!&##%$#! ¥ !¢£ ¢ ¤ 1) ' $ T P R # W % £ ! £ i£tY¡! y¢ T ' $ WUP R¤ oyY£ ' G % % r G P $ oy 5 I ey ¨ © " '&%(£ $" preserves the non-analytic nature of the phase transition seen for the usual swap algorithm. It is this property that makes this choice of distributions useful. The total spins distribution for the Ising model is known to be bimodal, even in the presence of an external field. With our choice of interpolants, it now follows that all distributions are bimodal as well. Moreover, the minima of the distributions occur at the same location for all distributions. Let be the place at which these minima occur. In order to show that this swapping chain is rapidly be the state mixing we use decomposition. Let space of the swapping chain on the Ising model, where . Define the trace Tr , where if the number of s in is less than and let if the number of s in is at least . The analysis of the restricted chains given in [10] in the context of the Ising model without an external field can be readily adapted to show the restrictions are also rapidly mixing. The analysis of the projection is analogous to the arguments used to bound the mixing rate of the projection for Example I. Hence, we can conclude that the swapping algorithm is rapidly mixing for the mean-field Ising model at any temperature, with any external field. We leave the details, including the extension to the Potts model, for the full version of the paper. 5 Conclusions Swapping, tempering and annealing provide a means, experimentally, for overcoming bottlenecks controlling the slow convergence of Markov chains. However, our results offer rigorous evidence that heuristics based on these methods might be incorrect if samples are taken after only a polynomial number of steps. In recent work, we have extended the arguments presented here to show an even more surprising result; tempering can actually be slower than the fixed temperature Metropolis algorithm by an exponential multiplicative factor. Many other future directions present themselves. It would be worthwhile to continue understanding examples when the standard (temperature based) interpolants fail to lead to efficient algorithms, but nonetheless variants of the swapping algorithm, such as presented in Section 4.3, succeed. The difficulty in extending our methods to more interesting examples, such as the Ising and Potts models on lattices, is that it is not clear how to define the interpolants. We would want a way to slowly modify the the entropy term in addition to the temperature, as we did in the mean-field case, to avoid the bad cut arising from the phase transition. It would be worthwhile to explore whether it is possible to determine a good set of interpolants algorithmically by bootstrapping, rather than analytically, as was done here, to define a more robust family of tempering-like algorithms. Acknowledgments The authors thank Christian Borgs, Jennifer Chayes, Claire Kenyon, and Elchanan Mossel for useful discussions. References [1] C. Borgs, J.T. Chayes, A. Frieze, J.H. Kim, P. Tetali, E. Vigoda, and V.H. Vu. Torpid mixing of some MCMC algorithms in statistical physics. Proc. 40th IEEE Symposium on Foundations of Computer Science, 218­229, 1999. [2] C. Cooper, M.E. Dyer, A.M. Frieze, and R. Rue. Mixing Properties of the Swendsen-Wang Process on the Complete Graph and Narrow Grids. J. Math. Phys. 41: 1499­1527: 2000. [3] P. Diaconis and L. Saloff-Coste. Comparison theorems for reversible Markov chains. Annals of Applied Probability. 3: 696­730, 1993. [4] C.J. Geyer. Markov Chain Monte Carlo Maximum Likelihood. Computing Science and Statistics: Proceedings of the 23rd Symposium on the Interface (E.M. Keramidas, ed.), 156163. Interface Foundation, Fairfax Sta tion, 1991. [5] C.J. Geyer and E.A. Thompson. Annealing Markov Chain Monte Carlo with Applications to Ancestral Inference. J. Amer. Statist. Assoc. 90: 909­920, 1995. [6] V.K. Gore and M.R. Jerrum. The Swendsen-Wang Process Does Not Always Mix Rapidly. J. Statist. Phys. 97: 67­86, 1995. [7] M.R. Jerrum and A.J. Sinclair. Approximate counting, uniform generation and rapidly mixing Markov chains. Information and Computation. 82: 93­133, 1989. [8] S. Kirkpatrick, L. Gellatt Jr., and M. Vecchi. Optimization by simulated annealing. Science. 220: 498­516, 1983. [9] N. Madras and D. Randall. Markov chain decomposition for convergence rate analysis. Annals of Applied Probability. 12: 581­606, 2002. [10] N. Madras and Z. Zheng. On the swapping algorithm. Random Structures and Algorithms. 22: 66­97, 2003. [11] E. Marinari and G. Parisi. Simulated tempering: a new Monte Carlo scheme. Europhys. Lett. 19: 451­458, 1992. [12] R.A. Martin and D. Randall. Sampling adsorbing staircase walks using a new Markov chain decomposition method. Proc. 41st Symposium on the Foundations of Computer Science (FOCS 2000), 492­502, 2000. [13] R.A. Martin and D. Randall. Disjoint decomposition with applications to sampling circuits in some Cayley graphs. Preprint, 2003. [14] N. Metropolis, A. W. Rosenbluth, M.N. Rosenbluth, A.H. Teller, and E. Teller. Equation of state calculations by fast computing machines. Journal of Chemical Physics, 21: 1087­1092, 1953. [15] R.B. Potts. Some Generalized Order-disorder Transformations Proceedings of the Cambridge Philosophical Society, 48: 106­109, 1952. [16] A.J. Sinclair. Algorithms for random generation & counting: a Markov chain approach. Birkhauser, 1993. ¨ [17] Z. Zheng. Analysis of Swapping and Tempering Monte Carlo Algorithms. Dissertation, York Univ., 1999. ¢ £ ¢ £ § ooy y § ¥ Xs £ ( r ! @ g oy£ ¥ Xs ( £ £ £ ¥ ¥ § £¡ ¢ @ n r ©! g 487