Particle Filtered MCMC-MLE with Connections to Contrastive Divergence Arthur U. Asuncion Qiang Liu Alexander T. Ihler Padhraic Smyth Department of Computer Science, University of California, Irvine, CA, 92697, USA ASUNCION @ ICS . UCI . EDU QLIU 1@ ICS . UCI . EDU IHLER @ ICS . UCI . EDU SMYTH @ ICS . UCI . EDU Abstract Learning undirected graphical models such as Markov random fields is an important machine learning task with applications in many domains. Since it is usually intractable to learn these models exactly, various approximate learning techniques have been developed, such as contrastive divergence (CD) and Markov chain Monte Carlo maximum likelihood estimation (MCMC-MLE). In this paper, we introduce particle filtered MCMC-MLE, which is a sampling-importanceresampling version of MCMC-MLE with additional MCMC rejuvenation steps. We also describe a unified view of (1) MCMC-MLE, (2) our particle filtering approach, and (3) a stochastic approximation procedure known as persistent contrastive divergence. We show how these approaches are related to each other and discuss the relative merits of each approach. Empirical results on various undirected models demonstrate that the particle filtering technique we propose in this paper can significantly outperform MCMC-MLE. Furthermore, in certain cases, the proposed technique is faster than persistent CD. One such technique is to use both MCMC and importance sampling to perform approximate maximum likelihood estimation. This technique, known as Markov chain Monte Carlo maximum likelihood estimation (MCMC-MLE), has been shown to be more accurate than pseudolikelihood methods on Ising models (Geyer & Thompson, 1992) and is widely used for estimating exponential random graph models (Snijders, 2002; Handcock et al., 2008). A benefit of MCMC-MLE is that the likelihood approximation becomes exact as the number of MCMC samples goes to infinity. However, since this technique is based on importance sampling, the quality of the approximation depends on the distance between the initial and the target distributions. If they are far apart, the samples from the initial distribution may not be able to adequately cover the target space, leading to unreliable estimates (Bartz et al., 2008). To combat this potential impoverishment of the system's samples, we propose particle filtered MCMC-MLE (or "PF"). Our PF approach applies the sampling-importanceresampling scheme (Smith & Gelfand, 1992) to MCMCMLE and performs rejuvenating MCMC steps to maintain diversity within the set of samples (Gilks & Berzuini, 2001). We find that PF is able to outperform MCMC-MLE due to these additional steps taken to mitigate degeneracy. Another well-known technique is contrastive divergence (CD) (Hinton, 2002), which has been successfully applied to many problems, including the learning of Boltzmann machines and deep architectures (Salakhutdinov & Hinton, 2009). One benefit of CD is computational efficiency. Unlike MCMC-MLE, CD does not wait for the Markov chains to reach equilibrium but takes only a few sampling steps to obtain an approximate gradient. When there are no hidden variables in the model, CD with a single-variable sampling step corresponds to maximum pseudolikelihood estimation (MPLE) (Hyv¨ rinen, 2006), and CD with blocked a sampling corresponds to maximum composite likelihood estimation (MCLE) (Asuncion et al., 2010). A stochastic approximation variant of CD, also known as persistent contrastive divergence (PCD), has also been developed and has 1. Introduction Undirected models such as Boltzmann machines, conditional random fields, and exponential random graph models are useful in many settings, including computer vision (Li, 1994), linguistics (Lafferty et al., 2001), and social network analysis (Robins et al., 2007). These models are often hard to learn exactly, and thus various approximation techniques have been proposed, including pseudolikelihood methods (Besag, 1974), variational methods (Wainwright & Jordan, 2008), and sampling methods (Geyer & Thompson, 1992). Appearing in Proceedings of the 27 th International Conference on Machine Learning, Haifa, Israel, 2010. Copyright 2010 by the author(s)/owner(s). Particle Filtered MCMC-MLE with Connections to Contrastive Divergence been shown to improve upon CD (Younes, 1988; Tieleman, 2008). Interestingly, PCD can be recast within the sequential Monte Carlo scheme, and we show later that PCD is a version of PF with importance weights fixed to 1. These connections suggest a unified view, where slight algorithmic choices lead to these different methods. In the next section, we review MCMC-MLE. Then we introduce PF and highlight the link to PCD. We present experimental results showing that PF improves upon MCMCMLE and in certain cases also outperforms PCD. Algorithm 1 MCMC-MLE Initialize 0 Sample {xs } p(x|0 ) 1 0 for i = 1 to max-iterations (or convergence) do Calculate {ws } via eq. 6, using i , 0 , {xs } ~ Calculate L via eq. 5, using {ws }, {xs } ~ i+1 i + L end for MCMC-MLE constructs the approximate likelihood in eq. 4 using MCMC samples from the alternate distribution and then finds the maximizer to that likelihood to obtain ^ the approximate ML estimate approx . Since the initial 0 may be far from the true parameter (leading to unreliable estimates), in practice this procedure is iterated a number ^ of times by running MCMC-MLE again with 0 = approx found in the previous iteration. For instance, in the widely used statnet software which estimates exponential random graph models, the default number of MCMC-MLE rounds is 3 (Handcock et al., 2008). Another approach is to initialize parameters at the MPLE (0 = mple ) (Snijders, 2002). To elucidate the role of importance sampling within MCMC-MLE, we derive the gradient of the likelihood in eq. 4 with respect to , ~ 1 dL(|X) = d N where ws = 1 S s 2. A Review of MCMC-MLE Before delving into our particle filtering technique, we review MCMC-MLE (Geyer & Thompson, 1992). Let us consider models in exponential family form1 p(x|) = exp( g(x))/Z(), (1) where g(x) is a vector of sufficient statistics, the apostrophe denotes transposition, and Z() is the partition function. Assume we have N independent observations from the model, X = {x1 , x2 , . . . , xN }, and the task is to find the parameters that generated the observed data. The standard approach is maximum likelihood estimation (MLE), which maximizes the log-likelihood, 1 L(|X) N N g(x ) - log Z(). i=1 i (2) 1 g(x ) - S i=1 i N S ws g(xs ), 0 s=1 (5) ^ While the ML estimator, ml arg max L(|X), has appealing theoretical properties such as asymptotic consistency and statistical efficiency (Lehmann & Casella, 1998), it is typically hard to compute for models with complex dependencies, due to the intractability of Z(). We can rewrite Z() in terms of an alternate distribution p(x|0 ) (Geyer & Thompson, 1992): Z() = Z(0 ) x exp{( - 0 ) g(xs )} 0 . exp{( - 0 ) g(xs )} 0 s (6) These weights {w } are the Monte Carlo versions of the s standard importance weights {wimp }, s wimp = exp{( - 0 ) g(x)}p(x|0 ). (3) If we are able to obtain S samples from the alternate distribution (xs p(x|0 )), a Monte Carlo approximation of 0 Z() yields the approximate likelihood (note that Z(0 ) does not depend on and can be ignored): 1 ~ L(|X) N N p(xs |) exp{( - 0 ) g(xs )} 0 0 = p(xs |0 ) Z()/Z(0 ) 0 exp{( - 0 ) g(xs )} 0 = x exp{( - 0 ) g(x)}p(x|0 ) exp{( - 0 ) g(xs )} 0 1 ws , s s exp{( - 0 ) g(x0 )} S (7) (8) (9) g(xi )-log i=1 1 S S exp{(-0 ) g(xs )}. 0 s=1 where eq. 8 is obtained by applying eq. 3, and eq. 9 is the Monte Carlo approximation of eq. 8. Thus, the second term of the gradient in eq. 5 is estimated via importance sampling. Using the gradient, one can run optimization meth^ ods such as gradient ascent to find approx . Algorithm 1 shows pseudocode for one round of MCMC-MLE. (4) This likelihood becomes exact when S . 1 It is also straightforward to apply the methods in this paper to models with hidden variables, such as RBMs. 3. Our Particle Filtering Approach We now introduce particle filtered MCMC-MLE and show its close connection to contrastive divergence. Particle Filtered MCMC-MLE with Connections to Contrastive Divergence 3.1. Particle Filtered MCMC-MLE The realization that MCMC-MLE is performing importance sampling naturally leads to ways to improve MCMCMLE. We propose particle filtered MCMC-MLE, which casts the algorithm within a sequential Monte Carlo framework (Doucet et al., 2001). Specifically, our PF approach adapts the sampling-importance-resampling (SIR) technique (Smith & Gelfand, 1992) to the task of estimating the second term of the gradient in eq. 5. At a high level, SIR is a way to perform importance sampling recursively using intermediate distributions, in such a way that the degeneracy of the importance weights is controlled. In the SIR framework, each particle is initially sampled from the initial distribution p(x|0 ) and the weights of all particles are set to 1. Then importance sampling is performed recursively on a sequence of intermediate distributions {p(x|i )}. In our task of finding the MLE, we specify the sequence of distributions to be precisely the algorithmic trail {0 , 1 , . . .} taken by our optimization algorithm. This specification of intermediate distributions bears resemblance to the technique of Carbonetto et al. (2009). At each iteration i, each particle weight ws is calculated recursively in the standard SIR scheme: ws ws p(xs |i ) . p(xs |i-1 ) (10) Algorithm 2 Particle Filtered MCMC-MLE Initialize 0 Sample {xs } p(x|0 ) 1 0 for i = 1 to max-iterations (or convergence) do Calculate {ws } via eq. 10, using i , i-1 , {xs } if ESS({ws }) < threshold then Resample {xs } in proportion to {ws } {ws } 1 Rejuvenate {xs } for n MCMC steps, using i end if ~ Calculate L via eq. 5, using {ws }, {xs } ~ i+1 i + L end for where only a few particles have dominating weights, only those particles would be replicated and the diversity of the set would diminish greatly. Rejuvenation combats the problem of degeneracy by applying MCMC move-steps to the particles after the resampling step (Gilks & Berzuini, 2001; Ridgeway & Madigan, 2003). Specifically, for each particle s, an MCMC chain governed by p(x|i ) is initialized at xs and is advanced for n steps, leading to a new configuration for that particle. Rejuvenation increases the diversity of samples but can be computationally expensive, depending on the number of steps and distributions. To mitigate this cost, we restrict ourselves to rejuvenation steps only after resampling. Pseudocode for our PF method is in Algorithm 2. One needs to specify the initial parameters 0 , number of particles S, ESS threshold, rejuvenation length n, learning rate , and convergence criteria. While the optimization algorithm shown is gradient ascent, one could also use (quasi)Newton methods. Note that as the number of particles goes to infinity, the likelihood gradient becomes exact. 3.2. Connection to Contrastive Divergence These weights can be normalized such that s ws = 1. This procedure is equivalent to calculating the weight nonrecursively (via eq. 6), based on the current and the prev when the particles were last resampled. The Monte Carlo approximation of the weights is used here as well. An important aspect of this approach is monitoring the particle set's quality. The effective sample size (ESS) reveals the number of random samples needed to match the Monte Carlo variation of the particle set. ESS can be approximately computed using the weights (Kong et al., 1994), ESS({ws }) = ( ws )2 . s 2 s (w ) s (11) An ESS that is low (i.e. below a predetermined threshold) suggests that the importance weights have become degenerate and that resampling and rejuvenation are necessary to replenish the diversity of the particles. Resampling is performed by sampling S particles from the set {xs } with replacement, with probabilities proportional to {ws }. After resampling, the weights are reset to 1. More advanced resampling techniques, such as stratified resampling, are also possible (Kitagawa, 1996). Particles with high relative weights (indicating closeness to the target) are sampled more frequently, while low-weight particles are often eliminated from the set. In the degenerate case Contrastive divergence (CD) is a popular learning algorithm that has been applied to a variety of models (Hinton, 2002). While CD's objective function is technically a difference of two KL divergences, the practical CD algorithm can be obtained by considering the data log-likelihood. Instead of using eq. 3, we derive the gradient of the likelihood in eq. 2 directly and apply a Monte Carlo approximation: dL(|X) 1 = d N 1 N N g(xi ) - i=1 N x g(x)p(x|) 1 S S (12) g(xi ) - i=1 g(xs ). s=1 (13) Particle Filtered MCMC-MLE with Connections to Contrastive Divergence The samples {xs } in eq. 13 are drawn from p(x|) via MCMC. However, to obtain an accurate sample from the distribution, the MCMC chain would need to reach equilibrium. Since the CD philosophy values computational efficiency, CD-n initializes the chains from the data distribution (to get close to the true modes) and then runs the MCMC chain for only n steps, where n is often 1 (i.e. a full sweep over the variables). The rationale behind CD is that only a rough estimate of the gradient is necessary to move the parameters in the right direction. Note that as the number of steps n and the number of samples S go to infinity, the gradient calculation becomes exact. It is interesting to compare this feature to PF, which only needs the number of particles S to go to infinity to obtain an exact gradient. Persistent contrastive divergence (PCD-n) is a version of CD-n where MCMC chains are persisted across iterations (Tieleman, 2008). At iteration i, each xs is advanced n steps using the MCMC sampler governed by p(x|i ) where i is the set of parameters at that iteration. At iteration i+1, the MCMC chain for each particle xs is initialized with the end sample of the previous iteration's chain for xs , and then the chain is advanced n MCMC steps under the distribution p(x|i+1 ). At each step, these configurations {xs } are used to estimate the gradient in eq. 13. A decreasing schedule for can be used to obtain a stochastic approximation guarantee of asymptotic convergence to the MLE (Younes, 1988; Salakhutdinov, 2009). Since xs is approximately sampled from p(x|i ) at the end of iteration i, and since the goal in iteration i + 1 is to compute an expectation of the sufficient statistics under the distribution p(x|i+1 ) (i.e. eq. 13), a natural extension is to augment PCD by inserting a samplingimportance-resampling step between iterations. Surprisingly, this slightly modified version of PCD is equivalent to our PF algorithm. Within our PF framework, we obtain this modified version of PCD-n if we (1) initialize 0 in such a way that sampling from p(x|0 ) is easy, (2) set the ESS threshold to be greater than S so that resampling and rejuvenation take place every iteration, and (3) set the length of rejuvenation to be n. The standard version of PCD can be viewed as PF with all the importance weights fixed to 1. In Algorithm 3, we show the pseudocode of PCD-n, in particle filtering terminology. The resampling step is not needed since all weights are 1. The n sampling steps within each iteration of PCD can be interpreted as simply being forced rejuvenation steps. PCD does rejuvenation every iteration by construction, but if we follow the PF approach of monitoring the ESS, it is not necessary to rejuvenate every iteration. In iterations when the ESS remains high, PF essentially performs "PCD-0" (since there are 0 MCMC rejuvenation steps). This feature can potentially make PF faster than PCD. Algorithm 3 PCD-n (in particle filtering terminology) Initialize 0 Sample {xs } p(x|0 ) using n steps of MCMC 1 0 for i = 1 to max-iterations (or convergence) do Fix {ws } 1 if true then Rejuvenate {xs } for n MCMC steps, using i end if ~ Calculate L via eq. 5, using {ws }, {xs } ~ i+1 i + L end for An issue with PF is that it can still have weight degeneracy issues if the step size moves the parameters too far at each iteration. PCD sidesteps this issue since all particles are fixed to have weights of 1 a priori. An alternative technique to obtain weights approaching 1 is to heuristically introduce a temperature T in the importance weights: ws = p T (xs |) p (xs |0 ) 1 T 1 = p(xs |) p (xs |0 )p1- T (xs |) 1 T 1 (14) As T , the weights move towards 1. A way to motivate this heuristic "smoothing" of the weights is that instead of estimating the current health of the particle xs , we are trying to estimate the future health of the particle after n steps of rejuvenation under distribution p(xs |), and thus we interpolate between p(xs |0 ) and p(xs |) in the denominator of eq. 14. It may be possible to adaptively learn the optimal T , but as we will see in the experiment in Figure 4, even using a fixed T > 1 can help in cases when weight degeneracy occurs in PF. Nonetheless, this heuristic is not needed as long as is small enough, and in most of our experiments, we leave T = 1. 4. Experimental Analysis We empirically compare MCMC-MLE, PF, and PCD on undirected models such as visible Boltzmann machines, exponential random graph models, conditional random fields, and restricted Boltzmann machines. Our experimental results suggest that PF generally outperforms MCMC-MLE and in some cases is computationally faster than PCD. 4.1. Visible Boltzmann Machines Our first model is a visible Boltzmann machine (VBM) of 1 the form p(x|) = Z() exp { i