The Recurrent Temporal Restricted Boltzmann Machine Ilya Sutskever, Geoffrey Hinton, and Graham Taylor University of Toronto { ilya, hinton, gwtaylor} @cs.utoronto.ca Abstract The Temporal Restricted Boltzmann Machine (TRBM) is a probabilistic model for sequences that is able to successfully model (i.e., generate nice-looking samples of) several very high dimensional sequences, such as motion capture data and the pixels of low resolution videos of balls bouncing in a box. The major disadvantage of the TRBM is that exact inference is extremely hard, since even computing a Gibbs update for a single variable of the posterior is exponentially expensive. This difficulty has necessitated the use of a heuristic inference procedure, that nonetheless was accurate enough for successful learning. In this paper we introduce the Recurrent TRBM, which is a very slight modification of the TRBM for which exact inference is very easy and exact gradient learning is almost tractable. We demonstrate that the RTRBM is better than an analogous TRBM at generating motion capture and videos of bouncing balls. 1 Introduction Modeling sequences is an important problem since there is a vast amount of natural data, such as speech and videos, that is inherently sequential. A good model for these data sources could be useful for finding an abstract representation that is helpful for solving "natural" discrimination tasks (see [4, 7] for an example of this approach for the non-sequential case). In addition, it could be also used for predicting the future of a sequence from its past, be used as a prior for denoising tasks, and be used for other applications such as tracking objects in video. The Temporal Restricted Boltzmann Machine [14, 13] is a recently introduced probabilistic model that has the ability to accurately model complex probability distributions over high-dimensional sequences. It was shown to be able to generate realistic motion capture data [14], and low resolution videos of 2 balls bouncing in a box [13], as well as complete and denoise such sequences. As a probabilistic model, the TRBM is a directed graphical model consisting of a sequence of Restricted Boltzmann Machines (RBMs) [3], where the state of one or more previous RBMs determines the biases of the RBM in next timestep. This probabilistic formulation straightforwardly implies a learning procedure where approximate inference is followed by learning. The learning consists of learning a conditional RBM at each timestep, which is easily done with Contrastive Divergence (CD) [3]. Exact inference in TRBMs, on the other hand, is highly non-trivial, since computing even a single Gibbs update requires computing the ratio of two RBM partition functions. As a result, the approximate inference procedure used in [13] was heuristic and was not even derived from a variational principle. In this paper we introduce the Recurrent TRBM (RTRBM), which is a model that is very similar to the TRBM, and just as expressive. Despite the similarity, exact inference is very easy in the RTRBM and computing the gradient of the log likelihood is feasible (up to the error introduced by the use of Contrastive Divergence). We demonstrate that the RTRBM is able to generate more realistic samples than an equivalent TRBM for the motion capture data and for the pixels of videos 1 of bouncing balls. The RTRBM's performance is better than the TRBM mainly because it learns to convey more information through its hidden-to-hidden connections. 2 Restricted Boltzmann Machines The building block of the TRBM and the RTRBM is the Restricted Boltzmann Machine [3]. An RBM defines a probability distribution over pairs of vectors, V {0, 1}NV and H {0, 1}NH (a shorthand for visible and hidden) by the equation P (v , h) = P (V = v , H = h) = exp(v bV + h bH + v W h)/Z (1 ) where bV is a vector of biases for the visible vectors, bH is a vector of biases for the hidden vectors, and W is the matrix of connection weights. The quantity Z = Z (bV , bH , W ) is the value of the partition function that ensures that Eq. 1 is a valid probability distribution. The RBM's definition implies that the conditional distributions P (H |v ) and P (V |h) are factorial (i.e., all the components of H in P (H |v ) are independent) and are given by P (H (j ) = 1|v ) = s(bH + W v )(j ) and P (V (i) = 1|h) = s(bV + W h)(i) , where s(x)(j ) = (1 + exp(-x(j ) ))-1 is the logistic function and x(j ) is the j th component of the vector x. In general, we use i to index visible vectors V and j to index hidden vectors H . 1 The RBM can be slightly modified to allow the vector V to take real values; one way of achieving this is by the definition P (v , h) = exp(- v 2 /2 + v bV + h bH + v W h)/Z. (2 ) v ~ where P (V ) = 1/|S | S v (V ) (here x (X ) is a distribution over real-valued vectors concentrated at x), and f (X ) P (X ) is the expectation of f (X ) under the distribution P . Computing the exact values of the expectations · P (H,V ) is intractable, and much work has been done on methods for computing approximate values for the expectations that are good enough for practical learning and inference tasks (e.g., [16, 19], including [15], which works well for the RBM). We will approximate the gradients with respect to the RBM's parameters using the Contrastive Divergence [3] learning procedure, CDn , whose updates are computed by the following algorithm. Algorithm 1 (CDn ) 1. 2. 3. 4. ~ Sample (v , h) P (H |V )P (V ) Set W to v · h repeat n times: sample v P (V |h), then sample h P (H |v ) Decrease W by v · h Using this equation does not change the form of the gradients and the conditional distribution P (H |v ). The only change it introduces is in the conditional distribution P (V |h), which is equal to a multivariate Gaussian with parameters N (bV + W h, I). See [18, 14] for more details and generalizations. v The gradient of the average log probability given a dataset S , L = 1/|S | S log P (v ), has the following simple form: P P V V (3 ) · H (H , V ) L/ W = · H (H | V )P (V ) - ~ Models learned by CD1 are often reasonable generative models of the data [3], but if learning is continued with CD25 , the resulting generative models are much better [12]. The RBM also plays a critical role in deep belief networks [4, 5], but we do not use this connection in this paper. 3 The TRBM It is easy to construct the TRBM with RBMs. The TRBM is a sequence of RBMs arranged in such a way that in any given timestep, the RBM's biases depend only on the state of the RBM in 1 We use uppercase variables (as in P (H |v )) to denote distributions and lowercase variables (as in P (h|v )) to denote the (real-valued) probability P (H = h|v ). 2 Figure 1: The graphical structure of a TRBM: a directed sequence of RBMs. the previous timestep. In its simplest form, the TRBM can be viewed as a Hidden Markov Model (HMM) [10] with an exponentially large state space that has an extremely compact parameterization tB of the transition and the emission probabilities. Let XtA = (XtA , . . . , XtB ) denote a sequence of T T variables. The TRBM defines a probability distribution P (V1T = v1 , H1 = hT ) by the equation 1 T P (v1 , hT ) = 1 tT P (vt , ht |ht-1 )P0 (v1 , h1 ) (4 ) =2 which is identical to the defining equation of the HMM. The conditional distribution P (Vt , Ht |ht-1 ) is that of an RBM, whose biases for Ht are a function of ht-1 . Specifically, v / P (vt , ht |ht-1 ) = exp t bV + vt W ht + h (bH + W ht-1 ) Z (ht-1 ) (5 ) t where bV , bH and W are as in Eq. 1, while W is the weight matrix of the connections from Ht-1 to Ht , making bH + W ht-1 be the bias of RBM at time t. In this equation, V {0, 1}NV and H {0, 1}NH ; it is easy to modify this definition to allow V to take real values as was done in Eq. 2. The RBM's partition function depends on ht-1 , because the parameters (i.e., the biases) of the RBM at time t depend on the value of the random variable Ht-1 . Finally, the distribution P0 is defined by an equation very similar to Eq. 5, except that the term W h0 is replaced by the term binit , so the hidden units receive a special initial bias at P0 ; we will often write P (V1 , H1 |h0 ) for P0 (V1 , H1 ) and W h0 for binit (note that h0 is undefined). It follows from these equations that the TRBM is a directed graphical model that has an (undirected) RBM at each timestep (a related directed sequence of Boltzmann Machines has been considered in [8]). As in most probabilistic models, the weight update is computed by solving the inference problem and computing the weight update by treating the inferred variables as observed. When the hidden variables are observed, equation 4 implies that the gradient of the log likelihood with respect to T the TRBM's parameters is t=1 log P (vt , ht |ht-1 ), and each term, being the gradient of the log likelihood of an RBM, can be approximated using CDn . Thus the main computational difficulty of TT learning TRBMs is in obtaining samples from a distribution approximating the posterior P (H1 |v1 ). Unfortunately, the TRBM's inference problem is harder than that of a typical undirected graphical (j ) model, because even computing the probability P (Ht = 1| everything else) involves evaluating the exact ratio of two partition functions of RBMs, which can be seen from Eq. 5. This difficulty necessitated the use of a heuristic inference procedure [13], which is based on the observation that t the distribution P (Ht |ht-1 , v1 ) = P (Ht |vt , ht-1 ) is factorial. 1 Algorithm 2 (for approximate inference from the TRBM): for 1 t T : 1. sample ht from P (Ht |vt , ht-1 ) 2 4 Recurrent TRBMs Let us start with notation. Consider an arbitrary factorial distribution P (H ). The statement h P (H ) means that h is sampled from the factorial distribution P (H ), so each h(j ) is set to 1 with 2 When t = 1, P (Ht |vt , ht-1 ) stands for P0 (H1 |v1 ), and similarly for the other conditional distributions. The same convention is used in all algorithms. 3 Figure 2: The graphical structure of the RTRBM, Q. The variables Ht are real valued while the variables Ht are binary. v The conditional distribution Q(Vt , Ht |ht-1 ) is given by the equation / Q(vt , ht |ht-1 ) = exp t W ht + vt bV + ht (bH + W ht-1 ) Z (ht-1 ), which is essentially the same as the TRBM's conditional distribution P from equation 5. We will always integrate out Ht and will work directly with the distribution Q(Vt |ht-1 ). The distribution Q(Ht |vt , ht-1 ) is s(W vt +W ht-1 ) (Ht ). Notice that when Vt is observed, Ht cannot affect Ht . probability P (H (j ) = 1), and 0 otherwise. In contrast, the statement h P (H ) means that each h(j ) is set to the real value P (H (j ) = 1), so this is a "mean-field" update [9, 17]. The symbol P stands for the distribution of some TRBM, while the symbol Q stands for the distribution defined by an RTRBM. Note that the outcome of the operation · P (Ht |vt , ht-1 ) is s(W vt + W ht-1 + bH ). T An RTRBM, Q(V1T , H1 ) (fig. 2), is defined by the equation T Q(v1 , hT ) = 1 tT Q(vt |ht-1 )Q(ht |vt , ht-1 ) · Q0 (v1 ). Q0 (h1 |v1 ) (6 ) =2 The terms appearing in this equation will be defined shortly. Let us contrast the generative process of the TRBM and the RTRBM. To sample from a TRBM P , we need to perform a directed pass, sampling from each RBM on every timestep. One way of doing this is described by the following algorithm. Algorithm 3 (for sampling from the TRBM): for 1 t T : 1. sample vt P (Vt |ht-1 ) 2. sample ht P (Ht |vt , ht-1 ) where step 1 requires sampling from the marginals of a Boltzmann Machine (by integrating out Ht ), which involves running a Markov chain. By definition, RTRBMs and TRBMs are parameterized in the same way, so from now on we will assume that P and Q have the same parameters (namely, W, W , bV , bH , and binit ). The following algorithm samples from the RTRBM Q under this assumption. Algorithm 4 (for sampling from the RTRBM) for 1 t T : 1. sample vt P (Vt |ht-1 )3 2. set ht P (Ht |vt , ht-1 ) We can infer that Q(Vt |ht-1 ) = P (Vt |ht-1 ) because of step 1 in Algorithm 4, which is also con sistent with the equation given in figure 2 where Ht is integrated out. The only difference between Algorithm 3 and Algorithm 4 is in step 2. The difference may seem small, since the operations ht P (Ht |vt , ht-1 ) and ht P (Ht |vt , ht-1 ) appear similar. However, this difference significantly alters the inference and learning procedures of the RTRBM. 3 Formally, P (Vt |ht-1 ) and P (Ht |vt , ht-1 ) are defined for only binary ht-1 . However, their defining equation (Eq. 5) is meaningful even when ht-1 is real-valued, which is the case here. 4 4.1 Inference in RTRBMs T Inference in RTRBMs given v1 is very easy, which might be surprising in light of its similarity to the TRBM. The reason inference is easy is similar to the reason inference in square ICAs is easy [1]: There is a unique and an easily computable value of the hidden variables that has a nonzero posterior T probability. Given that V1T = v1 , it follows that v1 was produced at the end of step 1 in Algorithm 4. Since step 2, the deterministic operation h1 P0 (H1 |v1 ), has been executed, the only value h1 can take is the value assigned by the operation · P0 (H1 |v1 ). Any other value for h1 is never produced by a generative process that outputs V1 = v1 and thus has posterior probability 0. In addition, by executing this operation, we can recover h1 . Thus Q0 (H1 |v1 ) = s(W v1 +bH +binit ) (H1 ). Note that T H1 's value is not influenced by v2 . Once h1 is known, we can consider the generative process that produced v2 . As before, since v2 was produced at the end of step 1, then the fact that step 2 has been executed implies that h2 can be computed by h2 P (H2 |v2 , h1 ) (recall that at this point h1 is known with absolute certainty). If the same reasoning is repeated t times, then all of ht is uniquely determined and is easily computed 1 when V1t is known. There is no need for smoothing because Vt and Ht-1 influence Ht with such strength that the knowledge of VtT 1 cannot alter the model's belief about Ht . This also implies that + Q(Ht |vt , ht-1 ) = s(W vt +bH +W ht-1 ) (Ht ). The resulting inference algorithm is simple: Algorithm 5 (inference in RTRBMs) for 1 t T : 1. ht P (Ht |vt , ht-1 ) T Let h(v )T denote the output of the inference algorithm on input v1 , in which case the posterior is 1 described by TT T Q(H1 |v1 ) = h(v)T (H1 ). (7 ) 1 4.2 Learning in RTRBMs Learning in RTRBMs may seem easy once inference is solved, since the main difficulty in learning TRBMs is the inference problem. However, the RTRBM does not allow EM-like learning because h T T T the equation log Q(v1 ) = log Q(v1 , h1 ) T Q(H T |vT ) is not meaningful. To be precise, 1 1 1 T the gradient log Q(v1 , hT ) is undefined because s(W ht-1 +bH +W vt ) (ht ) is not a differentiable 1 function of W . Thus, the gradient has to be computed differently. T t T Notice that the RTRBM's log probability satisfies log Q(v1 ) = t=1 log Q(vt |v1-1 ), so we could T t-1 try computing the sum t=1 log Q(vt |v1 ). The key observation that makes the computation feasible is the equation t Q(Vt |v1-1 ) = Q(Vt |h(v )t-1 ) (8 ) t where h(v )t-1 is the last value computedhby the RTRBM inference algorithm with inputs v1-1 . This t t equation holds because Q(vt |v1-1 ) = t-1 Q(vt |ht-1 )Q(ht-1 |v1-1 )dht-1 = Q(vt |h(v )t-1 ), as t the posterior distribution Q(Ht-1 |v1-1 ) = h(v)t-1 (Ht-1 ) is a point-mass at h(v )t-1 by Eq. 7. t The equality Q(Vt |v1-1 ) = Q(Vt |h(v )t-1 ) allows us to define a recurrent neural network (RNN) [11] whose parameters are identical to those of the RTRBM, and whose cost function is equal to the log likelihood of the RTRBM. This is useful because it is easy to compute gradients with respect to the RNN's parameters using the backpropagation through time algorithm [11]. The RNN has a pair of variables at each timestep, {(vt , rt )}T=1 , where vt are the input variables and rt are the RNN's t T hidden variables (all of which are deterministic). The hiddens r1 are computed by the equation rt = s(W vt + bH + W rt-1 ) (9 ) where, as always, W rt-1 is replaced with binit when t = 1. This definition was chosen so that the T T equation r1 = h(v )1 would hold. The RNN attempts to probabilistically predict the next timestep from its history using the marginal distribution of the RBM Q(Vt+1 |Ht = rt ), so its objective function at time t is defined to be log Q(vt+1 |Ht = rt ), where Q depends on the RNN's parameters 5 in the same way it depends on the RTRBM's parameters (recall that the two sets of parameters being T identical). This is a valid definition of an RNN whose cumulative objective for the sequence v1 is tT O= log Q(vt |Ht-1 = rt-1 ) (1 0 ) =1 T where Q(v1 |H0 = r0 ) = Q0 (v1 ). But since rt as computed in equation 9 on input v1 is identical to t-1 h(v )t , the equality log Q(vt |Ht-1 = rt-1 ) = log Q(vt |v1 ) holds. Substituting this identity into Eq. 10 yields tT tT t T O= log Q(vt |Ht-1 = rt-1 ) = log Q(vt |v1-1 ) = log Q(v1 ) (1 1 ) =1 =1 which is the log probability of the corresponding RTRBM. T This means that O = log Q(v1 ) can be computed with the backpropagation through time algorithm [11], where the contribution of the gradient from each timestep is computed with Contrastive Divergence. 4.3 Details of the backpropagation through time algorithm The backpropagation thorugh time is simply the usual backpropagation algorithm applied to RNNs. Specifically, the algorithm first computes all of rt , and then computes O/ rt , which is a sum of on the contribution from the future timesteps and the contribution from the current timestep. This is done by the recursive equation O/ rt = W (rt+1 . O/ rt+1 ) + W log Q(vt+1 |Ht = rt )/ bH (1 2 ) where a.b denotes component-wise multiplication, the term rt = rt .(1 - rt ) arises from the derivative of the logistic function s (x) = s(x).(1 - s(x)), and log Q(vt+1 |Ht = rt )/ bH is computed by CD. Once O/ rt is computed for all t, the gradients of the parameters can be computed using the following equations tT tT O = vt · (rt . O / rt ) + log Q(vt |Ht = rt-1 )/ W (1 3 ) W =1 =1 O W = tT =2 (rt . O/ rt ) · rt-1 + tT log Q(vt |Ht = rt-1 )/ W (1 4 ) =2 Each of the above derivatives consist of two summations because both W and W are used in two ways (fig. 2). The first summation is the set gradients sent from future timesteps (thanks to Q(Ht |vt , ht-1 )), and the second gradients arise from the use of W and W as the parameters of the distribution of the conditional RBM (namely, Q(Vt |Ht = rt-1 )). Finally, the gradient would be computed exactly if CD were to return the exact gradient of the RBM's log probability. 5 Experiments We report the results of experiments comparing RTRBMs to TRBMs. The results in [14, 13] were obtained using TRBMs that had several delay-taps, so each hidden unit could directly observe several previous timesteps. To demonstrate that the RTRBM learns to use the hidden units to store information, we did not use delay-taps for the RTRBM nor the TRBM, which causes the results to be worse (but not much) than in [14, 13]. If delay-taps are allowed, then the results of [14, 13] show that there is little benefit from the hidden-to-hidden connections (which are W ), making the comparison between the RTRBM and the TRBM uninteresting. In all experiments, the RTRBM and the TRBM had the same number of hidden units, their parameters were initialized in the same manner, and they were trained for the same number of weight updates. When sampling from the TRBM, we would use the sampling procedure of the RTRBM using the TRBM's parameters to eliminate the additional noise from its hidden units. If this is not done, the samples produced by the TRBM are significantly worse. Unfortunately, computing the log probability on a test set is infeasible for both the TRBM and the RTRBM, so we report the squared prediction error for the the TRBM and the RTRBM. The code for the experiments can be found in [www.cs.utoronto.ca/ilya/code/2008/rtrbm.tar.gz]. 6 Figure 3: This figure shows the receptive fields of the first 36 hidden units of the RTRBM on the left, and the corresponding hidden-to-hidden weights between these units on the right: the ith row on the right corresponds to the ith receptive field on the left, when counted left-to-right. Hidden units 18 and 19 exhibit unusually strong hidden-to-hidden connections; they are also the ones with the weakest visible-hidden connections, which effectively makes them belong to another hidden layer. 5.1 Videos of bouncing balls We used a dataset consisting of videos of 3 balls bouncing in a box. The videos are of length 100 and of resolution 30×30. Each training example is synthetically generated, so no training sequence is seen twice by the model which means that overfitting is highly unlikely. The task is to learn to generate videos at the pixel level. This problem is high-dimensional, having 900 dimensions per timestep, and the RTRBM and the TRBM are given no prior knowledge about the nature of the task (e.g., by convolutional weight matrices). Both the RTRBM and the TRBM had 400 hidden units. Samples from these models are provided as videos 1,2 (RTRBM) and videos 3,4 (TRBM). A sample training sequence is given in video 5. All the samples can be found in [www.cs.utoronto.ca/ilya/pubs/2008/rtrbm vid.tar.gz]. The realvalues in the videos are the conditional probabilities of the pixels [13]. The RTRBM's samples are noticeably better than the TRBM's samples; a difference between these samples is that the balls produced by the TRBM moved in a random walk, while those produced by the RTRBM moved in a more persistent direction. An examination of the visible-to-hidden connections of the RTRBM reveals a number of hidden units that are not connected to any visible units. These units have the most active hidden to hidden connections, which must be used to propagate information through time. In particular, these units are the only units that do not have a strong self connection (i.e., Wi,i is not large; see figure 3). No such separation of units is found in the TRBM and all its hidden units have large visible to hidden connections. The mean squared prediction error per pixel per timestep for the RTRBM is 0.007 and is 0.04 for the TRBM. 5.2 Motion capture data We used a dataset that represents human motion capture data by sequences of joint angles, translations, and rotations of the base of the spine. The data was preprocessed to be invariant to isometries [14]. The total number of timesteps in the dataset set was 3826, from which the model learned on subsequences of length 50. Each timestep has 49 dimensions, and both models have 200 hidden units. The data is real-valued, so the TRBM and the RTRBM were adapted to have Gaussian visible variables using equation 2. The samples produced by the RTRBM exhibit less sticking and footskate than those produced by the TRBM; samples from these models are provided as videos 6,7 (RTRBM) and videos 8,9 (TRBM); video 10 is a sample training sequence. Part of the Gaussian noise was removed in a manner described in [14] in both models. The mean squared prediction error per dimension per timestep for the RTRBM is 0.42 and is 0.47 for the TRBM. 5.3 Details of the learning procedures Each problem was trained for 100,000 weight updates, with a momentum of 0.9, where the gradient was normalized by the length of the sequence for each gradient computation. The weights are updated after computing the gradient on a single sequence. The learning starts with CD10 for the first 1000 weight updates, which is then switched to CD25 . The visible to hidden weights, W , were initialized with static CD5 (without using the (R)TRBM learning rules) on 30 sequences (which resulted in 30 weight updates) with learning rate of 0.01 for the first problem and 0.001 for the second, with momentum 0.9. These weights were then given to the (R)TRBM learning procedure, where the learning rates were linearly reduced towards 0. The weights W and the biases were initialized with a sample from spherical Gaussian of standard-deviation 0.005. 7 6 Conclusions In this paper we introduced the RTRBM, which is a probabilistic model as powerful as the intractable TRBM that has an exact inference and an almost exact learning procedure. The common disadvantage of the RTRBM is that it is a recurrent neural network, a type of model known to have difficulties learning to use its hidden units to their full potential [2]. However, this disadvantage is common to many other probabilistic models, and it can be partially alleviated using techniques such as the long short term memory RNN [6]. Acknowledgments This research was partially supported by the Ontario Graduate Scholarship and by the Natural Council of Research and Engineering of Canada. The mocap data used in this project was obtained from http://people.csail.mit.edu/ehsu/work/sig05stf/. For Matlab playback of motion and generation of videos, we have adapted portions of Neil Lawrence's motion capture toolbox (http://www.dcs.shef.ac.uk/neil/mocap/). References [1] A.J. Bell and T.J. Sejnowski. An Information-Maximization Approach to Blind Separation and Blind Deconvolution. Neural Computation, 7(6):1129­1159, 1995. [2] Y. Bengio, P. Simard, and P. Frasconi. Learning long-term dependencies with gradient descent is difficult. Neural Networks, IEEE Transactions on, 5(2):157­166, 1994. [3] G.E. Hinton. Training Products of Experts by Minimizing Contrastive Divergence. Neural Computation, 14(8):1771­1800, 2002. [4] G.E. Hinton, S. Osindero, and Y.W. Teh. A Fast Learning Algorithm for Deep Belief Nets. Neural Computation, 18(7):1527­1554, 2006. [5] G.E. Hinton and R.R. Salakhutdinov. Reducing the Dimensionality of Data with Neural Networks. Science, 313(5786):504­507, 2006. [6] S. Hochreiter and J. Schmidhuber. Long Short-Term Memory. Neural Computation, 9(8):1735­1780, 1997. [7] H. Larochelle, D. Erhan, A. Courville, J. Bergstra, and Y. Bengio. An empirical evaluation of deep architectures on problems with many factors of variation. In ICML 24, pages 473­480, 2007. [8] S. Osindero and G. Hinton. Modeling image patches with a directed hierarchy of Markov random fields. Advances Neural Information Processing Systems, 2008. [9] C. Peterson and J.R. Anderson. A mean field theory learning algorithm for neural networks. Complex Systems, 1(5):995­1019, 1987. [10] L.R. Rabiner. A tutorial on hidden Markov models and selected applications inspeech recognition. Proceedings of the IEEE, 77(2):257­286, 1989. [11] D.E. Rumelhart, G.E. Hinton, and R.J. Williams. Learning representations by back-propagating errors. Nature, 323(6088):533­536, 1986. [12] R. Salakhutdinov and I. Murray. On the quantitative analysis of deep belief networks. In Proceedings of the International Conference on Machine Learning, volume 25, 2008. [13] I. Sutskever and G.E. Hinton. Learning multilevel distributed representations for high-dimensional sequences. Proceeding of the Eleventh International Conference on Artificial Intelligence and Statistics, pages 544­551, 2007. [14] G.W. Taylor, G.E. Hinton, and S. Roweis. Modeling human motion using binary latent variables. Advances in Neural Information Processing Systems, 19:1345­1352, 2007. [15] T. Tieleman. Training restricted boltzmann machines using approximations to the likelihood gradient. In Proceedings of the International Conference on Machine Learning, volume 25, 2008. [16] M.J. Wainwright, T.S. Jaakkola, and A.S. Willsky. A new class of upper bounds on the log partition function. IEEE Transactions on Information Theory, 51(7):2313­2335, 2005. [17] M.J. Wainwright and M.I. Jordan. Graphical models, exponential families, and variational inference. UC Berkeley, Dept. of Statistics, Technical Report, 649, 2003. [18] M. Welling, M. Rosen-Zvi, and G. Hinton. Exponential family harmoniums with an application to information retrieval. Advances in Neural Information Processing Systems, 17:1481­1488, 2005. [19] J.S. Yedidia, W.T. Freeman, and Y. Weiss. Understanding belief propagation and its generalizations. Exploring Artificial Intelligence in the New Millennium, pages 239­236, 2003. 8