Factored Conditional Restricted Boltzmann Machines for Modeling Motion Style Graham W. Taylor GWTAYLOR @ CS . TORONTO . EDU Geoffrey E. Hinton HINTON @ CS . TORONTO . EDU Department of Computer Science, University of Toronto, Toronto, Ontario M5S 2G4, Canada Abstract The Conditional Restricted Boltzmann Machine (CRBM) is a recently proposed model for time series that has a rich, distributed hidden state and permits simple, exact inference. We present a new model, based on the CRBM that preserves its most important computational properties and includes multiplicative three-way interactions that allow the effective interaction weight between two units to be modulated by the dynamic state of a third unit. We factor the threeway weight tensor implied by the multiplicative model, reducing the number of parameters from O(N 3 ) to O(N 2 ). The result is an efficient, compact model whose effectiveness we demonstrate by modeling human motion. Like the CRBM, our model can capture diverse styles of motion with a single set of parameters, and the three-way interactions greatly improve the model's ability to blend motion styles or to transition smoothly among them. proximate learning algorithm called contrastive divergence (CD) (Hinton, 2002). RBMs have been used in a variety of applications (Hinton & Salakhutdinov, 2006; Salakhutdinov et al., 2007) and their properties have become better understood over the last few years (Welling et al., 2005; Carreira-Perpinan & Hinton, 2005; Salakhutdinov & Murray, 2008). The CD learning procedure has also been improved (Tieleman, 2008). A major motivation for the use of RBMs is that they can be used as the building blocks of deep belief networks (DBN), which are learned efficiently by training greedily, layer-bylayer. DBNs have been shown to learn very good generative models of handwritten digits (Hinton et al., 2006), but they fail to model patches of natural images. This is because RBMs have difficulty in capturing the smoothness constraint in natural images: a single pixel can usually be predicted very accurately by simply interpolating its neighbours. Osindero and Hinton (2008) introduced the Semirestricted Boltzmann Machine (SRBM) to address this concern. The constraints on the connectivity of the RBM are relaxed to allow lateral connections between the visible units in order to model the pair-wise correlations between inputs, thus allowing the hidden units to focus on modeling higher-order structure. SRBMs also permit deep networks. Each time a new level is added, the previous top layer of units is given lateral conections, so, after the layerby-layer learning is complete, all layers except the topmost contain lateral connections between units. SRBMs make it possible to learn deep belief nets that model image patches much better, but they still have strong limitations that can be seen by considering the overall generative model. The equilibrium sample generated at each layer influences the layer below by controlling its effective biases. The model would be much more powerful if the equilibrium sample at the higher level could control the lateral interactions at the layer below using a three-way, multiplicative relationship. Memisevic and Hinton (2007) introduced the gated CRBM, which permitted such multiplicative interactions and was able to learn rich, distributed representations of image transformations. 1. Introduction Directed graphical models (or Bayes nets) have been a dominant paradigm in models of static data. Their temporal counterparts, Dynamic Bayes nets, generalize many existing models such as the Hidden Markov Model (HMM) and its various extensions. In all but the simplest directed models, inference is made difficult due to a phenomenon known as "explaining away" where observing a child node renders its parents dependent. An alternative to approximate inference in directed models is to use a special type of undirected model, the Restricted Boltzmann Machine (RBM) (Smolensky, 1986), that allows efficient, exact inference. The Restricted Boltzmann Machine has an efficient, apAppearing in Proceedings of the 26 th International Conference on Machine Learning, Montreal, Canada, 2009. Copyright 2009 by the author(s)/owner(s). Factored Conditional Restricted Boltzmann Machines In this paper, we explore the idea of multiplicative interactions in a different type of CRBM (Taylor et al., 2007). Instead of gating lateral interactions with hidden units, we allow a set of context variables to gate the three types of connections ("sub-models") in the CRBM shown in Fig. 1. Our modification of the CRBM architecture does not change the desirable properties related to inference and learning but makes the model context-sensitive. While our model is applicable to general time series where conditional data is available (e.g. seasonal variables for modeling rainfall occurrences, economic indicators for modeling financial instruments) we apply our work to capturing aspects of style in data captured from human motion (mocap). Taylor et al. (2007) showed that a CRBM could capture many different styles with a single set of parameters. Generation of different styles was purely based on initialization, and the model architecture did not allow control of transitions among styles nor did it permit style blending. By using style variables to gate the connections of a CRBM, we obtain a much more powerful generative model that permits controlled transitioning and blending. We demonstrate that in a conditional model, gating is superior to simply using labels to bias the hidden units, which is the technique most commonly applied to static models. This paper is also part of a large body of work related to the separation of style and content in motion. The ability to separately specify the style (e.g. sad) and the content (e.g. walk to location A) is highly desirable for animators. Previous work has looked at applying user-specified style to an existing motion sequence (Hsu et al., 2005; Torresani et al., 2007). The drawback to these approaches is that the user must provide the content. We propose a generative model for content that adapts to stylistic controls. Recently, models based on the Gaussian Process Latent Variable Model (Lawrence, 2004) have been successfully applied to separate content and style in human motion (Wang et al., 2007). The advantage of our approach over such methods is that our model does not need to retain the training dataset (just a few frames for initialization) and is thus suitable for low-memory devices. Furthermore, training is linear in the number of frames, and so our model can scale up to massive datasets, unlike the kernel-based methods which are cubic in the number of frames. The rich, distributed hidden state of our model means that it does not suffer from the from the limited representational power of HMM-based methods (e.g. Brand & Hertzmann, 2000). Hidden layer j i Visible layer t-2 t-1 t Figure 1. Architecture of the CRBM latent variables, h, connected to a collection of visible variables, v. The visible variables can use any distribution in the exponential family (Welling et al., 2005), but for mocap data, we use real-valued Gaussian units (Freund & Haussler, 1992). At each time step t, v and h receive directed connections from the visible variables at the last N timesteps. To simplify the presentation, we will assume the data at t - 1, . . . , t - N is concatenated into a "history" vector which we call v