Unsupervised Hierarchical Modeling of Locomotion Styles

Wei Pan PWAY @ CS . DARTMOUTH . EDU Lorenzo Torresani LORENZO @ CS . DARTMOUTH . EDU Computer Science Department, Dartmouth College, 6211 Sudikoff Lab, Hanover, NH 03755 USA

Abstract
This paper describes an unsupervised learning technique for modeling human locomotion styles, such as distinct related activities (e.g. running and striding) or variations of the same motion performed by different subjects. Modeling motion styles requires identifying the common structure in the motions and detecting stylespecific characteristics. We propose an algorithm that learns a hierarchical model of styles from unlabeled motion capture data by exploiting the cyclic property of human locomotion. We assume that sequences with the same style contain locomotion cycles generated by noisy, temporally warped versions of a single latent cycle. We model these style-specific latent cycles as random variables drawn from a common "parent" cycle distribution, representing the structure shared by all motions. Given these hierarchical priors, the algorithm learns, in a completely unsupervised fashion, temporally aligned latent cycle distributions, each modeling a specific locomotion style, and computes for each example the style label posterior distribution, the segmentation into cycles, and the temporal warping with respect to the latent cycles. We demonstrate the flexibility of the model on several application problems such as style clustering, animation, style blending, and filling in of missing data.

1. Introduction
Modeling human locomotion1 is of fundamental importance for a wide range of applications including gait recognition, diagnosis of movement disorders, analysis of run1 In this paper the term locomotion is used in a wide sense to indicate any cyclic limb motion, such as walking, running, crawling, swimming.

Appearing in Proceedings of the 26 th International Conference on Machine Learning, Montreal, Canada, 2009. Copyright 2009 by the author(s)/owner(s).

ning efficiency, human tracking in video sequences, and computer animation. In this paper we focus on the specific problem of deriving computational models capable of capturing and representing distinct locomotion styles, corresponding for example to distinct related activities (e.g. walking and running) or variations of the same activity performed by different subjects. Hand-constructing style models is generally not possible due to the subtlety of the style variations and the complexity of the human dynamics. Because of such challenges, several researchers have proposed automatically learning motion style models from human motion examples. Most previously proposed approaches treat motion style modeling as a generic data fitting problem by employing generalpurpose learning models. Examples of such models include Restricted Boltzmann Machines (Taylor et al., 2007), Gaussian Processes (Wang et al., 2007), linear dynamical systems (Brand & Hertzmann, 2000; Li et al., 2002; Chiappa et al., 2009), and nonlinear manifolds (Elgammal & Lee, 2004). General-purpose models fail to exploit relevant prior information, such as the cyclic property of locomotion or the knowledge that different styles of an activity must correspond to subtle variations of a common motion. Recent work (Liu et al., 2005; Urtasun et al., 2008) has shown that incorporating domain-specific prior information in the model yields motion representations that are more accurate and intuitive, and helps reducing the risk of overfitting. This is particularly important in the motion domain, where the data is high-dimensional and training examples are scarce. There are many possible ways to encode prior knowledge in the model. Liu et al. (2005) use body physics constraints. Urtasun et al. (2008) force the learned model to satisfy a specific topological structure. In this paper we propose to encode domain knowledge via hierarchical priors and probability distributions specifically suited to the properties of human locomotion. We employ hierarchical priors to encode the knowledge that distinct locomotion styles must share a common structure. We view each motion style as a random variable drawn from an unknown distribution common to all styles. This common distribution assumption constrains the styles to represent subtle variations around an average motion. Furthermore, we exploit the cyclic nature of the data and learn mod-

Unsupervised Hierarchical Modeling of Locomotion Styles

els representing single cycles of locomotion instead of describing entire periodic sequences. By doing so we make more efficient use of the available data, reduce the number of unknowns in the model, and obtain models that are easy to interpret. We achieve this by inferring cyclic alignment distributions which temporally synchronize the observations and describe each sequence as a concatenation of motions generated by single-cycle models. A distinctive feature of our approach is that the learning is completely unsupervised: while the methods in (Liu et al., 2005; Urtasun et al., 2008) require training examples of a single style or data with user-provided style labels, our algorithm can automatically learn distinct style models from a large pool of unlabeled motion sequences. Our model learns for each style a prototypical, highresolution fusion of the cycles belonging to sequences assigned to that style (to be more precise, each learned style is a PDF describing also how cycle samples can deviate from the prototypical style cycle). We demonstrate that these cycle prototypes can be used to generate vivid animations, visually undistinguishable from real motion. Furthermore, since our algorithm yields motion style prototypes that are time-synchronized, the output of our system is directly usable by motion style blending algorithms (Rose et al., 1998; Kovar et al., 2002) to generate realistic novel motion. We demonstrate this application in our experiments. Our work builds on the Hierarchical Bayesian Continuous Profile Model (HB-CPM) proposed by Listgarten et al. (2007) for detection of differences in time series classes. We generalize this method to the fully unsupervised case, where class labels are not available as input. Thus, our algorithm simultaneously performs clustering, difference detection, and alignment of time-series. HB-CPM was originally proposed to model time-series with class-specific differences corresponding to impulses at rare but systematic locations. However, style differences in human motion are typically extended in time (see Figure 1) and thus an impulse-based model is not appropriate for our purposes. We describe a different set of hierarchical priors, specifically suited to the case of human locomotion. Inference in (Listgarten et al., 2007) is handled by means of a Markov Chain Monte Carlo (MCMC) method. However, in our case a stochastic approach is not computationally practical due to the high-dimensionality of the data. Instead we propose an efficient variational method, which is able to learn motion style models in just a couple of minutes from datasets including up to 39 distinct 62-dimensional time series (compare this performance to the several hours needed by the MCMC algorithm in (Listgarten et al., 2007) to process 21 one-dimensional time series).

2. A Hierarchical Locomotion Model
Our approach to style modeling is inspired by the observation that motions with the same style are characterized by similar body pose sequences albeit with possibly different timing (e.g. walking performed by the same individual at different paces). On the other hand, motions with distinct styles contain significant differences in body poses under any temporal warping. Therefore, we propose a model which groups within the same cluster motions that can be temporally aligned to have similar sequences of body poses. Note that in our approach the clustering, the temporal alignment, and the learning of the underlying pose trajectories are performed simultaneously. We now describe formally our Hierarchical Locomotion Model (HLM). Let Xk = (xk , xk , ..., xk k ) denote the k1 2 N th sequence in a dataset of K locomotion examples sharing a common structure but containing some stylistic variations. Nk indicates the length of the sequence and xk is i a F -dimensional vector encoding the 3D configuration of the body at time i, for example in the form of kinematic joint angles. We assume that the dataset comprises C distinct locomotion styles. We indicate with lk  {1, ..., C} the unknown style label of sequence k, which we assume to be drawn from a Multinomial distribution with parameters  = {1 , ..., C }. We model each style c by means of a hidden variable Zc = (zc , ..., zc ), which we will refer 1 M to as the latent cycle of style c. M denotes the length of the latent cycle and zc is an F -dimensional vector encodm ing the 3D body configuration at time frame m in the cycle. We assume that the cycles of an observed sequence are generated from temporally subsampled versions of a latent cycle. Consequently, we would like M to be much larger than the typical length of a cycle in the observed motions, as this would yield higher-resolution representations of the motions. However, care must be taken to avoid overfitting. Inspired by the choice of this parameter in (Listgarten et al., 2007), we select M = 2N where N is a value provided as input to the system and representing the expected length of a cycle in the training set (note that our system can handle cycle lengths in the observed motions differing considerably from N). We assume that a sequence Xk with motion style c consists of a concatenation of cycles generated by an HMM which moves cyclically and in left-toright order through time samples of latent cycle Zc , and emits noise-corrupted versions of 3D configurations zc . m k -1 In other words, we assume xk  N (zl k , lk ), where i 
k i  {1, ..., M } indicates the HMM state and lk is a diagonal, style-specific, covariance matrix. We denote with k k p(i |i-1 ; dk ) the cyclic, left-to-right transition distribution governing the HMM of sequence k, implemented as follows:
i

Unsupervised Hierarchical Modeling of Locomotion Styles
k k p(i = m|i-1 = n; dk )  k if m - n = 1 or M + m - n = 1  d1   ... =  dk if m - n = J or M + m - n = J  J  0 otherwise

s and c , where c = diag([c , ..., c ]). We regularize 1 f F parameters  and dk via hyperparameters   and  d . (1) Exact inference in our model is analytically intractable, and thus approximation methods need to be employed. Listgarten et al. (Listgarten et al., 2007) applied stochastic approximation (Markov Chain Monte Carlo) to learn a fully-Bayesian Hierarchical Continuous Profile Model in the simpler supervised setting (i.e. when class labels are provided), and for the case of one-dimensional timeseries. However, a stochastic approach in our case is not computationally practical due to the high-dimensionality of our data2 and the more complex setting deriving from the use of unlabeled data. We make the problem tractable by modeling some of the unknowns as parameters and by adopting a variational approach to estimate the distributions of the other unobservables. Specifically, we model k the HMM states {i }, the style labels {lk }, and the latent style cycles {Zc } as hidden variables, which are fully marginalized out during learning. All remaining unobservŻ Ż able   {Z, z , s , 1 , ..., C , , d1 , ..., dK } are treated as parameters estimated using a penalized maximum likelihood framework, with penalties defined via fixed hyperŻ parameters   { z ,   ,  d }. Thus we solve for  to maximize
K

where 1  m, n  M , and J is the maximum transition length expressed in number of frames. The dk denote j J probabilities satisfying the condition j=1 dk = 1. Note j that since we use J << M , the HMM valid transitions 2 are only either left-to-right or from the tail section to the head section of the latent trace (corresponding to moves from state n to state m such that (M + m) - n = j, with j  J ). This latter type of transition is used to model the periodic property of locomotion. Finally, we force the latent cycles to be aligned to one another, and to share a common structure by assuming that each style-specific Zc is a random variable drawn from a distribution encouraging the latent cycle to be temporally smooth and "similar" Ż to a parent cycle Z = (Ż1 , ..., ŻM ) common to all styles. In z z summary, we assume the following generative process for a dataset X = {X1 , ..., XK }:
Ż Ż 1. Z  N (Ż1 ; ŻM ,  z I) z z M z Ż z z m=2 N (Żm ; Żm-1 ,  I)

2. For each style c  {1, ..., C}:
M

Z 

c

N (zc ; zc , -1 I) 1 M s
m=2

N (zc ; zc , -1 I) m m-1 s
M

p(|)p(X|) = p(|)
k=1

p(Xk |)

= p(|)
-1 N (zc ; Żm , z I) m z Ż K C

×
m=1

×
k=1 lk =1  k k

p(Xk ,  k , lk , Z1 , .., ZC |)dZ1 ...dZC (2)

3.   D(  ) 4. For each sequence k  {1, ..., K}: (a) lk  Mult( ) (b) dk  D( d ) k (c)  k  Mult(1 ; 1/Nk ) (d) Xk 
Nk i=1
k i

N (xk ; zl k ,  i 

Nk k k k i=2 p(i |i-1 ; d ) -1 lk

where k denotes the set of all possible HMM paths for sequence k. In the next section we describe the variational method for maximizing this objective.

)

3. Inference and learning
Using Jensen's inequality we obtain the following lower bound LQ on the penalized log likelihood (Jordan et al., 1999): log p()p(X|) 
K C

where D() indicates a Dirichlet distribution, and Mult() a Ż Multinomial distribution. The improper prior on Z (defined Ż via hyperparameter  z ) is used to enforce temporal smoothness of the parent cycle. Note that this encourages also the last frame in the cycle to be similar to the first. The PDF of the latent cycle Zc is the product between a smoothing distribution correlating configurations of consecutive frames and a Gaussian distribution which encourages each latent cycle frame zc to be close to the corresponding parent cym cle frame Żm . The resulting PDF is a multivariate Gausz sian (Listgarten et al., 2007). s and z are precision paŻ rameters controlling the temporal smoothing and the distribution relating the latent cycle to the parent cycle, respectively. We assume uninformative priors for parameters z , Ż

log p() +
k=1 Z1 ...ZC lk =1  k k k k k 1

Q( k , lk , Z1 , ..., ZC ) × log p(X ,  , l , Z , ..., ZC |) 1 dZ ...dZC Q( k , lk , Z1 , ..., ZC ) (3)

 LQ

2 In our experiments the dimensionality of the observed configuration at each frame is 50 or higher.

Unsupervised Hierarchical Modeling of Locomotion Styles

where Q( k , lk , Z1 , ..., ZC ) is an arbitrary distribution. We now assume that this distribution factorizes as follows:
C

Q( k , lk , Z1 , ..., ZC ) = Q( k )Q(lk )
c=1

Q(Zc )

(4)

We maximize the variational bound LQ , subject to the mean field assumption in eq. 4, using the EM algorithm. In the E-step we keep  fixed and estimate the factor distributions maximizing the variational bound. This is done via variational inference as described in the following subsection. In the M-step we maximize the expected complete penalized log likelihood given the hidden variable distributions. This yields closed-form updates for each of the parameters in . The complete penalized log-likelihood for our model is given by Lp = L + P, where L is the loglikelihod term:
K

where <>h denotes expectation with respect to all hidden variables except h. Note that equations (7, 8, 9) are coupled. Thus, a closed-form optimal solution is not possible. However, convergence to the optimal distributions is guaranteed if we iteratively update the distributions by solving each equation using the current estimates of the other factor distributions. The variational update steps are obtained by expanding the expectations on the righthand side of equations (7, 8, 9). For brevity, we write k k  k (c)  Q (lk = c), i (m)  Q (i = m), and k  k k i (m, n)  Q (i = m|i-1 = n). Variational update for Q ( k ) Q ( k ) is updated by applying the forward-backward algorithm (Rabiner, 1989) to an HMM with transition probabilities given by eq. 1 and unnormalized observation loglikelihoods given by: log p(xk |i = m) ~ i k =- 1 2
C

L =
k=1

k log  l + log Mult(1 ) Nk

k

+
i=2 Nk

k k log p(i |i-1 ; dk )

 k (c) (xk - zc )T c (xk - zc ) i m i m
c=1 F

zc m

+
i=1 C

k k log N (xk ; zl k , l i i

-1

M

)


-
f =1

log c + const f

(10)

+
c=1 m=1 C

-1 log N (zc ; Żm , z I) m z Ż

+
c=1 C

-1 log N (zc ; zc , s I) 1 M Ż M -1 log N (zc ; zc , s I) m m-1 Ż c=1 m=2

+ and P is the penalty term: P = = log p(|)

(5)

Variational update for Q (Zc ) c Let zmf be the f -th entry in vector zc with f  {1, ..., F } m where F is the dimensionality of the configuration vector at each frame. It is easy to verify that, according to our c c probabilistic model, the random variables zmf and zmf  for  distinct features f, f  {1..., F } are independent. Thus, F  c,f ) where zc,f = we can write Q (Zc ) = f =1 Q (z
c c z1f , ..., zMf T

. Q (zc,f ) is a multivariate Gaussian dis-

tribution with precision S c,f and mean µc,f . The nonzero c,f entries of S c,f are the diagonal entries Sm,m = 2s +
M

Ż Ż - z ||Ż1 - ŻM ||2 -  z z z m=2 K

||Żm - Żm-1 ||2 z z

+ log D( ;   ) +
k=1

log D(dk ;  d )

(6)

k z + c / k=1 i=1  k (c)i (m) for m = 1, ..., M , f c,f c,f the off-diagonal entries Sm,m+1 = Sm+1,m = -s for c,f c,f m = 1, ..., M - 1, and S1,M = SM,1 = -s . The mean µc,f is given by µc,f = S c,f (z Żf + vc,f ) where Żf = z Żz

K

Nk

[Ż1f , ..., zMf ] and vc,f = c z Ż f

T

PK

k=1

 k (c)

PN k

PK

k=1

 k (c)

k k i=1 i (m)xif PN k k i=1 i (m)

.

3.1. Variational Inference The factor distributions {Q ( )}k=1,..,K , {Q (lk )}k=1,..,K , {Q (Zc )}c=1,..,C maximizing the lower bound LQ must satisfy the following equations (Jordan et al., 1999): log Q ( k ) log Q (lk ) log Q (Zc ) = < L >k + const = < L >lk + const = < L >Zc + const (7) (8) (9)
 k

Variational update for Q (lk ) The class label distribution Q (lk = c) is updated as C Q (lk = c) = k / c=1 k where c c  F  1 M Nk Nk k c c 2 k c =  (f ) exp - i (m)  2 m=1 i=1
f =1

× (xk - zc )T c (xk - zc ) i m i m

zc m

(11)

Unsupervised Hierarchical Modeling of Locomotion Styles

3.2. Parameter updates The parameters  are updated in the M-step. The update for each parameter is computed by setting the corresponding partial derivative of the expected penalized log-likelihood to zero. The update rules are:

Estimating missing data If some sequences contain missing entries, we fill in the unobservable data during the M-step. The idea is to optimize the expected log-likelihood with respect to the missing entries. Let () denote the rows of the missing entries in frame xk . The update rule is: i
C -1

(xk ) i z  CM F/  Ż 
C M F c (zm,f - zm,f )2 Ż c=1 m=1 f =1
c zm,f


c=1 C

 (c)( )
M

k

c 

  (12)

×
c=1

 k (c)
m=1

k i (m)(c ) (µc,m )(18)

where (c ) is the square matrix sub-block corresponding to the missing entries. s  CM F/  
C M F c c (zm,f - zm1,f )2 c=1 m=1 f =1
zc m,f zc m1,f

 
4. Experiments
Data preprocessing We evaluated our method on several sets of motion capture sequences from the CMU Graphics Lab Motion Capture Database3 . The data is represented in the form of Euler joint angles parameterized so as to avoid discontinuities. The configuration at each frame is a 62-dimensional vector. When generating animations, in addition to the joint angles, we used the 3D global translations of the body in the form of frame-to-frame 3D displacements of a root marker. In our experiments we have investigated the usefulness of PCA as a preprocessing step to reduce the dimensionality of the data. We have found that, while eliminating the last few principal components is generally beneficial, using fewer than 50 PCA dimensions results consistently in lower performance for all methods. Thus, here we report results obtained by applying PCA to each dataset and using only the first 50 dimensions. Furthermore, we show that HLM works equally well without this preprocessing by including also the results obtained by our method without the use of PCA. Comparison We consider the following algorithms in our comparison: · HLM: this is the novel hierarchical model described in this paper. We initialize the vectors Żm by linearly inz terpolating the first M/2 frames of a sequence randomly chosen from the training set. For all styles c, c was f initially set equal to the sample precision of the f -th coordinate of the data. z was initially set to 0.01 times Ż the sample precision of all the data coordinates xk , and i,f s was set to 0.1. The parameters dk were initialized j Ż to 1/J , with J = 3. The hyperparameters  z ,   ,  d  were all set equal to 0.1. We initialized Q (lk ) by adding small random noise to an equal-probability distribution over the labels. Finally, we initialized Q (zc ) by setm ting µc,m equal to Żm and sc,m equal to c . Q ( k ) was z f f then estimated from these initializations. In our experiments we kept z fixed to its initial value, since doing so Ż
3

(13)

K

K

c  f
k=1

 k (c)N k /
k=1 Nk M

 k (c)   (14)

×
i=1 m=1

k c i (m) (xk - zm,f )2 i,f

c zm,f

K

C

K  (c +

c 

 c + k=1

 k (c) /

 k (c ))
k=1

c =1

(15)

 Nk M 1 d k j + dj  q i=2 m=1

nTj (m)

k i (m, n)


(16)

where Tj (m) = {n  {1, ..., M } s.t. m - n = j or M + m - n = j}, q is a constant enforcing the constraint J k j=1 dj = 1, and m  1 = (m - 1) if m > 1, m  1 = M otherwise. In order to update the parent cycle we solve the linear system of M equations given by:  L Żm z
C

=

z Ż
c=1

(< zc > -Żm ) z m

Ż + z (2Żm-1 - 4Żm + 2Żm+1 ) = 0 (17) z z z

for m = 2, ..., M - 1, and by the two analogous equations corresponding to cases m = 1, m = M .

Available at http://mocap.cs.cmu.edu/

Unsupervised Hierarchical Modeling of Locomotion Styles

100

100

2nd component (aligned by K-CPM)

100 80 60 40 20 0 -20 -40 -60 -80 -100

100

2nd component (aligned by FAM)

80

80 60 40 20 0 -20 -40 -60 -80 -100

2nd component (aligned by HLM)
0 50 100 150 200 250

80 60 40 20 0 -20 -40 -60 -80 -100

2nd component (raw data)

60 40 20 0 -20 -40 -60 -80 -100

(a)

0

10

20

30

40

50

60

70

80

90

time

(b)

20

40

60

80

100

120

time

(c)

time

(d)

0

20

40

60

80

100

120

140

time

Figure 1. (a) Time series corresponding to the second PCA components of the 39 motions. Sequences having the same label in the CMU database are drawn with the same color. (b) The time series aligned by FAM. (c) Viterbi alignment computed using the CPM model. (d) Time-series warped according to the Viterbi alignment derived from our HLM model. Our algorithm successfully aligns all time series.

produced better results. We believe that our datasets are too small to be able to estimate this parameter reliably. We found that our variational algorithm generally converges very rapidly to a minimum. In our experiments we used 15 or fewer EM iterations, with 3 variational inference updates in each E-step. · CPM: we modified the original Continuous Profile Model (Listgarten et al., 2005) to handle the cyclic nature of our data by using the transition distribution given by eq. 1. This algorithm aligns all sequences with respect to a single learned latent cycle. Finally, for each sequence it produces a prototypical cycle of length M obtained by averaging the motion warped according to the Viterbi alignment over the multiple observed cycles. Note that HLM differs from CPM in several ways. CPM aligns the data without performing clustering. In HLM each example is aligned with respect to style-specific latent cycles, which in turn are forced to be aligned with respect to the parent cycle. As we show in our experiments, this hierarchy leads to more accurate alignment than when warping all examples with respect to a single, generic cycle. Furthermore, our approach learns for each style a full PDF rather than a single point estimate. · FAM: this algorithm uses the Functional Analysis Model described in (Ormoneit et al., 2005) to warp and segment the observed sequences into aligned cycles. As HLM, this model has the ability to fill-in missing data. · LGSSM: this implements the algorithm described in (Chiappa et al., 2009). This method performs clustering of motions using a Bayesian Mixture of Linear Gaussian State-Space Models. As in HLM, the clustering and the model learning are done at the same time. CPM and FAM produce for each sequence a prototypical cycle summarizing the motion. Thus, such methods can be naturally extended to model styles by applying a clustering algorithm to the aligned prototypical cycles. Here we evaluate these algorithmic extensions by running K-means on the prototypical cycles produced by CPM and FAM. The cluster centroids are finally used as style-prototypes. We denote with K-CPM and K-FAM the algorithms obtained by combining K-means with CPM and FAM, respectively.

4.1. Learning motion styles In this section we show that HLM can be used to discover styles from a pool of motions, and to generate novel animations for each learned style. For this experiment we used a dataset of 39 locomotion sequences, taken from CMU subject categories 07, 08, 09, and 35. This set contains regular walking sequences performed by different subjects, as well as examples of striding and running. Figure 1(a) shows the second PCA component of each sequence plotted as a function of time. Sequences having the same motion label in the CMU database are plotted with the same color (there are four distinct CMU labels in this set). Note, however, that these labels are not provided to the algorithm, and that the motion styles are learned in a fully unsupervised way. We trained HLM and our "cyclic" version of CPM on this dataset. After training, we aligned the time series using the maximum likelihood HMM state path computed by applying the Viterbi algorithm (Rabiner, 1989) to both learned models. The results are shown in Figure 1(c) and (d) for CPM and HLM, respectively. Figure 1(b) shows the sequences aligned by FAM. There are significant alignment errors with the FAM and CPM models, while HLM synchronizes the time-series successfully, even sequences having noticeably different characteristics. We also evaluated the quality of the clusterings by using the CMU style labels as ground truth data. The plot in Figure 2 reports the average cluster purity4 obtained for different values of C. The purity values are computed by averaging the results over 50 runs for each algorithm. HLM yields consistently the best clustering results. Note that our algorithm performs roughly the same when applied to raw joint angles as opposed to data obtained from PCA. This indicates robustness to noise and high-dimensionality. Here K-FAM performs slightly better than K-CPM, but worse than HLM. As illustrated in this Figure, LGSSM produces very poor clustering results.
4 The average purity is the weighted sum of the individual cluster purities, with weights proportional to the cluster sizes and normalized to sum to 1. The purity of a cluster is the fraction of motions in the majority ground truth class assigned to that cluster.

Unsupervised Hierarchical Modeling of Locomotion Styles
1

0.9

average cluster purity

0.8

0.7

0.6

0.5

HLM w/o PCA HLM w/ PCA K-CPM K-FAM LGSSM

0.4

0.3

3

4

5

6

7

number of classes

Figure 2. Average cluster purity obtained with LGSSM, K-FAM, K-CPM, and HLM (with both raw data and data processed by PCA) for different values of C. The figure includes error bars.

Figure 3. An illustration of the animation generated by blending learned latent cycles corresponding to "taichi" and "striding". Notice the varying walking style.

the course of the sequence. The result is a realistic animation containing style transitions (see video at http: //www.cs.dartmouth.edu/hlm). 4.3. Filling in missing data Despite the recent advances in technology, motion capture systems today are still prone to the problem of marker dropouts, i.e. markers that are lost by the tracker due to noise or occlusion. This problem is typically addressed in a post-processing stage via interactive software using interpolation methods. In contrast, our model can exploit the correlation among the joint angles and reconstruct missing data in a fully automatic way from the available data. For this experiment, we used the 39-sequence dataset previously introduced and trained the models on the raw joint angles. We selected a new walking sequence, not included in our original training set, from CMU subject category 7. We left the first half of this sequence unchanged. However, we eliminated 48 joint angles, corresponding to all the degrees of freedom of the upper body, including arms, the head, and the hips (note that this affects also the leg configuration) from the entire second half of the example. We use the update in eq. 18 to predict the missing entries. During inference we used the previously learned model to estimate the style label and the HMM state distributions for the new sequence. Note that the M-step update for missing data can also be used to fill-in unobservable entries in the training data during learning, although here we do not test such case. We compare our approach for handling missing entries with FAM, and a simple solution based on nearest neighbor (NN), as in (Taylor et al., 2007): for each frame containing missing entries, we find the most similar body configuration in the training set (in terms of Euclidean distance) and copy from it the data corresponding to the unobserved angles. We show reconstruction of missing joint angles using HLM and NN in Figure 4. The mean squared error per joint is 23.9 when using NN, and 9.3 with HLM. The motion filled in with HLM appears real, while the NN and FAM reconstructions look unnatural (see Figure 5). We found that the FAM model can effectively handle only cases where the number of missing entries is very small. On our challenging experiment FAM yields a reconstruction error greater than 200.

Examples of animations generated with the different models can be viewed at http://www.cs.dartmouth. edu/hlm. The animations show the means of the stylespecific latent cycles obtained when setting C = 4. The motions learned by HLM are very realistic and stylistically distinct. One of the four learned styles clearly corresponds to running, and one to striding. The remaining two motions are different styles of walking. In contrast, the motions produced with K-CPM and K-FAM are noisy and jittery, possibly due to the inaccurate alignment. Furthermore, they appear to mix together different styles. We found that motions generated with the LGSSM model deviated considerably from the original motions, particularly as time progressed. This problem occurs even when the method clusters the sequences correctly, as this model generates the configuration at each time by using the estimate at the previous time step, and thus propagates errors over time. It typically generates body poses inconsistent with the training set within 20 frames. 4.2. Style blending Several authors (Rose et al., 1998; Kovar et al., 2002; Grochow et al., 2004; Torresani et al., 2007) have proposed methods that generate novel styles by interpolating (or blending) corresponding body-poses taken from stylistically different sequences. Our algorithm can be used to establish these pose correspondences. Our approach aligns all sequences together and thus it can even support multiway (as opposed to pairwise) interpolation. Furthermore, our approach learns synchronized latent cycle distributions. Consequently, novel styles can be directly generated from them, for example by interpolating the latent cycle expectations µc of different styles. Here we demonstrate this application using a dataset of six sequences, comprising three distinct styles: taichi walking, striding, and regular walking (note that as usual the style labels have not been provided to the algorithm). Figure 3 shows the result of blending together two of the learned latent cycles. The blending is obtained by computing ((m)µc1 + (1 - (m))µc2 ), m m where (m) is a value varying smoothly in [0, 1] over

Unsupervised Hierarchical Modeling of Locomotion Styles

joint angle of the right humerus

-30 -40 -50 -60 -70 -80 -90

6.5

joint angle of the thorax

Nearest Neighbour HLM Original Data

6 5.5 5 4.5 4 3.5 3 2.5

Nearest Neighbour HLM Original Data
0 20 40 60 80

0

20

40

60

80

2

time

time

Figure 4. Filling in of right humerus (left) and thorax (right) joint angles. The original data is shown together with reconstructions computed with HLM and NN.

Figure 5. Reconstruction of motion with missing entries using FAM (left), NN (center), and HLM (right) to fill in the unobserved data. Note the jittery motion traces created by FAM and NN.

5. Conclusions
We have described a novel unsupervised method for learning locomotion styles using hierarchical priors and shown its versatility with a variety of applications. We have demonstrated that our algorithm outperforms state-of-theart methods on the tasks of animation, style clustering, and filling in missing data. Our system uses a new efficient variational method which can infer the distributions of the model in a few minutes even when applied to large datasets of motions. Currently, our algorithm requires the number of classes as input. Future research will investigate Bayesian approaches for model selection. We are also interested in extending our model to represent motions with non-cyclic properties, such as turning or bending. More complex hierarchies and distributions may be needed to model these combinations of styles. Furthermore, we would like to study how our model can be adapted to create animations that simultaneously satisfy user-specified constraints and exhibit the styles learned during training. Although in this paper we have focused on the problem of modeling locomotion styles, we believe that our hierarchical approach can be applied effectively to model timeseries in many other domains.

Kovar, L., Gleicher, M., & Pighin, F. (2002). Motion graphs. ACM Trans. on Graphics, 21, 473­482. Li, Y., Wang, T., & Shum, H.-Y. (2002). Motion texture: A two-level statistical model for character motion synthesis. ACM Trans. on Graphics, 21, 465­472. Listgarten, J., Neal, R. M., Roweis, S. T., & Emili, A. (2005). Multiple alignment of continuous time series. In Adv. in Neural Inform. Proc. Systems 17, 817­824. Listgarten, J., Neal, R. M., Roweis, S. T., Puckrin, R., & Cutler, S. (2007). Bayesian detection of infrequent differences in sets of time series with shared structure. In Adv. in Neural Inform. Proc. Systems 19, 905­912. Liu, K., Hertzmann, A., & Popovic, Z. (2005). Learning physics-based motion style with nonlinear inverse optimization. ACM Trans. on Graphics, 24, 1071­1081. Ormoneit, D., Black, M., Hastie, T., & Kjellstr¨ m, H. o (2005). Representing cyclic human motion using functional analysis. Image and Vision Comp., 1264­1276. Rabiner, L. R. (1989). A tutorial on HMMs and selected applications in speech recognition. Proc. IEEE, 77. Rose, C., Cohen, M., & Bodenheimer, B. (1998). Verbs and adverbs: multidimensional motion interpolation. IEEE Computer Graphics and Application, 18, 32­40. Taylor, G. W., Hinton, G. E., & Roweis, S. T. (2007). Modeling human motion using binary latent variables. In Adv. in Neural Inform. Proc. Systems 19, 1345­1352. Torresani, L., Hackney, P., & Bregler, C. (2007). Learning motion style synthesis from perceptual observations. In Adv. in Neural Inform. Proc. Systems 19, 1393­1400. Urtasun, R., Fleet, D. J., Geiger, A., Popovic, J., Darrell, T., & Lawrence, N. D. (2008). Topologically-constrained latent variable models. Proc. Int. Conf. Machine Learning (pp. 1080­1087). Wang, J. M., Fleet, D. J., & Hertzmann, A. (2007). Multifactor gaussian process models for style-content separation. Proc. Int. Conf. Machine Learning (pp. 975­982).

References
Brand, M., & Hertzmann, A. (2000). Style machines. Proc. of SIGGRAPH (pp. 183­192). Chiappa, S., Kober, J., & Peters, J. (2009). Using bayesian dynamical systems for motion template libraries. In Adv. in Neural Inform. Proc. Systems 21, 297­304. Elgammal, A. M., & Lee, C.-S. (2004). Separating style and content on a nonlinear manifold. Proc. of Comp. Vision Pattern Recogn. (pp. 478­485). Grochow, K., Martin, S. L., Hertzmann, A., & Popovi´ , Z. c (2004). Style-based inverse kinematics. ACM Trans. on Graphics, 23, 522­531. Jordan, M. I., Ghahramani, Z., Jaakkola, T., & Saul, L. K. (1999). An introduction to variational methods for graphical models. Machine Learning, 37, 183­233.