A Probabilistic Mo del for Generating Realistic Lip Movements from Sp eech Gwenn Englebienne School of Computer Science University of Manchester ge@cs.man.ac.uk Tim F. Co otes Imaging Science and Biomedical Engineering University of Manchester Tim.Cootes@manchester.ac.uk Magnus Rattray School of Computer Science University of Manchester magnus.rattray@manchester.ac.uk Abstract The present work aims to model the correspondence between facial motion and speech. The face and sound are modelled separately, with phonemes being the link between both. We propose a sequential model and evaluate its suitability for the generation of the facial animation from a sequence of phonemes, which we obtain from speech. We evaluate the results both by computing the error between generated sequences and real video, as well as with a rigorous double-blind test with human sub jects. Experiments show that our model compares favourably to other existing methods and that the sequences generated are comparable to real video sequences. 1 Intro duction Generative systems that model the relationship between face and speech offer a wide range of exciting prospects. Models combining speech and face information have been shown to improve automatic speech recognition [4]. Conversely, generating video-realistic animated faces from speech has immediate applications to the games and movie industries. There is a strong correlation between lip movements and speech [7, 10], and there have been multiple attempts at generating an animated face to match some given speech realistically [2, 3, 9, 13]. Studies have indicated that speech might be informative not only of lip movement but also of movement in the upper regions of the face [3]. Incorporating speech therefore seems crucial to the generation of true-to-life animated faces. Our goal is to build a generative probabilistic model, capable of generating realistic facial animations in real time, given speech. We first use an Active Appearance Model (AAM [6]) to extract features from the video frames. The AAM itself is generative and allows us to produce video-realistic frames from the features. We then use a Hidden Markov Model (HMM [12]) to align phoneme labels to the audio stream of video sequences, and use this information to label the corresponding video frames. We propose a model which, when trained on these labelled video frames, is capable of generating new, realistic video from unseen phoneme sequences. Our model is a modification of Switching Linear Dynamical Systems (SLDS [1, 15]) and we show that it performs better at generation than other existing models. We compare its performance to two previously proposed models by comparing the sequences they generate to a golden standard, features from real video sequences, and by asking volunteers to select the "real" video in a forced-choice test. The results of human evaluation of our generated sequences are extremely encouraging. Our system performs well with any speech, and since it can easily handle real-time generation of the facial animation, it brings a realistic-looking, talking avatar within reach. 1 2 The Data We used sequences from the freely available on-line news broadcast Democracy Now! The show is broadcast every weekday in a high quality MP4 format, and as such constitutes a constant source of new data. The text transcripts are available on-line, thus greatly facilitating the training of a speech recognition system. We manually extracted short video sequences of the news presenter talking (removing any inserts, telephone interviews, etc.), cutting at "natural" positions in the stream, viz. during pauses for breath and silences. The sequences are all of the same person, albeit on different days within a period of slightly more than a month. There was no reason to restrict the data to a single person, other than the difficulty to obtain sequences of similar quality from other sources. All usable sequences were extracted from the data, that is, those where the face of the speaker was visible and the sound was not corrupted by external sound sources. The sequences do include hesitations, corrections, incomplete words, noticeable fatigue, breath, swallowing, etc. The speaker visibly makes an effort to speak clearly, but obviously makes no effort to reduce head motion or facial expression, and the data is hence probably as representative of the problem as can be hoped for. In total, sequences totalling 1 hour and 7 minutes of video were extracted and annotated.1 The data was split into independent training and test sets for a 10-fold cross validation, based on the number of sequences in each set (rather than the total amount of data). This resulted in training sets of an average of 60 minutes of data, and test sets of approximately 7 min. All models evaluated here were trained and tested on the same data sets. Sound features and lab elling. The sequences are split into an audio and a video stream, which are treated separately (see Figure 1). From the sound stream, we extract Mel Frequency Cepstrum Coefficients (MFCC) at a rate of 100Hz, using tools from the HMM Tool Kit [16], resulting in 13-dimensional feature vectors. We train a HMM on these MFCC features, and use it to align phonetic labels to the sound. This is an easier task than unrestricted speech recognition, and is done satisfactorily by a simple HMM with monophones as hidden states, where mixtures of Gaussian distributions model the emission densities. The sound samples are labelled with the Figure 1: Combining sound and face Viterbi path through the HMM that was "unrolled" with the phonetic transcription of the text. The labels obtained from the sound stream are then used to label the corresponding video frames. The difference in rate (the video is processed at 29.97 frames per second while MFCC coefficients are computed at 100 Hz) is handled by simple voting: each video frame is labelled with the phoneme that labels most of the corresponding sound frames. Face features. The feature extraction for the video was done using an Active Appearance Model (AAM [6]). The AAM represents both the shape and the texture of an ob ject in an image. The shape of the lower part of the face is represented by the location of 23 points on key features on the eyes, mouth and jaw-line (see Figure 2). Given the position of the points in a set of training images, we align them to a common co-ordinate frame and apply PCA to learn a low-dimensional linear model capturing the shape change [5]. The intensities across the region in each example are warped to the mean shape using a simple triangulation of the region (Fig 2), and PCA applied to the vectors of intensities sampled from each image. This leads to a low-dimensional linear model of the intensities in the mean frame. Efficient algorithms exist for matching such models to new images [6]. By combining shape and intensity model together, a wide range of convincing synthetic faces can be generated [6]. In this case a 32 parameter model proves sufficient. This is closely related to eigenfaces [14] but gives far better results as shape and texture are decoupled [8]. Since the AAM parameters 1 The data is publicly available at http://www.cs.manchester.ac.uk/ai/public/demnow. 2 Figure 2: The face was modelled with an AAM. A set of training images is manually lab elled as shown in the two leftmost images. A statistical model of the shape is then combined with a model of the texture within the triangles between feature points. Applying the model to a new image results in a vector of coefficients, which can be used to reconstruct the original image. are a low-dimensional linear pro jection of the original ob ject, pro jecting those parameters back to the high-dimensional space allows us to reconstruct the modelled part of the original image. 3 Mo delling the dynamics of the face We model the face using only phoneme labels to capture the shared information between speech and face. We use 41 distinct phoneme labels, two of which are reserved for breath and silence, the rest being the generally accepted phonemes in the English language. Most earlier techniques that use discrete labels to generate synthetic video sequences use some form of smooth interpolation between key frames [2, 9]. This requires finding the correct key frames, and lacks the flexibility of a probabilistic formulation. Brand uses a HMM where Gaussian distributions are fitted to a concatenation of the data features and "delta" features [3]. Since the distribution is fitted to both the features and the difference between features, the resulting "distribution" cannot be sampled, as it would result in non-sensical mismatch between features and delta features. It is therefore not genuinely generative and obtaining new sequences from the model requires solving an optimisation problem. Under Brand's approach, new sequences are obtained by finding the most likely sequence of observations for a set of labels. This is done by setting the first derivative of the likelihood with respect to the observations to zero, resulting in a set of linear equations involving, at s s each time t, the observation yt and the previous observation yt-1 . Such a set of linear equations can be solved relatively efficiently thanks to its block-band-diagonal structure. This requires the storage of O(d2 T ) elements and O(d3 T ) time to solve, where d is twice the dimensionality of the face features and T is the number of frames in a sequence. This becomes non-trivial for sequences exceeding a few tens of seconds. More important, however, is that this cannot be done in real time, as the last label of the sequence must be known before the first observation can be computed. In this work, we consider more standard probabilistic models of sequential data, which are genuinely generative. These models are shown to outperform Brand's approach for the generation of realistic sequences. Switching Linear Dynamical Systems. Before introducing the SLDS, we introduce some notational conventions. We have a set of S video sequences, which we index with s [1 . . . S ]. The feature vector of the frame at time t in the video sequence s is indicated s as yt Rd , and the complete set of feature vectors for that sequence is denoted as {y}Ts , 1 where Ts is the length of the sequence. Continuous hidden variables are indicated as x and discrete state labels are indicated with , where [1 . . . ]. In an SLDS, the sequence of observations {y}Ts is modelled as being a noisy version of a 1 hidden sequence {x}Ts which depends on a sequence of discrete labels { }Ts . Each state is 1 1 associated with a transition matrix A and with a distribution for the output noise v and the s s s s s s process noise w, such that yt = Bt xt + vt , xs N (µ1 , 1 ) and xs = At xs-1 + t + wt s t t 1 for 2 t Ts . Both the output noise vt and the process noise wt are normally distributed s s with zero mean; vt N (0, Rt ) and wt N (0, Qt ). The states in our application are 3 t-2 t-1 t t+1 t+2 t-2 t-1 t t+1 t+2 ... xt-2 xt-1 xt xt+1 xt+2 ... ... µt-2 µt-1 µt µt+1 µt+2 ... yt-2 yt-1 yt yt+1 yt+2 yt-2 yt-1 yt yt+1 yt+2 (a) (b) Figure 3: Graphical representation of the different models: figure (a) depicts the dependencies in an SLDS when the labels are known and (b) represents our proposed DPDS, where we assume the process is noiseless. The circles are discrete and the squares are multivariate continuous quantities. The shaded elements are observed and the random variables in the dashed box are conditioned on the quantities outside of it. the phonemes, which are obtained from the sound. Notice that in general, when the state labels are not known, computing the likelihood in an SLDS is intractable as it requires the enumeration of all possible state sequences, which is exponential in T [1]. In our case, s however, the state label t of each frame is known from the sound and the likelihood can be computed with the same algorithm as for a standard Linear Dynamical Systems (LDS), which is linear in T . Parameter optimisation can therefore be carried out efficiently with a standard EM algorithm. Also note that neither SLDS or LDS are commonly described s with the explicit state bias t , as this can easily be emulated by augmenting each latent s s vector xs with a 1 and incorporating t into At . However, doing so prevents us from t s using a diagonal matrix for At , and experience has shown that the state mean is crucial to good prediction while the lack of sufficient data or, as is the case with our data, the a ` priori known approximate independence of the data dimensions may make the reduction of s s s the complexity of At , Qt and Rt warranted. In this form, the model is over-parametrised; it can be simplified without any loss of geners ality either by fixing Qt to the identity matrix I or, if there is no reason to use a different s dimensionality for x and y, by setting Bt = I. We did the latter, as this makes the T resulting {x}1 easier to interpret and compare across the different models we evaluate here. We trained a SLDS by maximum likelihood and used the model to generate new sequences of face observations for given sequences of labels. This was done by computing the most likely sequence of observations for the given set of labels. An in-depth evaluation of the trained SLDS model, when used to generate new video sequences, is given in section 4. This evaluation shows that SLDS is overly flexible: it appears to explain the data well and results in a very high likelihood, but does a poor job at generating realistic new sequences. Deterministic Pro cess Dynamical System. We reduced the complexity of the model by simplifying its covariance structure. If we set the output noise vt of the SLDS to zero, leaving only process noise, we obtain the autoregressive hidden Markov model [11]. This model has the advantage that it can be trained using an EM algorithm when the state labels are unknown, but we find that it performs very poorly at data generation. If we set the process noise wt = 0, however, then we obtain a more useful model. The complete hidden sequence {x}T is then determined exactly by the labels { }T . The log-likelihood p({y}|{ }) 1 1 is given by log p({y}|{x}) = - 1 2 sS l s s s og |1 | + (y1 - xs ) -11 (y1 - xs )+ 1 1 s tTs =2 =1 s = A xs-1 + t for t > 1. We will now refer to this model as the where = µ1 and s t Deterministic Process Dynamical System (DPDS, see Figure 3). In our implementation we xs 1 xs t l + s s s og |Rt | + (yt - xs ) R-t1 (yt - xs ) dTs log 2 s t t ( 1) s t 4 (a) Mean L1 distance (b) RMS Error (c) Mean L distance (d) Log-likelihood Figure 4: Comparison of the multiple models on the test data of 10-fold cross-validation. Each plot shows the mean error of the generated data with respect to the real data over the 10 folds. The error bars span the 95% confidence interval of the true error. s s model all matrices Rt , t as diagonal, and further reduce the complexity by sharing the output noise covariance over all states. It is reasonable to assume this because the features are the result of PCA and are therefore uncorrelated. s Since in this case the labels t are known, equation (1) does not contain any hidden variables. Applying EM is therefore not necessary. Deriving a closed-form solution for the ML estimates of the parameters, however, results in solving polynomial equations of the order s s Ts , because xs = f (A2 · · · At ). An efficient solution is to use a gradient-based method. t s The log-likelihood of a sequence is a sum of scaled quadratic terms of (yt - xs ), where t s t xt = f ({ }1 ). The log-likelihood must thus be computed by a forward iteration over all s time steps t using xs-1 to compute xs . The gradients of the likelihood with respect to At t t can be computed numerically in a similar fashion, by applying the chain rule iteratively at each time step and storing the result for the next step. The same could be done for other s s s parameters, however for given values of At , the values of µt , t and Rt that maximise s the likelihood can be computed exactly by solving a set of linear equations. This markedly improves the rate of convergence. An algorithm for the computation of the gradients with s respect to At and the exact evaluation of the other parameters is given in Appendix A. Sequence generation. Since all models parametrise the distribution of the data, we can sample them to generate new observation sequences. In order to evaluate the performance of the models and compare it to Brand's model, it is however useful to generate the most likely sequence of observation features for a sequence of labels with the features of the corresponding real video sequence. s For both SLDS (when Bt = I) and the DPDS, the mean for a given sequence of labels ^ { }T is found by a forward iteration starting with y1 = µ1 and iterating for t > 1 with s 1 s s^ yt = At yt-1 + t . This does not require the storage of the complete sequence in memory ^ as the current observation only depends on the previous one. In setups where artificial speech is generated, the video sequence can therefore be generated at the same time as the audio sequence and without length limitations, with O(d) space and O(dT ) time complexity, where d is the dimensionality of the face features (without delta features). 4 Evaluation against real video We evaluated the models in two ways: (1) by computing the error between generated face features and a ground truth (the features of real video), and (2) by asking human sub jects to rate how they perceived the sequences. Both tests were done on the same real-world data, but partitioned differently: the comparison to the ground truth was done using 10fold cross-validation, while the test on humans was done using a single partitioning, due to the limited availability of unbiased test sub jects. Test error and likeliho o d. In order to test the models against the ground truth, we use the sound to align the labels to the video and generate the corresponding face features. We use 10-fold cross validation and evaluate the performance of the models using different metrics, see Figure 4. Plot (a) shows, for different models, the L1 error between the face 5 A Brand Brand Brand DPDS DPDS reality prefer A 5 4 36 29 60 58 undecided 7 7 21 11 5 5 prefer B 54 55 9 26 1 3 B DPDS reality SLDS reality SLDS SLDS DPDS reality Brand SLDS Table 1: Raw results of the Psychophysical test conducted by human volunteers. Every model is compared to every other model; the order in which models are listed in this table is meaningless. See text for details. features generated for the test sound sequences and the face features extracted from the real video. We compared the sequences generated by DPDS, Brand's model and SLDS to the most likely observations under a standard HMM. This last model just generates the mean face for each phoneme, hence resulting in very unnatural sequences. It illustrates how an obviously incorrect model nevertheless performs very similarly to the other models in terms of generation error. Plots (b) and (c) respectively show the corresponding Root Mean Square (RMS) and L error. We can see that, except for the SLDS which performs worse than the other methods in terms of L1 , RMS and L error, the generation error for the models considered, under all metrics, is consistently not statistically significantly different. In terms of the log-likelihood of the test data under the different models, the opposite is true: the traditional HMM and DPDS clearly perform worst, while SLDS performs dramatically better. The model with the highest likelihood generates the sequences with the largest error. The likelihood under Brand's model cannot be compared directly as it has double the amount of features. These results notwithstanding, great differences can be seen in the quality of the generated video sequences, and the models giving the lowest error or the highest likelihood are far from generating the most realistic sequences. We have therefore performed a rigorous test where volunteers were asked to evaluate the quality of the sequences. Psychophysical test. For this experiment, we trained the models on a training set of 642 sequences of an average of 5 seconds each. We then labelled the sequences in our test set, which consists of 80 sequences and 436 seconds of video from sound with phonemes. These are substantial amounts of data, showing the face in a wide variety of positions. We set up a web-based test, where 33 volunteers compared 12 pairs of video sequences. All video sequences had original sound, but the video stream was generated by any one of four methods: (1) from the face features extracted from the corresponding real video, (2) from SLDS, (3) from Brand's model and (4) from DPDS. A pool of 80 sequences was generated from previously unseen videos. The 12 pairs were chosen such that each generation method was pitted against each other generation method twice (once on each side, left or right, in order to eliminate bias towards a particular side) in random order. For each pair, corresponding sequences were chosen from the respective pools at random. The volunteers were only told that the sequences were either real or artificial, and were asked to either select the real video or to indicate that they could not decide. The test is kept available on-line for validation at http://www.cs.manchester.ac.uk/ai/public/dpdseval. The results are shown in Table 1. The first row, e.g., shows that when comparing Brand's model with the DPDS, people thought that the sequence generated with the former model was real in 5 cases, could not make up their mind in 7 cases, and thought the sequence generated with DPDS was real in 54 instances. These results indicate that DPDS performs quite well at generation, clearly much better than the two other models. Note however that this test discriminates the models very harshly. Despite the strong down-voting of Brand's model in this test, the sequences generated with that model do not look all that bad. They are over-smoothed, however, and humans appear to be very sensitive to that. Also remember that Brand's model is the only model considered here with a closed form solution for the parameter estimation given the labels. Contrary to the other two models, it can easily be trained in the absence of labelling, using an EM algorithm. 6 In order to correlate human judgement with the generation errors discussed at the start of this section, we have computed the same error measures on the data as partitioned for the psychophysical test. These confirmed the earlier conclusions: the SLDS, which humans like least, gives the highest likelihood and the worst generation errors while DPDS and Brand's model do not give significantly different errors. 5 Conclusion In this work we have proposed a truly generative model, which allows real-time generation of talking faces given speech. We have evaluated it both using multiple error measures and with a thorough test of human perception. The latter test clearly shows that our method perceptually outperforms the others and is virtually indistinguishable from reality. Compared to Brand's method it is slower during training, and cannot easily be trained in the absence of labelling. This is a trade-off for the very fast generation and visually much more appealing face animation. In addition, we have shown that traditional metrics do not agree with human perception. The error measures do not necessarily favour our method, but the human preference for it is very significant. We believe this deserves deeper analysis. In future work, we plan to investigate different error measures, especially on the more directly interpretable video frames rather than on the extracted features. We also intend to experiment with a covariance s matrix per state and an unrestricted matrix structure for the transition matrix At . References [1] David Barber. Expectation correction for smoothed inference in switching linear dynamical systems. Journal of Machine Learning Research, 7:2515­2540, 2006. [2] V. Blanz, C. Basso, T. Poggio, and T. Vetter. Reanimating faces in images and video. In Proceedings of ACM SIGGRAPH, Annual Conference Series, 2003. [3] M. Brand. Voice puppetry. In SIGGRAPH '99: Proceedings of the 26th annual conference on Computer graphics and interactive techniques, pages 21­28, New York, NY, USA, 1999. ACM Press/Addison-Wesley Publishing Co. [4] C. Bregler, H. Hild, and S. Manke. Improving letter recognition by lipreading. In Proceedings of ICASSP, 1993. [5] T. F. Cootes, C. J. Taylor, D. H. Cooper, and J. Graham. Active shape models, their training and application. Comput. Vis. Image Underst., 61(1):38­59, 1995. [6] T.F. Cootes, G.J. Edwards, and C.J. Taylor. Active appearance models. IEEE Transactions on Pattern Analysis and Machine Intel ligence, 23(6):681­685, 2001. [7] P. Duchnowski, U. Meier, and A. Weibel. See me, hear me: Integrating automatic speech recognition and lipreading. In Proc. ICSLP 94, 1994. [8] G. Edwards, C. Taylor, and T. Cootes. Interpreting face images using active appearance models, 1998. [9] T. F. Ezzat, G. Geiger, and T. Poggio. Trainable videorealistic speech animation. In SIGGRAPH '02: Proceedings of the 29th annual conference on Computer graphics and interactive techniques, pages 388­398, New York, NY, USA, 2002. ACM Press. [10] H. McGurk and J. MacDonald. Hearing lips and seeing voices. Nature, pages 746 ­ 748, December 1976. [11] Alan B. Poritz. Linear predictive hidden markov models and the speech signal. Proc. IEEE Int. Conf. Acoustics, Speech and Signal Processing, 7:1291­1294, May 1982. [12] L. R. Rabiner. A tutorial on hidden markov models and selected applications in speech recognition. In Readings in speech recognition, pages 267­296. Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, 1990. [13] B. Theobald, G. Cawley, I. Matthews, J. Glauert, and J. Bangham. 2.5D visual speech synthesis using appearance models. Proceedings of the British Machine Vision Conference, 2003. [14] M. A. Turk and A. P. Pentland. Face recognition using eigenfaces. Proc. IEEE Conf. Computer Vision and Pattern Recognition, pages 586­591, 1991. [15] Mike West and Jeff Harrison. Bayesian Forecasting and Dynamic Models. Springer, 1999. [16] S. Young. The HTK hidden markov model toolkit: Design and philosophy, 1993. 7 A Parameter estimation in DPDS The log-likelihood of a sequence is given by eq. 1, which is a multiplicative function of A s s s (x1 = f (A1 ), x2 = f (A2 A1 ), etc.). Applying the chain rule repeatedly gives us, for diagonal matrices and using Lt o denote the log-likelihood of a single observation at time t s t, that L1 An = 0 and Lt An = R-t1 (yt - xs )( xs An ) for 2 t T , where t t s xs xs t s s s s = xs nt + At t-1 , and nt = 1 iff n = t t An An (2) There we give the gradients for diagonal matrices for simplicity of notation and because we used diagonal matrices for this work, ut the same principle applies to full matrices. The b S Ts gradient of the likelihood is then L An = s=1 t=2 Ls,t An . In general the same is done for the other parameters of the model, however when the covariance is shared by all states, the value of the other parameters can be maximised exactly as described below. In the following, superscripts differentiate between variables by indicating what the variable is S S Ts s s a coefficient to. The covariance R = s=1 t=2 (yt - xs )(yt - xs ) t t s=1 (Ts - 1) where s s s s x1 = µ1 , xs = At xs-1 + t , while µt and t are found by solving the system of linear s s t t equations (3) for which the coefficients D and b are computed by Algorithm 1, which takes { }, {y} and the current values of A1... as input: µ X1 , 1 · · · X1 , =b w µ µµ µ diag× (Dn ) D× ×1 ×1 . (3) . .. here X× . . . . µ . ×1 b×1 D× D × X1, · · · X, Algorithm 1 Maximisation of L with respect to µ and for n {1 . . . } do bµ 0, b 0, Dµµ 0 n n n m {1 . . . }: Dµ, , D ,m , D µ 0 nm n n s for s {s|0 = n} do Compute coefficients Dµµ ,Dµx ,bµ to µn n n n µµ µµ µ µ µ s Dn Dn + I, D I, bn bn + yt m {1 . . . }: Cµ 0 Cµ and Dµ below are temporary variables m m for t {2 . . . Ts } do s s Dµ Dµ + At Dµ , Dµµ Dµµ + Dµ Dµ , bµ bµ + Dµ yt n n n n µ µ µ µ µµ s m {1 . . . }: Cm At Cm , Dn,m Dn,m + D Cm Cµt Cµt + I s s end for end for µ Compute coefficients D x ,D x ,b to n for s {1 . . . S } do n n n m {1 . . . }: C 0 m D 0, Cµ I C , D , Cµ are temporary variables m m for t {2 . . . Ts } do s D At D , s if t = n then D D + I end if s m {1 . . . }: C At C , D ,m D ,m + D C m n n m m µ µ µ µ s s Ct Ct + I, C At C , D1 D1 + D Cµ , b b + D yt s s s s n n end for end for end for 8