Combined discriminative and generative articulated pose and non-rigid shape estimation Leonid Sigal Alexandru Balan Michael J. Black Department of Computer Science Brown University Providence, RI 02912 {ls, alb, black}@cs.brown.edu Abstract Estimation of three-dimensional articulated human pose and motion from images is a central problem in computer vision. Much of the previous work has been limited by the use of crude generative models of humans represented as articulated collections of simple parts such as cylinders. Automatic initialization of such models has proved difficult and most approaches assume that the size and shape of the body parts are known a priori. In this paper we propose a method for automatically recovering a detailed parametric model of non-rigid body shape and pose from monocular imagery. Specifically, we represent the body using a parameterized triangulated mesh model that is learned from a database of human range scans. We demonstrate a discriminative method to directly recover the model parameters from monocular images using a mixture of regressors. This predicted pose and shape are used to initialize a generative model for more detailed pose and shape estimation. The resulting approach allows fully automatic pose and shape recovery from monocular and multi-camera imagery. Experimental results show that our method is capable of robustly recovering articulated pose, shape and biometric measurements (e.g. height, weight, etc.) in both calibrated and uncalibrated camera environments. 1 Introduction We address the problem of marker-less articulated pose and shape estimation of the human body from images using a detailed parametric body model [3]. Most prior work on marker-less pose estimation and tracking has concentrated on the use of generative Baysian methods [7, 14] that exploit crude models of body shape (e.g. cylinders [7, 14], voxels [6]). We argue that a richer representation of shape is needed to make future strides in building better generative models. Discriminative methods [1, 2, 9, 12, 15, 16], more recently introduced specifically for the pose estimation task, do not address estimation of the body shape. Any real-world system must be able to estimate both body shape and pose simultaneously. Discriminative approaches to pose estimation attempt to learn a direct mapping from image features to 3D pose from either a single image [1, 13, 16] or multiple approximately calibrated views [8]. These approaches tend to use silhouettes [1, 8, 13] and sometimes edges [15, 16] as image features and learn a probabilistic mapping in the form of Nearest Neighbor (NN) search, regression [1], mixture of regressors [2], mixture of Baysian experts [16], or specialized mappings [13]. While effective and fast, they are inherently limited by the amount and the quality of the training data. More importantly they currently do not address estimation of the body shape itself. Body shape estimation (independent of the pose) has many applications in biometric authentication and consumer application domains (e.g. virtual fitting rooms). 1 Simplified models of body shape have a long history in computer vision and provide a relatively low dimensional description of the human form. More detailed triangulated mesh models obtained from laser range scans have been viewed as too high dimensional for vision applications. Moreover, mesh models of individuals lack a convenient, low-dimensional, parameterization to allow fitting to new subjects. In this paper we use the SCAPE model (Shape Completion and Animation of PEople) [3] which provides a low-dimensional parameterized mesh that is learned from a database of 3D range scans of different people. The SCAPE model captures correlated body shape deformations of the body due to the identity of the person and their non-rigid muscle deformation due to articulation. This model has been shown to allow tractable estimation of parameters from silhouette image features [4, 10]. In [4] the SCAPE model is projected into multiple calibrated images and an iterative importance sampling method is used for inference of the pose and shape that best explain the observed silhouettes. Alternatively, in [10] visual hulls are constructed from many silhouette images and the Iterative Closest Point (ICP) algorithm is used to extract the pose by registering the volumetric features with SCAPE. Both [4] and [10], however, require manual initialization to bootstrap estimation. In this paper we substitute discriminative articulated pose and shape estimation in place of manual initialization. In doing so, we extend the current models for discriminative pose estimation to deal with the estimation of shape, and couple the discriminative and generative methods for more robust combined estimation. Few combined discriminative and generative pose estimation methods exist [15], and typically require temporal image data and do not address shape estimation. For discriminative pose and shape estimation we use a Mixture of Experts model, with linear regression as experts, to learn a direct probabilistic mapping between monocular silhouette contour features and the SCAPE parameters. To our knowledge this is the first work that has attempted to recover the 3D shape of the human body from monocular image directly. While the results are typically noisy, they are appropriate as initialization for the more precise generative refinement process. For generative optimization we make use of the method proposed in [4] where the silhouettes are predicted in multiple views given the pose and shape parameters of the SCAPE model and are compared to the observed silhouettes using a Chamfer distance measure. For training data we use the SCAPE model to generate pairs of 3D body shapes and projected image silhouettes. Evaluation is performed on sequences of two subjects performing free-style motion. We are able to predict pose, shape, and simple biometric measurements for the subjects from images captured by 4 synchronized cameras. We also show results for 3D shape estimation from monocular images. The contributions of this paper are two fold: (1) we formulate a discriminative model for estimating the pose and shape directly from monocular image features, and (2) we couple this discriminative method with a generative stochastic optimization for detailed estimation of pose and the shape. 2 SCAPE Body Model In this section we briefly introduce the SCAPE body model; for details the reader is referred to [3]. A low-dimensional mesh model is learned using principal component analysis applied to a registered database of range scans. The SCAPE model is defined by a set of parameterized deformations that are applied to a reference mesh that consists of T triangles {xt |t [1, ..., T ]} (here T = 25, 000). Each of the triangles in the reference mesh is defined by three vertices in 3D space, (vt,1 , vt,2 , vt,3 ), and has a corresponding associated body part index pt [1, ..., P ] (we work with the model that has P = 15 body parts corresponding to torso, pelvis, head, and 3 segments for each of the upper and lower extremities). For convenience, the triangles of the mesh are parameterized by the edges, xt = (vt,2 - vt,1 , vt,3 - vt,1 ), instead of the vertices themselves. Estimating the shape and articulated pose of the body amounts to estimating parameters, Y, of the deformations required to produce the mesh {yt |t [1, ..., T ]}, the projection of which matches the image evidence. The state-space of the model can be expressed by a vector Y = { , , }, where R3 is the global 3D position for the body, R37 is the joint-angle parameterization of the articulation with respect to the skeleton encoded using Euler angles, and R9 is the shape parameters encoding the identityspecific shape of the person. Given a set of estimated parameters Y a new mesh {yt } can be produced using: yt = Rpt ()S ( )Q(Rpt ())xt 2 (1) p n 1 2 p c 9 8 7 6 5 4 10 11 12 1 2 3 Radial bins 3 4 5 15 45 75 105 135 bins (in degrees) 165 195 225 255 285 315 345 (a) (b) Figure 1: Silhouette contour descriptors. Radial Distance Function (RDF) encoding of the silhouette contour is illustrated in (a); Shape Context (SC) encoding of a contour sample point in (b). where Rpt () is the rigid 3 × 3 rotation matrix for a part pt and is a function of the joint angles ; S ( ) is the linear 3 × 3 transformation matrix modeling subject-specific shape variation as a function of the shape-space parameters ; Q(Rpt ()) is a 3 × 3 residual transformation corresponding to the non-rigid articulation-induced deformations (e.g. bulging of muscles). Notice, that Q() is simply a learned linear function of the rigid rotation and has no independent parameters. To learn Q() we minimize the residual in the least-squared sense between the set of 70 registered scans of one person under different (but known) articulations. It is also worth mentioning that body shape linear deformation sub-space, S ( ) = Us + µs , is learned from a set of 10 meshes of different people in full correspondence using PCA; hence can be interpreted as a vector of linear coefficients corresponding to eigen-directions of the shape-space that characterize a given body shape. 3 Features In this work we make use of silhouette features for both discriminative and generative estimation of pose and shape. Silhouettes are commonly used for human pose estimation [1, 2, 12, 14, 16]; while limited in their representational power, they are easy to estimate from images and fast to synthesize from a mesh model. The framework introduced here, however, is general and can easily be extended to incorporate richer features (e.g. edges [14], dense region descriptors [15] such as SIFT or HOG, or hierarchical descriptors [9] like HMAX, Hyperfeatures, Spatial Pyramid). The use of such richer feature representations will likely improve both discriminative and generative estimation. Histograms of shape context. Shape contexts (SC) [5] are rich descriptors based on the local shape-based histograms of the contour points sampled from the external boundary of the silhouette. At every sampled boundary point the shape context descriptor is parameterized by the number of orientation bins, , number of radial-distance bins, r, and the minimum and maximum radial distances denoted by rin and rout respectively. As in [1] we achieve scale invariance by making rout a function of the overall silhouette height and normalizing the individual shape context histogram by the sum over all histogram bins. Assuming that N contour points are chosen at random to encode the silhouette, the feature vector can be encoded this way using rN bin histogram. Even for moderate values of N this produces high dimensional feature vectors that are hard to deal with. To reduce the silhouette representation to a more manageable size, a secondary histogramming was introduced by Agarwal and Triggs in [1]. In this, bag-of-words style model, the shape context space is vector quantized into a set of K clusters (a.k.a. codewords). The K = 100 center codebook is learned by running k-means clustering on the combined set of shape context vectors obtained from the large set of training silhouettes. Once the codebook is learned, the quantized K -dimensional histograms are obtained by voting into the histogram bins corresponding to codebook entries. Soft voting has been shown [1] to reduce effects of spatial quantization. The final descriptor Xsc RK is normalized to unit length, to ensure that silhouettes that contain different number of contour points can be compared. The resulting codebook shape context representation is translation and scale invariant by definition. Following the prior work [1, 12] we let = 12, r = 5, rin = 3, and rout = h where h is the height of the silhouette and is typically 1 ensuring integration of contour points over regions roughly 4 similar to the limb size [1]. For shape estimation, we found that combining features across multiple 1 spatial scales (e.g. = { 1 , 2 , ...}) to be more effective. 4 3 Radial distance function. The Radial Distance Function (RDF) features are defined by a feature vector Xrdf = {pc , ||p1 - pc ||, ||p2 - pc ||, ..., ||pN - pc ||}, where pc R2 is the centroid of the image silhouette, and pi is the point on the silhouette outer contour; hence ||pi - pc || R measures the maximal object extent in the particular direction denoted by i from the centroid. For all experiments, we use N = 100 points, resulting in the Xrdf R102 . We explicitly ensure that the dimensionality of the RDF descriptor is comparable to that of shape context introduced above. Unlike the shape context descriptor, the RDF feature vector is neither scale nor translation invariant. Hence, RDF features are only suited for applications where camera calibration is known and fixed. 4 Discriminative estimation of pose and shape To produce initial estimates for the body pose and/or shape in 3D from image features, we need to model the conditional distribution p(Y|X) of the 3D body state Y given the set of 2D features X. Intuitively this conditional mapping should be related to the inverse of the camera projection matrix and, as with many inverse problems, is highly ambiguous. To model this non-linear relationship we use a Mixtures of Experts (MoE) model to represent the conditionals [2, 16]. The parameters of the MoE model are learned by maximizing the log-likelihood of the training data set D = {(x(1) , y (1) ), ..., (x(N ) , y (N ) )} consisting of N input-output pairs (x(i) , y (i) ). We use an iterative Expectation Maximization (EM) algorithm, based on type-II maximum likelihood, to learn parameters of the MoE. Our model for the conditional can be written as: p(Y|X) kM pe,k (Y|X, e,k )pg,k (k |X, g,k ) (2) =1 where pe,k is the probability of choosing pose Y given the input X according to the k -th expert, and pg,k is the probability of that input being assigned to the k -th expert using an input sensitive gating network; in both cases represents the parameters of the mixture and gate distributions. For simplicity and to reduce complexity of the experts we choose linear regression with constant offset Y = X + as our expert model, which allows us to solve for the parameters e,k = {k , k , k } analytically using the weighted linear regression, where pe,k (Y|X, e,k ) = T -1 1 1n exp- 2 k k k , and k = Y - k X - k . ( 2 ) | k | Weighted ridge regression solution for the parameters k and k can be written in matrix notation as follows, T -1 D T d DT k X X diag(Zk ) DX + diag() Zk = iag(Zk ) DY , (3) T T T k Zk Zk Zk Zk (1) (2) (N ) Pose estimation is a high dimensional and ill-conditioned problem, so simple least squares estimation of the linear regression matrix parameters typically produces severe over-fitting and poor generalization. To reduce this, we add smoothness constraints on the learned mapping. We use a damped regularization term R( ) = || ||2 that penalizes large values in the coefficient matrix , where is a regularization parameter. Larger values of will result in overdamping, where the solution will be underestimated, small values of will result in overfitting and possibly ill-conditioning. Since the solution of the ridge regressors is not symmetric under the scaling of the inputs, we normalize the inputs {x(1) , x(2) , ..., x(N ) } by the standard deviation in each dimension respectively before solving. where Zk = [zk , zk , ..., zk ]T is the vector of ownership weights described later in the section and diag(Zk ) is diagonal matrix with Zk on the diagonal; DX = [x(1) , x(2) , ..., x(N ) ] and DY = [y (1) , y (2) , ..., y (N ) ] are vectors of inputs and outputs from the training data D. Maximization for the gate parameters can be done analytically as well. Given the gate model, T -1 1 exp- 2 (X-µk ) k (X-µk ) maximization of the gate parameters pg,k (k |X, g,k ) = 1n (2 ) | k | zk g,k = (k , µk ) becomes similar to the mixture of Gaussians estimation, where µk = N N N (n) (n) (n) (n) n / n=1 zk , k = PN 1 (n) n=1 zk [x(n) - µk ][x(n) - µk ]T and zk is the n=1 zk x n =1 4 estimated ownership weight of the example n by the expert k estimated by expectation zk (n) = The above outlines the full EM procedure for the MoE model. We learn three separate models for shape, p( |X), articulated pose, p(|X) and global position, p( |X). Similar to [2] we initialize the EM learning by clustering the output 3D poses using the K-means procedure. Implementation Details. For articulated pose and shape we experimented with using both RDF and SC features (global position requires RDF features since SC is location and scale invariant). SC features tend to work better for pose estimation where as RDF features perform better for shape estimation. Hence, we learn p( |Xrdf ), p(|Xsc ) and p( |Xrdf ). In cases where calibration is unavailable, we estimate the shape using p( |Xsc ) which tends to produce reasonable results but cannot estimate the overall height. We estimate the number of mixture components, M , and regularization parameter, , by learning a number of models and cross validating on the withheld dataset. pe,k (y (n) |x(n) , e,k )pg,k (k |x(n) , g,k ) . M (n) |x(n) , )p (n) , e,j g ,j (j |x g ,j ) j =1 pe,j (y (4) 5 Generative stochastic optimization of pose and shape Generative stochastic state estimation, as in [4], is handled within an iterative importance sampling framework [7]. To this end, we represent the posterior distribution over the state (that includes N both pose and shape), p(Y|I ) p(I |Y)p(Y), using a set of N weighted samples {yi , i }i=1 , p(I |yi )p(yi ) where yi q (Y) is a sample drawn from the importance function q (Y) and i q (yi ) is an associated normalized weight. As in [4] we make no rigorous probabilistic claims about the generative model, but rather use it as effective means of performing stochastic search. As required by the annealing framework, we define a set of importance functions qk (Y) from which we draw samples at each respective iteration k . We define importance functions recursively using a smoothed N (k ) (k ) version of posterior from the previous iteration qk+1 (Y) = i=1 i N (yi , (k) ), encoded using a kernel Gaussian density with iteration dependent bandwidth parameter (k) . To avoid effects of local optima, the likelihood is annealed as follows: pk (I |Y) = [p(I |Y)]Tk at every iteration, where Tk is the temperature parameter. As a result, effects of peaks in the likelihood are introduced slowly. To initiate the stochastic search an initial distribution is needed. The high dimensionality of the state space requires this initial distribution to be relatively close to the solution in order to reach convergence. Here we make use of the discriminative pose and shape estimate from Section 4 to give us the initial distribution for the posterior. In particular, given the discriminative model for the shape, p( |X), position, p( |X), and articulated pose, p(|X), of the body, we can let (with slight (0) (0) abuse of notation) yi [p( |X), p(|X), p( |X)] and i = 1/N for i [1, ..., N ]. The outlined stochastic optimization framework also requires an image likelihood function, p(I |Y), that measures how well our model under a given state Y matches the image evidence, I , obtained from one or multiple synchronized cameras. We adopt the likelihood function introduced in [4] that measures the similarity between observed and hypothesized silhouettes. For a given camera view, a foreground silhouette is computed using a shadow-suppressing background subtraction procedure and is compared to the silhouette obtained by projecting the SCAPE model subject to the hypothesized state into the image plane (given calibration parameters of the camera). Pixels in the non-overlapping regions are penalized by the distance to the closest contour point of the silhouette. This is made efficient by the use of Chamfer distance map precomputed for both silhouettes. 6 Experiments Datasets. In this paper we make use of 3 different datasets. The training dataset, used to learn discriminative MoE models and codeword dictionary for SC, was generated by synthesizing 3000 silhouette images obtained by projecting corresponding SCAPE body models into an image plane using calibration parameters of the camera. Body models in turn were generated by randomly sampling the pose from a database of motion capture data (consisting of generally non-cyclic random motions) and the body shape coefficient from a uniform distribution centered at the mean shape. Similar synthetic test dataset was constructed consisting of 597 silhouette-SCAPE body model 5 (a) (b) (c) Figure 2: Discriminative estimation of weight loss. Two images of a subject before and after weight loss are shown in (a) on the left and right respectively. The images were downloaded from the web (Google) and manually segmented (b). The estimated shape and pose obtained by our discriminative estimation procedure is shown in (c). In bottom row, we manually rotated the model 90 deg rees for better visibility of the shape variation. Since camera calibration is unavailable, we use p( |Xsc ) and normalize the before and after shapes to the same reference height. Our method estimated that the person in the top row lost 22 lb and the one at the bottom 32 lb; web-reported weight loss for the two subjects was 24 lb and 64 lb respectively. Notice that natural posture assumed in images was not present in our training data set, causing visible artifacts with estimation of the arm pose. Also, the bottom example pushes the limits of our current shape model which was trained using only 10 scans of people, none close to the desired body shape. pairs. In addition, we collected a real dataset consisting of hardware-synchronized motion capture and video collected using 4 cameras. Two subjects were captured performing roughly the same class of motions as in the training dataset. Discriminative estimation of shape. Results of using the MoE model, similar to the one introduced here, for pose estimation have previously been reported in [2] and [16]. Our experience with the articulated pose estimation was similar and we omit supporting experiments due to lack of space. For discriminative estimation of shape we quantitatively compared SC and RDF features, by training two MoE models p( |Xsc ) and p( |Xrdf ), and found the latter to perform better when camera calibration is available (on the average we achieve a 19.3 % performance increase over simply using the mean shape). We attribute the superior performance of RDF features to their sensitivity to the silhouette position and scale, that allows for better estimation of overall height of the body for example. Given the shape we can also estimate the volume of the body and assuming constant density of water, compute the weight of the person. To illustrate this we estimate approximate weight loss of a person from monocular uncalibrated images (see Figure 2). Please note that this application is a proof of concept and not a rigorous experiment1 . In principle, the SCAPE model is not ideal for weight calculations, since non-rigid deformations caused by articulations of the body will result in (unnatural) variations in weight. In practice, however, we found such variations produce relatively minor artifacts. The weight calculations are, on the other hand, very sensitive to the body shape estimate itself. Combining discriminative and generative estimation. Lastly we tested the performance of the combined discriminative and generative estimation by estimating articulated pose, shape and biometric measurements for people in our real dataset. Results of biometric measurement estimates can be seen in Figure 3; corresponding visual illustration of results is shown in Figure 4. 1 The "ground truth" weight change here is self reported and gathered from the Internet. 6 Biometric Feature Height (mm) Arm Span (mm) Weight (k g ) Height (mm) Arm Span (mm) Weight (k g ) Actual 1780 1597 88 1825 1668 63 Discriminative Mean Std 1716.1 41.9 1553.6 39.7 83.62 8.94 1703.8 88.8 1537.7 69.2 80.63 18.53 Disc. + Generative Mean Std 1776.2 43.8 1597.3 58.0 83.37 8.01 1751.0 95.2 1547.5 91.4 64.98 9.27 GT + Generative Mean Std 1796.9 22.9 1607.7 30.7 85.83 3.73 1844.1 63.8 1659.0 29.1 66.33 4.69 Figure 3: Estimating basic biometric measurements. Figure illustrates basic biometric measurements (height, arm span3 and weight) computed for two subjects A and B. Mean and standard deviation reported over 34 and 30 frames for subject A and B respectively. Every 25-th frame from two sequence obtained using 4 synchronized cameras was chosen for estimation. The actual measured values for the two subjects are shown in the left column. Estimates obtained using discriminative only and discriminative followed by generative shape estimation methods are reported in the next two columns. Discriminative method only used one view for estimation, where as generative method used all 4 views to obtain better fit. Last column reports estimates obtained using ground truth pose and mean shape as initialization for the generative fit (this is the algorithm proposed in [4]). Notice that generative estimation significantly refines the discriminative estimates. In addition, our approach, that unlike [4] does not require manual initialization, performs comparably (and sometimes marginally better than [4]) in terms of mean performance (but has roughly twice the variance). Analysis of errors. Rarely our system does produce poor pose and/or shape estimates. Typically these cases can be classified into two categories: (1) minor errors that only effect the pose and are artifacts of local optima or (2) more significant errors that effect the shape and result from poor initial distribution over the state produced by the discriminative method which arise from 180 degree ambiguity and/or configuration symmetry in silhouettes. 7 B (30) A (34) 3 Discussion and Conclusions We have presented a method for automatic estimation of articulated pose and shape of people from images. Our approach goes beyond prior work in that it is able to estimate a detailed parametric model (SCAPE) directly from images without requiring manual intervention or initialization. We found that the discriminative model produced an effective initialization for generative optimization procedure and that biometric measurements from the recovered shape were comparable to those produced by prior approaches that required manual initialization [4]. We also introduced and addressed the problem of discriminative estimation of shape from monocular calibrated and un-calibrated images. More accurate shape estimates from monocular data will require richer image descriptors. A number of straightforward extensions to our model will likely yeld immediate improvement in performance. Among such, is the use of temporal consistency in the discriminative pose (and perhaps shape) estimation [16] and dense image descriptors [9]. In addition, in this work we estimated the shape space of the SCAPE model from only 10 body scans, as a result the learned shape space is rather limited in its expressive power. We belive some of the artifacts of this can be observed in Figure 2 where the weight of the heavier woman is underestimated. References [1] A. Agarwal and B. Triggs. Recovering 3D human pose from monocular images, IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol. 28, No. 1, pp. 44­58, 2006. [2] A. Agarwal and B. Triggs. Monocular human motion capture with a mixture of regressors, IEEE Workshop on Vision for Human-Computer Interaction, 2005. [3] D. Anguelov, P. Srinivasan, D. Koller, S. Thrun, J. Rodgers and J.Davis. SCAPE: Shape Completion and Animation of PEople, ACM Trans. Graphics, 24(3):408-416, 2005. [4] A. Balan, L. Sigal, M. Black, J. Davis and H. Haussecker. Detailed human shape and pose from images, CVPR, 2007. [5] S. Belongie, J. Malik and J. Puzicha. Matching shapes, ICCV, pp. 454-461, 2001. Arm span is defined as the distance between knuckles of left and right arm fully extended in `T'-pose [4]. 7 Figure 4: Visualizing pose and shape estimation. Examples of simultaneous pose and shape estimation for subjects A and B are shown on top and bottom respectively. Results are obtained by discriminatively estimating the distribution over the initial state and then refining this distribution via generative local stochastic search. Left column illustrates projection of the estimated model into all 4 views. Middle column shows the projection of the model onto image silhouettes, where light blue denotes image silhouette, dark red projection of the model and orange non-silhouette regions that overlap with the projection. On the right are the two views of the estimated 3D model. [6] K.M. Cheung, S. Baker and T. Kanade. Shape-from-silhouette of articulated objects and its use for human body kinematics estimation and motion capture, CVPR, Vol. 1, pp. 77­84, 2003. [7] J. Deutscher, A. Blake and I. Reid. Articulated body motion capture by annealed particle filtering, CVPR, Vol. 2, pp. 126­133, 2000. [8] K. Grauman, G. Shakhnarovich, T. Darrell. Inferring 3D structure with a statistical image-based shape model, ICCV, pp. 641­648, 2003. [9] A. Kanaujia, C. Sminchisescu and D. Metaxas. Semi-supervised Hierarchical Models for 3D Human Pose Reconstruction, CVPR, 2007. [10] L. Muendermann, S. Corazza and T. Andriacchi. Accurately measuring human movement using articulated ICP with soft-joint constraints and a repository of articulated models, CVPR, 2007. [11] R. Plankers and P. Fua. Articulated soft objects for video-based body modeling, ICCV, 2001. [12] R.W. Poppe and M. Poel. Comparison of silhouette shape descriptors for example-based human pose recovery, IEEE Conference on Automatic Face and Gesture Recognition (FG 2006), pp. 541­546, 2006. [13] R. Rosales and S. Sclaroff. Learning Body Pose Via Specialized Maps, NIPS, 2002. [14] L. Sigal, S. Bhatia, S. Roth, M. J. Black and M. Isard Tracking Loose-limbed People, CVPR, Vol. 1, pp. 421­428, 2004. [15] C. Sminchisescu, A. Kanajujia and D. Metaxas. Learning Joint Top-Down and Bottom-up Processes for 3D Visual Inference, CVPR, Vol. 2, pp. 1743­1752, 2006. [16] C. Sminchisescu, A. Kanaujia, Z. Li and D. Metaxas. Discriminative density propagation for 3D human motion estimation, CVPR, Vol. 1, pp. 390­397, 2005. 8