Learning a Discriminative Hidden Part Model for H u m a n A c t io n R e c o g n it io n

Yang Wang School of Computing Science Simon Fraser University Burnaby, BC, Canada, V5A 1S6 ywang12@cs.sfu.ca

Greg Mori School of Computing Science Simon Fraser University Burnaby, BC, Canada, V5A 1S6 mori@cs.sfu.ca

Abstract
We present a discriminative part-based approach for human a ction recognition from video sequences using motion features. Our model is based on the recently proposed hidden conditional random field (hCRF) for object r ecognition. Similar to hCRF for object recognition, we model a human action by a flexible constellation of parts conditioned on image observations. Differe nt from object recognition, our model combines both large-scale global features a nd local patch features to distinguish various actions. Our experimental results s how that our model is comparable to other state-of-the-art approaches in action recognition. In particular, our experimental results demonstrate that combining large-scale global features and local patch features performs significantly bette r than directly applying hCRF on local patches alone.

1 Introduction
Recognizing human actions from videos is a task of obvious scientific and practical importance. In this paper, we consider the problem of recognizing human a ctions from video sequences on a frame-by-frame basis. We develop a discriminatively train ed hidden part model to represent human actions. Our model is inspired by the hidden conditional ran dom field (hCRF) model [16] in object recognition. In object recognition, there are three major representatio ns: global template (rigid, e.g. [3], or deformable, e.g. [1]), bag-of-words [18], and part-based [7, 6]. All three representations have been shown to be effective on certain object recognition tasks. In particular, recent work [6] has shown that part-based models outperform global templates and bag -of-words on challenging object recognition tasks. A lot of the ideas used in object recognition can also be found in action recognition. For example, there is work [2] that treats actions as space-time shapes an d reduces the problem of action recognition to 3D object recognition. In action recognition, both global template [5] and bag-of-words models [14, 4, 15] have been shown to be effective on certain tasks. Although conceptually appealing and promising, the merit of part-based models has no t yet been widely recognized in action recognition. The goal of this work is to address this gap. Our work is partly inspired by a recent work in part-based eve nt detection [10]. In that work, template matching is combined with a pictorial structure mo del to detect and localize actions in crowded videos. One limitation of that work is that one has to manually specify the parts. Unlike Ke et al. [10], the parts in our model are initialized automatically. 1


(a)

(b )

( c)

(d )

( e)

Figure 1: Construction of the motion descriptor. (a) origin al image; (b) optical flow; (c) x and y components of optical flow vectors Fx , Fy ; (d) half-wave rectification of x and y components to obtain + - + - + 4 separate channels Fx , Fx , Fy , Fy ; (e) final blurry motion descriptors F bx , F b- , F b+ , F b- . x y y

The major contribution of this work is that we combine the flex ibility of part-based approaches with the global perspectives of large-scale template features i n a discriminative model. We show that the combination of part-based and large-scale template features improves the final results.

2 Our Model
The hidden conditional random field model [16] was originally proposed for object recognition and has also been applied in sequence labeling [19]. Objects are modeled as flexible constellations of parts conditioned on the appearances of local patches found by interest point operators. The probability of the assignment of parts to local features is modeled by a conditional random field (CRF) [11]. The advantage of the hCRF is that it relaxes the conditional independence assumption commonly used in the bag-of-words approaches of object recognition. Similarly, local patches can also be used to distinguish actions. Figure. 4(a) shows some examples of human motion and the local patches that can be used to distinguish them. A bag-of-words representation can be used to model these local patches for action recognition. However, it suffers from the same restriction of conditional independence assumption that ignores the spatial structures of the parts. In this work, we use a variant of hCRF to model the constellation of these local patches in order to alleviate this restriction. There are also some important differences between objects a nd actions. For objects, local patches could carry enough information for recognition. But for act ions, we believe local patches are not sufficiently informative. In our approach, we modify the hCRF model to combine local patches and large-scale global features. The large-scale global featu res are represented by a root model that takes the frame as a whole. Another important difference with [16] is that we use the learned root model to find discriminative local patches, rather than using a gen eric interest-point operator.

2.1 Motion features Our model is built upon the optical flow features in [5]. This motion descriptor has been shown to perform reliably with noisy image sequences, and has been ap plied in various tasks, such as action classification, motion synthesis, etc. To calculate the motion descriptor, we first need to track and stabilize the persons in a video sequence. Any reasonable tracking or human detection algorit hm can be used, since the motion descriptor we use is very robust to jitters introduced by the tr acking. Given a stabilized video sequence in which the person of interest appears in the center of the fie ld of view, we compute the optical flow at each frame using the Lucas-Kanade [12] algorithm. The optical flow vector field F is then split into two scalar fields Fx and Fy , corresponding to the x and y components of F . Fx and Fy are fur+ - + - + - ther half-wave rectified into four non-negative channels Fx , Fx , Fy , Fy , so that Fx = Fx - Fx + - and Fy = Fy - Fy . These four non-negative channels are then blurred with a Ga ussian kernel and - normalized to obtain the final four channels F b+ ,F bx ,F b+ ,F b- (see Fig. 1). x y y 2


2.2 Hidden conditional random field(hCRF) Now we describe how we model a frame I in a video sequence. Let x be the motion feature of this frame, and y be the corresponding class label of this frame, ranging over a finite label alphabet Y . Our task is to learn a mapping from x to y . We assume each image I contains a set of salient patches {I1 , I2 , ..., Im }. we will describe how to find these salient patches in Sec. 3. Our training set consists of labeled images (xt , y t ) (as a notation convention, we use superscripts to index trai ning images and subscripts to index patches) for t = 1, 2, ..., n, where y t  Y and xt = (xt , xt ..., xt ). 1 2 m xt = xt (Iit ) is the feature vector extracted from the global motion featu re xt at the location of the i patch Iit . For each image I = {I1 , I2 , ..., Im }, we assume there exists a vector of hidden "part" variables h = {h1 , h2 , ..., hm }, where each hi takes values from a finite set H of possible parts. Intuitively, each hi assigns a part label to the patch Ii , where i = 1, 2, ..., m. For example, for the action "waving-two-hands", these parts may be used to characterize the movement patterns of the left and right arms. The values of h are not observed in the training set, and will become the hidd en variables of the model. We assume there are certain constraints between some pairs o f (hj , hk ). For example, in the case of "waving-two-hands", two patches hj and hk at the left hand might have the constraint that they tend to have the same part label, since both of them are characteri zed by the movement of the left hand. If we consider hi (i = 1, 2, ..., m) to be vertices in a graph G = (E , V ), the constraint between hj and hk is denoted by an edge (j, k )  E . See Fig. 2 for an illustration of our model. Note that the gra ph structure can be different for different images. We will des cribe how to find the graph structure E in Sec. 3.
y

class lab el

hi hj

(·) (·)  (·)  (·)

hk

hidden parts

xi xj x

xk

image

Figure 2: Illustration of the model. Each circle correspond s to a variable, and each square corresponds to a factor in the model. Given the motion feature x of an image I , its corresponding class label y , and part labels h, a hidden exp((y ,x,h; )) P conditional random field is defined as p(y , h|x; ) = P , where  is the exp((y ,x,h; )) ^^
y Y ^ ^ hHm

model parameter, and (y , h, x; )  R is a potential function parameterized by . It follows that h h m exp((y , h, x;  )) ^ H h p(y |x; ) = p(y , h|x; ) = (1 ) ^ y Y Hm exp((y , h, x;  )) m
H

We assume (y , h, x) is linear in the parameters  = {,  ,  ,  }: j j (   ·  (y , hj , hk ) +   ·  (y , x) (2) (y , h, x; ) =  · (xj , hj ) +   · (y , hj ) +
V V j,k)E

where (·) and (·) are feature vectors depending on unary hj 's,  (·) is a feature vector depending on pairs of (hj , hk ),  (·) is a feature vector that does not depend on the values of hidden variables. The details of these feature vectors are described in the fol lowing. Unary potential  · (xj , hj ) : This potential function models the compatibility between xj and the part label hj , i.e., how likely the patch xj is labeled as part hj . It is parameterized as c  · (xj , hj ) =  · 1{hj =c} · [f a (xj ) f s (xj )] (3 ) c
H

3


where we use [f a (xj ) f s (xj )] to denote the concatenation of two vectors f a (xj ) and f s (xj ). f a (xj ) is a feature vector describing the appearance of the patch xj . In our case, f a (xj ) is simply the concatenation of four channels of the motion features at patch xj , i.e., f a (xj ) = [F b+ (xj ) F b- (xj ) F b+ (xj ) F b- (xj )]. f s (xj ) is a feature vector describing the spatial location x x y y of the patch xj . We discretize the whole image locations into l bins, and f s (xj ) is a length l vector of all zeros with a single one for the bin occupied by xj . The parameter c can be interpreted as the measurement of compatibility between feature vector [f a (xj ) f s (xj )] and the part label hj = c. The parameter  is simply the concatenation of c for all c  H. Unary potential   · (y , hj ) : This potential function models the compatibility between class label y and part label hj , i.e., how likely an image with class label y contains a patch with part label hj . It is parameterized as ab   · (y , hj ) = a,b · 1{y=a} · 1{hj =b} (4 )
Y H

where a,b indicates the compatibility between y = a and hj = b. Pairwise potential   ·  (y , hj , hk ): This pairwise potential function models the compatibility between class label y and a pair of part labels (hj , hk ), i.e., how likely an image with class label y contains a pair of patches with part labels hj and hk , where (j, k )  E corresponds to an edge in the graph. It is parameterized as abc   ·  (y , hj , hk ) = a,b,c · 1{y=a} · 1{hj =b} · 1{hk =c} (5 )
Y H H

where a,b,c indicates the compatibility of y = a, hj = b and hk = c for the edge (j, k )  E . Root model   ·  (y , x): The root model is a potential function that models the compatibility of class label y and the large-scale global feature of the whole image. It is parameterized as a    ·  (y , x) = a · 1{y=a} · g (x) (6 )
Y

where g (x) is a feature vector describing the appearance of the whole im age. In our case, g (x) is the concatenation of all the four channels of the motion features in the image, i.e., g (x) = [F b+ F b- F b+ F b- ]. a can be interpreted as a root filter that measures the compatib ility bey y x x tween the appearance of an image g (x) and a class label y = a. And  is simply the concatenation of a for all a  Y . The parameterization of (y , h, x) is similar to that used in object recognition [16]. But there are two important differences. First of all, our definition of th e unary potential function (·) encodes both appearance and spatial information of the patches. Secondly, we have a potential function  (·) describing the large scale appearance of the whole image. Th e representation in Quattoni et al. [16] only models local patches extracted from the image. This may be appropriate for object recognition. But for human action recognition, it is not clear that local p atches can be sufficiently informative. We will demonstrate this experimentally in Sec. 4.

3 Learning and Inference
The model parameters  are learned by maximizing the conditional log-likehood on the training images: h ( t t log log p(y t |xt ; ) = arg max  = arg max L() = arg max p(y t , h|xt ; ) 7)
   - The objective function L() in Quattoni et al.[16] also has a regularization term 21 ||||2 . In our 2 experiments, we find that the regularization does not seem to have much effect on the final results, so we will use the un-regularized version. Different from co nditional random field (CRF) [11], the objective function L() of hCRF is not concave, due to the hidden variables h. But we can still use

4


gradient ascent to find  that is locally optimal. The gradient of the log-likelihood with respect to the t-th training image (xt , y t ) can be calculated as: j  E  Lt () t t = p(hj |y t ,xt ; ) (xj , hj ) - Ep(hj ,y |xt ; ) (xj , hj ) 
V

 Lt ()  Lt () 

 Lt () 

= =

j (

V

E

p(hj |y t ,xt ; ) (hj , y

t

) - Ep(hj ,y|xt ;) (hj , y )
t

j,k)E

E

p(hj ,hk |y t ,xt ; )  (y

, hj , hk ) - Ep(hj ,hk ,y|xt ;)  (y , hj , hk ) (8 )

=  (y t , xt ) - Ep(y|xt ;)  (y , xt )

Assuming the edges E form a tree, the expectations in Eq. 8 can be calculated in O(|Y ||E ||H|2 ) time using belief propagation. Now we describe several details about how the above ideas are implemented. Learning root filter  : Given a set of training images (xt , y t ), we firstly learn the root filter  by solving the following optimization problem: y t t exp  ·  (y t , xt ) (9 )   = arg max log L(y t |xt ;  ) = arg max log   exp (  ·  (y , xt ))

In other words,   is learned by only considering the feature vector  (·). We then use   as the starting point for  in the gradient ascent (Eq. 8). Other parameters ,  ,  are initialized randomly. Patch initialization: We use a simple heuristic similar to that used in [6] to initialize ten salient patches on every training image from the root filter   trained above. For each training image I with class label a, we apply the root filter a on I , then select an rectangle region of size 5 × 5 in the image that has the most positive energy. We zero out the weights in this region and repeat until ten patches are selected. Figure 4(a) shows examples of the patches found in some images. The tree G = (V , E ) is formed by running a minimum spanning tree algorithm over t he ten patches. Inference: During testing, we do not know the class label of a given test image, so we cannot use the patch initialization described above to initialize the patches, since we do not know which root filter to use. Instead, we run root filters from all the classes on a te st image, then calculate the probabilities of all possible instantiations of patches under our learned model, and classify the image by picking the class label that gives the maximum of the these probabilities. In other words, for a testing image with motion descriptor x, we first obtain |Y | instances {x(1) , x(2) , ..., x(|Y |) }, where each x(k) is obtained by initializing the patches on x using the root filter k . The final class label y  of x is . m obtained as y  = arg maxy ax{p(y |x(1) ; ), p(y |x(2) ; ), ..., p(y |x(|Y |) ; )}

4 Experiments

We test our algorithm on two publicly available datasets that have been widely used in action recognition: Weizmann human action dataset [2], and KTH human motion dataset [17]. Performance on these benchmarks is saturating ­ state-of-the-art approac hes achieve near-perfect results. We show our method achieves results comparable to the state-of-the -art, and more importantly that our extended hCRF model significantly outperforms a direct application of the original hCRF model [16]. Weizmann dataset: The Weizmann human action dataset contains 83 video sequences showing nine different people, each performing nine different a ctions: running, walking, jumpingjack, jumping-forward-on-two-legs,jumping-in-place-on-two-legs, galloping-sideways, wavingtwo-hands, waving-one-hand, bending. We track and stabilize the figures using the background subtraction masks that come with this dataset. We randomly choose videos of five subjects as training set, and the videos in the remaining four subjects as test set. We learn three hCRF models with different sizes of possible part labels, |H| = 6, 10, 20. Our model classifies every frame in a video sequence (i.e., p er-frame classification), but 5


bend 1.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 jack 0.02 0.93 0.01 0.02 0.00 0.00 0.00 0.00 0.01 jump 0.01 0.03 0.74 0.00 0.06 0.02 0.12 0.02 0.00 pjump 0.01 0.00 0.00 0.99 0.00 0.00 0.00 0.00 0.00 run 0.00 0.05 0.00 0.00 0.72 0.06 0.17 0.00 0.00 side 0.00 0.01 0.07 0.00 0.02 0.73 0.17 0.00 0.00 walk 0.00 0.00 0.01 0.00 0.05 0.06 0.88 0.00 0.00 wave1 0.00 0.00 0.00 0.01 0.00 0.00 0.00 0.99 0.00 wave2 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 1.00
be j nd ack jum p pju mp run sid e wa lk wa ve 1 wa ve 2

bend 1.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 jack 0.00 1.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 jump 0.00 0.00 1.00 0.00 0.00 0.00 0.00 0.00 0.00 pjump 0.00 0.00 0.00 1.00 0.00 0.00 0.00 0.00 0.00 run 0.00 0.00 0.00 0.00 1.00 0.00 0.00 0.00 0.00 side 0.00 0.00 0.00 0.00 0.00 0.75 0.25 0.00 0.00 walk 0.00 0.00 0.00 0.00 0.00 0.00 1.00 0.00 0.00 wave1 0.00 0.00 0.00 0.00 0.00 0.00 0.00 1.00 0.00 wave2 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 1.00
be j nd ack jum p pju mp run sid e wa lk wa ve 1 wa ve 2

Frame-by-frame classification

Video classification

Figure 3: Confusion matrices of classification results on We izmann dataset. Horizontal rows are ground truths, and vertical columns are predictions.
method per-frame per-video root model 0.7470 0.8889 |H| = 6 0.5722 0.5556 local hCRF |H| = 10 |H| = 20 0.6656 0.6383 0.6944 0.6111 |H| = 6 0.8682 0.9167 our approach |H| = 10 |H| = 20 0.9029 0.8557 0.9722 0.9444

Table 1: Comparison of two baseline systems with our approac h on Weizmann dataset. we can also obtain the class label for the whole video sequence by the majority voting of the labels of its frames (i.e., per-video classification). We show the confusion matrix with |H| = 10 for both per-frame and per-video classification in Fig. 3. We compare our system to two baseline methods. The first basel ine (root model) only uses the root filter   ·  (y , x), which is simply a discriminative version of Efros et al. [5] . The second baseline (local hCRF) is a direct application of the original hCRF model [16]. It is similar to our model, but without the root filter   ·  (y , x), i.e., local hCRF only uses the root filter to initialize the salient patches, but does not use it in the final model. The comparative results are shown in Table 1. Our approach significantly outperforms the two baseline method s. We also compare our results(with |H| = 10) with previous work in Table 2. Note [2] classifies space-time cubes. It is not clear how it can be compared with other methods that classify frames or vi deos. Our result is significantly better than [13], and comparable to [8]. Although we accept the fact that the comparison is not completely fair, since [13] does not use any tracking or background subt raction. We visualize the learned parts in Fig. 4(a). Each patch is rep resented by a color that corresponds to the most likely part label of that patch. We also visualize the root filters applied on these images in Fig. 4(b). KTH dataset: The KTH human motion dataset contains six types of human actions (walking, jogging, running, boxing, hand waving and hand clapping) performed several times by 25 subjects in four different scenarios: outdoors, outdoors with scale variation, outdoors with different clothes and indoors. We first run an automatic preprocessing step to t rack and stabilize the video sequences, so that all the figures appear in the center of the field of view. We split the videos roughly equally into training/test sets and randomly sample 10 frames from each video. The confusion matrices (with |H| = 10) for both per-frame and per-video classification are
per-frame(%) 90.3 N/ A 55 N/ A per-video(%) 97.2 98.8 72.8 N/ A per-cube(%) N/ A N/ A N/ A 99.64

Our method Jhuang et al. [8] Niebles & Fei-Fei [13] Blank et al. [2]

Table 2: Comparison of classification accuracy with previou s work on the Weizmann dataset. 6


(a)

(b )

Figure 4: (a) Visualization of the learned parts. Patches ar e colored according to their most likely part labels. Each color corresponds to a part label. Some int eresting observations can be made. For example, the part label represented by red seems to corre spond to the "moving down" patterns mostly observed in the "bending" action. The part label repr esented by green seems to correspond to the motion patterns distinctive of "hand-waving" actions; (b) Visualization of root filters applied on these images. For each image with class label c, we apply the root filter c . The results show the filter responses aggregated over four motion descriptor cha nnels. Bright areas correspond to positive energies, i.e., areas that are discriminative for this clas s.
boxing 0.55 handclapping 0.02 handwaving 0.02 jogging 0.02 running 0.01 walking 0.02
bo xin ha g

0.04 0.74 0.10 0.01 0.00 0.01
nd cla ha pp

0.03 0.10 0.77 0.04 0.07 0.05
nd wa jog vin g

0.10 0.07 0.01 0.55 0.09 0.08
gin g run

0.17 0.04 0.05 0.20 0.67 0.10
nin wa g

0.12 0.02 0.04 0.18 0.16 0.73
lkin g

boxing 0.86 handclapping 0.00 handwaving 0.00 jogging 0.00 running 0.00 walking 0.00
bo xin ha g

0.00 0.97 0.02 0.00 0.00 0.00
nd cla ha pp

0.03 0.00 0.98 0.00 0.02 0.04
nd wa jog vin g

0.02 0.03 0.00 0.67 0.03 0.01
gin g run

0.05 0.00 0.00 0.19 0.84 0.01
nin wa g

0.05 0.00 0.00 0.14 0.11 0.93
lkin g

ing

ing

Frame-by-frame classification

Video classification

Figure 5: Confusion matrices of classification results on KTH dataset. Horizontal rows are ground truths, and vertical columns are predictions. shown in Fig. 5. The comparison with the two baseline algorithms is summarized in Table 3. Again, our approach outperforms the two baselines systems. The comparison with other approaches is summarized in Table 4. We should emphasize that we do not attempt a direct comparison, since different methods li sted in Table 4 have all sorts of variations in their experiments (e.g., different split of training/te st data, whether temporal smoothing is used, whether per-frame classification can be performed, whether tracking/background subtraction is used, whether the whole dataset is used etc.), which make it impossible to directly compare them. We provide the results only to show that our approach is compara ble to the state-of-the-art.
method per-frame per-video root model 0.5377 0.7339 |H| = 6 0.4749 0.5607 local hCRF |H| = 10 |H| = 20 0.4452 0.4282 0.5814 0.5504 our approach |H| = 10 |H| = 20 0.6698 0.6444 0.8760 0.7512

|H| = 6 0.6633 0.7855

Table 3: Comparison of two baseline systems with our approac h on KTH dataset. 7


methods Our method Jhuang et al. [8] Nowozin et al. [15] Niebles et al. [14] Dollar et al. [4] ´ Schuldt et al. [17] Ke et al. [9]

accuracy(%) 87.60 91.70 87.04 81.50 81.17 71.72 62.96

Table 4: Comparison of per-video classification accuracy wi th previous approaches on KTH dataset.

5 Conclusion
We have presented a discriminatively learned part model for human action recognition. Unlike previous work [10], our model does not require manual specifi cation of the parts. Instead, the parts are initialized by a learned root filter. Our model combines b oth large-scale features used in global templates and local patch features used in bag-of-words models. Our experimental results show that our model is quite effective in recognizing actions. The results are comparable to the state-of-theart approaches. In particular, we show that the combination of large-scale features and local patch features performs significantly better than using either of them alone.

References
[1] A. C. Berg, T. L. Berg, and J. Malik. Shape matching and object recognition using low distortion correspondence. In IEEE CVPR, 2005. [2] M. Blank, L. Gorelick, E. Shechtman, M. Irani, and R. Basri. Actions as space-time shapes. In IEEE ICCV, 2005. [3] N. Dalal and B. Triggs. Histogram of oriented gradients for human detection. In IEEE CVPR, 2005. [4] P. Dollar, V. Rabaud, G. Cottrell, and S. Belongie. Behavior recognition via sparse spatio-temporal ´ features. In VS-PETS Workshop, 2005. [5] A. A. Efros, A. C. Berg, G. Mori, and J. Malik. Recognizing action at a distance. In IEEE ICCV, 2003. [6] P. Felzenszwalb, D. McAllester, and D. Ramanan. A discriminatively trained, multiscale, deformable part model. In IEEE CVPR, 2008. [7] P. F. Felzenszwalb and D. P. Huttenlocher. Pictorial structures for object recognition. IJCV, 61(1):55­79, January 2003. [8] H. Jhuang, T. Serre, L. Wolf, and T. Poggio. A biologically inspired system for action recognition. In IEEE ICCV, 2007. [9] Y. Ke, R. Sukthankar, and M. Hebert. Efficient visual event detection using volumetric features. In IEEE ICCV, 2005. [10] Y. Ke, R. Sukthankar, and M. Hebert. Event detection in crowded videos. In IEEE ICCV, 2007. [11] J. Lafferty, A. McCallum, and F. Pereira. Conditional random fields: Probabilistic models for segmenting and labeling sequence data. In ICML, 2001. [12] B. D. Lucas and T. Kanade. An iterative image registration technique with an application to stereo vision. In Proc. DARPA Image Understanding Workshop, 1981. [13] J. C. Niebles and L. Fei-Fei. A hierarchical model of shape and appearance for human action classification. In IEEE CVPR, 2007. [14] J. C. Niebles, H. Wang, and L. Fei-Fei. Unsupervised learning of human action categories using spatialtemporal words. In BMVC, 2006. [15] S. Nowozin, G. Bakir, and K. Tsuda. Discriminative subsequence mining for action classification. In IEEE ICCV, 2007. [16] A. Quattoni, M. Collins, and T. Darrell. Conditional random fields for object recognition. In NIPS 17, 2005. [17] C. Schuldt, L. Laptev, and B. Caputo. Recognizing human actions: a local SVM approach. In IEEE ICPR, 2004. [18] J. Sivic, B. C. Russell, A. A. Efros, A. Zisserman, and W. T. Freeman. Discovering objects and their location in images. In IEEE ICCV, 2005. [19] S. B. Wang, A. Quattoni, L.-P. Morency, D. Demirdjian, and T. Darrell. Hidden conditional random fields for gesture recognition. In IEEE CVPR, 2006.

8