Large Margin Training for Hidden Markov Models with Partially Observed States Trinh-Minh-Tri Do Trinh-Minh-Tri.Do@lip6.fr Thierry Arti`res e Thierry.Artieres@lip6.fr LIP6, Universit´ Pierre et Marie Curie, 104 avenue du pr´sident Kennedy, 75016, Paris, France e e Abstract Large margin learning of Continuous Density HMMs with a partially labeled dataset has been extensively studied in the speech and handwriting recognition fields. Yet due to the non-convexity of the optimization problem, previous works usually rely on severe approximations so that it is still an open problem. We propose a new learning algorithm that relies on non-convex optimization and bundle methods and allows tackling the original optimization problem as is. It is proved to converge to a solution with accuracy with a rate O(1/ ). We provide experimental results gained on speech and handwriting recognition that demonstrate the potential of the method. mentation case). This is the usual setting in speech or handwriting recognition tasks, where one never gets the complete sequence of states corresponding to an observation sequence in the training stage. In test, segmentation is performed through Viterbi decoding that maps an observation sequence into a state sequence. Based on the underlying semantics of the states (e.g. passing through the three states of the left-right HMM corresponding to a particular phone means this phone has been recognized), the sequence of states translates into a sequence of labels (e.g. phones). This typical use of HMMs is very popular since it is both simple and efficient, and it scales well with large corpus. However, this learning strategy does not focus on what we are primarily concerned with, namely minimizing the classification (or the segmentation) error rate. Hence, a number of attempts have been made to develop discriminative learning methods for HMMs. First studies, in the speech recognition field, aimed at optimizing a discriminative criterion such as the Minimum Classification Error (MCE) (Juang & Katagiri, 1992) or the Maximum Mutual Information (MMI) criterion (Woodland & Povey, 2002). Recently, a promising direction has been explored with the development of margin-based methods for sequences (Taskar et al., 2004; Tsochantaridis et al., 2004). However these works mainly deal with fully supervised learning. There is still work to do to extend these works to the learning of CDHMMs with partially labeled data. Building on these seminal works a few approaches have been proposed for large margin learning of HMMs, especially in the speech recognition community (Sha & Saul, 2007; Jiang & Li, 2007) (see (Yu & Deng, 2007) for a review). However none of these works actually handle the whole problem of max-margin learning for HMM parameters in the standard partially labeled setting. (Jiang & Li, 2007) focuses on correctly predicted examples near the decision boundary only, while (Sha & Saul, 2007) mainly focuses on the fully supervised case, where the sequence 1. Introduction Hidden Markov Models (HMMs) have been widely used for automatic speech recognition and handwriting recognition . Continuous Density HMMs (CDHMMs) are particularly suited for dealing with sequences of real-valued feature vectors that one gets after typical front-end processing in signal processing tasks. CDHMMs usually exploit Gaussian mixture models to describe the variability of observations in a state. HMM parameters are learnt with the ExpectationMaximization algorithm (EM) to maximize the joint likelihood of observation and of hidden state sequences. Training is performed based on a partially labeled training set, that is a set of observation sequences together with the corresponding classes (classification case) or with the corresponding unit sequences (segAppearing in Proceedings of the 26 th International Conference on Machine Learning, Montreal, Canada, 2009. Copyright 2009 by the author(s)/owner(s). Large Margin Training for HMMs with Partially Observed States of states corresponding to the training sequences are determined by a traditionally trained HMM system via Maximum Likelihood Estimation (MLE). Here we propose a new method for discriminative learning of CDHMM models in the complex unlabeled setting. The main difficulty one encounters when formulating the maximum margin learning of CDHMM with partially labeled data lies in the non-convexity of the optimization problem. While (Sha & Saul, 2007) consider a simpler and convex problem we build on non-convex optimization ideas and investigate here the direct optimization of the non-convex objective function using bundle methods (Kiwiel, 1985; HiriartUrruty & Lemarechal, 1993). Our contributions are: · A new fast optimization method that can handle non-convex, non-differentiable function, with a theoretical analysis of its convergence rate. · Experimental results showing the method is well adapted for large margin training of CDHMMs. We first present the maximum margin training formalization in the general case of structured outputs and for HMM learning in particular. Then we show how such a learning problem may resume to a non-convex optimization problem that may be solved with a variant of bundle methods that we propose, for which we provide a detailed analysis convergence. Finally we provide experimental results on speech and on handwriting recognition that show first the performance of the classifiers and second the efficiency of the optimization method. for speech recognition, zi might be the full state sequence corresponding to the ith speech signal xi while yi is the corresponding sequence of phones. We are interested in learning a discriminant function F : X × Y R over input-output pairs which can be used to predict the output y for an input x: h(x, w) = argmax F (x, y, w) yY (1) where w denotes the parameter vector to be learned. In the case of partially labeled data, the discriminant function F (x, y, w) may take various forms such as F (x, y, w) = maxz g(x, y, z, w) where g(x, y, z, w) stands for an elementary discriminant function. For instance g might be a linear function g(x, y, z, w) = (x, y, z), w with (x, y, z) being a feature vector. Following previous works (Tsochantaridis et al., 2004; Taskar et al., 2004), learning an optimal h may be done by solving the following soft margin problem: w, s.t. min 2 w 2 + i i iy=yi i F (xi ,yi ,w)F (xi ,y,w)+(yi ,y)- i 0 i (2) where (yi , y) terms allow taking into account differences between labellings (Cf. (Taskar et al., 2004)). The equivalent unconstrained problem is: minw 2 w 2 + i maxy (F (xi ,y,w)+(yi ,y)-F (xi ,yi ,w)) f (w) 2. Large Margin Learning and HMMs In this section, we address the problem of learning a model for structured outputs based on partially labeled data. Although our method is quite general we mainly focus here on the special case of learning a model for sequence segmentation based on partially labeled training sequences, which fits our case study of maximum margin learning for HMMs. 2.1. Large Margin Learning for Structured Outputs with Partially Labeled data We consider a training set that consists of K inputoutput pairs (x1 , y1 ), ..., (xK , yK ) X × Y where xi = (xi , xi , ..., xi ) (Rd )T is an observation se1 2 T i i i quence and yi = (y1 , y2 , ..., yL ) LL is the corresponding label sequence (with L T ). We consider hidden variables (z1 , z2 , ..., zK ) where zi stands for the missing variables corresponding to the ith training sample. For instance if we wish to learn a HMM (3) where f (w) is the primal objective function. A variant consists in using a softmax instead of a max in (3): maxy (F (xi ,y,w)+(yi ,y)-F (xi ,yi ,w)) max 0, log y=yi eF (x i ,y,w)+(yi ,y) -F (xi ,yi ,w) (4) 2.2. Application to CDHMMs We consider standard CDHMMs with Gaussian mixture as probability density function (pdf) within each state. The pdf over observation frame x Rd within a state s is defined as a Gaussian mixture: p(x|s) = M m=1 ps,m Ns,m (x) (5) where ps,m stands for the prior probability of the mth mixture in state s, and Ns,m stands for the Gaussian distribution whose mean vector is noted µs,m and whose covariance matrix is noted s,m . Ns,m (x) = -1 1 1 e- 2 (x-µs,m ) s,m (x-µs,m ) (2)d/2 |s,m |1/2 Large Margin Training for HMMs with Partially Observed States A N states HMM is defined with a set of parameters w = {, A, µ, }. Using standard notations, stands for the initial state probabilities, A for transition probabilities, µ for all mean vectors µ = {µs,m |m [1, M ], s [1, N ]}, and for all covariance matrices, = {s,m |m [1, M ], s [1, N ]}. The joint probability p(x, y|w) of an input-output pair x = (x1 , ..., xT ) and y = (y1 , ..., yL ) may be computed by summation over two sets of hidden variables: the sequence of states (called segmentation) s = (s1 , ..., sT ); and the sequence of numbers of Gaussian components (in Gaussian mixtures) responsible for the observations of x, m = (m1 , ..., mT ). In fact s runs over the set S(y) of segmentations matching y: p(x, y|w) = sS(y) m lists) of a subset of training samples together with a subset of candidate labelings (see (Yu & Deng, 2007) for a review). This is a first step but further approximations are needed to solve the problem. For instance (Jiang & Li, 2007) also relies on an additional convex relaxation before solving the problem with semi definite programming. At the end the impact of successive approximations and simplifications is difficult to measure and the robustness of the method is questionable. A more direct solution has been proposed in (Sha, 2006) which is based on primal optimization (Eq. (3)) and convex relaxation ideas. These authors proposed to overcome the non-convexity of f by linearizing function g (as in Eq. (8)) and by simplifying concave terms -F (xi , yi , w) in the following way. They remove the maximization in the computation of F (xi , yi , w) by using an oracle (A Generative Sysi tem learnt with MLE) providing a guess si , mgs for gs i i argmaxs,m [g(x , y , s, m, w)]. Hopefully, using the oracle trick the objective function becomes convex in w: f o (w)= w 2 2 p(x, y, s, m|w) (6) where, using notation ps,m (x) = ps,m Ns,m (x): p(x, y, s, m|w) = s1 ps1 ,m1 (x1 ) T t=2 def ast-1 ,st pst ,mt (xt ) In practice one often uses the approximation p(x, y|w) maxs p(x, y, s|w) or even: p(x, y|w) maxs,m p(x, y, s, m|w) (7) + i maxy (F (xi ,y,w)+(yi ,y)-g(xi ,yi ,si ,mi ,w)) gs gs (10) At the end, the quality of the solution is not clear since there are no guarantees that such an oracle provides relevant information for learning the discriminant system. Such a limitation has been stressed in (Sha, 2006) where the more complex HMMs topology is (i.e. the stronger the oracle approximation is) the less interesting discriminant training is. HMM max margin training can be easily cast in the formalism of previous section by defining the following score function, and considering z = (s, m): g(x, y, s, m, w) = log p(x, y, s, m|w) F (x, y, w) = maxsS(y),m g(x, y, s, m, w) or the following softmax variant: F sof t (8) 3. Non-Convex Optimization Algorithm e g(x,y,s,m,w) (x, y, w) = log sS(y),m (9) We consider the optimization problem below which includes the maximum margin HMM learning problem as a special case (Cf. Eq. (3), (8)): min f (w) = w 2.3. Solving the large margin HMM training optimization problem In the context of our study solving the unconstrained problem Eq. (3) raises a major problem since f (w) is naturally non-convex. For instance if F (x, y, w) = maxz g(x, y, z, w) and assuming g is linear, then F (x, y, w) is a convex function. Yet f (w) is not convex because of the -F (xi , yi , w) terms which are concave. Alternatively the constrained form in Eq. (2) raises problems too since there does not exist any efficient algorithm to solve such a non-convex problem with an exponential number of constraints. Some previous works overcome the difficulty by considering a smaller sized problem, e.g. through a heuristic selection (based on Viterbi-like decoding and N-best w 2 2 + R(w) (11) where R(w) is an upper bound of the empirical risk that we want to minimize. Our algorithm is inspired by a recent variant of bundle methods for minimizing convex regularized risk in machine learning problems (Teo et al., 2007; Joachims, 2006). This variant has two main advantages, the first one being its very good convergence rate, the second one being its relevant stopping criterion, namely the gap between the objective and the minimum of the approximated problem. Our algorithm inherits both advantages, but is also able to solve non-convex problems. The approach described in (Teo et al., 2007) solves the general optimization problem in Eq. (11) whatever R(w) provided that it is convex. We briefly Large Margin Training for HMMs with Partially Observed States describe now the method in the case where R(w) is convex. It relies on the cutting plane technique, where a cutting plane of R(w) at w is defined as: cw (w) = aw , w + bw s.t. cw (w ) = R(w ) and w cw (w ) w R(w ) (12) Then minimizing the approximated problem gives w3 which is a local maximum. The problem comes from the fact that a cutting plane of a non-convex function is not necessarily a lower bound, which leads to a poor approximation (see Figure 1a). Indeed the linearization error (R(w)-cw (w)) of a cutting plane cw at a point w may be negative, meaning that the function is overestimated at that point. In the following we will say in such a case that there is a conflict between cw and w. Here aw w R(w ) is a subgradient of R(w) at w and bw = R(w ) - aw , w . Such a cutting plane cw (w) is a linear lower bound of the risk R(w) and 2 + cw (w) is a quadratic lower bound of f (w). 2 w The bundle method aims at iteratively building an increasingly accurate piecewise quadratic lower bound of the objective function. Starting with an initial (e.g. random) solution w1 , one first determines the solution w2 minimizing the approximation problem g1 (w) = w 2 + cw1 (w). Then a new cutting plane 2 cw2 is built and one looks for the minimum w3 of the more accurate approximation problem g2 (w) = 2 + max(cw1 (w), cw2 (w)). More generally at it2 w eration t, one adds a new cutting plane at the current solution wt , and looks for the solution wt+1 minimizing the new approximated problem: wt+1 vt with = argminw gt (w) = minw gt (w) gt (w) = w 2 + maxj=1..t {cj (w)} 2 (13) Figure 1. Cutting planes and linearization errors where, in the convex case, cj (w) cwj (w) is defined as in Eq. (12). We use the notation cj (w) rather than cwj (w) to stress that cj (w) is built from solution wj in iteration j but does not necessarily coincide with cwj (w) as we will see below in the non-convex case. At iteration t the minimization of the approximated problem in Eq. (13) can be solved by quadratic programming. Note that by construction of the quadratic lower bound gt (w) the minimum of the approximated problem vt = gt (wt+1 ) increases every iteration. It may be shown that the gap between the minimum observed value of the objective function and the minimum of the approximated problem, f (w) - gt (wt+1 ), decreases towards zero and that it requires O(1/ ) iterations to reach a gap less than (Teo et al., 2007). Handling non-convex risk function Unfortunately, the approach in (Teo et al., 2007) cannot be used for non-convex problems since the minimization of the approximated problem may lead to a local maximum w. Figure 1b illustrates a situation where the method designed for convex cases yields such an improper solution. Consider we get the two cutting planes computed at w1 and w2 after two iterations. Standard solution to overcome conflicts is to lower any new cutting plane cwt (w) = awt , w + bwt that gives negative linearization errors for the solutions wj at previous iterations (Kiwiel, 1985). This may be done by tuning offset bwt . However, this approach does not guarantee any more the improvement of vt , as the change in the cutting plane parameters changes the approximated problem. This is the reason why the convergence rate of standard bundle methods is not guaranteed, and usually slow in practice. Instead, our algorithm handles non-convex risk while still preserving the good convergence rate O(1/ ), it is described in Algorithm 1. One key idea lies in that rather than solving all conflicts between a new cutting plane and all previous solutions, we focus on the conflict of the new cutting plane cwt w.r.t the best observed solution up to now, wt (see line 4). Fundamentally we allow eventual overestimation at other previous solutions and focus on the area of the best observed solution by considering there is a conflict if and only if condition (14) is not satisfied: cwt (wt ) = awt , wt + bwt R(wt ) (14) A second main idea lies in the modification procedure of cwt in case of conflict where cwt is adjusted into ct . Finally the next solution wt+1 is found by minimization of the approximated problem Eq. (13) like in the convex case but where cj denotes now either the "true" cutting plane cwj computed at iteration j (Cf. Eq. (12)), or its modified version in case there was a conflict at iteration j. Large Margin Training for HMMs with Partially Observed States Solve Conflict In case of conflict we look for a modified cutting plane that satisfies both Eq. (14) and Eq. (15): wt 2 2 + ct (wt ) f (wt ) (15) whose interest will appear later in the convergence analysis subsection. Note that cwt always satisfies (15) by definition of wt , so that ct also satisfies (15) in case there is no conflict. Algorithm 2 guarantees that the new cutting plane ct with parameters at and bt satisfies condition (14) and (15). First it tries to solve the conflict by tuning bt while fixing at = awt . The two conditions (14) and (15) may be rewritten as: bt bt R(wt ) - awt , wt = U f (wt ) - wt 2 - awt , wt = L 2 Algorithm 1 Non-Convex Cutting Plane 1: Input: w1 , , 2: for t = 1 to do 3: Define cwt according to Eq. (12) 4: wt = argminwj {w1 ,...wt } f (wj ) 5: if condition (14) is not satisfied then 6: ct = SolveConf lict(wt , wt , cwt ) 7: else ct = cwt 8: Compute wt+1 and vt according to Eq. (13) 9: gapt = f (wt ) - vt 10: if gapt then return wt 11: end for Algorithm 2 SolveConflict 1: Input: wt , wt , cwt with parameters (awt , bwt ) 2: Output: ct with parameters (at , bt ) 3: Compute L, U according to (16) 4: if L U then [at , bt ] = [awt , L] else 5: [at , bt ] = [-wt , f (wt ) - wt 2 - at , wt ] 2 2007). One can assume f (w) being locally convex then, which would make it converge to a local minimum. We first recall a result from (Teo et al., 2007) which is needed in our proof (Lemma 3.1). Then Lemma 3.2 determines a lower bound on the decrease of the gap every iteration. Finally, Theorem 3.1 proves that Algorithm 1 converges to a solution with accuracy with a rate O(1/ ). Lemma 3.1. (Teo et al., 2007) The minimum of 1 2 2 qx - lx with q, l > 0 and x [0, 1] is bounded from l above by - 2 min(1, l/q). Lemma 3.2. Approximation gap decrease: gapt-1 (gapt-1 )2 , ) (19) 2 8G2 where the approximation gap is defined as gapt = f (wt ) - vt , and where G is an upper bound on the norm of cutting planes direction parameters ai . gapt-1 - gapt min( Proof. The approximation problem (13) at iteration t can be rewritten as follows: vt = minw, s.t. 2 (16) which defines an upper bound U and a lower bound L of bt . If L U any value in (L, U ) works (in our implementation we set bt = L). It may happen that L > U , then tuning bt is not enough. We must adjust both bt and the normal vector at to make sure that the conflict is solved (see Line 5 in Algorithm 2). The chosen values of at and bt define a new cutting plane that trivially satisfies condition (15). It also satisfies condition (14) as we show now. = = at , wt + bt at , wt + f (wt ) - wt 2 - at , wt 2 R(wt ) + at + (wt + wt ), wt - wt 2 (17) where we used the definition of objective function f (wt ) = wt 2 +R(wt ). Then, we substitute -wt 2 for at (Cf. Line 5) and obtain a t , wt + bt = R(wt ) - R(wt ) 2 wt - wt 2 (18) Convergence Analysis Our algorithm improves in two ways over previous standard non-convex bundle methods. First, it generates a sequence wt that converges to a solution w where f (w ) f (wt )t, which is not guarantied with standard methods that may generate (stationary) cluster points, not necessarily better than previous solutions. Second, we provide below an upper bound on the convergence rate of our algorithm which converges with O(1/ ) rate, one cannot derive such a rate for standard bundle methods. Experimentally, after having reached "a moderate gap"(which is fast) no conflicts arise and our algorithm behaves like (Teo et al., w 2+ aj , w + bj j = 1..t (20) where is a slack variable. The solution is given by a saddle point of the Lagrangian that must be minimized wrt. parameters w, and maximized wrt. Lagrange multipliers. One gets easily the dual form: vt = maxRt s.t Dt () = - At 2 i 0 i = 1..t i=1..t i = 1 2 + Bt (21) Large Margin Training for HMMs with Partially Observed States where At = [a1 ; ...; at ] and Bt = [b1 ; ...; bt ] and stands for the vector of Lagrange multipliers (of length t at iteration t). Let t be the solution maximizing Dt ()1 . The primal solution, which may be obtained using saddle point optimality conditions (of Lagrange A duality), is: wt+1 = - t t , with: vt = Dt (t ). Unfortunately vt cannot be computed explicitly since it is the result of a quadratic program. However, we may interestingly study the dual Dt for a subspace of values, along the segment line between start = [t-1 , 0] and end = [0, .., 0, 1]. It is easy to verify that any point along this segment () = start + (1 - )end [0, 1] (22) Next, since [0, 1], vt Dt (()): -vt -vt-1 - l 2 min(1, l/q) Adding f (wt ) to both sides, using l f (wt ) - vt-1 : f (wt ) - vt f (wt ) - vt-1 f (w )-vt-1 f (w )-vt-1 t t min(1, ) - 2 q (24) Now note that x - x min(1, x/q) is monotonically in2 creasing q > 0. Also f (wt ) is monotonically decreas ing so that f (wt ) - vt-1 f (wt-1 ) - vt-1 = gapt-1 . Putting this together: gapt gapt-1 - gapt-1 2 min(1, gapt-1 /q) (25) is feasible. Note also that Dt (start ) = Dt-1 (t-1 ) = vt-1 , which implies naturally that vt vt-1 . Substituting Eq. (22) into Eq. (21) and noting that At = [At-1 ; at ], Bt = [Bt-1 ; bt ], we get a quadratic form (in ) for Dt (()): Dt (()) 1 = - 2 at - t-1 At-1 2 2 1 + t-1 At-1 2 - t-1 Bt-1 1 + at , t-1 At-1 + bt 1 - 2 t-1 At-1 2 + t-1 Bt-1 1 Finally we show that q 4G2 /. Actually, q = at + 1 2 2 wt-1 = at +t-1 At-1 where t-1 At-1 G i because i ai G and i=1..t-1 t-1 = 1. Finally 2 2 at - t-1 At-1 4G and we get q 4G2 /. Theorem 3.1. Algorithm 1 produces an approximation gap below after T steps where : T T0 + 8G2 / - 2 with T0 = 2log2 w1 +a1 / - 2 G and converges with a rate O(1/ ). Proof. Consider the two quantities occurring in Eq. (19), gapt-1 /2 and gap2 /8G2 . We first show that t-1 the situation where gapt-1 /2 > gap2 /8G2 (i.e. t-1 gapt-1 > 4G2 /) may only happen a finite number of iterations, T0 . Actually if gapt-1 > 4G2 / Lemma 3.2 shows that gapt gapt-1 /2 and the gap is at least divided by two every iteration. Then gapt-1 > 4G2 / may arise for at most T0 = log2 (gap1 /4G2 )+1. Since gap1 = w1 +a1 / 2 (it may be obtained analytically 2 since the approximation function in the first iteration is quadratic), T0 = 2log2 w1 +a1 / - 2. G Hence after at most T0 iterations the gap decrease obeys gapt - gapt-1 -gap2 /8G2 0. To estimate t-1 the number of iterations required to reach gapt , we follow an idea from (Teo et al., 2007) and introduce a function u(t) which is an upper bound of gapt . Solving differential equation u (t) = - 8G2 u2 (t) 2 with boundary condition u(T0 ) = 4G / gives u(t) = 8G2 - (t+2-T0 ) gapt t T0 . Solving u(t) t 8G2 / + T0 - 2, the solution is reached with accuracy within T0 + 8G2 / - 2 iterations. (26) (23) This expression may be simplified by using that t-1 was the solution of the dual approximation problem A at iteration t - 1, hence wt = - t-1 t-1 and vt-1 = Dt-1 (t-1 ). Then: Dt (()) 1 = - 2 at + wt 2 2 + wt 2 + at , wt + bt - vt-1 2 +vt-1 = - 1 q 2 + l + vt-1 2 1 with q = at + wt-1 2 and l = wt 2 + at , wt + 2 bt - vt-1 . Note that Eq.(15) l f (wt ) - vt-1 . We consider now two cases. If l 0, and using that vt-1 vt , one gets f (wt ) vt-1 vt which yields 2 gapt 0 gapt-1 - min( gapt-1 , (gapt-1 ) ) assum2 8G2 ing naturally that gapt-1 > > 0 since convergence was not reached at iteration t - 1 (otherwise algorithm would have stopped). Note that we never observed such a singular case l 0 in our experiments but its study is required to complete the proof. The case where l > 0 is not as simple and we rely on Lemma 3.1 for bounding the minimum of vt-1 - Dt (()) 1 q 2 - l: 2 min[0,1] vt-1 - Dt (()) 1 4. Experiments We provide experimental results on speech and on online handwritten digit recognition and analyze experimentally the convergence behavior of our method. l -2 min(1, l/q) We used solvers from the STPR toolbox available at http://cmp.felk.cvut.cz/cmp/software/stprtool/ Large Margin Training for HMMs with Partially Observed States Automatic Speech Recognition We performed experiments on the TIMIT database with the standard splitting into train, development and test data. The signal was preprocessed using the procedure described in (Sha & Saul, 2007). There are 3696 utterances and over 1 million frames in the training set. A left-right HMM with one to three states and Gaussian mixture probability densities was build for each of 48 phonetic classes. We followed standard conventions in mapping the 48 phonetic labels down to 39 broader phone categories and error rates were computed as the sum of substitution, deletion, and insertion error rates from the alignment process. We naturally compared our algorithms with a non discriminant system (MLE) (trained with the HTK Toolkit). In addition this MLE system is used during the training of discriminant systems both for initialization and for regularization. Actually we used the regularization term w - wM LE 2 which experimen2 tally performs slightly better than w 2 . Moreover, 2 using w - wM LE 2 for regularization term leads to a bigger optimal value of than using w 2 , which reduces considerably the number of training steps (and also the number of cutting planes required to approximate well the objective function). We implemented two variants of our method (non-convex optimization or NCO), one uses the hard-max (NCO-H) and the other one (NCO-S) uses the soft-max version over all possible labellings (see Eq. (4),(9)), this latter version is implemented with a Forward-Backward procedure. Also, we used the hamming distance for (yi , y). Experimentally NCO-S is about 10 times slower than NCO-H which is 2 times slower than MLE training, we give hints now. Actually learning cost mainly decomposes into two terms, computing frames probabilities and dynamic programming. In exprimental settings such as in speech recognition the first term dominates and is similar for NCO and MLE methods if training sentences include many different phones. Besides, to reach a gap < 1%, NCO-H requires about two times more iteration than MLE requires to converge. Finally NCO-H training is 2 times slower than MLE. We compared our methods to three competitive discriminant methods, the large margin convex formulation of (Sha & Saul, 2007) (named Oracle), and two benchmark discriminant methods, Conditional Maximum Likelihood (CML) and Minimum Classification Error (MCE). Table 1 shows phone error rates of all these methods for one-state HMM per phone. Note that Oracle, MCE and MMI results are taken from (Sha, 2006) and correspond to the same experimental setting. These results clearly show first that discrim- Table 1. Phone error rates with single state phone HMMs (N = 1) and mixtures of M Gaussian laws per state. M MLE NCO-H NCO-S Oracle CML MCE 1 2 4 8 44.75 39.54 36.06 34.46 31.44 29.70 29.13 28.29 31.02 30.21 29.30 29.11 31.2 30.8 29.8 28.2 36.4 34.6 32.8 31.5 35.6 34.5 32.4 30.9 Table 2. Phone error rates with multi-state phone HMMs. N 2 states 2 states 2 states 2 states 3 states 3 states 3 states 3 states M 1 2 4 8 1 2 4 8 MLE 38.21 34.14 32.00 31.25 36.70 31.92 30.28 29.55 NCO-H 29.57 27.99 27.67 27.58 28.70 27.93 27.40 27.61 Oracle(Sha, 2006) Not Available NA NA NA 37.8 32.6 NA NA inant approaches significantly outperform MLE training, and second that large margin approaches (NCO and Oracle) significantly outperform the two other discriminant methods. Note also that the two variants of our method NCO-H and NCO-S perform similarly. Since NCO-H is much faster we report only NCO-H results in the following. Table 2 shows results with a few states per left-right phone HMM, for the two most efficient techniques (NCO and Oracle) only. Note that (Sha, 2006) only report results for 3 states HMM with a small number of gaussians. As may be seen in these experiments the oracle method is not able to exploit the increasing complexity of the models while our method can take advantage of the number of states to reach lower error rates. We believe that this success comes from the original non-convex formulation. On-line Handwriting Recognition On-line handwriting signals are temporal sequences of the position of an electronic pen captured through a digital tablet. We used a part of the Unipen international database with a total of 15k digit samples, 5k samples for training and 10k samples for testing. We trained a five states left-right CDHMM for each digit. Table 3 reports classification error rates of three systems, namely MLE, the Oracle method and NCO-H. Again, one can see that our method reaches the best results whatever M the number of Gaussian in Gaussian mixtures. NCO-H is shown to significantly outperform the Oracle based method showing that our algorithm has been able here too to efficiently learn from partially labeled training samples. Large Margin Training for HMMs with Partially Observed States Table 3. Handwritten digit recognition error rates (N = 5) M 1 2 3 4 MLE 13.50 10.50 9.48 15.02 Oracle 2.01 1.70 1.82 1.74 NCO-H 1.53 1.33 1.53 1.56 ods and cutting planes. We provided a convergence proof and reported experimental results on speech and handwritten digit recognition showing improved results over state of the art algorithms. References Hiriart-Urruty, J., & Lemarechal, C. (1993). Convex analysis and minimization algorithms, i and ii. Springer-Verlag. Jiang, H., & Li, X. (2007). Incorporating training errors for large margin hmms under semi-definite programming framework. Proc. Intl. Conf. on Acoustics, Speech and Signal Processing, IV­629­632. Joachims, T. (2006). Training linear SVMs in linear time. ACM International Conference On Knowledge Discovery and Data Mining (pp. 217­226). Juang, B., & Katagiri, S. (1992). Discriminative learning for minimum error classification. IEEE Trans. Signal Processing, Vol.40, No.12 (pp. 3043­3054). Kiwiel, K. C. (1985). Methods of descent for nondifferentiable optimization. Springer-Verlag. Sha, F. (2006). Large margin training of acoustic models for speech recognition. Phd Thesis. Sha, F., & Saul, L. K. (2007). Large margin hidden markov models for automatic speech recognition. Neural Information Processing Systems (pp. 1249­1256). Cambridge, MA: MIT Press. Taskar, B., Guestrin, C., & Koller, D. (2004). Maxmargin markov networks. Neural Information Processing Systems (pp. 25­32). Cambridge, MA. Teo, C. H., Le, Q. V., Smola, A., & Vishwanathan, S. V. (2007). A scalable modular convex solver for regularized risk minimization. ACM SIGKDD International Conference On Knowledge Discovery and Data Mining (pp. 727­736). New York, USA. Convergence Finally, we analyze experimentally the convergence rate of our algorithms (on speech recognition experiments). In these experiments learning is performed until the approximation gap becomes less than 1% of the objective function, which is enough to reach an optimal error-rate. Figure 2a plots the evolution of the error rate on the development set and on the test set against the learning iteration number. It is seen that the error rate remains stable after about 50 iterations. Figure 2b shows the evolution of the approximation gap (in log scale) as a function of the iteration number, with different values of the regularization parameter . For clarity reasons we plot a normalized gap which is computed by dividing the gap by the number of frames in the training set. Note that the convergence rate depends directly on the value of as observed in (Teo et al., 2007), which is not the case in traditional bundle methods. This comes naturally from the use of a regularization term. More importantly, this figure shows that the convergence rate is actually closer to O(log(1/ )) than to our theoretical proven O(1/ ). (a) (b) Figure 2. Training LMHMM with 3 states and 4 Gaussians for speech recognition Tsochantaridis, I., Hofmann, T., Joachims, T., & Altun, Y. (2004). Support vector machine learning for interdependent and structured output spaces. Proc. Intl. Conf. on Machine Learning (pp. 104­112). Woodland, P., & Povey, D. (2002). Large scale discriminative training of hidden markov models for speech recognition. Computer Speech and Language, 16, 25­47. Yu, D., & Deng, L. (2007). Large-margin discriminative training of hidden markov models for speech recognition. Proceedings of the International Conference on Semantic Computing (pp. 429­438). 5. Conclusion We described a new method for maximum margin learning of CDHMMs, that allows learning with partially labeled training sets, which is still an open problem. We showed how this optimization problem may be cast as a non-convex optimization problem for which we propose a method based on bundle meth-