Discriminative Keyword Selection Using Support Vector Machines W. M. Campbell, F. S. Richardson MIT Lincoln Laboratory Lexington, MA 02420 wcampbell,frichard@ll.mit.edu Abstract Many tasks in speech processing involve classification of long term characteristics of a speech segment such as language, speaker, dialect, or topic. A natural technique for determining these characteristics is to first convert the input speech into a sequence of tokens such as words, phones, etc. From these tokens, we can then look for distinctive phrases, keywords, that characterize the speech. In many applications, a set of distinctive keywords may not be known a priori. In this case, an automatic method of building up keywords from short context units such as phones is desirable. We propose a method for construction of keywords based upon Support Vector Machines. We cast the problem of keyword selection as a feature selection problem for n-grams of phones. We propose an alternating filter-wrapper method that builds successively longer keywords. Application of this method to language recognition and topic recognition tasks shows that the technique produces interesting and significant qualitative and quantitative results. 1 Introduction A common problem in speech processing is to identify properties of a speech segment such as the language, speaker, topic, or dialect. A typical solution to this problem is to apply a detection paradigm. A set of classifiers is applied to a speech segment to produce a decision. For instance, for language recognition, we might construct detectors for English, French, and Spanish. The maximum scoring detector on a speech segment would be the predicted language. Two basic categories of systems have been applied to the detection problem. A first approach uses short-term spectral characteristics of the speech and models these with Gaussian mixture models (GMMs) or support vector machines (SVMs) to directly produce a decision. Although quite accurate, this type of system produces only a classification decision with no qualitative interpretation. A second approach uses high level features of the speech such as phones and words to detect the properties. The advantage of this approach is that, in some instances, we can explain why we made a decision. For example, a particular phone or word sequence might indicate the topic. We adopt this latter approach for our paper. SVMs have become a common method of extracting properties of high-level sequences of speech tokens [1, 2, 3]. Sequence kernels are constructed by viewing a speech segment as a document of tokens. The SVM feature space in this case is a scaling of co-occurrence probabilities of tokens in an utterance. This technique is analogous to methods for applying SVMs to text classification [4]. SVMs have been applied at many linguistic levels of tokens as detectors. Our focus in this paper is at the acoustic phone level. Our goal is to automatically derive long sequences of phones which This work was sponsored by the Department of Homeland Security under Air Force Contract FA872105-C-0002. Opinions, interpretations, conclusions, and recommendations are those of the authors and are not necessarily endorsed by the United States Government. 1 we call keywords which are characteristic of a given class. Prior work, for example, in language recognition [5], has shown that certain words are a significant predictor of a language. For instance, the presence of the phrase "you know" in a conversational speech segment is a strong indicator of English. A difficulty in using words as the indicator of the language is that we may not have available a speech-to-text (STT) system in all languages of interest. In this case, we'd like to automatically construct keywords that are indicative of the language. Note that a similar problem can occur in other property extraction problems. For instance, in topic recognition, proper names not in our STT system dictionary may be a strong indicator of topic. Our basic approach is to view keyword construction as a feature selection problem. Keywords are composed of sequences of phones of length n, i.e. n-grams. We would like to find the set of n-grams that best discriminates between classes. Unfortunately, this problem is difficult to solve directly, since the number of unique n-grams grows exponentially with increasing n. To alleviate this difficultly, we propose a method that starts with lower order n-grams and successively builds higher order n-grams. The outline of the paper is as follows. In Section 2.1, we review the basic architecture that we use for phone recognition and how it is applied to the problem. In Section 2.2, we review the application of SVMs to determining properties. Section 3.1 describes a feature selection method for SVMs. Section 3.2 presents our method for constructing long context units of phones to automatically create keywords. We use a novel feature selection approach that attempts to find longer strings that discriminate well between classes. Finally, in Section 4, we show the application of our method to language and topic recognition problems. We show qualitatively that the method produces interesting keywords. Quantitatively, we show that the method produces keywords which are good discriminators between classes. 2 Phonotactic Classification 2.1 Phone Recognition The high-level token extraction component of our system is a phone recognizer based upon the Brno University (BUT) design [6]. The basic architecture of this system is a monophone HMM system with a null grammar. Monophones are modeled by three states. This system uses two powerful components to achieve high accuracy. First, long time-span time-frequency features, TRAPS, provides contextual cues for modeling monophones. Second, the BUT recognizer extensively uses discriminatively trained feedforward artificial neural to model HMM state posterior probabilities. We have constructed a phone recognizer for English units using the BUT architecture and automatically generated STT transcripts on the Switchboard 2 Cell corpora [7]. Training data consisted of approximately 10 hours of speech. ANN training was accomplished using the ICSI Quicknet package [8]. The resulting system has 49 monophones including silence. The BUT recognizer is used along with the HTK HMM toolkit [9] to produce lattices. Lattices encode multiple hypotheses with acoustic likelihoods. From a lattice, a 1-best (Viterbi) output can be produced. Alternatively, we use the lattice to produce expected counts of tokens and n-grams of tokens. Expected counts of n-grams can be easily understood as an extension of standard counts. Suppose we have one hypothesized string of tokens, W = w1 , · · · , wn . Then the 2-grams are created by grouping 2 tokens at a time to form, W2 = w1 _w2 w2 _w3 · · · wn-1 _wn . Higher order n-grams are formed from longer juxtaposition of tokens. The count function for a given bigram, d i , is count(di |W2 ) is the number of occurrences of di in the sequence W2 . To extend counts to a lattice, L, we find the expected count over all all possible hypotheses in the lattice, EW [count(di |W )] = W L p(W |L) count(di |W ). (1) The expected counts can be computed efficiently by a forward-backward algorithm; more details can be found in Section 3.3 and [10]. 2 A useful extension of extension to expected counts is to find the probability of an n-gram in a lattice. For a single hypothesis, W , we can construct a joint probability using counts of n-grams, p(wi , wi |W ) = ^ count(wi , wi |W ) ^ j count(wj , wj |W ) ^ (2) wi = wi-(n-1) , . . . , wi-1 ^ where the sum in (2) is performed over all unique n-grams in the utterance, and count(w i , wi |W ) is ^ the number of times the n-gram, wi wi , occurs in the sequence W . We have introduced the notation, ^ wi , to denote history (or context) of wi in the sequence W . Extending (2) to lattices is trivial since ^ we can replace counts by expected counts. 2.2 Discriminative Language Modeling: SVMs Token-based class recognition using SVMs can be performed in several ways [3, 11, 12]. We focus on an approach which is similar to [2, 12]. In [2], a token sequence, W = w 1 , · · · , wN , is modeled using a bag-of-n-grams approach. For a sequence of tokens, (joint) probabilities of the unique ngrams, wj wj , on a per conversation basis are calculated, p(wj , wj |W ). As an example, for words, ^ ^ a typical bigram might be w2 w2 = the_example. Then, the probabilities are mapped to a sparse ^ vector with entries Dj p(wj , wj |W ). ^ (3) The selection of the weighting, Dj , in (3) is critical for good performance. A typical choice is something of the form ( 1 C 4) Dj = min j , gj p(wj , wj |all) ^ where gj (·) is a function which squashes the dynamic range, and Cj is a constant. The probability p(wj , wj |all) in (4) is calculated from the observed probability across all classes. The squashing ^ function should monotonically map the ierval [1, ) to itself to suppress large inverse probabilint ties. Typical choices for gj are gj (x) = x and gj (x) = log(x) + 1. In both cases, the squashing function gj normalizes out the typicality of a feature across all classes. The constant C j limits the effect of any one feature on the kernel inner product. If we set C j = 1, then this makes Dj = 1 for all j . The general weighting of probabilities is then combined to form a kernel between two token sequences, see [2] for more details. For two token sequences, W and V , the kernel is j 2 K (W, V ) = ^ ^ (5) Dj p(wj , wj |W )p(wj , wj |V ). Intuitively, the kernel in (5) says that if the same n-grams are present in two sequences and the normalized frequencies are similar there will be a high degree of similarity (a large inner product). If n-grams are not present, then this will reduce similarity since one of the probabilities in (5) will be zero. The normalization Dj insures that n-grams with large probabilities do not dominate the kernel function. The kernel can alternatively be viewed as a linearization of the log-likelihood ratio [2]. Incorporating the kernel (5) into an SVM system is straightforward. SVM training and scoring require only a method of kernel evaluation between two objects that produces positive definite kernel matrices (the Mercer condition). We use the package SVMTorch [13]. Training is performed with a one-versus-all strategy. For each target class, we group all remaining class data and then train with these two classes. Extending (5) to lattices is straightforward. Rather than computing the probability p(w j , wj |W ) ^ using counts from the 1-best sequence, expected counts from a lattice can be used. 3 Discriminative Keyword Selection 3.1 SVM Feature Selection A first step towards an algorithm for automatic keyword generation using phones is to examine feature selection methods. Ideally, we would like to select over all possible n-grams, where n is 3 varying, the most discriminative sequences for determining a property of a speech segment. The number of features in this case is prohibitive, since it grows exponentially with n. Therefore, we have to consider alternate methods. As a first step, we examine feature selection for fixed n and look for keywords with n or less phones. Suppose that we have a set of candidate keywords. Since we are already using an SVM, a natural algorithm for discriminative feature selection in this case is to use a wrapper method [14]. Suppose that the optimized SVM solution is f (X ) = and w= i i K (X, Xi ) + c i (6) (7) i b(Xi ) where b(Xi ) is the vector of weighted n-gram probabilities in (3). We note that the kernel given in (5) is linear. Also, the n-gram probabilities have been normalized in (3) by their probability across the entire data set. Intuitively, because of this normalization and since f (X ) = w t b(X ) + c, large magnitude entries in w correspond to significant features. A confirmation of this intuitive idea is the algorithm of Guyon, et. al. [15]. Guyon proposes an iterative wrapper method for feature selection for SVMs which has these basic steps: · For a set of features, S , find the SVM solution with model w. 2 · Rank the features by their corresponding model entries wi . Here, wi is the ith entry of w in (7). · Eliminate low ranking features. The algorithm may be iterated multiple times. Guyon's algorithm for feature selection can be used for picking significant n-grams as keywords. We can create a kernel which is the sum of kernels as in (5) up to the desired n. We then train an SVM and rank n-grams according to the magnitude of the entries in the SVM model vector, w. As an example, we have looked at this feature selection method for a language recognition task with trigrams (to be described in Section 4). As a motivation for the applicability of Guyon's feature selection method, see Figure 1. The figure shows two functions. First, the CDF of the SVM model values, |wi |, is shown. The figure shows an S-curve shape. There is a small set of large model weights. The second curve shows the equal error rate (EER) of the task as a function of applying one iteration of the Guyon algorithm and retraining the SVM. EER is defined as the value where the miss and false alarm rates are equal. All features with |wi | below the value on the x-axis are discarded in the first iteration. From the figure, we see that only a small fraction (< 5%) of the features are needed to obtain good error rates. This interesting result provides motivation that a small subset of keywords are significant to the task. 3.2 Keywords via an alternating wrapper/filter method The algorithm in Section 3.1 gives a method for n-gram selection for fixed n. Now, suppose we want to find keywords for arbitrary n. One possible hypothesis for keyword selection is that since higher order n-grams are discriminative, lower order n-grams in the keywords will also be discriminative. Therefore, it makes sense to finding distinguishing lower order n-grams and then construct longer units from these. On the basis of this idea, we propose the following algorithm for keyword construction: Keyword Building Algorithm · Start with an initial value of n = ns . Initialize the set, Sn , to all possible n-grams of phones including lower order grams. By default, let S1 be the set of all phones. · (Wrapper Step) General n. Apply the feature selection algorithm in Section 3.1 to produce a subset of distinguishing n-grams, Sn Sn . 4 1 0.2 0.75 CDF |w | 0.5 EER 0.25 0.1 0.05 0 -4 10 10 10 -3 10 Threshold -2 10 -1 0 0 10 Figure 1: Feature selection for a trigram language recognition task using Guyon's method · (Filter Step) Construct a new set of (n + 1)-grams by juxtaposing elements from S n with phones. Nominally, we take this step to be juxtaposition on the right and left, S n+1 = {dp, q d|d Sn , p S1 , q S1 }. · Output: Sn at some stopping n. A few items should be noted about the proposed keyword building algorithm. First, we call the second feature selection process a filter step, since induction has not been applied to the (n + 1)-gram features. Second, note that the purpose of the filter step is to provide a candidate set of possible (n + 1)-grams which can then be more systematically reduced. Third, several potential algorithms exist for the filter step. In our experiments and in the algorithm description, we nominally append one phone to the beginning and end of an n-gram. Another possibility is to try to combine overlapping n-grams. For instance, suppose the keyword is some_people which has phone transcript s_ah_m_p_iy_p_l. Then, if we are looking at 4-grams, we might see as top features s_ah_m_p and p_iy_p_l and combine these to produce a new keyword. 3.3 Keyword Implementation The expected N-gram counts were computed on the lattice using the Forward-Backward algorithm. Equation (8) gives the posterior probability of a connected sequence of arcs in the lattice where src_nd(a) and dst_nd(a) are the source and destination node of arc a, (a)is the likelihood associated with arc a, (n) and (n) are the forward and backward probabilities of reaching node n from the beginning or end of the lattice L respectively, and (L) is the total likelihood of the lattice (the (·) of the final node or (·) of the initial node of the lattice). p(aj , ..., aj +N ) = (src_nd(aj )) (aj )... (aj +N ) (dst_nd(aj +N )) (L) (8) · Iterate to the wrapper step. ) ( Now if we define the posterior probability of a node p(n) as p(n) = (n(L) n) . Then equation (8) becomes: p(aj )...p(aj +N ) (9) p(aj , ..., aj +N ) = p(src_nd(aj +1 ))...p(src_nd(aj +N )) Equation (9) is attractive because it provides a way of computing the path posteriors locally using only the individual arc and node posteriors along the path. We use this computation along with a trie structure [16] to compute the posteriors of our keywords. 5 Equal Error Rate CDF 0.15 i 4 Experiments 4.1 Language Recognition Experimental Setup The phone recognizer described in Section 2.1 was used to generate lattices across a train and an evaluation data set. The training data set consists of more than 360 hours of telephone speech spanning 13 different languages and coming from a variety of different sources including Callhome, Callfriend and Fisher. The evaluation data set is the NIST 2005 Language Recognition Evaluation data consisting of roughly 20,000 utterances (with duration of 30, 10 or 3 seconds depending on the task) coming from three collection sources including Callfriend, Mixer and OHSU. We evaluated our system for the 30 and 10 second task under the the NIST 2005 closed condition which limits the evaluation data to 7 languages (English, Hindi, Japanese, Korean, Mandarin, Spanish and Tamil) coming only from the OHSU data source. The training and evaluation data was segmented using an automatic speech activity detector and segments smaller than 0.5 seconds were thrown out. We also sub-segmented long audio files in the training data to keep the duration of each utterance to around 5 minutes (a shorter duration would have created too many training utterances). Lattice arcs with posterior probabilities lower than 10-6 were removed and lattice expected counts smaller than 10-3 were ignored. The top and bottom 600 ranking keywords for each language were selected after each training iteration. The support vector machine was trained using a kernel formulation which requires pre-computing all of the kernel distances between the data points and using an alternate kernel which simply indexes into the resulting distance matrix (this approach becomes difficult when the number of data points is too large). 4.2 Language Recognition Results (Qualitative and Quantitative) To get a sense of how well our feature selection approach was working, we looked at the top ranking keywords from the English model only (since our phone recognizer is trained using the English phone set). Table 1 summarizes a few of the more compelling phone 5-grams, and a possible keyword that it corresponds to each one. Not suprisingly, we also noticed that in the list of top-ranking N-grams there were many variations or partial N-gram matches to the same keyword, as well as N-grams that didn't correspond to any apparent keyword. Table 1: Top ranking keywords for 5-gram SVM for English language recognition model phones Rank keyword SIL_Y_UW_N_OW !NULL_SIL_Y_EH_AX !NULL_SIL_IY_M_TH P_IY_P_AX_L R_IY_L_IY_SIL Y_UW_N_OW_OW T_L_AY_K_SIL L_AY_K_K_SIL R_AY_T_SIL_!NULL HH_AE_V_AX_N !NULL_SIL_W_EH_L 1 3 4 6 7 8 17 23 27 29 37 you know yeah ??? people really you know (var) ? like like (var) right have an well The equal error rate for our system on the NIST 2005 language recognition evaluation are summarized in table 2. The 4-gram system gave a relative improvement of 12% on the 10 second task and 9% on the 30 second task, but despite the compelling keywords produced by the 5-gram system, the performance actually degraded significantly compared to the 3-gram and 4-gram systems. 6 Table 2: %EER for 10 and 30 second NIST language recognition tasks N 1 2 3 4 5 10sec 30sec 25.3 18.3 16.5 07.4 11.3 04.3 10.0 03.9 13.6 05.6 4.3 Topic Recognition Experimental Setup Topic recognition was performed using a subset of the phase I Fisher corpus (English) from LDC. This corpus consists of 5851 telephone conversations. Participants were given instructions to discuss a topic for 10 minutes from 40 different possible topics. Topics included "Education", "Hobbies," "Foreign Relations", etc. Prompts were used to elicit discussion on the topics. An example prompt is: Movies: Do each of you enjoy going to the movies in a theater, or would you rather rent a movie and stay home? What was the last movie that you saw? Was it good or bad and why? For our experiments, we used 2750 conversation sides for training. We also constructed development and test sets of 1372 conversation sides each. The training set was used to find keywords and models for topic detection. 4.4 Topic Recognition Results We first looked at top ranking keywords for several topics; some results are shown in Table 3. We can see that many keywords show a strong correspondence with the topic. Also, there are partial keywords which correspond to what appears to be longer keywords, e.g. "eh_t_s_ih_k" corresponds to get sick. As in the language recognition task, we used EER as the performance measure. Results in Table 4 show the performance for several N -gram orders. Performance improves going from 3-grams to 4grams. But, as with the language recognition task, we see a degradation in performance for 5-grams. Table 3: Top keyword for 5-gram SVM in Topic Recognition Topic Phones Keyword Professional Sports on TV Hypothetical: Time Travel Affirmative Action US Public Schools Movies Hobbies September 11 Issues in the Middle East Illness Hypothetical: One Million Dollars to leave the US S_P_AO_R_T G_OW_B_AE_K AX_V_AE_K_CH S_K_UW_L_Z IY_V_IY_D_IY HH_OH_B_IY_Z HH_AE_P_AX_N IH_Z_R_IY_L EH_T_S_IH_K Y_UW_M_AY_Y sport go back affirmat[ive ac]tion schools DVD hobbies happen Israel g[et sick] you may Table 4: Performance of Topic Detection for Different N -gram orders N -gram order 3 4 5 EER (%) 10.22 7 8.95 9.40 5 Conclusions and future work We presented a method for automatic construction of keywords given a discriminative speech classification task. Our method was based upon an successively building longer span keywords from shorter span keywords using phones as a fundamental unit. The problem was cast as a feature selection method and an alternating filter and wrapper method was proposed. Results showed that reasonable keywords and improved performance could be achieved using this methodology. Numerous possibilities exist for future work on this task. First, extension and experimentation on other tasks such as dialect and speaker recognition would be interesting. The method has the potential for discovery of new interesting characteristics. Second, comparison of this method with other feature selection methods may be appropriate [17]. A third area for extension is various technical improvements. For instance, we might want to consider more general keyword models where skips are allowed (or more general finite state transducers [18]). Also, alternate methods for the filter for constructing higher order n-grams is a good area for exploration. References [1] A. Stolcke, L. Ferrer, S. Kajarekar, E. Shriberg, and A. Venkataraman, "MLLR transforms as features in speaker recognition," in Proc. Interspeech, 2005, pp. 2425­2428. [2] W. M. Campbell, J. P. Campbell, D. A. Reynolds, D. A. Jones, and T. R. Leek, "Phonetic speaker recognition with support vector machines," in Advances in Neural Information Processing Systems 16, Sebastian Thrun, Lawrence Saul, and Bernhard Schölkopf, Eds. MIT Press, Cambridge, MA, 2004. [3] W. M. Campbell, T. Gleason, J. Navratil, D. Reynolds, W. Shen, E. Singer, and P. Torres-Carrasquillo, "Advanced language recognition using cepstra and phonotactics: MITLL system performance on the NIST 2005 language recognition evaluation," in Proc. IEEE Odyssey, 2006. [4] T. Joachims, Learning to Classify Text Using Support Vector Machines, Kluwer Academic Publishers, 2002. [5] W. M. Campbell, F. Richardson, and D. A. Reynolds, "Language recognition with word lattices and support vector machines," in Proceedings of ICASSP, 2007, pp. IV­989 ­ IV­992. [6] Petr Schwarz, Matejka Pavel, and Jan Cernocky, "Hierarchical structures of neural networks for phoneme recognition," in Proceedings of ICASSP, 2006, pp. 325­328. [7] Linguistic Data Consortium, "Switchboard-2 corpora," http://www.ldc.upenn.edu. [8] "ICSI QuickNet," http://www.icsi.berkeley.edu/Speech/qn.html. [9] S. Young, Gunnar Evermann, Thomas Hain, D. Kershaw, Gareth Moore, J. Odell, D. Ollason, V. Valtchev, and P. Woodland, The HTK book, Entropic, Ltd., Cambridge, UK, 2002. [10] L. Rabiner and B.-H. Juang, Fundamentals of Speech Recognition, Prentice-Hall, 1993. [11] Christopher White, Izhak Shafran, and Jean-Luc Gauvain, "Discriminative classifiers for language recognition," in Proceedings of ICASSP, 2006, pp. I­213­I­216. [12] Lu-Feng Zhai, Man hung Siu, Xi Yang, and Herbert Gish, "Discriminatively trained language models using support vector machines for language identification," in Proc. IEEE Odyssey: The Speaker and Language Recognition Workshop, 2006. [13] Ronan Collobert and Samy Bengio, "SVMTorch: Support vector machines for large-scale regression problems," Journal of Machine Learning Research, vol. 1, pp. 143­160, 2001. [14] Avrim L. Blum and Pat Langley, "Selection of relevant features and examples in machine learning," Artificial Intelligence, vol. 97, no. 1-2, pp. 245­271, Dec. 1997. [15] I. Guyon, J. Weston, S. Barnhill, and V. Vapnik, "Gene selection for cancer classification using support vector machines," Machine Learning, vol. 46, no. 1-3, pp. 389­422, 2002. [16] Konrad Rieck and Pavel Laskov, "Language models for detection of unknown attacks in network traffic," Journal of Computer Virology, vol. 2, no. 4, pp. 243­256, 2007. [17] Takaaki Hori, I. Lee Hetherington, Timothy J. Hazen, and James R. Glass, "Open-vocabulary spoken utterance retrieval using confusion neworks," in Proceedings of ICASSP, 2007. [18] C. Cortes, P. Haffner, and M. Mohri, "Rational kernels," in Advances in Neural Information Processing Systems 15, S. Thrun S. Becker and K. Obermayer, Eds., Cambridge, MA, 2003, pp. 601­608, MIT Press. 8