Spherical Discriminant Analysis in Semi-supervised Speaker Clustering Hao Tang Stephen M. Chu Thomas S. Huang Dept. of ECE IBM T. J. Watson Research Center Dept. of ECE University of Illinois Yorktown Heights, NY 10598, USA University of Illinois Urbana, IL 61801, USA schu@us.ibm.com Urbana, IL 61801, USA haotang2@ifp.uiuc.edu huang@ifp.uiuc.edu Abstract Semi-supervised speaker clustering refers to the use of our prior knowledge of speakers in general to assist the unsupervised speaker clustering process. In the form of an independent training set, the prior knowledge helps us learn a speaker-discriminative feature transformation, a universal speaker prior model, and a discriminative speaker subspace, or equivalently a speaker-discriminative distance metric. The directional scattering patterns of Gaussian mixture model mean supervectors motivate us to perform discriminant analysis on the unit hypersphere rather than in the Euclidean space, which leads to a novel dimensionality reduction technique called spherical discriminant analysis (SDA). Our experiment results show that in the SDA subspace, speaker clustering yields superior performance than that in other reduceddimensional subspaces (e.g., PCA and LDA). speaker identities. An interesting question is: Can we do semi-supervised speaker clustering? That is, can we make use of any available information that can be helpful to speaker clustering? Our answer to this question is positive. Here, semi-supervision refers to the use of our prior knowledge of speakers in general to assist the unsupervised speaker clustering process. In the form of an independent training set, the prior knowledge helps us learn a speaker-discriminative feature transformation, a universal speaker prior model, and a discriminative speaker subspace, or equivalently a speaker-discriminative distance metric. 2 Semi-supervised Speaker Clustering 1 Introduction A general pipeline of speaker clustering consists of four essential elements, namely feature extraction, utterance representation, distance metric, and clustering. We incorporate our prior knowledge of speakers into the various stages of this pipeline through an independent training set. 2.1 Feature Extraction Speaker clustering is a critical part of speaker diarization (a.k.a. speaker segmentation and clustering) (Barras et al., 2006; Tranter and Reynolds, 2006; Wooters and Huijbregts, 2007; Han et al., 2008). Unlike speaker recognition, where we have the training data of a set of known speakers and thus recognition can be done supervised, speaker clustering is usually performed in a completely unsupervised manner. The output of speaker clustering is the internal labels relative to a dataset rather than real This work was funded in part by DARPA contract HR001106-2-0001. The most popular speech features are spectrumbased acoustic features such as mel-frequency cepstral coefficients (MFCCs) and perceptual linear predictive (PLP) coefficients. In order to account for the dynamics of spectrum changes over time, the basic acoustic features are often supplemented by their first and second derivatives. We pursue a different avenue in which we augment the basic acoustic features of every frame with those of the neighboring frames. Specifically, the acoustic features of the current frame and those of the KL frames Proceedings of NAACL HLT 2009: Short Papers, pages 57­60, Boulder, Colorado, June 2009. c 2009 Association for Computational Linguistics 57 to the left and KR frames to the right are concatenated to form a high-dimensional feature vector. In the context-expanded feature vector space, we learn a speaker-discriminative feature transformation by linear discriminant analysis (LDA) based on the known speaker labels of the independent training set. The resulting low-dimensional feature subspace is expected to provide optimal speaker separability. 2.2 Utterance Representation Deviating from the mainstream "bag of acoustic features" representation where the extracted acoustic features are represented by a statistical model such as a Gaussian mixture model (GMM), we adopt the GMM mean supervector representation which has emerged in the speaker recognition area (Campbell et al., 2006). Such representation is obtained by maximum a posteriori (MAP) adapting a universal background model (UBM), which has been finely trained with all the data in the training set, to a particular utterance. The component means of the adapted GMM are stacked to form a column vector conventionally called a GMM mean supervector. In this way, we are allowed to represent an utterance as a point in a high-dimensional space where traditional distance metrics and clustering techniques can be naturally applied. The UBM, which can be deemed as a universal speaker prior model inferred from the independent training set, imposes generic speaker constraints to the GMM mean supervector space. 2.3 Distance Metric In the GMM mean supervector space, a naturally arising distance metric is the Euclidean distance metric. However, it is observed that the supervectors show strong directional scattering patterns. The directions of the data points seem to be more indicative than their magnitudes. This observation motivates us to favor the cosine distance metric over the Euclidean distance metric for speaker clustering. Although the cosine distance metric can be used in the GMM mean supervector space, it is optimal only if the data points are uniformly spread in all directions in the entire space. In a high-dimensional space, most often the data lies in or near a lowdimensional manifold or subspace. It is advantageous to learn an optimal distance metric from the 58 data directly. The general cosine distance between two data points x and y can be defined and manipulated as follows. d(x, y) = 1 - xT Ay xT Ax yT Ay (A1/2 x)T (A1/2 y) (1) (A1/2 y)T (A1/2 y) = 1- = 1- (A1/2 x)T (A1/2 x) (W T x)T (W T x) (W T x)T (W T y) (W T y)T (W T y) The general cosine distance can be casted as the cosine distance between two transformed data points W T x and W T y where W T = A1/2 . In this sense, learning an optimal distance metric is equivalent to learning an optimal linear subspace of the original high-dimensional space. 3 Spherical Discriminant Analysis Most existing linear subspace learning techniques (e.g. PCA and LDA) are based on the Euclidean distance metric. In the GMM mean supervector space, we seek to perform discriminant analysis in the cosine distance metric space. We coin the phrase "spherical discriminant analysis" to denote discriminant analysis on the unit hypersphere. We define a projection from a d-dimensional hypersphere to a d -dimensional hypersphere where d < d y= WTx WTx (2) We note that such a projection is nonlinear. However, under two mild conditions, this projection can be linearized. One is that the objective function for learning the projection only involves the cosine distance. The other is that only the cosine distance is used in the projected space. In this case, the norm of the projected vector y has no impact on the objective function and distance computation in the projected space. Thus, the denominator term of Equation 2 can be safely dropped, leading to a linear projection. 3.1 Formulation The goal of SDA is to seek a linear transformation W such that the average within-class cosine similarity of the projected data set is maximized while the average between-class cosine similarity of the projected data set is minimized. Assuming that there are c classes, the average within-class cosine similarity can be written in terms of the unknown projection matrix W and the original data points x SW = 1 c c Si i=1 (3) data points xi and xj . For a specific dimensionality reduction algorithm, there may exist two graphs. The intrinsic graph {X, S (i) } characterizes the data properties that the algorithm aims to preserve and the penalty graph {X, S (p) } characterizes the data properties that the algorithm aims to avoid. The goal of graph embedding is to represent each vertex in X as a low dimensional vector that preserves the similarities in S. The objective function is W =arg minW i=j Si = = 1 |Di ||Di | 1 |Di ||Di | xj ,xk Di yj ,yk Di T yj yk T yj yj f (xi ,W )-f (xj ,W ) T yk yk 2 (s(i) -s(p) ) ij ij (6) xT W W T xk j xT W W T xj j xT W W T xk k where f (x, W ) is a general projection with parameters W . If we take the projection to be of the form in Equation 2, the objective function becomes W =arg minW i=j W T xi W T xi where |Di | denotes the number of data points in the ith class. Similarly, the average between-class cosine similarity can be written in terms of W and x 1 SB = c(c - 1) Smn = 1 |Dm ||Dn | 1 |Dm ||Dn | xj Dm xk Dn yj Dm yk Dn c c - W T xj W T xj 2 (sij -sij ) (i) (p) (7) Smn m=1 n=1 (m = n) (4) T yj yk T yj yj It is shown that the solution to the graph embedding problem of Equation 7 may be obtained by a steepest descent algorithm (Fu et al., 2008). If we expand the L2 norm terms of Equation 7, it is straightforward to show that Equation 7 is equivalent to Equation 5 provided that the graph weights are set to proper values, as follows. sjk (i) T yk yk = xT W W T xk j xT W W T xj j xT W W T xk k sjk (p) 1 if xj , xk Di , i = 1, ..., c c|Di ||Di | 1 if xj Dm , xk Dn c(c - 1)|Dm ||Dn | m, n = 1, ..., c, m = n (8) where |Dm | and |Dn | denote the number of data points in the mth and nth classes, respectively. The SDA criterion is to maximize SW while minimizing SB W = arg max(SW - SB ) W That is, by assigning appropriate values to the weights of the intrinsic and penalty graphs, the SDA optimization problem in Equation 5 can be solved within the elegant graph embedding framework. (5) 4 Experiments Our SDA formulation is similar to the work of Ma et al. (2007). However, we solve it efficiently in a general dimensionality reduction framework known as graph embedding (Yan et al., 2007). 3.2 Graph Embedding Solution In graph embedding, a weighted graph with vertex set X and similarity matrix S is used to characterize certain statistical or geometrical properties of a data set. A vertex in X represents a data point and an entry sij in S represents the similarity between the 59 Our speaker clustering experiments are based on a test set of 630 speakers and 19024 utterances selected from the GALE database (Chu et al., 2008), which contains about 1900 hours of broadcasting news speech data collected from various TV programs. An independent training set of 498 speakers and 18327 utterances is also selected from the GALE database. In either data set, there are an average of 30-40 utterances per speaker and the average duration of the utterances is about 3-4 seconds. Note that there are no overlapping speakers in the two data sets ­ speakers in the test set are not present in the independent training set. The acoustic features are 13 basic PLP features with cepstrum mean subtraction. In computing the LDA feature transformation using the independent training set, KL and KR are both set to 4, and the dimensionality of the low-dimensional feature space is set to 40. The entire independent training set is used to train a UBM via the EM algorithm, and a GMM mean supervector is obtained for every utterance in the test set via MAP adaptation. The trained UBM has 64 mixture components. Thus, the dimension of the GMM mean supervectors is 2560. We employ the hierarchical agglomerative clustering technique with the "ward" linkage method. Our experiments are carried out as follows. In each experiment, we perform 4 cases, each of which is associated with a specific number of test speakers, i.e., 5, 10, 20, and 50, respectively. In each case, the corresponding number of speakers are drawn randomly from the test set, and all the utterances from the selected speakers are used for clustering. For each case, 100 trials are run, each of which involves a random draw of the test speakers, and the average of the clustering accuracies across the 100 trials is recorded. First, we perform speaker clustering in the original GMM mean supervector space using the Euclidean distance metric and the cosine distance metric, respectively. The results indicate that the cosine distance metric consistently outperforms the Euclidean distance metric. Next, we perform speaker clustering in the reduced-dimensional subspaces using the eigenvoice (PCA) and fishervoice (LDA) approaches, respectively. The results show that the fishervoice approach significantly outperforms the eigenvoice approach in all cases. Finally, we perform speaker clustering in the SDA subspace. The results demonstrate that in the SDA subspace, speaker clustering yields superior performance than that in other reduced-dimensional subspaces (e.g., PCA and LDA). Table 1 presents these results. Metric Euc Cos Subspace Orig PCA LDA Orig SDA 5 85.0 85.5 94.0 90.7 98.0 10 82.6 82.9 90.8 86.5 94.7 20 78.1 79.3 86.6 82.2 90.0 50 69.4 69.9 79.6 77.7 85.9 Table 1: Average speaker clustering accuracies (unit:%). model, and a speaker-discriminative distance metric through an independent training set. Motivated by the directional scattering patterns of the GMM mean supervectors, we peroform discriminant analysis on the unit hypersphere rather than in the Euclidean space, leading to a novel dimensionality reduction technique "SDA". Our experiment results indicate that in the SDA subspace, speaker clustering yields superior performance than that in other reduceddimensional subspaces (e.g., PCA and LDA). References C. Barras, X. Zhu, S. Meignier, and J. Gauvain. 2006. Multistage speaker diarization of broadcast news. IEEE Trans. ASLP, 14(5):1505­1512. W. Campbell, D. Sturim, D. Reynolds. 2006. Support vector machines using GMM supervectors for speaker verification. Signal Processing Letters 13(5):308-311. S. Chu, H. Kuo, L. Mangu, Y. Liu, Y. Qin, and Q. Shi. 2008. Recent advances in the IBM GALE mandarin transcription system. Proc. ICASSP. Y. Fu, S. Yan and T. Huang. 2008. Correlation Metric for Generalized Feature Extraction. IEEE Trans. PAMI 30(12):2229­2235. K. Han, S. Kim, and S. Narayanan. 2008. Strategies to Improve the Robustness of Agglomerative Hierarchical Clustering under Data Source Variation for Speaker Diarization. IEEE Trans. SALP 16(8):1590­1601. Y. Ma, S. Lao, E. Takikawa, and M. Kawade. 2007. Discriminant Analysis in Correlation Similarity Measure Space. Proc. ICML (227):577­584. S. Tranter and D. Reynolds. 2006. An Overview of Automatic Speaker Diarization Systems. IEEE Trans. ASLP, 14(5):1557­1565. C. Wooters and M. Huijbregts. 2007. The ICSI RT07s Speaker Diarization System. LNCS. S. Yan, D. Xu, B. Zhang, H. Zhang, Q. Yang, S. Lin. 2007. Graph Embedding and Extensions: A General Framework for Dimensionality Reduction. IEEE Trans. PAMI 29(1):40­51. 5 Conclusion This paper proposes semi-supervised speaker clustering in which we learn a speaker-discriminative feature transformation, a universal speaker prior 60