Correlated Bigram LSA for Unsupervised Language Model Adaptation

Yik-Cheung Tam InterACT, Language Technologies Institute Carnegie Mellon University Pittsburgh, PA 15213 yct@cs.cmu.edu

Tanja Schultz InterACT, Language Technologies Institute Carnegie Mellon University Pittsburgh, PA 15213 tanja@cs.cmu.edu

Abstract
We present a correlated bigram LSA approach for unsupervised LM adaptation for automatic speech recognition. The model is trained using efficient variational EM and smoothed using the proposed fractional Kneser-Ney smoothing which handles fractional counts. We address the scalability issue to large training corpora via bootstrapping of bigram LSA from unigram LSA. For LM adaptation, unigram and bigram LSA are integrated into the background N-gram LM via marginal adaptation and linear interpolation respectively. Experimental results on the Mandarin RT04 test set show that applying unigram and bigram LSA together yields 6%­8% relative perplexity reduction and 2.5% relative character error rate reduction which is statistically significant compared to applying only unigram LSA. On the large-scale evaluation on Arabic, word error rate reduction from bigram LSA is statistically significant compared to the unadapted baseline.

1 Introduction
Language model (LM) adaptation is crucial to automatic speech recognition (ASR) as it enables higher-level contextual information to be effectively incorporated into a background LM improving recognition performance. Exploiting topical context for LM adaptation has shown to be effective for ASR using latent semantic analysis (LSA) such as LSA using singular value decomposition [1], Latent Dirichlet Allocation (LDA) [2, 3, 4] and HMM-LDA [5, 6]. One issue in LSA is the bagof-word assumption which ignores word ordering. For document classification, word ordering may not be important. But in the LM perspective, word ordering is crucial since a trigram LM normally performs significantly better than a unigram LM for word prediction. In this paper, we investigate whether relaxing the bag-of-word assumption in LSA helps improving the ASR performance via LM adaptation. We employ bigram LSA [7] which is a natural extension of LDA to relax the bag-of-word assumption by connecting the adjacent words in a document together to form a Markov chain. There are two main challenges in bigram LSA which are not addressed properly in [7] especially for largescale application. Firstly, the model can be very sparse since it covers topical bigrams in O(V 2 · K ) where V and K denote the vocabulary size and the number of topics. Therefore, model smoothing becomes critical. Secondly, model initialization is important for EM training, especially for bigram LSA due to the model sparsity. To tackle the first challenge, we represent bigram LSA as a set of K topic-dependent backoff LM. We propose fractional Kneser-Ney smoothing 1 which supports
This work is partly supported by the Defense Advanced Research Projects Agency (DARPA) under Contract No. HR0011-06-2-0001. Any opinions, findings and conclusions or recommendations expressed in this material are those of the authors and do not necessarily reflect the views of DARPA. 1 This method was briefly mentioned in [8] without detail. To the best of our knowledge, our formulation in this paper is considered new to the research community.


1


Prior distribution over topic mixture weights


Latent topics

z1

z2

zN

Observed words <s>

w1

w2

wN

Figure 1: Graphical representation of bigram LSA. Adjacent words in a document are linked together to form a Markov chain from left to right. fractional counts to smooth each backoff LM. We show that our formulation recovers the original Kneser-Ney smoothing [9] which supports only integral counts. To address the second challenge, we propose a bootstrapping approach for bigram LSA training using a well-trained unigram LSA as an initial model. During unsupervised LM adaptation, word hypotheses from the first-pass decoding are used to estimate the topic mixture weight of each test audio to adapt both unigram and bigram LSA. The adapted unigram and bigram LSA are combined with the background LM in two stages. Firstly, marginal adaptation [10] is applied to integrate unigram LSA into the background LM. Then the intermediately adapted LM from the first stage is combined with bigram LSA via linear interpolation with the interpolation weights estimated by minimizing the word perplexity on the word hypotheses. The final adapted LM is employed for re-decoding. Related work includes topic mixtures [11] which perform document clustering and train a trigram LM for each document cluster as an initial model. Sentence-level topic mixtures are modeled so that the topic label is fixed within a sentence. Topical N-gram model [12] focuses on phrase discovery and information retrieval. We do not apply this model because the phrase-based LM seems not outperform the word-based LM. The paper is organized as follows: In Section 2, we describe the bigram LSA training and the fractional Kneser-Ney smoothing algorithm. In Section 3, we present the LM adaptation approach based on marginal adaptation and linear interpolation. In Section 4, we report LM adaptation results on Mandarin and Arabic ASR, followed by conclusions and future work in Section 5.

2 Correlated bigram LSA
Latent semantic analysis such as LDA makes a bag-of-word assumption that each word in a document is generated irrespective of its position in a document. To relax this assumption, bigram LSA has been proposed [7] to modify the graphical structure of LDA by connecting adjacent words in a document together to form a Markov chain. Figure 1 shows the graphical representation of bigram LSA where the top node represents the prior distribution over the topic mixture weights and the middle layer represents the latent topic label associated to each observed word at the bottom layer. The document generation procedure of bigram LSA is similar to LDA except that the previous word is taken into consideration for generating the current word: 1. Sample  from a prior distribution p() 2. For each word wi at the i-th position of a document: (a) Sample topic label: zi  Multinomial() (b) Sample wi given the previous word wi-1 and the topic label zi : wi  p(·|wi-1 , zi ) Our incremental contributions for bigram LSA are three-folded: Firstly, we present a technique for topic correlation modeling using Dirichlet-Tree prior in Section 2.1. Secondly, we propose efficient algorithm for bigram LSA training via variational Bayes approach and model bootstrapping which are scalable to large settings in Section 2.2. Thirdly, we formulate the fractional Kneser-Ney smoothing to generalize the original Kneser-Ney smoothing which supports only integral counts in Section 2.3. 2


j=1 Dir(.)

j=1 Dir(.) 0.1+0.2 0.3+0.4 j=3 Dir(.)

j=2 Dir(.)

j=3 Dir(.)

j=J Dir(.)

propagate

j=2 Dir(.)

Latent topics

topic 1

topic 2 topic 3

topic 4

topic K-1 topic K

q(z=k)

0.1

0.2

0.3

0.4

Figure 2: Left: Dirichlet-Tree prior of depth two. Right: Variational E-step as bottom-up propagation and summation of fractional topic counts. 2.1 Topic correlation Modeling topic correlations is motivated by an observation that documents such as newspaper articles are usually organized into main-topic and sub-topic hierarchy for document browsing. From this perspective, a Dirichlet prior is not appropriate since it assumes topic independence. A DirichletTree prior [13, 14] is employed to capture topic correlations. Figure 2 (Left) illustrates a depth-two Dirichlet-Tree. A depth-one Dirichlet-tree is equivalent to a Dirichlet prior in LDA. The sampling procedure for the topic mixture weight   p() can be described as follows: 1. Sample a vector of branch probabilities bj  Dirichlet(·; {j c }) for each node j = 1...J where {j c } denotes the parameter of the Dirichlet distribution at node j , i.e. the pseudocounts of the outgoing branch c at node j . j jc (k) 2. Compute the topic mixture weight as k = where j c (k ) is an indicator function c bj c which sets to unity when the c-th branch of the j -th node leads to the leaf node of topic k and zero otherwise. The k -th topic weight k is computed as the product of sampled branch probabilities from the root node to the leaf node corresponding to topic k . The structure and the number of outgoing branches of each Dirichlet node can be arbitrary. In this paper, we employ a balanced binary Dirichlet-tree. 2.2 Model training Gibbs sampling was employed for bigram LSA training [7]. Despite the simplicity, it can be slow and inefficient since it usually requires many sampling iterations for convergence. We present a N variational Bayes approach for model training. The joint likelihood of a document w1 , the latent N topic sequence z1 and  using the bigram LSA can be written as follows:
N N p(w1 , z1 , )

= p() ·

N N By introducing a factorizable variational posterior distribution q (z1 , ; ) = q () · i=1 q (zi ) over the latent variables and applying the Jensen's inequality, the lower bound of the marginalized document likelihood can be derived as follows: z N N p(w1 , z1 , ; ) N N q (z1 , ; ) · log p(w1 ; , ) = log (2 ) N q (z1 , ; ) 1 ...zN z N N p(w1 , z1 , ; ) N q (z1 , ; ) · log  (By Jensen's Inequality) (3) N ,  ; ) q (z1 ...z
1 N

iN

p(zi |) · p(wi |wi-1 , zi )

(1 )

=1

= Eq [log

N = Q(w1 ; , ) (5 ) N where the expectation is taken using the variational posterior q (z1 , ). For the E-step, we compute the partial derivative of the auxiliary function Q(·) with respect to q (zi ) and the parameter j c in the Dirichlet-Tree posterior q (). Setting the derivatives to zero yields:

i i p(zi |) p() Eq [log Eq [log p(wi |wi-1 , zi )] ]+ ]+ q () q (zi ) =1 =1

N

N

(4 )

3


E-Steps: q (zi = k )  j c = p(wi |wi-1 , k ) · eEq [log k ;{jc }] for k = 1..K j c + j iN Eq [j c (zi )] = j c + j iN kK q (zi = k ) · j c (k )  (j c ) - ( c j c ) (6 ) (7 ) (

=1

=1 =1

where Eq [log k ] =

j c (k ) · Eq [log bj c ] =
c

j c (k )
c

8)

where Eqn 7 is motivated from the conjugate property that the Dirichlet-Tree posterior given the N topic sequence z1 has the same form as the Dirichlet-Tree prior:   j  -1 i N j  (z ) j N N J bj c c (9 ) bj j c i  · p(bJ |z1 )  p(z1 |bJ ) · p(b1 ; {j c })   1 1 c
=1 c c

=

j

(jc + bj c

c

N

i=1

jc (zi ))-1

=

j

bj c c
c

j -1

=

jJ

Dirichlet(bj ; {j c })

(1 0 )

=1

Figure 2 (Right) illustrates that Eqn 7 can be implemented as propagation of fractional topic counts in a bottom-up fashion with each branch as an accumulator for j c . Eqn 6 and Eqn 7 are applied iteratively until convergence is reached. For the M-step, we compute the partial derivative of the auxiliary function Q(·) over all training documents d with respect to topic bigram probability p(v |u, k ) and set it to zero: M-Step (unsmoothed): p(v |u, k )  = d iNd d q (zi = k |d) ·  (wi-1 , u) (wi , v ) Cd (u, v |k )
=1

(1 1 ) (1 2 )

=1

d

where Nd denote the number of words in document d and  (wi , v ) is a 0-1 Kronecker Delta function to test if the i-th word in document d is vocabulary v . Cd (u, v |k ) denotes the fractional counts of a bigram (u, v ) belonging to topic k in document d. Intuitively, Eqn 12 simply computes the relative frequency of the bigram (u, v ). However, this solution is not practical since bigram LSA assigns zero probability to unseen bigrams. Therefore, bigram LSA should be smoothed properly. One simple approach is to use Laplace-smoothing by adding a small count  to all bigrams. However, this approach can lead to worse performance since it will bias the bigram probability towards a uniform distribution when the vocabulary size V gets large. Our approach is to represent p(v |u, k ) as a standard backoff LM smoothed by fractional Kneser-Ney smoothing as described in Section 2.3. Model initialization is crucial for variational EM training. We employ a bootstrapping approach using a well-trained unigram LSA as an initial model for bigram LSA so that p(wi |wi-1 , k ) is approximated by p(wi |k ) in Eqn 6. It saves computation and avoids keeping the full initial bigram LSA in memory during the EM training. To make the training procedure more practical, we apply bigram pruning during statistics accumulation in the M-step when the bigram count in a document is less than 0.1. This heuristic is reasonable since only a small number of topics are "active" to a bigram. With the sparsity, there is no need to store K copies of accumulators for each bigram and thus reducing the memory requirement significantly. The pruned bigram counts are re-assigned to the most likely topic of the current document so that the counts are conserved. For practical implementation, accumulators are saved into the disk in batches for count merging. In the final step, each topic-dependent LM is smoothed individually using the merged count file. 2.3 Fractional Kneser-Ney smoothing Standard backoff N-gram LM is widely used in the ASR community. The state-of-the-art smoothing for the backoff LM is based on Kneser-Ney smoothing [9]. The belief of its success is due to the preservation of marginal distributions. However, the original formulation only works for integral 4

V
v

Cd (u, v

|k )

=

C (u, v |k ) V
v
=1

C (u, v |k )


counts which is not suitable for bigram LSA using fractional counts. Therefore, we propose the fractional Kneser-Ney smoothing as a generalization of the original formulation. The interpolated form using absolute discounting can be expressed as follows: max{C (u, v ) - D, 0} + (u) · pK N (v ) (1 3 ) pK N (v |u) = C (u) where D is a discounting factor. In the original formulation, D lies between 0 and 1. But in our formulation, D can be any positive number. Intuitively, D controls the degree of smoothing. If D is set to zero, the model is unsmoothed; If D is too big, bigrams with counts smaller than D are pruned from the LM. (u) ensures the bigram probability sums to unity. After summing over all possible v on both sides of Eqn 13 and re-arranging terms, (u) becomes: v max{C (u, v ) - D, 0} 1= + (u) (1 4 ) C (u) v max{C (u, v ) - D, 0} v C (u, v ) - D = (u) = 1 - =1- (1 5 ) C (u) C (u) :C (u,v )>D v v C (u) - :C (u,v )>D 1 :C (u,v )>D C (u, v ) + D (1 6 ) = C (u) v v :C (u,v )>D 1 :C (u,v )D C (u, v ) + D = (1 7 ) C (u) CD (u, ·) + D · N>D (u, ·) = (1 8 ) C (u) where CD (u, ·) denotes the sum of bigram counts following u and smaller than D. N>D (u, ·) denotes the number of word types following u with the bigram counts bigger than D. In Kneser-Ney smoothing, the lower-order distribution pK N (v ) is treated as unknown parameters which can be estimated using the preservation of marginal distributions: u p(v ) = ^ pK N (v |u) · p(u) ^ (1 9 )

where p(v ) is the marginal distribution estimated from the background training data so that p(v ) = ^ ^ C v (v ) ) . Therefore, we substitute Eqn 13 into Eqn 19: C (v · m u ax{C (u, v ) - D, 0} + (u) · pK N (v ) C (u) (2 0 ) C (v ) = C (u) u + u = max{C (u, v ) - D, 0} pK N (v ) · C (u) · (u) (2 1 ) = pK N (v ) = C (v ) - u

max{C (u, v ) - D, 0} u (2 2 ) C (u) · (u) C (v ) - C>D (·, v ) + D · N>D (·, v ) u (2 3 ) = C (u) · (u) CD (·, v ) + D · N>D (·, v ) u = (using Eqn 18) (2 4 ) CD (u, ·) + D · N>D (u, ·) CD (·, v ) + D · N>D (·, v ) v (2 5 ) = CD (·, v ) + D · N>D (·, v ) Eqn 25 generalizes Kneser-Ney smoothing to integral and fractional counts. For the original formulation, CD (u, ·) equals to zero since each observed bigram count must be at least one by definition with D less than one. As a result, the D term cancels out yielding the original formulation which counts the number of words preceding v and thus recovering the original formulation. Intuitively, the numerator in Eqn 25 measures the total discounts of observed bigrams ending at v . In other words, fractional Kneser-Ney smoothing estimates the lower-order probability distribution using the relative frequency over discounts instead of word counts. With this approach, each topic-dependent LM in bigram LSA can be smoothed using our formulation. 5


3 Unsupervised LM adaptation
Unsupervised LM adaptation is performed by first inferring the topic distribution of each test audio using the word hypotheses from the first-pass decoding via variational inference in Eqn 6­7. Relative frequency over the branch posterior counts j c is applied on each Dirichlet node j . The MAP topic ^ mixture weight  and the adapted unigram and bigram LSA are computed as follows: ^ k  j  c jc
j c (k )

for k = 1...K


(2 6 )

c

jc

pa (v ) =

kK

^ p(v |k ) · k and pa (v |u) =

=1

kK

^ p(v |u, k ) · k

(2 7 )

=1

The unigram LSA marginals are integrated into the background N-gram LM pbg (v |h) via marginal adaptation [10] as follows: p(1) (v |h) a  p
a (v )

pbg (v )


· pbg (v |h)

(2 8 )

Marginal adaptation has a close connection to maximum entropy modeling since the marginal constraints can be encoded as unigram features. Intuitively, bigram LSA would be integrated in the same fashion by introducing bigram marginal constraints. However, we found that integrating bigram features via marginal adaptation did not offer further improvement compared to only integrating unigram features. Since marginal adaptation integrates a unigram feature as a likelihood ratio between the adapted marginal pa (v ) and the background marginal pbg (v ) in Eqn 28, perhaps the unigram and bigram likelihood ratios are very similar and thus the latter does not give extra information. Another explanation is that marginal adaptation corresponds to only one iteration of generalized iterative scaling (GIS). Due to the large number of bigram features in terms of millions, one GIS iteration may not be sufficient for convergence. On the other hand, simple linear LM interpolation is found to be effective in our experiment. The final LM adaptation formula is provided using results from Eqn 27 and Eqn 28 as a two-stage process:
( pa2) (v |h)

=  · p(1) (v |h) + (1 - ) · pa (v |u) a

(2 9 )

where  is tuned to optimize perplexity on word hypotheses from the first-pass decoding on a peraudio basis.

4 Experimental setup
Our LM adaptation approach was evaluated using the RT04 Mandarin Broadcast News evaluation system. The system employed context-dependent Initial-Final acoustic models trained using 100hour broadcast news audio from the Mandarin HUB4 1997 training set and a subset of TDT4. 42dimension features were extracted after linear discriminant analysis projected from a window of MFCC and energy features. The system employed a two-pass decoding strategy using speakerindependent and speaker-adaptive acoustic models. For the second-pass decoding, we applied standard acoustic model adaptation such as vocal tract length normalization and maximum likelihood linear regression on the feature and model spaces. The training corpora include Xinhua News 2002 (January­September) containing 13M words and 64k documents. A background 4-gram LM was trained using modified Kneser-Ney smoothing using the SRILM toolkit [15]. The same training corpora were used for unigram and bigram LSA training with 200 topics. The vocabulary size is 108k words. Discounting factor D for fractional Kneser-Ney smoothing was set to 0.4. First-pass decoding was first performed to obtain an automatic transcript for each audio show. Then unsupervised LM adaptation was applied using the automatic transcript to obtain an adapted LM for second-pass decoding using the approach described in Section 3. Word perplexity and character error rates (CER) were measured on the Mandarin RT04 test set. Matched pairs sentence-segment word error test was performed for significance test using the NIST scoring tool. 6


Table 1: Correlated bigram topics extracted from bigram LSA. Topic index "topic-61" + "topic-62" "topic-63" "topic-64" "topic-65" + Top bigrams sorted by p(u, v |k ) ('s student), + ('s education), + (education 's) (school 's), + (youth class), + (quality of education) + (expert cultivation), + (university chancellor) + (famous), + (high-school), + ('s student) + (and social security), + ('s employment), + (unemployed officer), + (employment position) + ('s research), + (expert people), + (etc area) + (biological technology), + (research result) + (Human DNA sequence), + ('s DNA) + (biological technology), + (embryo stem cell)

Table 2: Character Error Rates (Word perplexity) on the RT04 test set. Bigram LSA was applied in addition to unigram LSA. LM (13M) b ack g r o u n d L M +unigram LSA +bigram LSA (Kneser-Ney, 30 topics) +bigram LSA (Witten-Bell) +bigram LSA (Kneser-Ney) 4.1 LM adaptation results Table 1 shows the correlated bigram topics sorted by the joint bigram probability p(v |u, k ) · p(u|k ). Most of the top bigrams appear either as phrases or words attached with a stopword such as ('s in English). Table 2 shows the LM adaptation results in CER and perplexity. Applying both unigram and bigram LSA yields consistent improvement over unigram LSA in the range of 6.4%­8.5% relative reduction in perplexity and 2.5% relative reduction in the overall CER. The CER reduction is statistically significant at 0.1% significance level. We compared our proposed fractional Kneser-Ney smoothing with Witten-Bell smoothing which also supports fractional counts. The results showed that Kneser-Ney smoothing performs slightly better than Witten-Bell smoothing. Increasing the number of topics in bigram LSA helps despite model sparsity. We applied extra EM iterations on top of the bootstrapped bigram LSA but no further performance improvement was observed. 4.2 Large-scale evaluation We evaluated our approach using the CMU-InterACT vowelized Arabic transcription system trained on 1500-hour transcribed audio for the GALE Phase-3 evaluation. A large background 4-gram LM was trained using 962M-word text corpora with 737k vocabulary. Unigram and bigram LSA were trained on the same corpora and were applied to lattice rescoring on Dev07 and Eval07 test sets with 2.6-hour and 4.1-hour audio shows containing broadcast news (BN) and broadcast conversation (BC) genre. Table 3 shows that bigram LSA rescoring reduces the overall word error rate (WER) by 3.2% and 1.6% relative compared to the unadapted baseline on Dev07 and Eval07 respectively. The results are statistically significant at 0.1% and 0.4% significance level respectively. On the other hand, bigram LSA performs significantly better than unigram LSA on Dev07 but not on Eval07. Results suggest that bigram LSA is more effective on BN than BC. Table 3: Lattice rescoring results in word error rate on Dev07 (Eval07) using the CMU-InterACT Arabic transcription system for the GALE Phase-3 evaluation. GALE LM (962M) b ack g r o u n d L M +unigram LSA +bigram LSA (Kneser-Ney) BN 1 2 .4 % ( 1 4 .0 ) 1 2 .2 ( 1 4 .0 ) 1 1 .8 ( 1 3 .6 ) 7 BC 2 1 .0 ( 2 2 .9 ) 2 1 .0 ( 2 2 .5 ) 2 0 .7 ( 2 2 .7 ) OVERALL 1 5 .4 ( 1 8 .4 ) 1 5 .3 ( 1 8 .2 ) 1 4 .9 ( 1 8 .1 ) CCTV 1 5 .3 % ( 7 4 8 ) 1 4 .4 ( 6 2 9 ) 1 4 .5 ( 6 0 4 ) 1 4 .1 ( 5 9 4 ) 1 4 .0 ( 5 8 7 ) NTDTV 2 1 .8 ( 1 7 1 8 ) 2 1 .5 ( 1 5 4 7 ) 2 0 .7 ( 1 5 0 2 ) 2 0 .9 ( 1 4 5 2 ) 2 0 .8 ( 1 4 4 8 ) RFA 3 9 .5 ( 3 6 5 5 ) 3 8 .9 ( 3 0 1 5 ) 3 9 .0 ( 2 7 3 6 ) 3 8 .3 ( 2 6 2 8 ) 3 8 .2 ( 2 5 8 6 ) OVERALL 2 4 .9 2 4 .3 2 4 .1 2 3 .8 2 3 .7


5 Conclusion
We present a correlated bigram LSA approach for unsupervised LM adaptation for ASR. Our contributions include efficient variational EM for model training and fractional Kneser-Ney approach for LM smoothing with fractional counts. Bigram LSA yields additional improvement in both perplexity and recognition performance in addition to unigram LSA. Increasing the number of topics for bigram LSA helps despite the model sparsity. Bootstrapping bigram LSA from unigram LSA saves computation and memory requirement during EM training. Our approach is scalable to large training corpora and works well on different languages. The improvement from bigram LSA is statistically significant compared to the unadapted baseline. Future work includes applying the proposed approach for statistical machine translation.

Acknowledgement
We would like to thank Mark Fuhs for help parallelizing the bigram LSA training via condor.

References
[1] J. R. Bellegarda, "Large Vocabulary Speech Recognition with Multispan Statistical Language Models," IEEE Transactions on Speech and Audio Processing, vol. 8, no. 1, pp. 76­84, Jan 2000. [2] D. Blei, A. Ng, and M. Jordan, "Latent Dirichlet Allocation," in Journal of Machine Learning Research, 2003, pp. 1107­1135. [3] Y. C. Tam and T. Schultz, "Language model adaptation using variational Bayes inference," in Proceedings of Interspeech, 2005. [4] D. Mrva and P. C. Woodland, "Unsupervised language model adaptation for mandarin broadcast conversation transcription," in Proceedings of Interspeech, 2006. [5] T. Griffiths, M. Steyvers, D. Blei, and J. Tenenbaum, "Integrating topics and syntax," in Advances in Neural Information Processing Systems, 2004. [6] B. J. Hsu and J. Glass, "Style and topic language model adaptation using HMM-LDA," in Proceedings of Empirical Methods on Natural Language Processing (EMNLP), 2006. [7] Hanna M. Wallach, "Topic Modeling: Beyond Bag-of-Words," in International Conference on Machine Learning, 2006. [8] P. Xu, A. Emami, and F. Jelinek, "Training connectionist models for the structured language model," in Proceedings of Empirical Methods on Natural Language Processing (EMNLP), 2003. [9] R. Kneser and H. Ney, "Improved backing-off for M-gram language modeling," in Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), 1995, vol. 1, pp. 181­184. [10] R. Kneser, J. Peters, and D. Klakow, "Language model adaptation using dynamic marginals," in Proceedings of European Conference on Speech Communication and Technology (EUROSPEECH), 1997, pp. 1971­1974. [11] R. Iyer and M. Ostendorf, "Modeling long distance dependence in language: Topic mixtures versus dynamic cache models," IEEE Transactions on Speech and Audio Processing, vol. 7, no. 1, pp. 30­39, Jan 1999. [12] X. Wang, A. McCallum, and X. Wei, "Topical N-grams: Phrase and topic discovery, with an application to information retrieval," in IEEE International Conference on Data Mining, 2007. [13] T. Minka, "The dirichlet-tree distribution," 1999. [14] Y. C. Tam and T. Schultz, "Correlated latent semantic model for unsupervised language model adaptation," in Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), 2007. [15] A. Stolcke, "SRILM - an extensible language modeling toolkit," in Proceedings of International Conference on Spoken Language Processing (ICSLP), 2002.

8