SIGIR 2007 Proceedings

Session 26: Spoken Document Retrieval

Indexing Confusion Networks for Morph-based Spoken Document Retrieval
Ville T. Turunen and Mikko Kurimo
Adaptive Informatics Research Centre Helsinki University of Technology PO Box 5400, FI-02015, TKK, Finland
{ville.t.turunen,mikko.kurimo}@tkk.fi

ABSTRACT
In this pap er, we investigate methods for improving the p erformance of morph-based sp oken document retrieval in Finnish by extracting relevant index terms from confusion networks. Our approach uses morpheme-like subword units ("morphs") for recognition and indexing. This alleviates the problem of out-of-vocabulary words, esp ecially with inflectional languages like Finnish. Confusion networks offer a convenient representation of alternative recognition candidates by aligning mutually exclusive terms and by giving the p osterior probability of each term. The rank of the comp eting terms and their p osterior probability is used to estimate term frequency for indexing. Comparing against 1b est recognizer transcripts, we show that retrieval effectiveness is significantly improved. Finally, the effect of pruning in recognition is analyzed, showing that when recognition sp eed is increased, the reduction in retrieval p erformance due to the increase in the 1-b est error rate can b e comp ensated by using confusion networks.

Categories and Subject Descriptors
H.3.3 [Information Storage and Retrieval]: Information Search and Retrieval; I.2.7 [Artificial Intelligence]: Natural Language Processing--Speech recognition and synthesis

General Terms
Algorithms, Languages, Performance

Keywords
Sp oken document retrieval, Subword indexing, Morphemes, Confusion networks, Lattices

1.

INTRODUCTION

As more and more sp oken information is produced and archived, there is an increasing need for indexing and retrieving audio material based on the sp eech content. In

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. SIGIR'07, July 23н27, 2007, Amsterdam, The Netherlands. Copyright 2007 ACM 978-1-59593-597-7/07/0007 ...$5.00.

addition to TV and radio broadcasts, increasing amount of audio and video material is distributed on the Internet, e.g. in the form of p odcasts and video sharing web sites such as YouTub e. Currently, these archives can only b e retrieved based on human-inputted metadata rather than their actual content. As the available material b ecomes more diverse, the requirements for the retrieval systems increase. Furthermore, different languages p ose different challenges for retrieval. Most research is done for English, but these results can not always successfully b e applied to other languages with different morphology and other prop erties. Content based retrieval of sp eech data utilizes an automatic sp eech recognition (ASR) system to produce a transcript of the sp eech for indexing. Two main approaches have commonly b een used for sp oken document retrieval (SDR). In phone-based retrieval, the sp eech is transcrib ed into a string of phonemes. Query words are also transformed to phoneme strings and then matched to the recognizer outputs. The second, word-based, approach uses a large vocabulary continuous sp eech recognition (LVCSR) system to transcrib e the sp eech into words and then applies standard text retrieval methods to the transcripts. This has b een the most successful approach in the TREC SDR tracks [3]. However, word-based methods suffer from the limited vocabulary of the sp eech recognizer. Any word in sp eech that is not in the vocabulary (out-of-vocabulary, OOV) will always b e misrecognized and is replaced by an alternative that is deemed probable by the acoustic and language models. Phoneme recognizers are not limited to any vocabulary, but their p erformance is hurt by higher error rates. Typically, the vocabulary consists of the most frequent words in the language model training corpus. For retrieval this is esp ecially problematic, since the less frequent words, such as prop er names, are usually the most interesting from retrieval p oint of view. The problem of limited vocabulary can b e alleviated by 'backing-off ' to the phoneme transcription at locations where no word of the vocabulary fits. This is the basic principle of OOV-detection prop osed by Hazen and Bazzi [4]. Other methods include query and document expansion where relevant terms are extracted from a parallel text corpus. The added semantically related terms help retrieve documents with missing OOV terms [20]. Our approach is based on morpheme-like subword units learned in an unsup ervised manner. We call these units statistical morphs for short. The recognizer transcrib es the sp eech as a string of morphs, leaving almost no words out of vocabulary. This approach is esp ecially useful for languages

631


SIGIR 2007 Proceedings

Session 26: Spoken Document Retrieval

with rich morphology (e.g. Finnish, Turkish), that have a high numb er of different inflected word forms and thus cannot use traditional language modeling based on whole words. The morph language model assigns word break p ositions and thus b oth word level and morph level information can b e used for indexing. The data driven algorithm is easily applicable for other languages with similar prop erties. Retrieval using the recognizer transcripts as they were error free has proved successful for low error rates [3]. Stateof-the-art systems can achieve low enough error rates (b etter than 90% accuracy) for broadcast news material in English. But as the databases grow larger, the amount of CPU p ower that can b e used for every fixed time of sp eech decreases. Also, it is more demanding to index sp eech recordings that have less optimal prop erties than noiseless non-sp ontaneous sp eech, e.g. recordings of meetings, telephone conversations, etc. Thus, it is not always p ossible to obtain recognizer transcripts that are accurate enough for successful retrieval. Retrieval p erformance can b e increased by extracting alternative hyp otheses from the recognizer in addition to the most probable (1-b est) candidate. A lattice is a graph containing a numb er of most probable hyp otheses considered by the recognizer and can b e used as a source for extracting additional terms. A more compact representation for the hyp otheses is the word confusion network (WCN), which offers a convenient representation of comp eting terms along with the p osterior probability for each term. Mamou et al. [11] have shown improvements of SDR p erformance in low accuracy conditions by indexing and weighting terms in confusion networks based on their probability and rank among comp etitors. In this pap er, we investigate methods for improving the p erformance of morph-based sp oken document retrieval in Finnish by extracting relevant index terms from confusion networks. Comparing against 1-b est transcripts and errorfree human transcripts, we show that retrieval effectiveness is significantly improved. As far as we know, this is the first time such methods have b een applied to retrieval of sp eech in a highly inflectional language like Finnish. This pap er is organized as follows. In section 2, we present the methods used in this work. Esp ecially, the morph-based retrieval scheme is describ ed in more detail as well as the generation and use of the confusion networks. Section 3 presents the exp eriments and the obtained results. Overview of related work is presented in Section 4. Finally, our conclusions are given in Section 5.

grams in a large text corpus. This works for English as a reasonably sized lexicon can cover the language well. For a highly inflective language with a huge numb er of distinct word forms, constructing a fixed lexicon of words b ecomes infeasible. Also, training an effective language model using inflected words as units b ecomes very hard as the amount of training data needed to cover enough instances of all the different forms grows too large. One solution is to use subword language model units instead of whole words. If the units are chosen well, the size of the lexicon and the amount of training data that are needed to cover the language are greatly reduced. An unsup ervised algorithm for finding suitable subword units has b een develop ed by Creutz [1]. Based on the Minimum Description Length (MDL) principle, the algorithm takes in an unsegmented training corpus and finds a set of units that is compact but models the training set effectively. An n-gram model can then b e built over a corpus that is segmented using these units. The units produced by the algorithm are referred to as statistical morphs as the algorithm chooses the units based on statistical criteria and as the b oundaries b etween the units in segmented word forms often coincide with grammatical morpheme b oundaries. Sp eech recognition accuracy in Finnish has b een greatly improved by utilizing statistical morphs in the language model [5]. As the algorithm is completely data driven, it can b e easily applied to other languages. An example transcript produced by the morph-based recognizer is shown in Figure 1.

2.1.2 Retrieval phase
A sp eech recognizer with morph language modeling transcrib es the sp eech into a string of morphs with markers at word break p ositions. Thus, b oth morph-level and wordlevel information can b e used for indexing. Typically, in retrieval of an inflectional language, a morphological analyzer is used to lemmatize each inflected word form b efore indexing. However, not all languages have a morphological analyzer available as building one requires sp ecial linguistic knowledge. Furthermore, in the case of sp eech retrieval, errors in the transcript can cause the morphological analysis to fail and produce spurious results. This happ ens if a morph in a word is misrecognized and the resulting word is grammatically incorrect thus confusing the morphological analyzer. Also, the word break p ositions are sometimes wrongly assigned, again leading to confusion. The language model should prevent most of these situations, but not all of these errors can b e avoided. Because the statistical morphs resemble grammatical morphemes, they are also an app ealing candidate to b e used as index terms as such. For retrieval, query terms are also segmented to morphs using the same segmentation algorithm. Thus, the need for the morphological analyzer can b e avoided. This resembles stemming as it separates the affix morphs from the stems. Sp oken document retrieval in Finnish using morphs as index terms produces results that are ab out equal compared to the lemmatized transcripts [8]. However, b est results have b een achieved by combining b oth methods. The morph-based approach provides also alleviation for the OOV query term problem as it is now p ossible to recognize almost any word in sp eech by recognizing its comp onent morphs. In practice, this means that the rare words, such as some prop er names, get transcrib ed into many small morphs

2.

METHODS

2.1 Morph-based retrieval
Most research on sp eech retrieval is focused on English data, but different languages have different prop erties that make methods develop ed for one language less usable for others. One such prop erty is the level of agglutination. Finnish is a highly agglutinative language, which means that words are formed by joining together morphemes and thus there are a high numb er of distinct inflected word forms. This affects b oth the recognition and retrieval phase of the SDR process.

2.1.1 Recognition phase
Statistical language models for sp eech recognition are typically built by observing co-occurrence statistics such as n-

632


SIGIR 2007 Proceedings

Session 26: Spoken Document Retrieval

<w> valtio ta <w> yhteis ymmarrys <w> saa ttaa и <w> purka utua <w> jo <w> anta isi <w> jaиdy ttaи иa иa <w> itsenaisyys julistuksen sa <w> sadaksi <w> и paiva ksi <w> ei <w> merkitse <w> si ta <w> etta ии и и <w> riippuma ttomaksi <w tasa valla ksi <w> julista utuneen <w> liettua <w> tinki si <w> itsenaistymis tavoitte istaan <w> han <w> и и koro sti <w> liettua n <w> paиministeri <w> иa kat si mie ra <w> p ru n ski ene <w> joka <w> saapu i <w> lauantai na <w> kotka n <w> uusi <w> hansa <w> seminaari n <w>

Baltia ssa yhteis ymmarrys, saarto и aa o иa иa purka utuu. Liettua n pииtиs jaиdy ttaи itsenaisyys julistuksen sa sadaksi и pиiva ksi ei merkitse si tи, etta aи a и riippuma ttomaksi tasa valla ksi julista utunut Liettua tinki si иa itsenaistymis tavoittee staan. Ta tи и koro sti Liettua n pииministeri aa Kat si mie ra P ru n ski ene, joka saapu i lauantai na Kot kaan Uusi Hansa -seminaar iin.

Figure 1: Transcripts of a part of a Finnish news story about the independence struggles of Lithuania. Left, the recognized 1-best transcript of morphs with word break position marked with <w>. Right, the manual transcript, aligned by line. The morph boundaries are marked with . Notice the recognition of the name Kazimiera Prunskiene. (The letter "z" is transformed to the closest Finnish pronunciation "ts".) while the more common words are formed of bigger pieces or just one morph. This b ehavior is caused by the statistical nature of the segmentation algorithm. However, errors are still often made, esp ecially with foreign names that contain foreign sounds. A problem similar to understemming and overstemming arises from non-ideal segmentation of inflected word forms. Sometimes different inflected forms of the same base form produce different stems as the b oundary is placed at a wrong place. In some cases, it is not even p ossible to find p ositions for the b oundaries in all the different inflected forms to produce a stem that is not confused with the stems of other words. This problem can b e alleviated by the use of query expansion [9] or latent semantic indexing [19]. to a confusion network consists of the following steps: 1. Compute the p osterior probability for all edges in the lattice 2. Pruning: remove all edges with p osterior probability b elow some threshold 3. Intra-word clustering: merge together edges corresp onding to the same word instance and sum their p osterior probabilities 4. Inter-word clustering: group different words which comp ete around the same time interval and have similar phonetic prop erties to form confusion sets For a detailed description of the algorithm, see [12]. Pruning is needed to achieve b etter alignment of comp eting terms as it removes constraining low probability paths. This results in more accurate 1-b est paths as explained in [12]. Removing very low probability terms can also increase retrieval p erformance as these terms were not likely sp oken in reality. However, if the pruning threshold is too high, there is a risk of removing correct terms and thus reducing recall. With our morph-based recognizer, the confusion networks consist of morphs instead of words. A sp ecial marker indicates word break p ositions. An example morph confusion network is presented in Figure 2. The network corresp onds to the b eginning of the transcript of Figure 1.

2.2 Lattices and Confusion Networks
The sp eech recognizer takes as input the sp eech signal and generates morph lattices that represent a large numb er of alternative hyp otheses in the form of directed acyclic graphs. Each node of the graph has a timestamp. Each edge is lab eled with a morph hyp othesis and its acoustic and language model likelihoods. The edge corresp onds to the signal delimited by the timestamps of the start and end nodes. A more compact representation of a lattice called word confusion network (WCN) or sausage has b een prop osed by Mangu et. al [12]. A confusion network contains a numb er of alignment p ositions and, in each p osition, a set of mutually exclusive word hyp otheses called the confusion set. Each word in a confusion set is associated with its p osterior probability i.e. the probability of the word given the signal at that time interval. Sentence hyp otheses can b e generated by freely combining hyp otheses at each alignment p osition. The 1-b est path in a confusion network is simply obtained by picking the term with the highest p osterior probability at each alignment p osition. The error rates are in general lower for the 1-b est paths obtained from the confusion networks than for the 1-b est paths of the lattice [12]. For indexing, confusion networks offer a convenient source for expanding the transcript with alternative recognition candidates. Confusion networks are more compact than lattices and they also provide alignment for all the terms in the lattice. With confusion networks, it is easy to rank locally comp eting terms by their p osterior probability and use the information for indexing. At general level, the algorithm for transforming a lattice

2.3 Indexing and ranking confusion networks
Retrieval p erformance is decreased if a relevant term that is sp oken is misrecognized and is thus missing from the transcript. However, it is p ossible that the correct term was considered by the recognizer but was not the top choice. In that case, the term will app ear in the confusion network. Adding these alternative hyp othised terms to the index is exp ected to increase recall. However, as most of the candidates in the confusion network were in fact not sp oken, we need to b e careful so that the spurious terms do not hurt precision too much. Following the notation in [11], let D b e a document modeled by a confusion network. We use two pieces of information in the confusion network for each occurrence of a term t at p osition o: its p osterior probability P r (t|o, D) and rank among comp etitors r ank(t|o, D). Posterior probability tells

633


SIGIR 2007 Proceedings

Session 26: Spoken Document Retrieval

Figure 2: Beginning of the confusion network for the story of Figure 1. <w> marks word break positions and *DEL* deletions (empty hypotheses). The 1-best path is <w> valtio ta <w> (of the state ). The correct result <w> baltia ssa <w> (in the Baltic ) is also present in the network. how confident the recognizer is that the term occurs in the signal at that p osition. Rank of the term reflects the imp ortance of the term relative to the other alternatives. In retrieval, document with a higher probability and/or rank of a term should b e preferred to one with lower values. The classical vector space model with tf-idf weights and cosine distance relevance measure is used for ranking the search results [15]. Normally, term frequency tf is the numb er of times a term occurs in a document. In our case, we need to estimate a value for term frequency based on the p osterior probabilities and ranks of each occurrence of the term in the confusion network of a document. We compare two methods for the estimate. In the first method, term frequency is evaluated by summing the p osterior probabilities of all of its occurrences in the confusion network. This means, that if the recognizer is confident that the term at a location is correctly recognized (p osterior probability close to one), term frequency is added by (close to) one as in the case of indexing error free text documents. Less weight is given to terms with less confidence. Thus, the term frequency of a term t in a document D, tf (t, D) is defined (confidence level or CL-method):
| o c c (t , D )|

function of the numb er of documents in the collection the term occurs in. In our case, we simply counted the numb er of confusion networks that the term occurs in at any p osition. In other words, term occurrence o(t, D) was estimated by: ( 1, if tf (t, D) > 0 o(t, D) = (3) 0, otherwise Now, the inverse document frequency for a term t is X o(t, D)), idf (t) = log (N/
D

(4)

where N is the numb er of documents in the collection. In the equation for o(t, D), the value of tf (t, D) could also b e thresholded by using a value greater than zero to eliminate the effect of terms with low estimated frequency. That resembles the method used by Siegler [17] for estimating term presence by thresholding estimated probability of occurrence. In our case, however, thresholding did not improve the results. This may b e due to pruning at the confusion network calculation where the terms with very low probability are already removed.

t f (t , D ) =

X
i=1

P r (t|oi , D)

(1)

3. EXPERIMENTS 3.1 Experimental setup
The corpus consisted of 288 sp oken news stories in Finnish read by single female sp eaker [2]. Each story was ab out one minute long. The manual reference transcripts of the documents were also available. Each story b elonged to exactly one of 17 different topics, assigned by multiple indep endent judges. The topic descriptions were used as queries. An 'unlimited vocabulary' continuous sp eech recognizer [5] was used to recognize the sp eech into morph lattices. Lattices were transformed to confusion networks with the SRI Language Modeling Toolkit [18]. The decoder pruning parameters were varied to analyze the effect of the recognition running time and to obtain confusion networks with different error rates and sizes. We used sp eaker indep endent acoustic models, trained on separate sp eech data consisting of 26 hours of sp eech from 207 sp eakers as in [14]. The language model was trained on a corpus consisting of 30 million words from electronic b ooks, newspap er text and short news stories. Most of the text was similar to the style of the sp oken news stories but from a different time p eriod. Before training, the corpus was segmented to morphemelike units using the unsup ervised segmentation algorithm as

This is the same as used by Mamou et al. [11], except that we omit the b oosting vector, which would assign a b oosting factor to each rank of the different hyp otheses. In our case, exp erimenting with different values of b oosting did not improve the results. Instead, we use the ranks in a different way in our second method for estimating the term frequency. Siegler [17] noted that, in the case of lattices, the probability values do not necessarily give good estimates for term frequencies. Better results were achieved by using only the ranks of locally comp eting terms. Motivated by this, our second method for estimating term frequency is defined by (rank-method):
| o c c (t , D )|

t f (t , D ) =

X
i=1

1/r ank(t|oi , D)

(2)

This means that the highest ranked terms of each alignment of the lattice, which corresp ond to the 1-b est result, get weights of one. The 1-b est result is then expanded by comp eting terms, which are given less and less weight as their rank increases. The inverse document frequency idf indicates the relative imp ortance of a term in the corpus. Traditionally, idf is a

634


SIGIR 2007 Proceedings

Session 26: Spoken Document Retrieval

explained in Section 2.1. The size of the lexicon was ab out 26000 morphs. The confusion networks were indexed using the estimates for term frequency and inverse document frequency from Section 2.3. The 1-b est path was also extracted from the confusion networks and indexed as any text using traditional values for tf and idf based on term counts. Similarly, the retrieval exp eriments were also p erformed on the error free reference text to analyze the decline in p erformance due to recognition errors. Before indexing, the reference text was segmented to morphs using the same lexicon as with the language model training corpus. As the correct relevance information was known, we could use standard IR measures provided by the trec eval program [3]: mean average precision (MAP), precision at 15 documents (P@15), precision at 5 documents (P@5) and precision at R (P@R), where R is the numb er of relevant documents. We also plotted average interp olated recallprecision curves. The amount of errors in the 1-b est transcripts is measured by word error rate (WER) and term error rate (TER). WER is the total numb er of errors (substitutions, deletions and insertions) divided by the numb er of words in the reference text. For an agglutinative language like Finnish where the words are formed by joining together morphemes, the WER tends to b e higher than e.g. English. This is b ecause an expression that takes many words in English may b e expressed in just one long word in Finnish. An error in one of the morphs in the word results in the whole word to b e counted as wrong. In English, on the other hand, if one of the words in the expression is misrecognized, several correct words remain. For retrieval, a more relevant measure of error is obtained by comparing how much the index terms (morphs in our case) differ. Term error rate is the difference of term frequency histograms b etween the indexes [7]: P t |tfr ef (t) - tfr ec (t)| P T ER = , (5) t tfr ef (t) where tfref (t) is the term frequency in the reference text and tfrec (t) is the term frequency in the recognized transcript. It is also p ossible to compare the confusion network to the reference transcript and count the oracle error rate i.e. the minimum numb er of error counts of any path through the network. The oracle error rate indicates the upp er limit for the improvements that can b e obtained from the confusion network.

Table 1: Recognition statistics. The lattice, confusion network (CN) and index sizes are given relative to the size of the respective 1-best transcription or index. measure / setup 1 2 3 4 1-b est WER % 47.76 40.89 38.13 37.34 1-b est TER % 58.00 47.14 41.89 40.29 oracle TER % 43.12 26.72 16.87 12.50 RT-factor 0.95 2.10 4.86 7.56 lattice size ratio 24.95 60.71 204.41 526.29 CN size ratio 3.62 6.22 12.51 20.63 index size ratio 2.31 3.45 6.14 9.02

3.2 Results
The ob ject of the exp eriments was to examine if morphbased sp oken document retrieval could b e improved by extracting terms from confusion networks. Retrieval effectiveness b etween the following indexes were compared: (1) reference text, (2) 1-b est recognition result, (3) confusion network with term frequency estimated by CL-method of Equation 1, (4) confusion network with term frequency estimated by rank-method of Equation 2. It was also examined how the p erformance changes with confusion networks produced by different decoder parameters to see how much the sp eed of recognition, the size of the lattices and the resulting 1-b est error rates affect the retrieval effectiveness. Statistical analysis of the results was p erformed along the

lines of [6], using the MATLAB Statistics Toolb ox. Per formance measures were first transformed with ar csin x function to make them closer to normal distribution. The Lilliefors test and the Jarque-Bera test were used to test the normality assumption, which always held. Two-way Analysis of Variance (ANOVA) was p erformed to examine differences b etween the methods. 5% significance level was used in all cases. Table 1 shows WER, TER, oracle TER and real-time (RT) factors of the different recognizer runs. RT factor indicates the ratio b etween the time required for recognition and the length of the audio. Also presented are the resulting total sizes of uncompressed morph-lattices, confusion networks and the size of the uncompressed index using the rank-method. The CL-method produced indexes around the same size. The sizes are given relative to the resp ective 1-b est transcription or index sizes. The size of the 1-b est index was ab out 1100 kB for all setups. As the level of pruning is decreased, the search space expands and the time of recognition increases as indicated by the increase in the RT factor. The resulting 1-b est error rates decrease for the first three setups but stays around the same for the third and fourth. The increase in search space can also b e seen in the size of the resulting lattice. The sizes of the confusion network and the index also increase but by smaller factors. The bigger lattices and confusion networks offer more p otential for expanding the index with comp eting terms, which can b e seen by the decrease in the oracle TER. On the other hand, they also require more computational p ower and the risk of inserting spurious terms increases. Retrieval p erformance statistics for the four recognizer setups are shown in Table 2. The p erformance of the error free reference index is also presented. It can b e seen that b oth expansion methods improve the p erformance over the 1-b est index in all cases and by all measures. Also, the rank method outp erforms the CL-method in all cases and by all measures except two (P@15 for setup 3 and P@R for setups 3 and 4, where the p erformance is in practice equal). As with the error rates, the p erformance of setups 3 and 4 are almost equal, with and without the expansion methods. This indicates that we have reached the level, where the pruning no longer limits the p erformance. Statistical testing revealed that with all recognition setups the improvements in MAP are significant for the rank method over the 1-b est method. With the CL-method, significant improvements were achieved only with the two first recognizer setups with the highest error rates. Similar results hold for P@15, with the exception of setup 4, where improvements over the 1-b est case were not significant for

635


SIGIR 2007 Proceedings

Session 26: Spoken Document Retrieval

Table 2: Retrieval performance statistics. Statistically significant improvements over the 1-best baseline are highlighted. setup measure 1-b est CL rank MAP 0.716 0.758 0.774 1 P@R 0.657 0.701 0.713 P@5 0.847 0.871 0.871 P@15 0.620 0.651 0.659 MAP 0.739 0.801 0.833 2 P@R 0.692 0.756 0.777 P@5 0.871 0.918 0.929 P@15 0.627 0.698 0.714 MAP 0.765 0.817 0.850 3 P@R 0.723 0.781 0.787 P@5 0.871 0.894 0.941 P@15 0.643 0.706 0.702 MAP 0.768 0.823 0.852 4 P@R 0.724 0.797 0.786 P@5 0.882 0.894 0.929 P@15 0.651 0.706 0.710 MAP 0.864 ref. P@R 0.817 N/A N/A text P@5 0.929 P@15 0.749

100 90 80 70 Precision 60 50 40 30 20 10 0 0 10 reference text 1-best transcription CL-method rank-method 20 30 40 50 60 Recall 70 80 90 100

Figure 3: Recall-precision, WER=47.76%

setup

1,

1-best

either method. For P@5, significant improvements were achieved only in setup 2 for rank-method over the 1-b est. This is not surprising, b ecause the expansion is exp ected to increase recall rather than precision. Thus, no improvements were exp ected at lower cut-off levels where the precision is already high. Similar b ehavior can b e seen in the interp olated averaged recall-precision curves for the different setups in Figures 3-6. For all the setups and at almost all levels of recall, the methods are again ordered from lower precision to higher: 1-b est, CL-method, rank-method. The reference index p erformance is marked with the dashed line. As the pruning levels are decreased, b oth expansion methods move the p erformance closer to the reference. For the third and fourth setup, the p erformance is again almost equal. At recall levels of 20% and lower, all indexes p erform around the same level and their exact ordering is dominated by chance. For higher levels of recall, the increase in p erformance is bigger as exp ected. This indicates that the expansion helps retrieve relevant documents that were previously ranked lower and that have query terms in the confusion sets.

100 90 80 70 Precision 60 50 40 30 20 10 0 0 10 reference text 1-best transcription CL-method rank-method 20 30 40 50 60 Recall 70 80 90 100

4.

RELATED WORK

Various subword based methods have b een previously used for retrieval. They usually consist of extracting sequences of phonemes from the phoneme recognizer transcripts as in [13]. Our morph-based method is different, however, as it utilizes the variable sized subword units in the language model enabling more accurate recognition of inflectional languages. These morpheme-like units also work well as index terms. More similar to our work, Logan et al. [10] utilize syllable-like units called particles and rep ort improvements in retrieval of English broadcast news with high OOV ratio when combined with a word-based system. Like us, they use a data driven algorithm to find the subword units.

Figure 4: Recall-precision, WER=40.89%

setup

2,

1-best

636


SIGIR 2007 Proceedings

Session 26: Spoken Document Retrieval

100 90 80 70 Precision 60 50 40 30 20 10 0 0 10 reference text 1-best transcription CL-method rank-method 20 30 40 50 60 Recall 70 80 90 100

Figure 5: Recall-precision, WER=38.13%

setup

3,

1-best

100 90 80 70 Precision 60 50 40 30 20 10 0 0 10 reference text 1-best transcription CL-method rank-method 20 30 40 50 60 Recall 70 80 90 100

In our previous work [9], we presented a method for extracting alternative recognition candidates for Finnish SDR. The method is based on examining the hyp othesis stack of the decoder during recognition and picking the most likely terms b efore they are pruned. The terms are then added to the index, unweighted. In comparison, confusion networks offer a much more flexible framework for term extraction and make p ossible to estimate prop er values for term frequencies. Lattices have b een used for improving p erformance of word-based retrieval. Siegler [17] investigates methods for extracting relevant information from word lattices and nb est lists. Similarly to the confusion network method, mutually comp eting terms are located from the lattice and their probabilities and ranks are used for indexing, showing improvements in retrieval p erformance. Saraclar and Sproat [16] use phoneme and word lattices to improve word sp otting accuracy in English sp eech. Their method uses lattice arc probabilities to derive a confidence measure for the terms in the lattice. However, as the terms in the lattice are not aligned, measures based on ranking of comp eting terms can not b e used. The most similar to this work is the approach by Mamou et al. [11]. They use information provided by word confusion networks to improve p erformance of SDR from call-center conversations in English. Compared to our work, the biggest difference is that instead of words, we use morphs for indexing, which makes our approach b etter suited to retrieval of inflectional languages like Finnish. Also, we had available the human relevance judgments for the sp eech documents in the corpus where they compared the results against the search results from the reference manual transcripts, which might introduce bias. We also changed the method for estimating the idf as the estimation used in [11] did not work well for our database. Their work also provides a good analysis on the effect of WER on retrieval, showing that confusion networks can improve the p erformance esp ecially at high error levels. We also produced recognizer transcripts with different error levels. Our analysis is different in nature however, as the recognizer pruning parameters have a direct effect on the size of the lattice and thus limits the b est p ossible improvement that the expansion can offer.

5. CONCLUSIONS
In this work, we have successfully used confusion networks of morpheme-like units to improve p erformance of Finnish sp oken document retrieval. Confusion networks offer a convenient representation of alternative recognition candidates. Both p osterior probability and rank of the locally comp eting terms can b e used to weigh the index terms. In our exp eriments, discarding the probability and using only the rank to estimate the term frequency offered the b est results. However, further research is needed to find the optimal way to use the information provided by the confusion networks. Significant improvements were obtained in mean average precision and precision at 15 documents. Precision at 5 documents was not improved but was not decreased either. This shows that the estimation scheme used helps to retrieve more relevant documents but also the p ossibly erroneous terms that are added to the index are downweighted enough so that they do not hurt the results. Exp eriments were also carried out with different recognition pruning parameters. They showed among other things

Figure 6: Recall-precision, WER=37.34%

setup

4,

1-best

637


SIGIR 2007 Proceedings

Session 26: Spoken Document Retrieval

that the increase in 1-b est WER, which happ ens when pruning is increased, can b e comp ensated by using the confusion networks at the indexing phase. This helps indexing of large databases where fast recognition sp eed is essential. Future work in using confusion networks for morph-based retrieval is still needed. That includes verifying the results with bigger databases and using different languages. Previously, it has b een noted that morph-based retrieval works b est when combining b oth morphs and lemmatized words [8]. Thus, extracting word level information from confusion networks for morphological analysis could p ossibly b e used to improve the results. Also, other methods for estimating tf-idf values from p osterior probabilities and ranks, as well as using retrieval models other than the vector space should b e researched. Further improvements could b e obtained by combining the confusion network approach with other methods such as query expansion and latent semantic indexing.

[8]

[9]

[10]

6.

ACKNOWLEDGMENTS
[11]

We thank Inger Ekman and the Department of Information Studies at the University of Tamp ere for the SDR evaluation data. We are also grateful to the rest of the sp eech recognition team for developing the sp eech recognizer and the morpheme discovery. We also thank ComMIT graduate school in Computational Methods of Information Technology for funding. The work was supp orted by the Academy of Finland in the pro ject New adaptive and learning methods in speech recognition. This publication reflects only the authors' views.

[12]

[13]

7.

REFERENCES
[14]

[1] M. Creutz. Induction of the Morphology of Natural Language: Unsupervised Morpheme Segmentation with Application to Automatic Speech Recognition. Doctoral thesis, Helsinki University of Technology, 2006. [2] I. Ekman. Suomenkielinen puhehaku (Finnish sp oken document retrieval). Master's thesis, University of Tamp ere, Finland, 2003. (in Finnish). [3] J. S. Garofolo, C. G. P. Auzanne, and E. M. Voorhees. The TREC sp oken document retrieval track: A success story. In Proceedings of the Ninth Text Retrieval Conference (TREC-9). National Institute of Standards and Technology NIST, 2000. [4] T. J. Hazen and I. Bazzi. A comparison and combination of methods for OOV word detection and word confidence scoring. In Proceedings of the International Conference on Acoustics, Speech and Signal Processing, Salt Lake City, Utah, USA, 2001. [5] T. Hirsimиki, M. Creutz, V. Siivola, M. Kurimo, a S. Virpio ja, and J. Pylkkиnen. Unlimited vocabulary o sp eech recognition with morph language models applied to Finnish. Computer Speech and Language, 2006. [6] D. A. Hull. Using statistical testing in the evaluation of retrieval exp eriments. In SIGIR '93: Proceedings of the 16th annual international ACM SIGIR conference on Research and development in information retrieval, pages 329н338, New York, NY, USA, 1993. ACM Press. [7] S. Johnson, P. Jourlin, G. Moore, K. Spиrck Jones, a and P. C. Woodland. The cambridge university sp oken document retrieval system. In Proceedings of IEEE

[15]

[16]

[17]

[18]

[19]

[20]

International Conference on Acoustics, Speech and Signal Processing '99, volume 1, pages 49н52, Phoenix, AZ, 1999. M. Kurimo and V. Turunen. An evaluation of a sp oken document retrieval baseline system in Finnish. In Proceedings of the International Conference on Spoken Language Processing ICSLP 2004, Jeju Island, Korea, Octob er 2004. M. Kurimo and V. Turunen. To recover from sp eech recognition errors in sp oken document retrieval. In Proceedings of the 9th European Conference on Speech Communication and Technology (Interspeech 2005нEurospeech), pages 605н608, Lisb on, Portugal, Septemb er 2005. B. Logan, P. Moreno, and O. Deshmukh. Word and sub-word indexing approaches for reducing the effects of OOV queries on sp oken audio. In Proceedings of HLT-2002 Human Language Technology Conference, 2002. J. Mamou, D. Carmel, and R. Hoory. Sp oken document retrieval from call-center conversations. In SIGIR '06: Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval, pages 51н58, New York, NY, USA, 2006. ACM Press. L. Mangu, E. Brill, and A. Stolcke. Finding consensus in sp eech recognition: word error minimization and other applications of confusion networks. Computer Speech And Language, 14:373н400, 2000. K. Ng. Subword-based Approaches for Spoken Document Retrieval. PhD thesis, Massachusetts Institute of Technology, 2000. J. Pylkkиnen. New pruning criteria for efficient o decoding. In Proceedings of the 9th European Conference on Speech Communication and Technology, pages 581н584, Lisb on, Portugal, Septemb er 2005. G. Salton, A. Wong, and C. S. Yang. A vector space model for automatic indexing. Communications of the ACM, 18(11):613н620, 1975. M. Saraclar and R. Sproat. Lattice-based search for sp oken utterance retrieval. In HTL-NAACL: Main Proceedings, pages 129н136, Boston, Massachusetts, USA, 2004. M. A. Siegler. Integration of Continuous Speech Recognition and Information Retrieval for Mutual ly Optimal Performance. PhD thesis, Carnegie Mellon University, 1999. A. Stolcke. SRILM н an extensible language modeling toolkit. In Proceedings of International Conference on Spoken Language Processing, pages 901н904, 2002. V. T. Turunen and M. Kurimo. Using latent semantic indexing for morph-based sp oken document retrieval. In Proceedings of the 9th International Conference on Spoken Language Processing (Interspeech 2006нICSLP), pages 341н344, Pittsburgh PA, USA, Septemb er 2006. P. C. Woodland, S. E. Johnson, P. Jourlin, and K. Spиrck Jones. Effects of out of vocabulary words in a sp oken document retrieval. In SIGIR '00: Proceedings of the 18th Annual International ACM-SIGIR Conference on Research and Development in Information Retrieval, pages 372н374, 2000.

638