SIGIR 2007 Proceedings Session 26: Spoken Document Retrieval Improving Text Classification for Oral History Archives with Temporal Domain Knowledge J. Scott Olsson Appl. Math. and Sci. Comp./UMIACS University of Maryland College Park, Maryland Douglas W. Oard College of Information Studies/UMIACS University of Maryland College Park, Maryland olsson@math.umd.edu ABSTRACT This pap er describ es two new techniques for increasing the accuracy of topic lab el assignment to conversational sp eech from oral history interviews using sup ervised machine learning in conjunction with automatic sp eech recognition. The first, time-shifted classification, leverages local sequence information from the order in which the story is told. The second, temp oral lab el weighting, takes the complementary p ersp ective by using the p osition within an interview to bias lab el assignment probabilities. These methods, when used in combination, yield b etween 6% and 15% relative improvements in classification accuracy using a clipp ed R-precision measure that models the utility of lab el sets as segment summaries in interactive sp eech retrieval applications. Categories and Sub ject Descriptors: H.3 [Information Storage and Retrieval]: Miscellaneous General Terms: Algorithms, Measurement Keywords: sp oken document classification, automatic topic classification, classifying with domain knowledge oard@glue.umd.edu stantial error rates. State of the art Automatic Sp eech Recognition (ASR) systems achieve word error rates b etween 15% and 50% on conversational sp eech [4], with that wide variation resulting from differences in the degree to which the system has b een tuned (often at significant exp ense) to the characteristics of a particular collection. In this pap er, we exp eriment with a 25% word error rate transcription, the b est that is presently available for any collection of oral history interviews. Even so, at that error rate, many of the most selective query terms are often misrecognized, and few of the most informative snipp ets would b e completely correct. When a suitable thesaurus and suitable training data are available, using automatic transcription as a basis for topic classification offers a p otentially useful interaction paradigm. Automatically assigned thesaurus terms can b e displayed as a "bulleted list" content summary, and iterative query refinement can b e done by incorp orating thesaurus terms that have b een seen to describ e useful content. Because topic classification algorithms that leverage broad patterns of term co-occurrence are available, this approach can yield more robust summaries that are less sensitive than snipp ets would b e to variations in the word error rate. Word error rates in large sp eech collections typically vary systematically by sp eaker, so this might also help to minimize the natural bias that has b een observed from term-based systems in favor of the clearest sp eakers [14]. On the other hand, implementing thesaurus-based search alone can make formulation of an initial query challenging for untrained users, and search topics that were not anticipated when the thesaurus was created can b e particularly difficult to express. The natural approach is therefore to use free text and thesaurusbased techniques together. These considerations naturally raise the technical question of how accurately it is p ossible to assign thesaurus terms to sp oken content. That is not a question that is easily answered in the abstract, so in this pap er we adopt the sp ecific context of assigning thesaurus terms to manually partitioned segments from English oral history interviews based on a one-b est ASR transcript. That formulation reveals two salient characteristics of a topic classification problem that are common to many typ es of sequentially-told stories (e.g., television programs, or the evolution of news rep orting over time): (1) the order in which the story is told provides p otentially useful evidence, and (2) different asp ects of a story evolve over different time scales as it is told. As a simple example, we exp ect to find a review of prior work early in this pap er, exp eriment results towards the end, and a consis- 1. INTRODUCTION Interactive information retrieval systems rely heavily on the user's ability to p ose good queries and to recognize relevant content. Collections of conversational sp eech p ose unique challenges for b oth tasks. How is the user to know which words might b e correctly indexed without understanding b oth the way in which individuals sp oke and the limitations of sp eech processing comp onents? And how can we compactly summarize sp oken content in ways that p ermit users to select useful results from large result sets? Modern Web search engines use term sequences for b oth purp oses, accepting query terms that will b e matched with terms found in the documents, and displaying document snipp ets containing occurrences of the query terms. That approach does not transfer well to conversational sp eech (e.g., recorded meetings, telephone calls, or interviews) b ecause the b est available automatic transcription yields sub- Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. SIGIR'07, July 23­27, 2007, Amsterdam, The Netherlands. Copyright 2007 ACM 978-1-59593-597-7/07/0007 ...$5.00. 623 SIGIR 2007 Proceedings Session 26: Spoken Document Retrieval tent topical coverage throughout. In this pap er we explore how those effects can b e leveraged to improve classification accuracy in the context of a richly annotated oral history collection. The remainder of this pap er is organized as follows. Section 2 briefly reviews related work on topic classification for sp oken content. Section 3 then describ es the test collection, training data and evaluation measures that we used in our exp eriments. Sections 4, 5 and 6 present algorithms for our baseline kNN classifier, an enhancement using timeshifted classification, and an enhancement using temp oral lab el weighting. Section 7 describ es our approach to evidence combination, and section 8 presents the results of our exp eriments that show improvements in classification accuracy of b etween 8 and 15% for leaf nodes in the thesaurus, and improvements of b etween 6% and 13% for interior nodes. Section 9 concludes the pap er with some observations on the broader utility of these techniques b eyond the collection that we used in our exp eriments. a collection now exists. In 2005 and again in 2006, the Cross-Language Evaluation Forum (CLEF) Cross-Language Sp eech Retrieval (CL-SR) track distributed a collection of English oral history interviews with 246 Holocaust survivors, rescuers and witnesses with one-b est ASR results, a rich thesaurus, and ground truth mappings b etween the ASR results and the thesaurus lab els. We use those ground truth mappings as the answer key for evaluating classification accuracy, so at the end of this section we describ e how those mappings were created and introduce the clipp ed R-precision measure that we use to characterize classification accuracy. In b etween, we describ e the disjoint training set of mappings b etween text and thesaurus lab els that we used to train our classifier. 3.1 Evaluation Set The interviews from which the CLEF CL-SR collection was created were conducted by the Survivors of the Shoah Visual History Foundation (VHF)1 late in the twentieth century and recorded on videotap e. Each interview was structured by the interviewer to proceed in roughly chronological order through the interviewee's life exp eriences, with the first 20% or so typically addressing exp eriences b efore the Second World War, the middle 60% typically addressing exp eriences during the war, and the final 20% typically addressing exp eriences after the war. Most interviews are in the form of an extended narrative with occasional steering comments from the interviewer, but more structured question-answer formats were also sometimes used. At the end of the interview, interviewees would often hold up artifacts (e.g., photographs) for the camera to record and say a few words ab out them. An initial thesaurus for indexing these materials was develop ed by VHF based on scholarly analysis of events during the time frame the interviewees describ ed. Professional indexers, generally with academic training in disciplines related to the content of the collection, then manually segmented each interview into topically coherent passages that were recorded in a database as a standoff annotation to the sp oken content, which at that p oint was still recorded on videotap e. Each segment was then describ ed by the indexer by associating several thesaurus lab els with a segment. Operationally, it is useful to think of the segmentation process as having b een guided in some way by the thesaurus: when a set of assigned thesaurus terms no longer describ ed what was b eing discussed, insertion of a segment b oundary would b e appropriate. When indexers encountered concepts that were not yet present in the thesaurus, they nominated new thesaurus lab els for consideration by the thesaurus maintenance team (a thesaurus extension process generally known as "literary warrant"). The resulting thesaurus thus covers the topical scop e of the collection quite well. The thesaurus itself consists of two hierarchies, one a set of part-whole relations (the "term" hierarchy) and one a set of is-a relations (the "typ e" hierarchy). Figures 1 and 2 show some illustrative examples. These figures illustrate a distinction that we will make throughout this pap er, with Figure 1 drawn from the branch of the thesaurus in which geography and time p eriods app ear (what we call the "geography" part) and Figure 2 drawn from the remainder of the thesaurus (which we generically refer to as the "concept" part). In parallel with the indexing process, the original video1 2. RELATED WORK The BBN OnTAP system app ears to have b een the first to use automatically assigned topic lab els to describ e the content of sp eech in an interactive information retrieval system [10]. In their approach, topic lab els are presented vertically aligned with the salient sections of the transcript during full-text display so that b oth can b e scrolled together (along with a third vertical region depicting sp eaker identity). Byrne et al. presented classification results on parts of the same collection used in this study, using the results of an early ASR system with a higher word error rate [1]. Olsson et al., also using parts of the same collection, later rep orted classification results where the training examples were taken from a second language [12]. Both showed that kNN was a reasonable approach, given that the problem is multi-lab el with many topic classes. Iurgel et al. rep orted classification results on sp oken content using combinations of binary supp ort vector machines, although their task contained many fewer classes [8]. A great deal of research has looked at incorp orating domain knowledge to improve classification effectiveness for text documents. In [7], domain knowledge from topical hierarchies is used to enrich the document representation for search. Other work has focused largely on comp ensating for a shortage of available training data [2, 9, 15, 18], sometimes requiring significant modification to the learning algorithm (e.g., [18] develop ed a modified supp ort vector machine classifier). Our work differs in the typ e of domain knowledge considered (temp oral evidence as opp osed to exp ert knowledge), in that we do not sp ecifically consider the limited training data problem, and in that our application focus is on supp orting search in sp eech collections. Our work also differs from [5] (which exploited temp oral evidence for classification), in that we do not adapt to evidence from previously seen stories, but rather to evidence from within the same story (within the same interview). 3. EVALUATION FRAMEWORK Exploring these questions requires a sp eech collection, ASR results, a thesaurus, and examples of how recognized words are used with different thesaurus lab els. Fortunately, such The successor to VHF is the USC Shoah Foundation Insti- 624 SIGIR 2007 Proceedings Session 26: Spoken Document Retrieval Term: Antarctica Type: Continents by time Concept labels Geography labels Antarctica (1945-2000) A B C A B C Figure 1: An example from the geography part of the CLEF CL-SR topic thesaurus. Solid lines denote part-whole ("term") relations, dashed lines denote is-a ("type") relations. Term: Military Spanish Soldiers Type: Soldiers Turkish Soldiers Figure 3: Computing clipped R-precision for concept and geography label hypotheses on three segments, A, B , C . Dashed circles indicate the label falls below the clipping level M for the segment. Figure 2: An example from the concept part of the CLEF CL-SR thesaurus. Solid lines denote part-whole ("term") relations, dashed lines denote is-a ("type") relations. 3.2 Training Set The traditional structure of a topic classification problem can b e formulated as: given the words produced for that segment by ASR, find the set of thesaurus lab els that a human indexer would have assigned. In this pap er, we adopt a more general formulation: given a sequence of segments, each with ASR-generated words, find the corresp onding sequence of thesaurus lab el sets. In order to train a classifier, we need training data in which such associations are known. As it happ ens, an additional set of segments, each with sets of topic lab els assigned by the same indexers using the same process, are available. These segments are not distributed with the CLEF CL-SR collection, so we obtained them on a research license from VHI for use in training our system. There were initially over 186,000 segments in this collection, but after deletion of short segments near the end of an interview 168,584 training segments remained. One imp ortant limitation of our training collection is that no ASR results are available for the words sp oken in those segments. Instead, VHI provided us with three-sentence summaries written by the indexers for each segment that describ e "who, what, when, where" in a fairly structured and stylized way. We therefore trained our classifiers by acting as if these summaries were representative of the words that would have b een generated by ASR for those segments. tap es were digitized by VHF and then automatically transcrib ed by IBM using an ASR system trained on 200 hours of manually transcrib ed sp eech from 800 held out interviewees (i.e., interviewees who do not app ear in the test collection that we used) [1]. The rep orted mean word error rate for the one-b est transcriptions that were provided by IBM is 25% for most sp eakers, although for logistical reasons transcriptions with an older system with a mean word error rate of 35% were used in a few cases (e.g., when glitches in the newer system that was still under development resulted in no output). The standoff annotations recorded in the database were used to automatically partition the resulting transcripts into disjoint segments (with some small automated adjustments to avoid splitting transcrib ed words and to align to segment b oundaries with pauses where p ossible). The resulting segments were then associated with the unique identifiers for each thesaurus term that had b een manually assigned by the indexer to that segment, and the result was stored as an XML data structure that was distributed by the Evaluation and Language Resources Distribution Agency (ELDA) to participants in the CLEF 2006 CL-SR collection, version 4.0. The CLEF-2006 CL-SR test collection was originally intended for evaluation of ranked retrieval, and thus it contains many comp onents (e.g., topics and relevance judgments) that we have not describ ed here. A complete description of that collection can b e found in [11]. One preprocessing step used in creating that collection affects the exp eriments that we rep ort on in this pap er, however. When the VHF indexers segmented the collection, they typically created one short segment for each artifact that was displayed at the conclusion of an interview. This resulted in a proliferation of very short segments, each with relatively few ASR-generated words. We elected to automatically remove all very short segments from the collection b ecause judging topical relevance for such sections without seeing the video was often impractical. As a result, those very short segments were not used in our exp eriments. The remaining 8,104 segments have a unimodal segment length distribution with a median of 4 minutes (ab out 500 words). tute for Visual History and Education, or "VHI." 3.3 Evaluation Measure In a content description task, we want to show the user only a small numb er of the b est predicted lab els. Supp osing we could show a user N lab els, we might choose as our evaluation measure precision at a cutoff of N . Unfortunately, this would unfairly p enalize segments with only a few (say 3) correct lab els placed at the top 3 ranks (giving a precision of 3/N ). Alternatively, we might choose a rank based measure such as R-precision (the precision at cutoff R, where R is the numb er of correct lab els for a segment), but this may factor in lab el hyp otheses which can never b enefit the user (i.e., if R > N ). As a solution to these problems, we take as our measure the clipped R-precision. Clipp ed R-precision is defined as the precision at cutoff M , where j R, RN M= (1) N, R>N Consider Figure 3. Three segments, A, B , C , have ranked lists of b oth concept and geography lab els. We would like 625 SIGIR 2007 Proceedings Session 26: Spoken Document Retrieval to show the user 6 concept and 4 geography lab els.2 First, consider concepts (N = 6). Segments A, B have R > 6, so their clipp ed R-precision is 2 and 1 resp ectively. Segment 6 6 C has R = 3, R < N , so M = 3 and its clipp ed R-precision 2 is 3 . The calculation is the same for geography lab els, now with N = 4. For segments A, B , R > N , so M = N and each have clipp ed R-precisions of 1 . For segment C , R = 4 2, so the score is 1 . Lastly, we average over segments, so 2 the clipp ed R-precisions on concepts in this example will b e 5 ( 2 + 1 + 1 )/3 = 18 . For geography, we have 1 . 6 6 3 3 Note that this evaluation measure is very severe: we give credit to our system only when the indexer assigned exactly the same content, no credit for b eing close enough that a savvy user could make sense of it, and no credit for b eing a p erfectly fine assignment (i.e., one that is useful for the purp ose of description) that the indexer just did not happ en to make (e.g., p erhaps b ecause of strictly standardized rules of interpretation). Cumulatively, these effects may b e significant b ecause (1) there are very many lab els and the segments may have multiple topics assigned (as opp osed to a single-lab el assignment problem in which we would not exp ect indexers to forget to assign the one appropriate lab el) and (2) the thesaurus terms often have greater sp ecificity than a user might desire. For example, in Figure 1 we see that Antarctica (1945-2000) is a different topic than Antarctica. Accordingly, the absolute value of our measure should b e interpreted generously when trying to imagine the utility of the lab els to the user of an interactive information retrieval system. Segment 1 Labels from Segment 1 Segment 2 Labels from Segment 2 time Segment 3 Labels from Segment 3 ... Figure 4: A schematic view of the TSC training setup. Segments are assigned labels from their temporally adjacent segment. Likewise, the classifier predicts labels for temporally adjacent (subsequent) segments. wj for j Kc , Kc = {j | neighb or wj has lab el c}. That is, X (AwT )T wj . scor e(wT , c) = j Kc For all exp eriments, we fixed the neighb orhood size at k = 100, which was found to b e roughly optimal for our baseline system. 5. TIME SHIFTED CLASSIFICATION One new source of information in oral history data is the set of features associated with temp orally adjacent segments. Features, here terms, may b e class-predictive for not only their own segment, but for the subsequent segments as well. This is an example of local temp oral evidence. This intuition may b e easily captured by a time-shifted classification (TSC) scheme. In TSC, each training segment is lab eled with the subsequent segment's lab els. During classification, each test segment is used to assign lab els to its subsequent segment. This is illustrated in Figure 4. Because the last segment in each interview has no associated timeshifted lab els, they are discarded in TSC training. Likewise, the first segment from each test interview has no preceding segment which may predict its lab els, and so falls back to using only the non-shifted lab el hyp otheses. Note, this approach may easily b e extended to predict lab els on segments temp orally farther away. Time shifted classification produces scores for lab els on segments, just as traditional non-shifted classification. Naturally, we would like to combine these scores with those from the original, non-shifted classification problem. We used a simple linear combination of the scores for a class c and document d, STSC.comb (c, d) = Sorig (c, d) + (1 - )STSC (c, d), where Sorig and STSC are the original and TSC scores resp ectively. We evaluated this combination approach on a set of 4,000 segments. For each setting of , we computed the clipp ed R-precision and then took 500 b ootstrap resamplings of size 4,000. The mean and confidence intervals of the clipp ed Rprecision are shown at each of several settings in Figure 5. Geography and concept lab els are plotted separately. We observe that optimal settings of occur at different p ositions for geography and concept lab els. For the b est setting on concepts, the time-shifted scores are only barely considered (i.e., is around 0.9), while for geography, they are strongly considered (i.e., is roughly 0.6). This conforms to our exp ectations, in that interviews were segmented by change in topic, while successive topics may naturally occur without a change in geography. On b oth lab el sets, we see the clipp ed R-precision varies smoothly with resp ect to . 4. BASELINE CLASSIFIER Our baseline is a k-Nearest Neighb ors (kNN) classifier using a symmetrized variant of Okapi term weighting [6, 13], w (t f , d l ) = tf , 0.5 + 1.5( adl l ) + tf vd where w(tf , dl) is the computed term weight, tf is the term frequency, dl is the length of the document in which the term occurs, and av dl is the average document length. It is symmetric in the sense that b oth testing and training vectors use the same weighting scheme. During classification, term weights are multiplied by their inverse document frequency, « ,, D - df + 0.5 , idf (df ) = log df + 0.5 where D is the total numb er of segments in training. For convenience, we represent this idf weighting as a matrix vector product b etween A (a square matrix with the idf weights on the diagonal) and a document vector. For a test document with vector wT , we first find the k nearest training vectors (neighb ors) wi , i = 1, 2, . . . , k in the document space, where our distance measure is the inner product, (AwT )T wi . The score for class c on test document vector wT is then computed as the sum of inner products b etween AwT and 2 It happ ens that the median numb er of true concept and geography lab els on segments is 3 and 2 resp ectively. We therefore simulate showing the user twice as many of each lab el typ e (6 and 4), which gives a total numb er of 10 lab els for presentation. The average thesaurus lab el contains four words, so these should easily fit on four display lines. 626 SIGIR 2007 Proceedings Session 26: Spoken Document Retrieval War crime trials Pre-war political activities Berlin 3 Clipped R-precision 0.18 0.20 0.22 0.14 0.16 Density 0.1 0.1 0.2 0.3 0.4 0.5 0.55 0.6 0.65 0.7 0.75 0.8 0.85 0.9 1 0.12 0 0.0 1 2 0.2 0.4 0.6 0.8 1.0 Segment start time (% of interview) Figure 5: Clipped R-precision vs. mixing parameter for combining original and TSC classification scores. White boxes show results for geography labels, gray boxes show concept labels. Note, this is only a preliminary analysis to gauge the smoothness of the combination method. 2000 2000 2000 Figure 7: Time density estimates for three commonly occurring labels. The top and bottom rugs show where label examples occur, for war crime trials and Berlin, respectively. mated using smoothed kernel density estimators on held out data. Figure 7 shows some example time density estimates. Kernel density estimators are non-parametric estimators for probability density functions, similar in purp ose to histograms, except that they are smooth and do not require a bin width b e chosen. The intuition is that observations ab out a p oint x should contribute to the density, more so if they are nearby, less so if they are far away. This notion of distance is encoded in a kernel K , so that the density at a p oint x is estimated as n 1 X " x - xi " ^ f (x ) = , K n i=1 b for observations xi , i = 1, 2, . . . , n, where the bandwidth b parameterizes the width of the kernel (sp ecifically, in this case, the bandwidth is the kernel's standard deviation). An applications-oriented introduction to kernel density estimators may b e found in [17]. Various kernels may b e used, although they are normally chosen to b e smooth, unimodal, to p eak at 0, and to b e a R probability density function, i.e., K (u)du = 1. We produce our time density estimates using a Gaussian kernel density estimator 1 1 K (u) = exp (- u2 ), 2 2 where the bandwidth is chosen such that (1) the distribution is unimodal for classes with few example and (2) the distribution may have multiple modes when they are strongly supp orted by available examples. Our default bandwidth is computed according to Silverman's "rule of thumb" (the default in the R statistics package) [16]. In practice, for classes with fewer than 100 examples, we iteratively increase this default smoothing bandwidth until the density function's derivative has no more than one zero crossing (i.e., the function has one maximum). This is illustrated in Figure 8 for an artificial lab el with only two training examples. With our 1980 1980 Year noted 1960 1960 1940 1940 1920 1920 1900 1900 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 1900 1920 0.0 1940 1960 1980 0.2 0.4 0.6 0.8 1.0 Segment start time (% of interview) Figure 6: Years spoken in automatic speech recognition transcripts versus the corresponding segment time (as a fraction of total interview time) for three interviews. 6. TEMPORAL LABEL WEIGHTING We can also b enefit from non-local temp oral information ab out a segment. For example, b ecause interviewees were prompted to relate their story in chronological order, we would b e less surprised to find a discussion of childhood at an interview's b eginning than at its end. This chronological ordering is observed in Figure 6, which shows the years noted in the sp eech recognition transcripts plotted against segment time for three different interviews. The noted years ramp upwards quickly as the interviewees summarize their childhood, then progress slowly through their adult years, and finally jump ab out somewhat erratically as artifacts from throughout their life are introduced. Because of this structure, topics may b e more likely to occur at some times than others. For example, discussions of war crime trials are considerably more likely to occur at the end of an interview than at the b eginning (simply b ecause war crime trials tend to occur after a war). We can exploit this intuition by weighting our lab el predictions by p(c, t), the probability of lab el c occurring during the interval of interview time t. We call this approach temporal label weighting (TLW). These lab el weights, p(c, t) may b e esti- 627 SIGIR 2007 Proceedings Session 26: Spoken Document Retrieval 0.6 0.7 0.8 0.9 1.0 1.1 0.60 0.65 0.70 0.75 0.80 Density 0.65 0.75 0.55 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 Segment start time (% of interview) Clipped R-precision (a) (b) (c) Figure 8: Three choices of bandwidth for smoothing a Gaussian kernel density estimate with very few examples (here 2). In (a), the density is bimodal with the default choice of bandwidth. At (c), a bandwidth is chosen providing a unimodal density function. Note two tick marks on the bottom edge of the graph show the position of the training examples in time. default bandwidth, the density function is bimodal (Figure 8a), which can not b e strongly supp orted with so few examples. In Figure 8b, the bandwidth is increased slightly, and then again in Figure 8c. We terminate at this final bandwidth, which provides a unimodal density estimate. Note that our weighting function is a density, so that it approximately integrates to one. This is true, of course, regard less of the numb er of lab el examples. This is made clear in Figure 7, where war crime trials has a greater mode than Berlin, despite Berlin having many more examples (as seen on Berlin's rug--the tick marks on the b ottom edge showing the observations' p ositions). This is reasonable b ecause the prep onderance of a lab el's examples (i.e., its prior probability) is already modeled implicitly by kNN. Now, to estimate p(c, t), we ought to integrate our estimated density function over the temp oral extent of the test document. Because the segment durations have fairly low variance however, we approximate our weighting, p(c, t), by the estimated density function for class c at the start time of interval t. This approximation will b e at least roughly prop ortional to the integrated probability mass--and has the advantage of not requiring runtime numerical integration, provided the density function is fairly flat.3 On the other hand, this approximation will b e bad where the first derivative of the density function is large. To mitigate this effect, we damp en the values logarithmically b efore applying the weights to our baseline classification scores. This gives our combination formula STLW.comb (c, d) = Sorig (c, d) × log (1 + p(c, t)), where c is the class, d is the document, and p(c, t) is the temp oral lab el weight for lab el c at the start time t. We use log (1 + p(c, t)) b ecause (1) it is p ositive for p(c, t) (0, ) and (2) for small p(c, t), log (1 + p(c, t)) p(c, t). 0.19 0.20 0.21 0.22 0.23 0.24 0.25 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Figure 9: Clipped R-precision vs. mixing parameter for combining TLW and TSC classification Output. White boxes show results forgeography labels, gray boxes show concept labels. Note, this is a preliminary analysis looking for smoothness. We shouldn't, for example, conclude that TSC scores are not used on concepts (we will see that they are). For each setting of , we computed the clipp ed R-precision and then took 500 b ootstrap resamplings of size 4,000. The combination parameter (used to produce the TSC results which are here combined with the TLW results), was taken from the similar analysis shown in Figure 5. The mean and confidence intervals of the clipp ed R-precision are shown at each of several settings in Figure 9. Again, we observe that optimal settings of occur at different p ositions for b oth geography and concept lab els. On b oth lab el sets, we see the clipp ed R-precision varies smoothly with resp ect to . In the exp erimental section, we will determine from the held out p ortion in our cross-fold validation. 8. EXPERIMENTS Our training set is a collection of 168,584 segments, as describ ed in Section 3. Each segment in the training collection has one or more manually assigned thesaurus terms, from a set of 13,764 unique thesaurus lab els, which in turn are drawn from a larger set of ab out 40,000 lab els in the full thesaurus. The training features are words taken from summaries of each segment written by human indexers. The classification task is to assign thesaurus lab els to a set of 8,104 new segments, where features are drawn from automatic sp eech recognition transcripts of the words sp oken in those segments. This data is available as the ASRTEXT2006B field of the CLEF 2006 version 4.0 CL-SR collection. We also know every segment's p osition in its interview and its temp orally adjacent segments. To facilitate statistical testing and allow our combination parameters to b e tuned on fair data, we use K -fold validation (K = 10). Our testing segments are partitioned into K folds and, for each fold, the combination parameters (, ) are chosen to optimize the clipp ed R-precision on the remaining K - 1 folds.4 We searched for optimal mixing 4 7. COMBINING EVIDENCE We may also combine the local evidence provided by TSC with the less localized evidence provided by TLW. Again, we use a simple linear combination of scores, Sfinal (c, d) = STSC.comb (c, d) + (1 - )STLW.comb (c, d). As b efore, we evaluated this combination approach on a set of 4,000 segments. Figure 9 shows the parameter sweep. 3 To see this, imagine approximating the integral over a small region by drawing a b ox under the density function. We emphasize that for the exp eriments rep orted in this 628 SIGIR 2007 Proceedings Session 26: Spoken Document Retrieval Geography Concepts mean() 0.59 0.93 s.d.() 0.05 0.0 mean( ) 0.44 0.18 s.d.( ) 0.14 0.01 Table 1: Mean values for the mixing parameters , (averaged over the cross-validation folds) and their standard deviation. part geo concept geo concept geo concept location leaf leaf 1LU term 1LU term 1LU typ e 1LU typ e baseline 0.2012 0.1896 0.2182 0.3116 0.2618 0.2175 TSC&TLW 0.2322 0.2054 0.2474 0.3317 0.2777 0.2323 R.I. (%) +15.4 +8.3 +13.4 +6.4 +6.1 +6.8 Clipped R-precision 0.10 0.15 0.20 0.25 Baseline TSC TLW Combination Table 2: Averaged clipped R-precision for each label set and thesaurus level, for both the baseline and combination approach. The relative improvement (R.I.) using the combined temporal evidence is also shown. 0.30 ... . ... . ... . Geography leaves Concept leaves Geography 1LU in Terms Concepts 1LU in Terms Geography 1LU in Types Concepts 1LU in Types Figure 10: Clipped R-precision for each setting, averaged over the cross-validation folds. Tick marks at the base of a bar indicate that, by a paired ttest with = 0.01, the bar's clipped R-precision is significantly better than the left-adjacent bar. parameters by stepping through with increment of 0.01. Table 1 shows the mean and standard deviation for the mixing parameters (averaging over the K folds). Figure 10 shows the final results from our exp eriments. For each setting, the averaged clipp ed R-precision over the K validation folds is shown. To test for statistically significant improvement, we compare the clipp ed R-precision across the K validation folds using paired t-tests with = 0.01.5 The results of this significance testing are shown in Figure 10: bars which have clipp ed R-precision significantly larger than the bar to their left are marked with a tick at their base. For example, we see that TSC significantly improves up on the baseline for geography lab els (at b oth the leaves and one level up in each of the two thesaurus hierarchies), but not for concepts. Note that each grouping of bars contains at least one tick mark: accordingly, using temp oral evidence improves up on our baseline for b oth lab el sets, at b oth levels in b oth thesaurus hierarchies, with statistical significance. These improvements are tabulated in Table 2. As Table 2 shows, moving one level up ("1LU") in the "term" (i.e., part-whole) hierarchy to classify to the first interior node improves the overall accuracy of concept classification, but does little to b enefit geography. Conversely, moving one level up in the "typ e" (i.e., is-a) hierarchy b enefits geography classification more than concepts. These imsection we use evidence combination parameters learned through cross-validation, not those learned on the 4,000segment sets describ ed in the previous sections. This distinction is imp ortant b ecause those 4,000-segment sets are a part of the 8,104 set on which we now rep ort results. 5 Our training sets overlap and thus violate an indep endence assumption, but the probability of Typ e I error nevertheless tends to b e acceptably small [3]. Alternatively, Fisher sign tests comparing clipp ed R-precision on paired segments in one fold show the same improvements are significant. provements are not surprising by themselves--the smaller numb er of interior nodes simply results in an easier classification problem. In b oth cases, however, further statistically significant improvements of ab out 6% are still observed even over the stronger of the two baselines when TSC and/or TLW are applied (and mean values for the combination are never lower than either used alone). This indicates that TSC and TLW, and the combination strategy that we have employed, have utility across a range of thesaurus granularities that might b e imp ortant in practical applications. This analysis also tells us something ab out how far the temp orally informed methods are moving class hyp otheses in the hierarchy to make correct class assignments. If, for example, the temp oral evidence was only able to correct a class assignment having a common parent node with the correct lab el, we would exp ect classification improvements to wash away when class hyp otheses were pushed up the hierarchy. As this does not occur, it app ears the prop osed methods are also correcting many "far misses" in the topic thesaurus. 0.00 0.05 9. CONCLUSION The most obvious limitation to the techniques that we have describ ed is the requirement for b oth a thesaurus (or some other source of appropriate topic lab els) and a training set in which those lab els have b een associated with text in a way that is representative of how the classifier should b ehave. Of course, that same condition applies to any text classification problem based on sup ervised machine learning--all that we have really done is remove the document indep endence assumption by observing that in this collection, classification assignments do indeed dep end on b oth the absolute and the relative p osition of segments within an interview. This suggests several directions for future research. The most fundamental, p erhaps b est thought of as research in digital libraries rather than topic classification, is to identify other applications that exhibit similar prop erties and for which a suitable topic inventory is available or could affordably b e constructed. A second research direction would b e to raise our baseline by, for example, automatically transforming the human-written summaries from the training collection into something more like ASR output. This would amount to fundamental research in feature set transformation for topic classification with ASR input, and it seems likely that b enefits could accrue from such an approach. Of course, we'd also hop e to compare that approach to training on a complete set of ASR transcripts. 629 SIGIR 2007 Proceedings Session 26: Spoken Document Retrieval 1500000 do tis ro ai sl ga es the up r gl te a in en fa tro ca rm d pr rid os g in uc m e br sig gd a au ur al lo ria nd elbat m Ö sba sb to ch va sta tip er ca p g rin re he rib er w gw tz in o c be pe sa hing an embl tern od nn git en mazo itz ar ev lic sy iu h la s nd a w wa ing di tens van ar a e sc te ia bu en b n m yemue rn rg do as e ss sa n lü bod sin rf de ig n kr cheit hi as rs nsc or c st pe et h s hm ei co ae t peafr an d m li ro paz trean n se b ut nk a b nb s paena o sz en loo so lo eg ge m k ra ug h r st öth lía ra e hb alo ss n or m m ou a CHI-Max Original Features So, much remains to b e done. But we should emphasize here in conclusion what we have shown--that the structure of stories told in the form of oral history interviews can b e leveraged to improve topic classification effectiveness. With the substantial investments now b eing made in ASR for conversational sp eech, we can reasonably anticipate the creation of new collections for which these techniques should b e directly applicable. 1000000 he m in gw a 500000 Acknowledgments The authors are grateful to Sam Gustman for first suggesting the idea that thesaurus lab els could serve as a useful content summary in this application. This work has b een supp orted in part by NSF I IS award 0122466 (MALACH). TSC Features 0 10. REFERENCES 25 0 5 10 15 20 Index Figure 11: Features sorted by 2 ax score on both m the original and time-shifted classification problem. TSC features are less informative and have a different feature ordering than the unshifted problem. A third research direction, and the one most directly inspired by our results, is to explore other ways of leveraging p osition and sequence dep endencies. One obvious approach to try would b e a Hidden Markov Model (or some other form of sequence model) in which prior lab el assignments are used to bias classification decisions. Another approach to try would b e to apply a decay function that decreases the contribution of individual words to a category as those words app ear further back in time. Considering a more nuanced decomp osition of the thesaurus than the geography vs. everything else approach that we tried might also yield additional insights. And, at the most basic level, a range of functions for combining evidence remains to b e explored. For time-shifted classification, features predictive for a segment are likely to b e different than those predictive for adjacent segments. This may b e esp ecially imp ortant when feature selection is used. Consider, for example, that 2 feature selectors [19] are based on testing for term-class indep endence--and this will surely vary b etween the traditional and TSC case. Figure 11 shows the most predictive features, according to 2 ax , for b oth the original and TSC m case. In this study, we considered only the all-features case. We exp ect future work may show additional improvements by incorp orating feature selection with TSC. Ultimately, the value of topic classification is revealed in the way the results are actually used, so studying the b ehavior of searchers presented with a system that incorp orates b oth text-based and topic-based sp eech searching will b e imp ortant. Machines further down a processing pip eline can also use topic classification. For example, topic classification can serve as a source of vocabulary with which to augment an index, either by using terms from the topic lab els directly, or by using the topics as pivot p oints in a blind feedback strategy. So extrinsic evaluations in which the utility of topic classification is assessed through its influence on ranked retrieval will also b e imp ortant. [1] W. Byrne et al. Automatic Recognition of Spontaneous Speech for Access to Multilingual Oral History Archives. IEEE Transactions on Speech and Audio Processing, Special Issue on Spontaneous Speech Processing, 12(4):420­435, July 2004. [2] A. Dayanik et al. Constructing Informative Prior Distributions from Domain Knowledge in Text Classification. In SIGIR'06. [3] T. G. Dietterich. Approximate Statistical Tests for Comparing Supervised Classification Learning Algorithms. Neural Comput., 10(7), 1998. [4] J. Fiscus et al. The Rich Transcription 2006 Evaluation Overview and Speech-To-Text Results. In 3rd Joint Workshop on Multimodal Interaction and Related Machine Learning Algorithms, Recognition Workshop, 2006. [5] G. Forman. Tackling Concept Drift by Temporal Inductive Transfer. In SIGIR'06. [6] Martin Franz. In unpublished correspondence. [7] E. Gabrilovich and S. Markovitch. Feature Generation for Text Categorization Using World Knowledge. In IJCAI'05. [8] U. Iurgel and G. Rigoll. Spoken Document Classification with SVMs using Linguistic Unit Weighting and Probabilistic Couplers. Proceedings of the 17th International Conference on Pattern Recognition, 2004. [9] R. Jones et al. Bootstrapping for Text Learning Tasks. In IJCAI'99 Workshop on Text Mining: Foundations, Techniques and Applications. [10] F. Kubala et al. Integrated Technologies for Indexing Spoken Language. Commun. ACM, 43(2), 2000. [11] D. W. Oard et al. Overview of the CLEF-2006 Cross-Language Speech Retrieval Track. In CLEF CL-SR'06. http://clef-clsr.umiacs.umd.edu/. [12] J. S. Olsson et al. Cross-Language Text Classification. In SIGIR'05. [13] S. E. Robertson et al. Okapi at TREC-3. In Text REtrieval Conference, 1992. [14] M. Sanderson and X. M. Shou. Search of Spoken Documents Retrieves Well Recognized Transcripts. In ECIR'07. [15] R. Schapire et al. Incorporating Prior Knowledge into Boosting. In Machine Learning: Proceedings of the Nineteenth International Conference, 2002. [16] B. W. Silverman. Density Estimation. Chapman and Hall, London, 1986. [17] W. N. Venables and B. D. Ripley. Modern Applied Statistics with S. Springer-Verlag, New York, NY, USA, 2002. [18] X. Wu and R. Srihari. Incorporating Prior Knowledge with Weighted Margin Support Vector Machines. In KDD'04. [19] Y. Yang and J. O. Pedersen. A Comparative Study on Feature Selection in Text Categorization. In ICML'97. 630