Question Answering for the Spoken Web Evaluation Task

Introduction | Test Collection | Schedule | References | Organizers


Question Answering for the Spoken Web (QASW) is an information retrieval evaluation in which the goal is to match questions spoken in Gujarati to answers spoken in Gujarati. QASW is a task in the 2013 Forum for Information Retrieval Evaluation (FIRE). MediaEval 2013 participants are welcome to participate in the FIRE QASW task, but evaluation results will not be available in time for the MediaEval meeting.

The source of the questions and the the collection of possible answers is the IBM Spoken Web Gujarati collection. The questions are actual questions asked by users of an operational system; the collection of possible answers is composed of both answers that were actually given to specific questions (which may apply to more than one question, since some topics were asked about more than once) and of announcements on topics of general interest; both are from the same operational system. The collections will be distributed on a license that is freely usable for research purposes, and the resulting test collections (including relevance judgments) will be deposited in a community repository (e.g., LDC or ELDA). The questions, the collection, and the relevance judgments will be identical in the two evaluations.

We expect QASW to be of interest to researchers interested in speech recognition, information retrieval (including question answering), and information and communications technology for development (ICTD).

Test Collection

We are have transcribed 196 questions (from the 2,285 that were asked in the operational system on the date we captured them). From these, we have selected 50 for training (numbered 1-50), and 99 for evaluation (numbered 101-201, with the exception of 174-175). Our intent in selecting 99 evaluation questions is to make it likely that we will get a yield of at least 5 relevant documents for at least 50 of the questions.

The collection to be searched will consist of about 4,000 speech segments that were selected from 3,557 answers that were given in response to specific questions and 834 "announcements" general answers that were provided that were provided to address topics of general interest. The selection was performed by removing those that were too short to be useful or that contain no recognizable speech. The speech segments are available in 2 forms: (1) .wav audio files, (2) manual transcripts. A Gujarati stemmer and stopword list are also available.

Relevance judgments will be performed using depth-30 pooling (or deeper, if resources allow) using graded relevance judgments. Participating systems will be asked to submit depth-1000 results using the full question, and also using truncated versions of the question (truncated at 5 seconds, 10 seconds, 15 seconds, etc.). The principal evaluation measure will be mean NDCG for the full questions. Participating systems will also be asked to predict which truncation point maximizes a reward function that rewards DCG@1 and that penalizes duration (i.e., later truncation points) -- the goal of this measure is to encourage the design of systems that can determine when to "barge in" for the first time with a plausible answer to the question (in a real system, subsequent interaction would be possible, but that will not be modeled in 2013).



  1. Douglas W. Oard, Query by Babbling: A Research Agenda, In Proceedings of the CIKM Workshop on Information and Knowledge Management for Developing Regions, 2012.
  2. F. Metze et al. The spoken Web Search Task. In Proceedings of MediaEval, 2012.
  3. Aren Jansen and Benjamin Van Durme. Indexing Raw Acoustic Features for Scalable Zero Resource Search. In Proceedings of Interspeech, 2012.
  4. Nigel G. Ward and Steven D. Werner. Thirty-Two Sample Audio Search Tasks. UTEP Technical Report UTEP-CS-12-39.


Doug Oard
Last modified: Sat Aug 24 03:15:18 2013