UMCP: iSchool: LBSC 796/INFM 718R: Spring 2011: Batch Evaluation Design

LBSC 796/INFM 718R
Information Retrieval Systems
Spring 2011
Batch Evaluation Design

The goal of this project requirement is to produce a plan for conducting a batch evaluation (using frozen relevance judgments).

Batch evaluations are designed to be conducted fully automatically. They include at least the following:

A "canned" set of information that is to be searched.
A set of requests to which the system will be expected to respond.
A set of "ground truth" responses that are expected as answers to each request.
Definitions for one or more evaluation measures that can be used to characterize the system's effectiveness.

The evaluation design needs to balance four desirable characteristics:

Affordable
Insightful
Repeatable
Understandable

In your evaluation design, you will need to specify each of the four components of the first group in a way that (in aggregate) reasonably balances the four desirable characteristics in the second group.

In general, you can do this either by adopting someone else's evaluation design and/or evaluation resources or by creating your own. For example, the TREC Legal Track had a batch email search task in 2009 that resembles in some ways the tobacco emails search task. In some cases, you may want to draw inspiration from what they did, in other cases you may want to go beyond that and actually use an existing test collection rather than creating your own.

The best way to see what an evaluation design looks like is to read a TREC, CLEF, or NTCIR track overview paper. For example, here are two that I have written recently (and thus that I am intimately familiar with the details of):

Your plan need not be as detailed a these, of course, because these were written AFTER the evaluation. To see what we had before the evaluation, look at:

TREC Legal Track (scroll down for the 2009 batch task guidelines)
CLEF 2007 CL-SR Track Guidelines

Of course, you won't need to specify all the submission format issues that we did (since you will be submitting to yourself, these don't need to be standardized in advance), and you will probably choose to evaluate your system for only one task. So 3 or 4 pages should probably suffice for what you will write up.

One thing you might want to think about is how you plan to divide your evaluation resources to support both formative and summative evaluation. You need some evaluation data to support development, but testing on your training set is a cardinal sin. So you'll want to divide your available data in some way to allow you to later demonstrate your (hopefully) excellent results on a previously unseen part of the test collection.

This assignment will be graded, but (as with all the pieces) the overall project grade will be assigned holistically rather than being determined by a fixed formula.

Doug Oard

Last modified: Jan 24 2011

LBSC 796/INFM 718R Information Retrieval Systems Spring 2011 Batch Evaluation Design

LBSC 796/INFM 718R
Information Retrieval Systems
Spring 2011
Batch Evaluation Design