UMCP: iSchool: INST 734: Fall 2015: Batch Evaluation Design

INST 734
Information Retrieval Systems
Fall 2015
Project Batch Evaluation Design (Assignment P8)

The deadline for this assginment has been extended to 11:59 PM on Wednesday October 28 because the Virtual IR lab was unreachable over the weekend prior to the original due date.

This document applies only to instructor-designed projects.

The goal of this project component is to produce a plan for conducting a batch evaluation that involves a parameter sweep to find the best parameter(s) for the comparison function that has been assigned to your team. Given the limited time in a semester, it might not be practical to design a new batch evaluation for the same system that is the focus of your user study. For the batch evaluation, we will therefore use the UMD15 instance of the University of Delaware Virtual IR lab (the same system you used in Exercise E5).

I will assign each team one comparison function (which will include one or more parameters). I will send a document by email that contains details on the Virtual IR Lab implementation of each comparison function, and that provides a reference to the paper in which that comparison function was first introduced. The team will then design a study that has as a goal finding the best parameter setting(s), where "best" is defined by at least one evaluation measure (of your choice) for at least two different test collections (of your choice). This essentially involves a series of comparisons for the same comparison function with different parameter settings. This should be reminiscent of the comparison that we did in Exercise E5, and it should be done in a similar way (e.g., reported differences should be tested for statistical significance).

Batch evaluations are designed to be conducted fully automatically (although the setup of the Virtual IR Lab will require you to do some manual bookkeeping). The evaluation design includes at least the following:

A "canned" set of information that is to be searched (the "collection").
A set of requests to which the system will be expected to respond (the "topics", with some way of forming "queries" from those topics).
A set of "ground truth" responses that are expected as answers to each request (the "qrels").
Definitions for one or more evaluation measures that can be used to characterize the system's effectiveness.

One way to get a sense for what an evaluation design looks like is to read a TREC, CLEF, NTCIR or FIRE track overview paper. For example, here are two that I have written:

Your plan need not be as detailed a these, of course, because these were written AFTER the evaluation. To see what we had before the evaluation, look at:

Of course, you won't need to specify all the submission format issues that we did (since you won't actually be submitting everything). So 3 or 4 pages should probably suffice for what you will write up as a plan.

There are four document collections available in the Virtual IR Lab system, so you should choose at least two of those four. To see the documents in the collections you can select a collection and then do some searches to get some documents that you can click on to view. To read about the collections, you can refer to documentation from the time they were created:

AP8889 and DOE: http://trec.nist.gov/pubs/trec1/papers/01.txt
WT10g: http://ir.dcs.gla.ac.uk/test_collections/wt10g.html
Robust04: http://trec.nist.gov/data/robust/04.guidelines.html

For the the AP8889 and the DOE collection (which used the same topics) the actual topic set is available at http://trec.nist.gov/data/topics_eng/index.html -- the three files you want there are the TREC-1 routing topics, the TREC-1 ad hoc topics and the TREC-2 ad hoc topics (all of which I have linked to directly from here because the TREC Web site has them compressed and named without a .txt extension).

For other topics, see http://trec.nist.gov/ then click on data then on the track you care about (e.g., Robust or Web -- WT10g is a Web track collection), then on Topics and then on what you want. If you are on a PC you may need WinRar to see .gz files, and you may need to open with Wordpad to see the correct line breaks in the file. Of course, you can also see the short ("title") version in the University of Delaware system, but longer queries can also be created from the full topics if you wish.

To find the qrels (i.e., those ground truth judgments I referred to above) you can do the same thing but navigate to the relevance judgments rather than the topics. Same caveats about WinRar and Wordpad. Note that judgments are here only for judged documents, and that most documents were not judged (you should understand why!). These are built in to the Virtual IR Lab system, but for analysis purposes you may want access to the full qrels files.

One thing to note is that both the queries and the documents are stemmed in the Virtual IR Lab System. As an example, the full version of the title field of AP8889 Topic 7 is "U.S. Budget Deficit". You can play with an online demo of the Porter stemmer at http://9ol.es/porter_js_demo.html -- as you will see, the Porter stemmer is smart enough not to stem us to u, so from this we can conclude that the U Del system does not use the Porter stemmer. But it is quite clearly stemming somehow -- lots of common endings are chopped off. There are many stemmers other than the Porter stemmer, and of course any consistent way of stemming things should work reasonably well.

For evaluation measures, you might select P@5, P@10, P@30, AP, or some other measure (e.g., NDCG). Which measure(s) you select will be up to you, but you will need to justify your selection by explaining what your chosen measure(s) emphasize, and why you have chosen a measure that emphasizes that.

How should you decide which collection(s) to select? To answer this, you need to understand what the comparison functions (i.e., the ranking functions) you are testing are designed to do well. Do you care how many terms are in the queries? If so, you should check that. Do you care whether all the documents are about the same length? If so, you should check that. Do you care if the collection is large? If so, you should check that. The challenge you face in making this selection is the same challenge that anyone evaluating an IR system would -- what document collection is representative of the content that you are most interested in searching well?

The whole idea of the project is to drive you into a more complex set of decisions than you had to make in the exercises, but to layer that complexity on in a way that will help you master key ideas from the course. If you would like to see how a project team from a prior semester approached a batch evaluation project, check out the project video on Pivoted Document Length Normalization from Fall 2014. And, of course, I will be glad to discuss the project with you along the way. So don't hesitate to ask questions -- that's what the project is designed to cause!

Please also include with your assignment an ranked list of which additional readings you would be interested in doing for Modules 10, 11, 12, and 13. Please select at least 5 readings, include at least one from each module, and list them in most-preferred-first order. You should refer to readings by module number and reading number using the numbering on the assigned summaries page. I will send out summary assignments by Saturday October 31 so that those will be doing readings in Module 10 will have time to foster.

Submit your batch evaluation plan using ELMS.

I will send you feedback on this assignment, but (as with all the pieces) the overall project grade will be assigned holistically rather than being determined by a fixed formula.

Doug Oard

Last modified: Sun Oct 25 12:44:20 2015

INST 734 Information Retrieval Systems Fall 2015 Project Batch Evaluation Design (Assignment P8)

INST 734
Information Retrieval Systems
Fall 2015
Project Batch Evaluation Design (Assignment P8)