INST 734
Information Retrieval Systems
Spring 2018
Exercise E5


Evaluating Systems

The purpose of this assignment is to gain experience with evaluation of information retrieval systems. We will use the University of Delaware Virtual IR Lab for this assignment.

The version of the Virtual IR Lab that you should use is the UMD18 instance, which is at https://virlab.eecis.udel.edu/UMD18/ (other versions that you might find through a Web search can not be used for this assignment). Each student in the class has an account on the UMD18 instance of the Virtual IR Lab. Your userid is the same as the userid that you use for ELMS. Your password for the UMD instance of the Virtual IR Lab can be found on ELMS (and you can and should change it after you log in, but don't lose your new password!).

The first thing to do after you log in is to create the two scoring functions that we will compare. For this go to "Create Function" in the left-hand menu enter "BM25a" (without the quotes) in the "function name" field, and then cut and paste the following into the content:

double okapiK1 = 2;
double okapiB = 0.75;
double okapiK3 = 1000;
for(occur)
{
        double idf = log((docN-DF[i]+0.5)/(DF[i]+0.5));
        double weight = ((okapiK1+1.0)*tf[i]) / (okapiK1*(1.0-okapiB+okapiB*docLength/docLengthAvg)+tf[i]);
        double tWeight = ((okapiK3+1)*qf[i])/(okapiK3+qf[i]);
        score+=idf*weight*tWeight;
}
and hit the save button. Then go to "Manage Function" on the left menu and you should see your new function there, listed under "Functions can be used to make search engine". There will also be two other functions that can not be used -- ignore them. Then go to Create Function again, and create a function called "BM25b". Cut and paste the following into the content for that one:
double okapiK1 = 1.2;
double okapiB = 0.75;
double okapiK3 = 1000;
for(occur)
{
        double idf = log((docN-DF[i]+0.5)/(DF[i]+0.5));
        double weight = ((okapiK1+1.0)*tf[i]) / (okapiK1*(1.0-okapiB+okapiB*docLength/docLengthAvg)+tf[i]);
        double tWeight = ((okapiK3+1)*qf[i])/(okapiK3+qf[i]);
        score+=idf*weight*tWeight;
}
and hit the save button. These two functions are the version of BM25 that we discussed in Module 3 and a second one with a single parameter changed.

Now go to "Compare Search Engines" in the left menu. When you do, the left menu changes, and if you want to go back you will need to select "Retrieval Function" in the top menu. If all else fails (as it will if you select About -- don't do that!) try the back button and be patient. And if that fails, start over and log in again from the URL above -- your two retrieval functions (which actually are a combination of the document representation functions and comparison function that we discussed previously) should both still be there!

In "Compare Search Engines" first select your two retrieval functions. I suggest putting BM25a on the left and BM25b on the right. Then select ap8889 as the collection -- this is a collection of news stories from the Associated Press from 1988 and 1989. Then, click on the green "Select a TREC Query" button, from the popup menu select the only entry (ap8889), and and then from the list of queries that appears select query number 1 (which is "antitrust case pend" -- these queries are already stemmed; the real TREC query was "antitrust cases pending"). Finally, click on the bluish "Compare" button. The result (perhaps after a few seconds) should be two ranked lists of document numbers. Each list has a value labeled "MAP" above the list; this is the average precision achieved for that system on that query (MAP is actually an error in the labeling; these are AP values for single queries). You should get 0.02864 on the left and 0.02568 on the right. If so, the cut and paste of your retrieval functions and your operation of the system to this point has been correct, and you are ready to actually do the rest of this exercise. If not, check with me to sort out what you (or the system!) are doing wrong.

Now look at the annotations on the right side of the rightmost ranked list. Here you can see a notation that the document that had been in position 3 in the left list is in a higher position in the right list (position 2 in this case). Under each document identifier you will see a notation of whether the document is relevant or not. It if is relevant, that is shown in green as Relevant(1) because the system is capable if recording different levels of relevance; the ap8889 collection only contains binary relevance, however, so 1 is the only value you will ever see for relevant documents. A quick skim will indicate that the system on the left found one relevant documents in the top 10, and the system on the right also found only one. P@10 (precision at a cutoff of 10 documents) is thus 0.1 on the left and 0.1 on the right for this query. To save you from flipping to the second and third pages, the system shows you p@30 (labeled "P30") for each list. AP is also computed by the system (and mislabeled as "MAP") by doing the AP computation down to 1000 documents, which will save you a lot of scrolling!

Okay, now that you understand how to operate the system and what it is displaying, you are ready to do the assignment. To save some time, we will use only the first 20 ap8889 queries. Select each, one at a time, and then select Compare to see the result from each system for that query. Use Excel (or any other system you want) to make a table on which you can record the AP value (you can ignore P@10 and P@30) for each of the first 20 queries. In each row, record three numbers: the query number, the AP for BM25a and the AP for BM25b.

Now that we have those numbers, we can use them to determine which system is "better". We can start by averaging the AP across the queries for each system to get a MAP (Mean Average Precision) for that each system. A higher MAP value would indicate that the system might be better. Another way to look at this is to count the number of queries for which the system on the left got higher, tied, or lower values of AP. Whichever system got queries in the "higher" category might be better. The pattern of higher (better) : same : lower (worse) that you observe might simply result from chance, however. To check that, you can use a the Sign Test calculator such as https://www.graphpad.com/quickcalcs/binomial1.cfm. If you had 4 higher, 7 the same, and 9 lower you would enter 4 in "successes" (number where the first system wins) and 13 in "trials" (total number of non-tied trials=4+9) and the system would compute p=0.2668, which means that there's nearly a 27% chance that this outcome resuled from chance. When the p value comes out below 0.05, we say that the result is statistically significant. You should also run a two-tailed paired Student-t test (which is not so named because you are a student, but rather because the inventor published it under the pseudonym "Student" to avoid problems with his employer -- Wikipedia has the full story), for which you will need to enter the two columns of figures from your table because the t-test is computed on the (paired) differences in the AP values, not just on how many times each system won. An easy way to do a t-test is to use Microsoft excel. See the last page of https://www.rwu.edu/sites/default/files/downloads/fcas/mns/running_a_t-test_in_excel.pdf for how to do that. You can download excel from Office Tools link at https://terpware.umd.edu if you don't already have it. The t-test is generally more sensitive than the sign test, so it might find a difference (again, we say a statisticaly significant difference was detected when p<0.05) when the sign test did not.

Summarize all of your results (the table, the two MAP values, the p value from the sign test and the t-test, and your conclusion about which system was better or whether no statistically significant difference can be detected) in a text document (e.g., using Word).

Just as an aside, the system also has additional capabilities. For example, there is an Evaluation button in the Manage Function menu that can be used to compute Mean Average Precision and Mean Precision at 30 over an entire test collection at one time (but it doesn't display all the details of the approach we are using in this exercise).

If you ran into any difficulties with this assignment that you were not able to sort out with my help, please make a note of them in what you submit.

You should submit your assignment on ELMS.


Doug Oard
Last modified: Sat Mar 3 23:53:35 2018