INST 734

Information Retrieval Systems

Fall 2014

Exercise E3

The purpose of this assignment is to give you some experience with the actual calculations that are done for ranked retrieval. You may find it convenient to use a spreadsheet (e.g., Microsoft Excel) for part or all of this assgnment.

**1.** Create the term frequency matrix for the four document
titles below:

- Doc 1: The Cornell Implementation of the SMART System
- Doc 2: User's Manual for the SMART Information Retrieval System
- Doc 3: Implementation of the SMART Information Retrieval System
- Doc 4: Introduction to Modern Information Retrieval

In building this matrix you should retain every word (i.e., there is no stopword list). The indexed units should be individual words, all of which should be converted to lower case. You should not apply stemming or other morphological processes to the words when creating the index terms. Your term-by-document matrix should have one row for each unique term and one column for each document. The entry in each cell should be the number of occurrences of the term that is associated with the element's row in the document that is associated with the element's column. You may leave elements for which the value is zero blank if you like or you may place a zero there. Turn in this matrix.

Next, use the term-by-document matrix that you just created to perform boolean retrieval.

**2.** Create a new term frequency matrix from the matrix you
prepared in question 1: each cell in this matrix should have a 1 in
positions where the term frequency is not zero and a zero in places
where the term frequency is zero in the original matrix (blanks are
not a good idea here because we will need to see the zeros). Turn in
this matrix. Use this new matrix for answering the rest of the
questions in part 1.

**3.** Extract the rows for the terms "information" and
"retrieval" from the matrix and write them down, one above the other.
Each should have four binary digits (0 or 1) in them. Then draw a
horizontal line under them and compute the AND of the two rows. This
should produce a single row in which a 1 appears for each document
that has both "information" and "retrieval". What are these
documents? Show your work.

As a reminder, the truth table for (a AND b) is:

\B 0 | 1 A +------- 0 | 0 | 0 --+---+--- 1 | 0 | 1

**4.** Perform the same computation for the following queries:

- (smart AND implementation) OR (introduction AND retrieval)
- (Cornell OR SMART) AND (Implementation)
- (information NOT retrieval)

Refer to lecture notes if you need the truth tables for the OR and NOT operators. Present your computation in any matter that you find clear.

**5.** Perform the same computation for the query:

(information XOR system)

where the truth table for (a XOR b) is:

\b 0 | 1 a +------- 0 | 0 | 1 --+---+--- 1 | 1 | 0

Show the computation in the same form that you presented the answer to question 3.

For the following questions, use the term-by-document matrix provided below to perform vector space retrieval (blanks indicate zeros):

d1 d2 d3 +---+---+---+ t1 | | | 5 | +---+---+---+ t2 | 4 | 1 | 3 | +---+---+---+ t3 | 5 | | 4 | +---+---+---+ t4 | 6 | 3 | 3 | +---+---+---+ t5 | | 1 | | +---+---+---+ t6 | 3 | | 7 | +---+---+---+ t7 | | 6 | 1 | +---+---+---+ t8 | 2 | | | +---+---+---+

**6.** Build the w matrix in which each element is computed as
TF * IDF, where TF is what is specified in each element above and the
IDF for term i is computed as:

log (total_number_of_documents / number_of_documents_containing_term_i)

Use base 10 logarithms. Microsoft Windows systems have a scientific calculator built in (select "scientific" from the menu) if you want to do this by hand (and Excel also has a function for this). Turn in this matrix.

**7.** Build the w' matrix by applying cosine normalization to
the w matrix that you just created. This is found by dividing each
element in the matrix by the square root of the sum of the squares of
every element in the same column as the element in question. A
formula for this that might be clearer but which says exactly the same
thing is provided in the notes. Turn in this matrix.

**8.** Using the w' matrix computed in question 2, compute the
rank order of the documents that would be found using the vector space
method for the query:

t2 t7

In your answer, give the similarity score that is computed for each document and give the ranked list that the system would return.

**9.** Compare your results for questions 6 through 8 to those
presented in the Similarity-Based Ranking lecture video for the same
matrix (the starting point was the left three columns of the matrix
used in the lecture slides) and BRIEFLY explain why any answers you
obtained (weights, similarity scores, or rank orderings) were
different.

You should submit your assignment on ELMS. Instructions for doing this are on the course ELMS site.

Doug Oard Last modified: Sun Aug 3 06:36:14 2014