UMCP: iSchool: INST 734: Spring 2018: Software

INST 734
Information Retrieval Systems
Spring 2018
Information Retrieval Software

Available Text Retrieval Software

The following software is available for use in this course. Those with links can be downloaded freely and used anywhere. The three you are most likely to want to use are listed first, others are listed in alphabetical order for completeness. Some of these search engines are compared in an October 2007 Technical Report from Universitat Pompeu Fabra in Spain.

The Top Three

These are the three that are most often used these days.

Lucene: A freely available Java IR system, probably the easiest system to get up and running, and the most easily modified. The SOLR Web front end for Lucene and the Elasticsearch distributed (sharded) version of Lucene are both also well worth considering.
Indri: Indri is optimized for efficiency, and thus is a good choice if you have a large collection and a single processor. It is built on top of the Lemur toolkit for building language modeling systems for information retrieval. There is also a simple variant of Indri written in Java called Galago that is designed for use with our textbook.
Terrier: An information retrieval system from the University of Glasgow that is optimized for efficiency. Terrier implements the divergence from randomness framework for ranked retrieval.

The Others

These are mostly of historical interest, but they are still available.

Cheshire 3: Freely available research software implementing a logistic regression model from the University of California at Berkeley. Getting it working may require some facility with Z39.50.
IRF: A Java toolkit for building IR systems for small applications. The strength of IRF is that the object oriented framework greatly simplifies tasks that require working with the source code. It because Java is designed for platform independence rather than efficiency, the size of the collections that can be handled is quite limited.
Ivory: An information retrieval system for the Hadoop MapReduce framework. This is a good choice is you have a very large collection and at least a modest size server cluster. You can buy time from Amazon Web Services if you don't have your own cluster.
MG: Research software from RMIT University that is designed to maximize storage efficiency on very large collections. It is available under the GNU public license. We installed this once several years ago and it wasn't too difficult.
Xapian: An open source IR system that is designed ot run under Linux. Xapian is a descendant of Omseek, which itself is a decedent of Open Muscat. Xapian is designed to handle several Western European languages, and thus might be a good choice if you want to work with languages other than English.
Zettair: Zettair is optimized for both efficiency and modifiability. It therefore occupies a part of the design space between Lucene and Indri.
WebGlimpse: Freely available software from the University of Arizona that is designed for efficient indexing (at some cost in retrieval efficiency). Glimpse is not configured for TREC-style evaluations, so that would take some extra work.

See also the listing at searchtools.com.

Doug Oard

Last modified: Sun Jan 14 20:21:58 2018

INST 734 Information Retrieval Systems Spring 2018 Information Retrieval Software

Available Text Retrieval Software

The Top Three

The Others

INST 734
Information Retrieval Systems
Spring 2018
Information Retrieval Software