LBSC 878, Spring 2005, Week 1, Doug Oard

Overview

Course Objectives and Approach

  • See Web site

Thinking broadly about “ISAR”

  • Data-information-knowledge-wisdom
  • Analysis-retrieval-synthesis-creation
  • Content-system-process-purpose
  • Technology-organization-access-use-context

Some terminology

  • Data:                Basic elements of meaning (assertions)
  • Information:      Assertions together with the context needed for their interpretation
  • Knowledge:      A basis for making decisions (Wilson’s "practical relevance")
  • Wisdom:           A basis for guiding decisions

Defining IR (Blair)

  • Both Information Retrieval and Database Retrieval use queries to obtain information
  • Database Retrieval retrieves data and combines it to produce information.
    • The query provides the context needed for interpretation of the retrieval result
  • Information Retrieval retrieves objects (e.g., documents) that contain information
    • The objects themselves provide the context needed for their interpretation

Some examples of Information Retrieval (IR) applications

  • Find something (an academic paper, an email I wrote two years ago, some class notes, ...) in a collection of written text, in any natural language, in any format (ASCII text, word processor files, scanned images of typeset pages, scanned images of handwritten manuscripts, ...)
  • Find something (an object, the work of a particular artist, a street address, ...) in a collection of still images of some type (photographs, blueprints, oil paintings, maps, ...)
  • Find something (spoken words, singing, instrumental music, ...) in a collection of recorded audio
  • Find something (a person, the depiction of some event, ...) in a collection of video (video tape, motion picture film, ...) that may or may not also contain synchronized audio and text.
  • Find someone (coworker, consultant, speaker, ...) with the expertise that is required to help you accomplish some goal.
  • Sift through a large stream of continuously generated materials (newswire stories, electronic mail, television programs, telephone calls, ...) to find something worthy of your attention
  • Explore a large collection of materials (documents, images, audio, video, ...) to identify some useful information (broad trends, new discoveries, unexpected events, ...)

Why is IR hard? (Blair)

  • An indexer must try to guess which terms every searcher will use to look for each document
  • A searcher must guess which terms the indexer chose
  • For full text retrieval, the searcher must guess which terms the author chose
  • Individual variability makes it impossible to do any of this perfectly
  • Retrieval effectiveness is thus a matter of degree, rather than an absolute
  • Effective retrieval thus becomes an iterative process
  • Techniques are known to help the user choose good search terms (Croft’s "magic")
    • But they tend to make the system less predictable, so it becomes harder to iterate
  • With well designed systems, it is often possible to eventually find useful documents
    • But there are fundamental limits on what you can know about what you missed

Desiderata (Croft, roughly in order)

  • Integration with non-IR systems
  • Source selection
  • Result merging
  • Response time
  • Handling multiple document formats
  • Overcoming vocabulary mismatch
  • Usability
  • Selective dissemination
  • Robustness
  • Multiple modalities
  • Rapid configuration for information extraction
  • Predictability
  • Handling documents written in multiple languages
  • Data mining in text databases
  • Text categorization

Meta-questions

  • How do articles get published in D-Lib?
  • Why did Blair write a book?

Focus Area Selection