1
|
- LBSC 796/INFM 718R
- Session 1, September 10, 2007
- Doug Oard
|
2
|
- Teaching theater orientation
- The structure of interactive IR systems
- Course overview
|
3
|
- Information retrieval is a problem-oriented discipline, concerned wi=
th
the problem of the effective and efficient transfer of desired
information between human generator and human user
|
4
|
- Information
- Retrieval
- What do we mean by “retrieval”?
- What are different types information needs?
- Systems
- How do computer systems fit into the human information seeking proc=
ess?
|
5
|
|
6
|
- Data
- The raw material of information
- Information
- Data organized and presented in a particular manner
- Knowledge
- “Justified true belief”
- Information that can be acted upon
- Wisdom
- Distilled and integrated knowledge
- Demonstrative of high-level “understanding”
|
7
|
- How is it different from “data”?
- Information is data in context
- Databases contain data and produce information
- IR systems contain and provide information
- How is it different from “knowledge”?
- Knowledge is a basis for making decisions
- Many “knowledge bases” contain decision rules
|
8
|
- Data
- 98.6º F, 99.5º F, 100.3º F, 101º F, …
- Information
- Hourly body temperature: 98.6º F, 99.5º F, 100.3º F,=
101º
F, …
- Knowledge
- If you have a temperature above 100º F, you most likely have a
fever
- Wisdom
- If you don’t feel well, go see a doctor
|
9
|
- Text
- Structured documents (e.g., XML)
- Images
- Audio (sound effects, songs, etc.)
- Video
- Programs
- Services
|
10
|
- Find something that you want
- The information need may or may not be explicit
- Known item search
- Answer seeking
- Is Lexington or Louisville the capital of Kentucky?
- Directed exploration
- Who makes videoconferencing systems?
|
11
|
- Relevance relates a topic and a document
- Duplicates are equally relevant, by definition
- Constant over time and across users
- Pertinence relates a task and a document
- Accounts for quality, complexity, language, …
- Utility relates a user and a document
- Accounts for prior knowledge
|
12
|
|
13
|
- Searchers often don’t clearly understand
- The problem they are trying to solve
- What information is needed to solve the problem
- How to ask for that information
- The query results from a clarification process
- Dervin’s “sense making”:
|
14
|
|
15
|
- Retrospective (“Retrieval”)
- “Searching the past”
- Different queries posed against a static collection
- Time invariant
- Prospective (“Filtering”)
- “Searching the future”
- Static query posed against a dynamic collection
- Time dependent
|
16
|
- Foster human-machine synergy
- Exploit complementary strengths
- Accommodate shared weaknesses
- Divide-and-conquer
- Divide task into stages with well-defined interfaces
- Continue dividing until problems are easily solved
- Co-design related components
- Iterative process of joint optimization
|
17
|
- Strategy: use encapsulation to limit complexity
- Approach:
- Define interfaces (input and output) for each component
- Define the functions performed by each component
- Study each component in isolation
- Repeat the process within components as needed
- Make sure that this decomposition makes sense
- Result: a hierarchical decomposition
|
18
|
|
19
|
- Machines are good at:
- Doing simple things accurately and quickly
- Scaling to larger collections in sublinear time
- People are better at:
- Accurately recognizing what they are looking for
- Evaluating intangibles such as “quality”
- Both are pretty bad at:
- Mapping consistently between words and concepts
|
20
|
|
21
|
|
22
|
- Study the IR black box in isolation
- Simple behavior: in goes query, out comes documents
- Optimize the choice of documents that come out
|
23
|
|
24
|
|
25
|
|
26
|
- What are examples of databases?
- Banks storing account information
- Retailers storing inventories
- Universities storing student grades
- What exactly is a (relational) database?
- Think of them as a collection of tables
- They model some aspect of “the world”
|
27
|
|
28
|
- What would you want to know from a database?
- What classes is John Arrow enrolled in?
- Who has the highest grade in LBSC 690?
- Who’s in the history department?
- Of all the non-CLIS students taking LBSC 690 with a last name short=
er
than six characters and were born on a Monday, who has the longest
email address?
|
29
|
|
30
|
- Bag =3D a “set” that can contain duplicates
- “The quick brown fox jumped over the lazy dog’s back=
221;
®
- &nbs=
p;
{back, brown, dog, fox, jump, lazy, over, quick, the, the}<=
/font>
- Vector =3D values recorded in any consistent order
- {back, brown, dog, fox, jump, lazy, over, quick, the, the} ®
- [1 1 1 1 1 1 1 1 2]
|
31
|
|
32
|
- Closer to the way people think
- Some documents are better than others
- Enriches browsing behavior
- Decide how far down the list to go as you read it
- Allows more flexible queries
- Long and short queries can produce useful results
|
33
|
- Terms tell us about documents
- If “rabbit” appears a lot, it may be about rabbits
- Documents tell us about terms
- “the” is in every document -- not discriminating
- Documents are most likely described well by rare terms that occur in
them frequently
- Higher “term frequency” is stronger evidence
- Low “document frequency” makes it stronger still
|
34
|
- Long documents have an unfair advantage
- They use a lot of terms
- So they get more matches than short documents
- And they use the same words repeatedly
- So they have much higher term frequencies
- Normalization seeks to remove these effects
|
35
|
- Homonymy
- Terms may have many unrelated meanings
- Polysemy (related meanings) is less of a problem
- Synonymy
- Many ways of saying (nearly) the same thing
- Anaphora
- Alternate ways of referring to the same thing
|
36
|
|
37
|
- New concepts
- Users and indexers may think differently
- Using thesauri effectively requires training
|
38
|
|
39
|
|
40
|
|
41
|
- Protecting privacy
- What absolute assurances can we provide?
- How can we make remaining risks understood?
- Scalable rating servers
- Is a fully distributed architecture practical?
- Non-cooperative users
- How can the effect of spamming be limited?
|
42
|
|
43
|
- Four Factors, working together
- User
- Process
- System
- Collection
|
44
|
- Appreciate IR system capabilities and limitations
- Understand IR system design & implementation
- For a broad range of applications and media
- Evaluate IR system performance
- Identify current IR research problems
|
45
|
- Text/readings provide background and detail
- At least one recommended reading is required
- Class provides organization and direction
- We will not cover every important detail
- Assignments and project provide experience
- The TA can help with the project
- Final exam helps focus your effort
|
46
|
- Everyone:
- LBSC 690 or INFM 603 or equivalent
- Comfortable with learning about technology
- MIM Students:
- Basic systems analysis, scripting languages
- Some programming is helpful
- MLS students:
- LBSC 650 and LBSC 670
- LBSC 750 or a subject access course is helpful
|
47
|
- Assignments (20%)
- Mastery of concepts and experience using tools
- Term project (50%)
- Options are described on course Web page
- Final exam (30%)
|
48
|
- Classes will be videotaped
- Available outside my office
- Office hours: 5 PM Mondays
- Or schedule by email, or ask after class
- Everything is on the Web
- At http://www.glue.umd.edu/~oard
- Doug is most easily reached by email
|
49
|
- Assignment 1
- Due at 6 PM next Monday!!
- At least skim the readings before class
- Explore the Web site
- Start thinking about the term project
|