1
|
- LBSC 796/INFM 718R
- Session 5, October 7, 2007
- Douglas W. Oard
|
2
|
- Questions
- Evaluation fundamentals
- System-centered strategies
- User-centered strategies
|
3
|
- Formulate a research question: the hypothesis
- Design an experiment to answer the question
- Perform the experiment
- Compare with a baseline “control”
- Does the experiment answer the question?
- Are the results significant? Or is it just luck?
- Report the results!
|
4
|
- Effectiveness
- System-only, human+system
- Efficiency
- Retrieval time, indexing time, index size
- Usability
- Learnability, novice use, expert use
|
5
|
- User-centered strategy
- Given several users, and at least 2 retrieval systems
- Have each user try the same task on both systems
- Measure which system works the “best”
- System-centered strategy
- Given documents, queries, and relevance judgments
- Try several variations on the retrieval system
- Measure which ranks more good docs near the top
|
6
|
- Capture some aspect of what the user wants
- Have predictive value for other situations
- Different queries, different document collection
- Easily replicated by other researchers
- Easily compared
- Optimally, expressed as a single number
|
7
|
- Achieve a meaningful improvement
- An application-specific judgment call
- Achieve reliable improvement in unseen cases
- Can be verified using statistical tests
|
8
|
- Evaluation by inspection of examples
- Evaluation by demonstration
- Evaluation by improvised demonstration
- Evaluation on data using a figure of merit
- Evaluation on test data
- Evaluation on common test data
- Evaluation on common, unseen test data
|
9
|
|
10
|
|
11
|
- Representative document collection
- Size, sources, genre, topics, …
- “Random” sample of representative queries
- Built somehow from “formalized” topic statements
- Known binary relevance
- For each topic-document pair (topic, not query!)
- Assessed by humans, used only for evaluation
- Measure of effectiveness
- Used to compare alternate systems
|
12
|
|
13
|
- Relevance relates a topic and a document
- Duplicates are equally relevant by definition
- Constant over time and across users
- Pertinence relates a task and a document
- Accounts for quality, complexity, language, …
- Utility relates a user and a document
- Accounts for prior knowledge
|
14
|
|
15
|
- Precision
- How much of what was found is relevant?
- Often of interest, particularly for interactive searching
- Recall
- How much of what is relevant was found?
- Particularly important for law, patents, and medicine
- Fallout
- How much of what was irrelevant was rejected?
- Useful when different size collections are compared
|
16
|
|
17
|
- Balanced F-measure
- Harmonic mean of recall and precision
- Weakness: What if no relevant documents exist?
- Cost function
- Reward relevant retrieved, Penalize non-relevant
- Weakness: Hard to normalize, so hard to average
|
18
|
- Expected search length
- Average rank of the first relevant document
- Mean precision at a fixed number of documents
- Precision at 10 docs is often used for Web search
- Mean precision at a fixed recall level
- Adjusts for the total number of relevant docs
- Mean breakeven point
- Value at which precision =3D recall
|
19
|
|
20
|
- In TREC, a statement of information need is called a topic
|
21
|
- Example “questions”:
- Does morphological analysis improve retrieval performance?
- Does expanding the query with synonyms improve retrieval performanc=
e?
- Corresponding experiments:
- Build a “stemmed” index and compare against
“unstemmed” baseline
- Expand queries with synonyms and compare against baseline unexpanded
queries
|
22
|
|
23
|
- Plot each (recall, precision) point on a graph
- Visually represent the precision/recall tradeoff
|
24
|
- Average of precision at each retrieved relevant document
- Relevant documents not retrieved contribute zero to score
|
25
|
|
26
|
|
27
|
|
28
|
- It is easy to trade between recall and precision
- Adding related query terms improves recall
- But naive query expansion techniques kill precision
- Limiting matches by part-of-speech helps precision
- But it almost always hurts recall
- Comparisons should give some weight to both
- Average precision is a principled way to do this
- More “central” than other available measures
|
29
|
|
30
|
|
31
|
|
32
|
- Exhaustive assessment is usually impractical
- Topics * documents =3D a large number!
- Pooled assessment leverages cooperative evaluation
- Requires a diverse set of IR systems
- Search-guided assessment is sometimes viable
- Iterate between topic research/search/assessment
- Augment with review, adjudication, reassessment
- Known-item judgments have the lowest cost
- Tailor queries to retrieve a single known document
- Useful as a first cut to see if a new technique is viable
|
33
|
- Exhaustive assessment can be too expensive
- TREC has 50 queries for >1 million docs each year
- Random sampling won’t work
- If relevant docs are rare, none may be found!
- IR systems can help focus the sample
- Each system finds some relevant documents
- Different systems find different relevant documents
- Together, enough systems will find most of them
|
34
|
- Systems submit top 1000 documents per topic
- Top 100 documents for each are judged
- Single pool, without duplicates, arbitrary order
- Judged by the person that wrote the query
- Treat unevaluated documents as not relevant
- Compute MAP down to 1000 documents
- Treat precision for complete misses as 0.0
|
35
|
- Judgments can’t possibly be exhaustive!
- This is only one person’s opinion about relevance
- What about hits 101 to 1000?
- We can’t possibly use judgments to evaluate a system that
didn’t participate in the evaluation!
|
36
|
- Incomplete judgments are useful
- If sample is unbiased with respect to systems tested
- Different relevance judgments change absolute score
- But rarely change comparative advantages when averaged
- Evaluation technology is predictive
- Results transfer to operational settings
|
37
|
- Additional relevant documents are:
- roughly uniform across systems
- highly skewed across topics
- Systems that don’t contribute to pool get comparable results=
li>
|
38
|
|
39
|
|
40
|
- Mean Kendall t between system rankings produced from different qrel
sets: .938
- Similar results held for
- Different query sets
- Different evaluation measures
- Different assessor types
- Single opinion vs. group opinion judgments
|
41
|
- How sure can you be that an observed difference doesn’t simply
result from the particular queries you chose?
|
42
|
|
43
|
- Measuring improvement
- Achieve a meaningful improvement
- Guideline: 0.05 is noticeable, 0.1 makes a difference
- Achieve reliable improvement on “typical” queries
- Wilcoxon signed rank test for paired samples
- Know when to stop!
- Inter-assessor agreement limits max precision
- Using one judge to assess the other yields about 0.8
|
44
|
- Evaluation measures focus on relevance
- Users also want utility and understandability
- Goal is to compare systems
- Values may vary, but relative differences are stable
- Mean values obscure important phenomena
- Augment with failure analysis/significance tests
|
45
|
|
46
|
|
47
|
- Goal is to account for interface issues
- By studying the interface component
- By studying the complete system
- Formative evaluation
- Provide a basis for system development
- Summative evaluation
- Designed to assess performance
|
48
|
- Example “questions”:
- Does keyword highlighting help users evaluate document relevance?=
li>
- Is letting users weight search terms a good idea?
- Corresponding experiments:
- Build two different interfaces, one with keyword highlighting, one
without; run a user study
- Build two different interfaces, one with term weighting functionali=
ty,
and one without; run a user study
|
49
|
- A classic study of retrieval effectiveness
- Earlier studies used unrealistically small collections
- Studied an archive of documents for a lawsuit
- 40,000 documents, ~350,000 pages of text
- 40 different queries
- Used IBM’s STAIRS full-text system
- Approach:
- Lawyers wanted at least 75% of all relevant documents
- Precision and recall evaluated only after the lawyers were satisfied
with the results
|
50
|
- Mean precision: 79%
- Mean recall: 20% (!!)
- Why recall was low?
- Users can’t anticipate terms used in relevant documents
- Differing technical terminology
- Slang, misspellings
- Other findings:
- Searches by both lawyers had similar performance
- Lawyer’s recall was not much different from paralegal’s=
|
51
|
- Select independent variable(s)
- e.g., what info to display in selection interface
- Select dependent variable(s)
- e.g., time to find a known relevant document
- Run subjects in different orders
- Average out learning and fatigue effects
- Compute statistical significance
- Null hypothesis: independent variable has no effect
- Rejected if p<0.05
|
52
|
- System
- Topic
- Sample topic space, compute expected value
- Topic+System
- Pair by topic and compute statistical significance
- Collection
- Repeat the experiment using several collections
|
53
|
- Learning
- Vary topic presentation order
- Fatigue
- Vary system presentation order
- Topic+User (Expertise)
- Ask about prior knowledge of each topic
|
54
|
|
55
|
- Do batch (black box) and user evaluations give the same results? If =
not,
why?
- Two different tasks:
- Instance recall (6 topics)
- Question answering (8 topics)
|
56
|
- Compared of two systems:
- a baseline system
- an improved system that was provably better in batch evaluations
- Results:
|
57
|
|
58
|
- Observe user behavior
- Instrumented software, eye trackers, etc.
- Face and keyboard cameras
- Think-aloud protocols
- Interviews and focus groups
- Organize the data
- For example, group it into overlapping categories
- Look for patterns and themes
- Develop a “grounded theory”
|
59
|
|
60
|
|
61
|
- Demographic data
- For example, computer experience
- Basis for interpreting results
- Subjective self-assessment
- Which did they think was more effective?
- Often at variance with objective results!
- Preference
- Which interface did they prefer?&n=
bsp;
Why?
|
62
|
- Qualitative user studies suggest what to build
- Design decomposes task into components
- Automated evaluation helps to refine components
- Quantitative user studies show how well it works
|
63
|
- If I demonstrated a new retrieval technique that achieved a
statistically significant improvement in average precision on the TR=
EC
collection, what would be the most serious limitation to consider wh=
en
interpreting that result?
|