1
|
- LBSC 796/CMSC 828o
- Session 8, March 15, 2004
- Douglas W. Oard
|
2
|
- Visceral
- “I’ll know it when I see it” (expected answer)
- Conscious
- Might be true for known item retrieval
- Formalized
- Used in the TREC interactive evaluations
- Compromised
- I can’t imagine a case were this would be right
|
3
|
- Questions
- Evaluation fundamentals
- System-centered strategies
- User-centered strategies
|
4
|
- Effectiveness
- System-only, human+system
- Efficiency
- Retrieval time, indexing time, index size
- Usability
- Learnability, novice use, expert use
|
5
|
- User-centered strategy
- Given several users, and at least 2 retrieval systems
- Have each user try the same task on both systems
- Measure which system works the “best”
- System-centered strategy
- Given documents, queries, and relevance judgments
- Try several variations on the retrieval system
- Measure which ranks more good docs near the top
|
6
|
- Capture some aspect of what the user wants
- Have predictive value for other situations
- Different queries, different document collection
- Easily replicated by other researchers
- Easily compared
- Optimally, expressed as a single number
|
7
|
- Achieve a meaningful improvement
- An application-specific judgment call
- Achieve reliable improvement in unseen cases
- Can be verified using statistical tests
|
8
|
- Evaluation by inspection of examples
- Evaluation by demonstration
- Evaluation by improvised demonstration
- Evaluation on data using a figure of merit
- Evaluation on test data
- Evaluation on common test data
- Evaluation on common, unseen test data
|
9
|
|
10
|
- Representative document collection
- Size, sources, genre, topics, …
- “Random” sample of representative queries
- Built somehow from “formalized” topic statements
- Known binary relevance
- For each topic-document pair (topic, not query!)
- Assessed by humans, used only for evaluation
- Measure of effectiveness
- Used to compare alternate systems
|
11
|
- Relevance relates a topic and a document
- Duplicates are equally relevant by definition
- Constant over time and across users
- Pertinence relates a task and a document
- Accounts for quality, complexity, language, …
- Utility relates a user and a document
- Accounts for prior knowledge
|
12
|
- Precision
- How much of what was found is relevant?
- Often of interest, particularly for interactive searching
- Recall
- How much of what is relevant was found?
- Particularly important for law, patents, and medicine
- Fallout
- How much of what was irrelevant was rejected?
- Useful when different size collections are compared
|
13
|
|
14
|
- Balanced F-measure
- Harmonic mean of recall and precision
- Weakness: What if no relevant documents exist?
- Cost function
- Reward relevant retrieved, Penalize non-relevant
- Weakness: Hard to normalize, so hard to average
|
15
|
|
16
|
|
17
|
|
18
|
- Expected search length
- Average rank of the first relevant document
- Mean precision at a fixed number of documents
- Precision at 10 docs is often used for Web search
- Mean precision at a fixed recall level
- Adjusts for the total number of relevant docs
- Mean breakeven point
- Value at which precision =3D recall
- Mean Average Precision (MAP)
- Interpolated: Avg precision at recall=3D0.0, 0.1, …, 1.0
- Uninterpolated: Avg precision at each relevant doc
|
19
|
|
20
|
|
21
|
|
22
|
- Expected search length
- Quantization noise requires many more topics
- Mean precision at 10 documents
- Some topics don’t have 10 relevant documents
- Mean precision at constant recall
- A specific fraction is rarely the user’s goal
- Mean breakeven point
- Nobody ever searches at the breakeven point
- Mean Average Precision
- Users may care more about precision or recall
|
23
|
|
24
|
|
25
|
|
26
|
- It is easy to trade between recall and precision
- Adding related query terms improves recall
- But naive query expansion techniques kill precision
- Limiting matches by part-of-speech helps precision
- But it almost always hurts recall
- Comparisons should give some weight to both
- Average precision is a principled way to do this
- More “central” than other available measures
|
27
|
- Exhaustive assessment can be too expensive
- TREC has 50 queries for >1 million docs each year
- Random sampling won’t work
- If relevant docs are rare, none may be found!
- IR systems can help focus the sample
- Each system finds some relevant documents
- Different systems find different relevant documents
- Together, enough systems will find most of them
|
28
|
- Systems submit top 1000 documents per topic
- Top 100 documents for each are judged
- Single pool, without duplicates, arbitrary order
- Judged by the person that wrote the query
- Treat unevaluated documents as not relevant
- Compute MAP down to 1000 documents
- Treat precision for complete misses as 0.0
|
29
|
- Exhaustive assessment is usually impractical
- Topics * documents =3D a large number!
- Pooled assessment leverages cooperative evaluation
- Requires a diverse set of IR systems
- Search-guided assessment is sometimes viable
- Iterate between topic research/search/assessment
- Augment with review, adjudication, reassessment
- Known-item judgments have the lowest cost
- Tailor queries to retrieve a single known document
- Useful as a first cut to see if a new technique is viable
|
30
|
- Incomplete judgments are useful
- If sample is unbiased with respect to systems tested
- Different relevance judgments change absolute score
- But rarely change comparative advantages when averaged
- Evaluation technology is predictive
- Results transfer to operational settings
|
31
|
- Additional relevant documents are:
- roughly uniform across systems
- highly skewed across topics
- Systems that don’t contribute to pool get comparable results=
li>
|
32
|
|
33
|
|
34
|
- Mean Kendall t between system rankings produced from different qrel
sets: .938
- Similar results held for
- Different query sets
- Different evaluation measures
- Different assessor types
- Single opinion vs. group opinion judgments
|
35
|
- How sure can you be that an observed difference doesn’t simply
result from the particular queries you chose?
|
36
|
|
37
|
|
38
|
|
39
|
- Choose some experimental variable X
- Establish a null hypothesis H0
- Identify distribution of X, assuming H0 is true
- Do experiment to get an empirical value x
- Ask: “What’s the probability I would have gotten this va=
lue
of x if H0 were true?”
- If this probability p =
is
low, reject H0
|
40
|
|
41
|
- Measuring improvement
- Achieve a meaningful improvement
- Guideline: 0.05 is noticeable, 0.1 makes a difference
- Achieve reliable improvement on “typical” queries
- Wilcoxon signed rank test for paired samples
- Know when to stop!
- Inter-assessor agreement limits max precision
- Using one judge to assess the other yields about 0.8
|
42
|
- Thou shalt define insightful evaluation metrics
- Thou shalt define replicable evaluation metrics
- Thou shalt report all relevant system parameters
- Thou shalt establish upper bounds on performance
- Thou shalt establish lower bounds on performance
- Thou shalt test differences for statistical significance
- Thou shalt say whether differences are meaningful
- Thou shalt not mingle training data with test data
- Thou shalt not mingle training data with test data
- Thou shalt not mingle training data with test data
|
43
|
- Evaluation measures focus on relevance
- Users also want utility and understandability
- Goal is to compare systems
- Values may vary, but relative differences are stable
- Mean values obscure important phenomena
- Augment with failure analysis/significance tests
|
44
|
|
45
|
|
46
|
- Goal is to account for interface issues
- By studying the interface component
- By studying the complete system
- Formative evaluation
- Provide a basis for system development
- Summative evaluation
- Designed to assess performance
|
47
|
- Select independent variable(s)
- e.g., what info to display in selection interface
- Select dependent variable(s)
- e.g., time to find a known relevant document
- Run subjects in different orders
- Average out learning and fatigue effects
- Compute statistical significance
- Null hypothesis: independent variable has no effect
- Rejected if p<0.05
|
48
|
- System
- Topic
- Sample topic space, compute expected value
- Topic+System
- Pair by topic and compute statistical significance
- Collection
- Repeat the experiment using several collections
|
49
|
- Learning
- Vary topic presentation order
- Fatigue
- Vary system presentation order
- Topic+User (Expertise)
- Ask about prior knowledge of each topic
|
50
|
|
51
|
|
52
|
- Query Formulation: Uninterpolated Average Precision
- Expected value of precision [over relevant document positions]
- Interpreted based on query content at each iteration
- Document Selection: Unbalanced F-Measure:
- P =3D precision
- R =3D recall
- a =3D 0.8=
favors
precision
- Models expensive human translation
|
53
|
|
54
|
|
55
|
- Observe user behavior
- Instrumented software, eye trackers, etc.
- Face and keyboard cameras
- Think-aloud protocols
- Interviews and focus groups
- Organize the data
- For example, group it into overlapping categories
- Look for patterns and themes
- Develop a “grounded theory”
|
56
|
- Demographic data
- For example, computer experience
- Basis for interpreting results
- Subjective self-assessment
- Which did they think was more effective?
- Often at variance with objective results!
- Preference
- Which interface did they prefer?&n=
bsp;
Why?
|
57
|
|
58
|
|
59
|
|
60
|
|
61
|
|
62
|
- Qualitative user studies suggest what to build
- Design decomposes task into components
- Automated evaluation helps to refine components
- Quantitative user studies show how well it works
|
63
|
- If I demonstrated a new retrieval technique that achieved a
statistically significant improvement in average precision on the TR=
EC
collection, what would be the most serious limitation to consider wh=
en
interpreting that result?
|