1
|
- LBSC 796/CMSC 828o
- Session 9, March 29, 2004
- Doug Oard
|
2
|
- Questions
- Finish up evaluation from last time
- Computational complexity
- Inverted indexes
- Project planning
|
3
|
- Goal is to account for interface issues
- By studying the interface component
- By studying the complete system
- Formative evaluation
- Provide a basis for system development
- Summative evaluation
- Designed to assess performance
|
4
|
- Select independent variable(s)
- e.g., what info to display in selection interface
- Select dependent variable(s)
- e.g., time to find a known relevant document
- Run subjects in different orders
- Average out learning and fatigue effects
- Compute statistical significance
- Null hypothesis: independent variable has no effect
- Rejected if p<0.05
|
5
|
- System
- Topic
- Sample topic space, compute expected value
- Topic+System
- Pair by topic and compute statistical significance
- Collection
- Repeat the experiment using several collections
|
6
|
- Learning
- Vary topic presentation order
- Fatigue
- Vary system presentation order
- Topic+User (Expertise)
- Ask about prior knowledge of each topic
|
7
|
|
8
|
|
9
|
- Query Formulation: Uninterpolated Average Precision
- Expected value of precision [over relevant document positions]
- Interpreted based on query content at each iteration
- Document Selection: Unbalanced F-Measure:
- P =3D precision
- R =3D recall
- a =3D 0.8=
favors
precision
- Models expensive human translation
|
10
|
|
11
|
|
12
|
- Qualitative user studies suggest what to build
- Design decomposes task into components
- Automated evaluation helps to refine components
- Quantitative user studies show how well it works
|
13
|
|
14
|
- How long will it take to find a document?
- Is there any work we can do in advance?
- If so, how long will that take?
- How big a computer will I need?
- How much disk space? =
How
much RAM?
- What if more documents arrive?
- How much of the advance work must be repeated?
- Will searching become slower?
- How much more disk space will be needed?
|
15
|
- Searching is easy - just ask Microsoft!
- “Find” can search my hard drive in a few minutes
- If it only looks at the file names...
- How long would it would take for the Web?
- A 100 GB disk?
- For the World Wide Web?
- Computers are getting faster, but…
- How does Google give answers in 3 seconds?
|
16
|
|
17
|
- Time complexity: how long will it take?
- Space complexity: how much memory is needed?
- Things you need to know to assess complexity:
- What is the
“size” of the input? (“n”)
- What aspects of the input are we paying attention to?
- How is the input represented?
- How is the output represented?
- What are the internal data structures?
- What is the algorithm?
|
18
|
|
19
|
|
20
|
- Constant, i.e. O(1)
- n doesn’t matter
- Sublinear, e.g. O(log n)
n =3D 65536 ® log n =3D 16
- Linear, i.e. O(n)
- n =3D 65536 ® n
=3D 65536
- Polynomial, e.g. O(n3)
- n =3D 65536 ® n3
=3D 281,474,976,710,656
- Exponential, e.g. O(2n)
- n =3D 65536 ®
beyond astronomical
|
21
|
- Organize the bag of words matrix by terms
- You know the terms that you are looking for
- Look up terms like you search dictionaries
- For each letter, jump directly to the right spot
- For terms of reasonable length, this is very fast
- For each term, store the document identifiers
- For every document that contains that term
- At query time, use the document identifiers
- Consult a “postings file”
|
22
|
|
23
|
|
24
|
- Boolean retrieval
- Ranked Retrieval
- Document number and term weight (TF*IDF, ...)
- Proximity operators
- Word offsets for each occurrence of the term
- Example: Doc 3 (t17, t36), Doc 13 (t3, t45)
|
25
|
- Very compact for Boolean retrieval
- About 10% of the size of the documents
- If an aggressive stopword list is used!
- Not much larger for ranked retrieval
- Enormous for proximity operators
- Sometimes larger than the documents!
|
26
|
- Simplest solution is a single sorted array
- Fast lookup using binary search
- But sorting large files on disk is very slow
- And adding one document means starting over
- Tree structures allow easy insertion
- But the worst case lookup time is linear
- Balanced trees provide the best of both
- Fast lookup and easy insertion
- But they require 45% more disk space
|
27
|
|
28
|
|
29
|
- Typically smaller than the postings file
- Depends on number of terms, not documents
- Eventually, most terms will already be indexed
- But the postings file will continue to grow
- Postings dominate asymptotic space complexity
- Linear in the number of documents
|
30
|
- CPU’s are much faster than disks
- A disk can transfer 1,000 bytes in ~20 ms
- The CPU can do ~10 million instructions in that time
- Compressing the postings file is a big win
- Trade decompression time for fewer disk reads
- Key idea: reduce redundancy
- Trick 1: store relative offsets (some will be the same)
- Trick 2: use an optimal coding scheme
|
31
|
- Postings (one byte each =3D 7 bytes =3D 56 bits)
- 37, 42, 43, 48, 97, 98, 243
- Difference
- Optimal Huffman Code
- 0:1, 10:5, 110:37, 1110:49, 1111: 145
- Compressed (17 bits)
|
32
|
- Indexing
- Walk the inverted file, splitting if needed
- Insert into the postings file in sorted order
- Hours or days for large collections
- Query processing
- Walk the inverted file
- Read the postings file
- Manipulate postings based on query
- Seconds, even for enormous collections
|
33
|
- Slow indexing yields fast query processing
- Key fact: most terms don’t appear in most documents
- We use extra disk space to save query time
- Index space is in addition to document space
- Time and space complexity must be balanced
- Disk block reads are the critical resource
- This makes index compression a big win
|
34
|
- LBSC 796 MLS/MIM
- Option 1: TREC-like IR evaluation (team of 2)
- Option 2: Design and run a user study (team of 3)
- LBSC 796 Ph.D.
- LBSC 828o
|
35
|
- What was the muddiest =
point
in today’s lecture?
|