1
|
- LBSC 796/CMSC 828o
- Douglas W. Oard
- April 12, 2004
- mostly adapted from
- A lecture by David Doermann
|
2
|
- Questions
- Definitions - Document, Image, Retrieval
- Document Image Analysis
- Page decomposition
- Optical character recognition
- Traditional Indexing with Conversion
- Confusion matrix
- Shape codes
- Doing things Without Conversion
- Duplicate Detection, Classification, Summarization, Abstracting
- Keyword spotting, etc
- Example: Chinese document images
|
3
|
- Expand your definition of what is a “DOCUMENT”
- To get an appreciation of the issues in document image indexing
- To look at different ways of solving the same problems with different
media
- Your job: compare/contrast with other media
|
4
|
- DOCUMENT
- Basic Medium for Recording Information
- Transient
- Multiple Forms
- Hardcopy (paper, stone, ..) / Electronic (CDROM, Internet, …)=
- Written/Auditory/Visual (symbolic, scenic)
- Access Requirements
|
5
|
- The Web
- Some PDF files come from scanned documents
- Arabic news stories are often GIF images
- Digital copiers
- Produce “corporate memory” as a byproduct
- Digitization projects
- Provide improved access to hardcopy documents
|
6
|
- Modality
- Linguistic modalities
- Electronic text, printed, handwritten, spoken, signed
- Nonlinguistic modalities
- Music, drawings, paintings, photographs, video
- Media
- The means by which the expression reaches you
- Internet, videotape, paper, canvas, …
|
7
|
- A collection of dots called “pixels”
- Arranged in a grid and called a “bitmap”
- Pixels often binary-valued (black, white)
- But greyscale or color is sometimes needed
- 300 dots per inch (dpi) gives the best results
- But images are quite large (1 MB per page)
- Faxes are normally 72 dpi
- Usually stored in TIFF or PDF format
|
8
|
- Pixel representation of intensity map
- No explicit “content”, only relations
- Image analysis
- Attempts to mimic human visual behavior
- Draw conclusions, hypothesize and verify
- DOCUMENT
|
9
|
- Scanned Pixel representation of document
- Data Intensive (100-300dpi, 1-24 bpp)
- NO EXPLICIT CONTENT
- Document image analysis or manual annotation required
- takes pixels -> contents
- automatic means are not guaranteed
- Yet we want to be able to process them like text files!
- DOCUMENT
|
10
|
- Collection of scanned images
- Need to be available for indexing and retrieval, abstracting, routin=
g,
editing, dissemination, interpretation …
- DOCUMENT
|
11
|
|
12
|
- Document Image Databases are often influenced by traditional DB inde=
xing
and retrieval philosophies
- We are comfortable with them
- They work
- Problem: Requires content to be accessible
- Techniques:
- Content based retrieval (keywords, natural language)
- Query by structure (logical/physical)
- Query by Functional attributes (titles, bold, …)
- Requirements:
- Ability to Browse, search and read
|
13
|
|
14
|
- General Flow:
- Obtain Image - Digitize
- Preprocessing
- Feature Extraction
- Classification
- General Tasks
- Logical and Physical Page Structure Analysis
- Zone Classification
- Language ID
- Zone Specific Processing
- Recognition
- Vectorization
|
15
|
- Skew correction
- Based on finding the primary orientation of lines
- Image and text region detection
- Based on texture and dominant orientation
- Structural classification
- Infer logical structure from physical layout
- Text region classification
- Title, author, letterhead, signature block, etc.
|
16
|
|
17
|
|
18
|
- Language-independent skew detection
- Accommodate horizontal and vertical writing
- Script class recognition
- Asian script have blocky characters
- Connected scripts can’t be segmented easily
- Language identification
- Shape statistics work well for western languages
- Competing classifiers work for Asian languages
|
19
|
- Pattern-matching approach
- Standard approach in commercial systems
- Segment individual characters
- Recognize using a neural network classifier
- Hidden Markov model approach
- Experimental approach developed at BBN
- Segment into sub-character slices
- Limited lookahead to find best character choice
- Useful for connected scripts (e.g., Arabic)
|
20
|
- Character segmentation errors
- In English, segmentation often changes “m” to
“rn”
- Character confusion
- Characters with similar shapes often confounded
- OCR on copies is much worse than on originals
- Pixel bloom, character splitting, binding bend
- Uncommon fonts can cause problems
- If not used to train a neural network
|
21
|
- Image preprocessing
- Mathematical morphology for bloom and splitting
- Particularly important for degraded images
- “Voting” between several OCR engines helps
- Individual systems depend on specific training data
- Linguistic analysis can correct some errors
- Use confusion statistics, word lists, syntax, …
- But more harmful errors might be introduced
|
22
|
- Neural networks take about 10 seconds a page
- Hidden Markov models are slower
- Voting can improve accuracy
- But at a substantial speed penalty
- Easy to speed things up with several machines
- For example, by batch processing - using desktop computers at night=
|
23
|
- Can be hard to guess in some cases
- Newspaper columns, figure captions, appendices, …
- Sometimes there are explicit guides
- “Continued on page 4” (but page 4 may be big!)
- Structural cues can help
- Column 1 might continue to column 2
- Content analysis is also useful
- Word co-occurrence statistics, syntax analysis
|
24
|
- Typical Document Image Indexing
- Convert hardcopy to an “electronic” document
- OCR
- Page Layout Analysis
- Graphics Recognition
- Use structure to add metadata
- Manually supplement with keywords
- Use traditional text indexing and retrieval techniques?
|
25
|
- Requires robust ways of indexing
- Statistical methods with large documents work best
- Key Evaluations
- Success for high quality OCR (Croft et al 1994, Taghva 1994)
- Limited success for poor quality OCR (1996 TREC, UNLV)
- Clustering successful for > 85% accuracy (Tsuda et al, 1995)
|
26
|
- Improve OCR
- Automatic Correction
- Enhance IR techniques
- Lopresti and Zhou, 1996
- NGrams
- Applications
- Cornell CS TR Collection (Lagoze et al, 1995)
- Degraded Text Simulator (Doermann and Yao, 1995)
|
27
|
- Powerful, Inexpensive statistical method for characterizing populati=
ons
- Approach
- Split up document into n-character pairs fails
- Use traditional indexing representations to perform analysis
- “DOCUMENT” -> DOC, OCU, CUM, UME, MEN, ENT
- Advantages
- Statistically robust to small numbers of errors
- Rapid indexing and retrieval
- Works from 70%-85% character accuracy where traditional IR fails
|
28
|
- Above 80% character accuracy, use words
- With linguistic correction
- Between 75% and 80%, use n-grams
- With n somewhat shorter than usual
- And perhaps with character confusion statistics
- Below 75%, use word-length shape codes
|
29
|
- With stroke information, can be automated
- Simple things can be read without strokes
- Postal addresses, filled-in forms
- Free text requires human interpretation
- But repeated recognition is then possible
|
30
|
- Full Conversion often required
- Conversion is difficult!
- Noisy data
- Complex Layouts
- Non-text components
|
31
|
|
32
|
- Processing Converted Text
- Manipulating Images of Text
- Title Extraction
- Named Entity Extraction
- Keyword Spotting
- Abstracting and Summarization
- Indexing based on Structure
- Graphics and Drawings
- Related Work and Applications
|
33
|
- Characteristics
- Does not require expensive OCR/Conversion
- Applicable to filtering applications
- May be more robust to noise
- Possible Disadvantages
- Application domain may be very limited
- Processing time may be an issue if indexing is otherwise required=
li>
|
34
|
- Problem: Filter proper nouns in images of text
- Advantages of the Image Domain:
- Saves converting all of the text
- Allows application of word recognition approaches
- Limits post-processing to a subset of words
- Able to use features which are not available in the text
- Approach:
- Identify Word Features
- Capitalization, location, length, and syntactic categories
- Classify using rule-set
- Achieve 75-85% accuracy without conversion
|
35
|
- Techniques:
- Work Shape/HMM - (Che=
n et
al, 1995)
- Word Image Matching - (Trenkle and Vogt, 1993; Hull et al)
- Character Stroke Features - (Decurtins and Chen, 1995)
- Shape Coding - (Tanaka and Torii; Spitz 1995; Kia, 1996)
- Applications:
- Filing System (Spitz - SPAM, 1996)
- Numerous IR
- Processing handwritten documents
- Formal Evaluation :
- Scribble vs. OCR (DeCurtins, SDIUT 1997)
|
36
|
- Approach
- Use of Generic Character Descriptors
- Make Use of Power of Language to resolve ambiguity
- Map Character based on Shape features including ascenders, descende=
rs,
punctuation and character with holes
|
37
|
- Group all characters that have similar shapes
- {A, B, C, D, E, F, G, H, I, J, K, L, M, N, O, P, Q, R, S, T, U, V, =
W,
X, Y, Z, 2, 3, 4, 5, 6, 7, 8, 9, 0}
- {a, c, e, n, o, r, s, u, v, x, z}
- {b, d, h, k, }
- {f, t}
- {g, p, q, y}
- {i, j, l, 1}
- {m, w}
|
38
|
- Can recognize shapes faster than characters
- Seconds per page, and very accurate
- Preserves recall, but with lower precision
- Useful as a first pass in any system
- Easily extracted from JPEG-2 images
- Because JPEG-2 uses object-based compression
|
39
|
- Handwritten Archival Manuscripts
- Page Classification
- (Decurtins and Chen, 1995)
- Matching Handwritten Records
- Headline Extraction
- Document Image Compression (UMD, 1996-1998)
|
40
|
- Processing Converted Text
- Manipulating Images of Text
- Indexing Based on Structure
- Logical
- Physical
- Functional
- Graphics and Drawings
- Related Work and Applications
|
41
|
- Humans process documents very robustly
- When interacting with documents, we can interpret without recognitio=
n
- We can judge relevance without reading
- We can rapidly navigate documents to find the information we want
- Claims
- We must provide basic ways to interact with documents, and
interaction often relies as much on the structure of a document, as =
on
the content
- Traditional geometric properties and type-dependent logical models a=
re
not sufficient
|
42
|
- The role or “function” of a document is to store data in
symbolic form which has been produced by a sender (the author) to facilitate transfer to a
receiver (the reader)
- Documents are designed to be interpreted by humans
- Authors typically tailor this design to optimize the transfer of
information
- Readers use structure to enhance interpretation
- In what ways does the design facilitate, disambiguate or enhance the
flow of information?
|
43
|
|
44
|
|
45
|
|
46
|
|
47
|
- Processing Converted Text
- Manipulating Images of Text
- Indexing based on Structure
- Graphics and Drawings
- Related Work and Applications
|
48
|
- Identify Legend on the Map Image
- Extract Images map labels and descriptions
- Identify labels in the map images
- Allow user to query based on extracted images
- Bootstraps the information extraction and interpretation problems
|
49
|
- Processing Converted Text
- Manipulating Images of Text
- Indexing based on Structure
- Graphics and Drawings
- Related Work and Applications
|
50
|
- Same content, same format
- For example, a xerox copy
- Same content, different format
- For example, as a web page or on paper
- Shared content, same format
- For example, a paper with annotations
- Shared content, different format
- For example, including text with cut-and-paste
|
51
|
|
52
|
- Use global features to restrict search
- Number of pages, number of lines, page moments
- Extract a signature
- Convert signature
- use a set of n-gram keys to index the database
- Rank and verify
- return top N documents
- visual or algorithmic refinement
- Advantages:
- Robust to noise, extracted quickly, extracted easily, efficiently
stored
|
53
|
|
54
|
- The usual approach: Model-based evaluation
- Apply confusion statistics to an existing collection
- A bit better: Print-scan evaluation
- Scanning is slow, but availability is no problem
- Best: Scan-only evaluation
- No existing IR collections have printed materials
|
55
|
- Many applications benefit from image based indexing
- Less discriminatory features
- Features may therefore be easier to compute
- More robust to noise
- Often computationally more efficient
- Many classical IR techniques have application for DIR
- Structure as well as content are important for indexing
- Preservation of structure is essential for in-depth understanding
|
56
|
|
57
|
- 57 Title pages, 891 non-title pages
- Overall Accuracy =3D 906/948 =3D 95.57%
- Title Page Accuracy =3D 37/57 =3D 64.91%
- False Positives =3D 22
- False Negatives =3D 20
- Observations
- All without Type-Specific Information
- Need Functional (or Logical) Features
|
58
|
- Questions
- Definitions - Document, Image, Retrieval
- Document Image Analysis
- Traditional Indexing with Conversion
- Doing things Without Conversion
- Recent work on IR with Chinese document images
|
59
|
- Full-text search based on manually re-keying the text
- Prohibitively expensive at large scale
- Search based on bibliographic metadata
- May be difficult to adequately describe the materials.
- Full text based on Optical Character Recognition (OCR)
- Inexpensive and relatively rapid
- Sensitive to OCR accurracy
|
60
|
- What to index?
- Phrase, words, character, or shape codes
- Unigrams or n-grams
- How to weight a term in a document?
- Term frequency (TF)
- Document frequency (DF)
- Document length normalization
- (Term position)
- How to assign scores to documents?
- Boolean, vector space, and probabilistic models
|
61
|
- Words may be any number of characters (typically 2-5)
- But some that contain only 1 character or more than 5 characters
- e.g., “貓” (cat), “聯合國=
教科文組織”
(UNESCO)
- Longer words (over 2 characters) often have shorter sub-word units=
li>
- Transliteration is an exception
- Written Chinese has no word separator
- A sentence can be segmented in different ways, all may be legal
- Similar to the phrase detection problem in English
- Chinese character inventory is very large
- 13,500 characters in Big-5 code (traditional Chinese: Taiwan and Ho=
ng
Kong)
- Over 6,000 characters in GB code (simplified Chinese: China, Singap=
ore)
- About 3,000 commonly used characters in each character set
|
62
|
- 800,000 newspaper clippings from 1950-1976
- Scanned over 300,000 at 300 dpi
- 30 China, Hong Kong, and Taiwan news agencies
- Mostly simplified Chinese, some traditional Chinese
- Focus on diplomatic and military activities
|
63
|
- Selected 11,108 scanned document images
- OCR yielded 8,438 valid docs (Presto! OCR Pro, Big-5)
- Avg valid document had a 69% system-reported “recognition
rate”
- Computed on a sample of 1,300 documents
- Second version prepared using Big-5 to GB conversion
- GB version used in experiments
|
64
|
- Based on contemporaneous Chinese journal articles
- From 100 paper titles, 30 were selected and rewritten as Chinese to=
pics
- Made English translations for cross-language experiments
- Translated by native speakers of Chinese
|
65
|
- Exhaustive tri-state relevance judgments
- Irrelevant (=3D0), partially relevant (=3D1), fully relevant (=3D2)=
- Every topic-document pair judged by 3 assessors
- 2 majored in history, 1 majored in library science
- Averaged 4 minutes per document image (for all 30 topics)
- Sum of the judgments provides a final estimate
- 0=3Dnot relevant, 1…5=3Dpartially relevant, 6=3Dfully relevan=
t
- Threshold as desired to reflect the intended application
- In our experiments, any score > 0 is treated as
“relevant”
|
66
|
- Indexing method:
- Both 1-gram (for partial match) and 2-gram (for preserving sequence=
)
- Example: “ABC” will be indexed with “A”,
“B”, “C”, “AB”, “BC”=
;
- Compared to 1-gram only and 2-gram only
- Weighting scheme:
- document terms : TF*IDF =3D log(1+ tf ) * log(N/df)
- query terms : tf * (3=
w-1),
where w is the length of the term
- Retrieval model:
- Vector space model compared with probabilistic model
- Document length normalization:
- byte size for document terms, compared to cosine
|
67
|
- Experiments by Taghva et al showed that
- some sophisticated weighting schemes shown to be more effective for
ordinary text might lead to more unstable results for OCR degraded
text.
- Singhal, Salton, Buckley [‘96] analyzed this phenomenon by
- Vector space model (SMART system)
- Word-based indexing
- simulated OCR output of a TREC collection (2GB of 742,202 docs)
- 50 TREC queries (numbered from 151 to 200)
- Specifically, effects of cosine normalization and IDF are analyzed<=
/li>
- Incorrect terms like ‘systom’ have large IDF and thus
affect weights of other terms in the same document if cosine
normalization is used:
- They correct this problem by using byte size normalization:
- (byte size)0.375
|
68
|
|
69
|
- The SCRC test collection is useful
- But more than 30 topics may be needed for statistical significance<=
/li>
- Indexing 1-grams and 2-grams together works well
- If 2-grams are given greater weight in the query
- Byte size normalization outperforms cosine normalization
- But Inquery does better than either on short queries
- OCR errors adversely affect blind relevance feedback
- A clean comparable collection would probably work better
- Pruning seems to help
- Considerable parameter tuning is needed (a, b, and k)
|