1
|
- LBSC 796/INFM 718R
- Session 13, December 3, 2007
- Douglas W. Oard
|
2
|
- Questions
- Image retrieval
- Video retrieval
- Finishing up Speech and Music
- Project presentations
|
3
|
- We have already discussed three approaches
- Controlled vocabulary indexing
- Ranked retrieval based on associated captions
- Social filtering based on other users’ ratings
- Today’s focus is on content-based retrieval
- Analogue of content-based text retrieval
|
4
|
- Retrospective retrieval applications
- Manage stock photo archives
- Index images on the web
- Organize materials for art history courses
- Information filtering applications
- Earth resources satellite image triage
- News bureau wirephoto routing
- Law enforcement at immigration facilities
|
5
|
|
6
|
|
7
|
|
8
|
- Represent image as a rectangular pixel raster
- e.g., 1024 columns and 768 rows
- Represent each pixel as a quantized color
- e.g., 256 colors ranging from red through violet
- Count the number of pixels in each color bin
- Produces vector representations
- Compute vector similarity
- e.g., normalized inner product
|
9
|
|
10
|
|
11
|
- Texture characterizes small-scale regularity
- Color describes pixels, texture describes regions
- Described by several types of features
- e.g., smoothness, periodicity, directionality
- Match region size with image characteristics
- Computed using filter banks, Gabor wavelets, …
- Perform weighted vector space matching
- Usually in combination with a color histogram
|
12
|
|
13
|
- Global techniques alone yield low precision
- Color & texture characterize objects, not images
- Segment at color and texture discontinuities
- Like “flood fill” in Photoshop
- Represent size shape & orientation of objects
- e.g., Berkeley’s “Blobworld” uses ellipses
- Represent relative positions of objects
- e.g., angles between lines joining the centers
- Perform rotation- and scale-invariant matching
|
14
|
- More sophisticated techniques are needed J
|
15
|
|
16
|
|
17
|
- Query
- Keywords, example, sketch
- Matching
- Caption text
- Segmentation
- Similarity (color, texture, shape)
- Spatial arrangement (orientation, position)
- Specialized techniques (e.g., face recognition)
- Selection
|
18
|
- Google Image Search (text)
- Columbia WebSeek (text, color)
- http://www.ctr.columbia.edu/webseek/
- IBM QBIC (color, location)
- http://wwwqbic.almaden.ibm.com/, select Hermitage
|
19
|
- A set of time-synchronized modalities
- Video
- Images, object motion, camera motion, scenes
- Audio
- Speech, music, other sounds
- Text
- Closed captioning, on-screen captions, signs, …
|
20
|
- Television programs
- News, sports, documentary, talk show, …
- Movies
- Drama, comedy, mystery, …
- Meeting records
- Conference, video teleconference, working group
- Others
- Surveillance cameras, personal camcorders, …
|
21
|
- Image structure
- Absolute positioning, relative positioning
- Object motion
- Camera motion
- Pan, zoom, perspective change
- Shot transitions
|
22
|
- Hypothesize objects as in image retrieval
- Segment based on color and texture
- Examine frame-to-frame pixel changes
- Classify motion
- Translation
- Linear transforms model unaccelerated motion
- Rotation
- Creation & destruction, elongation & compression
- Merge or split
|
23
|
- Do global frame-to-frame pixel analysis
- Classify the resulting patterns
- Central tendency -> zoom out
- Balanced exterior destruction -> zoom in
- Selective exterior destruction -> pan
- Coupled rotation and translation -> perspective
- Coupled within objects, not necessarily across them
|
24
|
- Create a color histogram for each image
- Segment at discontinuities (cuts)
- Cuts are easy, other transitions are also detectable
- Cluster representative histograms for each shot
- Identifies cuts back to a prior shot
- Build a time-labeled transition graph
|
25
|
- Shot-to-shot structure correlates with genre
- Reflects accepted editorial conventions
- Some substructures are informative
- Frequent cuts to and from announcers
- Periodic cuts between talk show participants
- Wide-narrow cuts in sports programming
- Simple image features can reinforce this
- Head-and-shoulders, object size, …
|
26
|
- Video rarely appears in isolation
- Sound track, closed captions, on-screen captions
- This provides synergy, not just redundancy
- Some information appears in only one modality
- Image analysis complements video analysis
- Face detection, video OCR
|
27
|
- Video often lacks easily detected boundaries
- Between programs, news stories, etc.
- Accurate segmentation improves utility
- Too large hurts effectiveness, to small is unnatural
- Multiple segmentation cues are available
- Genre shift in shot-to-shot structure
- Vocabulary shift in closed captions
- Intrusive on-screen text
- Musical segues
|
28
|
- Designed for hearing-impaired viewers
- Speech content, speaker id, non-speech audio
- Weakly synchronized with the video
- Simultaneously on screen for advance production
- Significant lag for live productions
- Missing text and significant errors are common
- Automatic spelling correction can produce nonsense
|
29
|
- Speech and closed caption are redundant, but:
- Each contains different types of errors
- Each provides unique information
- Merging the two can improve retrieval
- Start with a rough time alignment
- Synchronize at points of commonality
- Speech recognition provides exact timing
- Use the words from both as a basis for retrieval
- Learn which to weight more from training data
|
30
|
- On-screen captions can be very useful
- Speaker names, event names, program titles, …
- They can be very challenging to extract
- Low resolution, variable background
- But some factors work in your favor
- Absolutely stable over multiple frames
- Standard locations and orientations
|
31
|
- Text area detection
- Look for long thin horizontal regions
- Bias towards classic text locations by genre
- Integrate detected regions across multiple frames
- Enhance the extracted text
- Contrast improvement, interpolation, thinning
- Optical character recognition
- Matched to the font, if known
|
32
|
- Segment from images based on shape
- Head, shoulders, and hair provide strong cues
- Track across several images
- Using optical flow techniques
- Select the most directly frontal view
- Based on eye and cheek positions, for example
- Construct feature vectors
- “Eigenface” produces 16-element vectors
- Perform similarity matching
|
33
|
- Face recognition and speaker identification
- Both exploit information that is usually present
- But both require training data
- On-screen captions provide useful cues
- Confounded by OCR errors and varied spelling
- Closed captions and speech retrieval help too
- If genre-specific heuristics are used
- e.g., announcers usually introduce speakers before cuts
|
34
|
|
35
|
- Multimedia retrieval builds on all we know
- Controlled vocabulary retrieval & social filtering
- Text, image, speech and music retrieval
- New information sources are added
- Video structure, closed & on-screen captions
- Cross-modal alignment adds new possibilities
- One modality can make another more informative
- One modality can make another more precise
|
36
|
- Each minute of video contains 1,800 frames
- Some form of compaction is clearly needed
- Two compaction techniques have been used
- Extracts select representative frames or shots
- Abstracts summarize multiple frames
- Three presentation techniques are available
- Storyboard, slide show, full motion
|
37
|
- First frame of a shot is easy to select
- But it may not be the best choice
- Genre-specific cues may be helpful
- Minimum optical flow for director’s emphasis
- Face detection for interviews
- Presence of on-screen captions
- This may produce too many frames
- Color histogram clusters can reveal duplicates
|
38
|
|
39
|
- Composite images that capture several scenes
- And convey a sense of space, time, and/or motion
- Exploits familiar metaphors
- Time exposures, multiple exposures, strobe, …
- Two stages
- Modeling (e.g., video structure analysis)
- Rendering
- Global operators do =
time
exposure and variable resolution
- Segmentation supports production of composite frames
|
40
|
|
41
|
|
42
|
- Spatial arrangement of still images
- Linear arrangements depict temporal evolution
- Overlapped depictions allow denser presentations
- Graph can be used to depict video structure
- But temporal relationships are hard to capture
- Naturally balances overview with detail
- Easily browsed at any level of detail
- Tradeoff between detail and complexity
- Further limited by image size and resolution
|
43
|
|
44
|
|
45
|
- Flip through still images in one spot
- At a rate selected by the user
- Conserves screen space
- But it is hard to process several simultaneously
- Several variations possible
- Content-sensitive dwell times
- Alternative frame transitions (cut, dissolve, …)
|
46
|
|
47
|
- Extracted shots, joined by cuts
- The technique used in movie advertisements
- Conveys more information using motion
- Optionally aligned with extracted sound as well
- Hard to build a coherent extract
- Movie ads are constructed by hand
|
48
|
- Six 20-minute slots:
- 15 minute presentation
- 5 minutes for questions
- First slot will be Mai (from overseas)
- Two projectors
- Laptop or second instructor console for system
- Primary instructor console for slides
|
49
|
- What you did
- Why you did it
- Overview of how you did it
- What you how about how well it works
- Batch evaluation
- User study
- Big things you learned
|