1
|
- LBSC 796/CMSC 828o
- Session 6 – March 1, 2004
- Douglas W. Oard
|
2
|
- Questions
- Controlled vocabulary retrieval
- Generating metadata
- Metadata standards
- Putting the pieces together
|
3
|
|
4
|
- Homonymy
- Terms may have many unrelated meanings
- Polysemy (related meanings) is less of a problem
- Synonymy
- Many ways of saying (nearly) the same thing
- Anaphora
- Alternate ways of referring to the same thing
|
5
|
- Privacy limits access to observations
- Queries based on behavior are hard to craft
- Explicit queries are rarely used
- Query by example requires behavior history
- “Cold start” problem limits applicability
|
6
|
- Develop a concept inventory
- Uniquely identify concepts using “descriptors”
- Concept labels form a “controlled vocabulary”
- Organize concepts using a “thesaurus”
- Assign concept descriptors to documents
- Craft queries using the controlled vocabulary
|
7
|
|
8
|
- Canine AND Fox
- Canine AND Political action
- Canine OR Political action
|
9
|
- When implied concepts must be captured
- Political action, volunteerism, …
- When terminology selection is impractical
- Searching foreign language materials
- When no words are present
- Photos w/o captions, videos w/o transcripts, …
- When user needs are easily anticipated
- Weather reports, yellow pages, …
|
10
|
|
11
|
- Goal: fully automatic descriptor assignment
- Machine learning approach
- Assign descriptors manually for a “training set”
- Design a learning algorithm find and use patterns
- Bayesian classifier, neural network, genetic algorithm, …
- Present new documents
- System assigns descriptors like those in training set
|
12
|
|
13
|
|
14
|
- Goal: Automatically su=
ggest
descriptors
- Better consistency with lower cost
- Approach rule-based expert system
- Design thesaurus by hand in the usual way
- Design an expert system to process text
- String matching, proximity operators, …
- Write rules for each thesaurus/collection/language
- Try it out and fine tune the rules by hand
|
15
|
|
16
|
- Thesaurus must match the document collection
- Thesaurus must match the information needs
- Thesaurus can help to guide the searcher
- Broader term (“is-a”), narrower term, used for, …=
|
17
|
- Changing concept inventories
- Literary warrant and user needs are hard to predict
- Accurate concept indexing is expensive
- Machines are inaccurate, humans are inconsistent
- Users and indexers may think differently
- Diverse user populations add to the complexity
- Using thesauri effectively requires training
- Meta-knowledge and thesaurus-specific expertise
|
18
|
- Machine learning techniques can find:
- Two types of features are useful
- Orthography
- e.g., Paired or non-initial capitalization
- Trigger words
- e.g., Mr., Professor, said, …
|
19
|
|
20
|
- Variant forms of names (“name authority”)
- Pseudonyms, partial names, citation styles
- Acronyms and abbreviations
- Organizations, political entities, projects, …
- Co-reference resolution
- References to roles or objects rather than names
- Anaphoric pronouns for an antecedent name
|
21
|
|
22
|
- What can we describe?
- How can we convey it?
- Resource Description Framework (RDF)
- What can we say?
- What does it mean?
|
23
|
- Goals:
- Easily understood, implemented and used
- Broadly applicable to many applications
- Approach:
- Intersect several sta=
ndards
(e.g., MARC)
- Suggest only “best practices” for element content
- Implementation:
- 16 optional and repeatable “elements”
- Refined using a growing set of “qualifiers”
- “Best practice” suggestions for content standards
|
24
|
- Content
- Title
- Subject
- Description
- Type
- Audience
- Coverage
- Related resource
- Rights
- Instantiation
- Date
- Format
- Language
- Identifier
- Responsibility
- Creator
- Contributor
- Source
- Publisher
|
25
|
- XML schema for describing resources
- Can integrate multiple metadata standards
- Dublin Core, P3P, PICS, vCARD, …
- Dublin Core provides a XML “namespace”
- DC Elements are XML “properties
- DC Refinements are RDF “subproperties”
- Values are XML “content”
|
26
|
|
27
|
- RDF provides the schema for interchange
- Ontologies support automated inference
- Similar to thesauri supporting human reasoning
- Ontology mapping permits distributed creation
- This is where the magic happens J
|
28
|
- Search is user-controlled suppression
- Everything is known to the search system
- Goal: avoid showing things the user doesn’t want
- Other stakeholders have different goals
- Authors risk little by wasting your time
- Marketers hope for serendipitous interest
- Metadata from trusted sources is more reliable
|
29
|
- Goal: Manipulate rankings of an IR system
- Multiple strategies:
- Create bogus user-assigned metadata
- Add invisible text (font in background color, …)
- Alter your text to include desired query terms
- “Link exchanges” create links to your page
|
30
|
|
31
|
- On a sheet of p=
aper,
please briefly answer the following question (no names):
- What was the mu=
ddiest
point in today’s lecture?
|