Schedule of Topics
This is the schedule of topics for an
advanced
seminar in natural language processing, focused on the
foundations of cutting-edge methods in NLP. Foundations, as
used here, are the things that the cutting-edge research papers assume
you already understand.
To the extent possible, the readings here are available on the Web.
"Additional readings" are optional links pointing either to material
you should already know (but might want to review) or to related
material you might be interested in.
Some topic areas may take longer than expected, so
keep an eye on the class mailing list or e-mail me for "official"
dates..
[September 2]
Topic Area 1
- Topics:
- Administrivia and course overview
- Maximum likelihood estimation
- Optimization and search: big picture
Reading(s):
Assignment 1 (due next class)
- Version for less experienced programmers: Write a program that
reads in a text file and produces a file containing unigram, bigram, and
trigram MLE probability estimates. Write a second program that
reads the data file, inputs text strings one per line, and
produces as output the unigram, bigram, or trigram model probability
for the string. Feel free to exploit Pedersen's N-gram Statistics
Package.
- Exercise for more experienced programmers: same exercise but using
character N-grams up to arbitary N. (You can hard-code N.)
Additional reading
[September 9]
Topic Area 2
Reading(s):
Assignment 2 (due next class)
- Extend the previous assignment so your language model estimates
use smoothed probabilities. Use any smoothing method you like
except add-K. Feel free to exploit Dan Melamed's Simple
Good-Turing smoothing software.
Additional reading
[September 16 and 23]
Topic Area 3
- Topics:
- EM algorithms
Note: this topic area is scheduled for two meetings.
Reading(s):
Assignment 3 (Due Monday, Oct 11)
- Option 1. Define language model uses linear interpolation to
combine evidence from models at the word level and the character
level. (The previous assignments provide you with
implementations of these component models. Clever, huh?) Train
the coefficients of the interpolated model using EM. Evaluate the
perplexity of this model versus the individual component models
on unseen test data.
- Option 2. Consider an "aggregate bigram model" where p(w2|w1) =
sum_{c in C} p(w2|c) x p(c|w1). (C is a hidden variable ranging
over a set of classes {c1,...,cN} with, say, N=16.) Come
up with (do not look up!) an EM algorithm for estimating the
parameters of this model. Implement your algorithm, train on a
large sample of English, and compare the probabilities of
"colorless green ideas sleep furiously" versus "furiously sleep
ideas green colorless".
Additional reading
- Dan Klein's Lagrange
Multipliers without Permanent Scarring
- LR Rabiner,
"A tutorial on HMM and selected applications in speech
recognition", In Proc. IEEE, Vol. 77, No. 2, pp. 257-286,
Feb. 1989.
- Jason Eisner's Interactive
Spreadsheet for Teaching the Forward-Backward Algorithm (TNLP 2002)
- A
cute movie showing the Viterbi algorithm in action (link stolen
from Dan
Jurafsky's speech course)
- And if the Viterbi movie wasn't enough petty larceny for you, how about
Dan Jurafsky's nice lecture
notes on the forward-backward algorithm?
- Mark Liberman, Colorless
Green Probability Estimates
- Gliozzo, Magnini, and Strappavara, Unsupervised
Domain Relevance Estimation for Word Sense Disambiguation
[September 30]
Topic Area 4
- Topics:
- Weighted finite state transducers
Reading(s):
- Mehryar Mohri and Michael Riley (2002). Tutorial on Weighted
Finite-State Transducers in Speech Recognition, (Part
1), International Conference on Spoken Language Processing 2002
(ICSLP '02), Denver, Colorado, September 2002.
- Hobbs et al. (1996), FASTUS: A
Cascaded Finite-State Transducer for Extracting Information from
Natural-Language Text.
- Fernando C. N. Pereira and Michael Riley. Speech
Recognition by Composition of Weighted Finite Automata. In
E. Roche and Y. Schabes, editors, Finite-State Language
Processing. MIT Press, Cambridge, Massachusetts. 1997.
Assignment (for extra credit, due Nov 16)
Additional reading
[October 7]
Topic Area 5
- Guest leader: David Chiang
- Topics:
- Parsing as inference
- Semirings and parsing
Reading(s):
- Jason Eisner,
Parameter estimation for probabilistic finite-state
transducers, ACL 2002. (Focus is on Section 4.)
- Shieber, Schabes, and Pereira (1995),
"Principles and Implementation of Deductive Parsing",
J. Logic Programming 24(1-2), pp. 3-36. (You can skip section 4
if you're not interested.)
- Joshua Goodman (1999),
Semiring Parsing, CL 25(4), 573-605.
(Note: there is an
erratum on p. 592, Fig. 7. You can skip section 4, as we'll just
discuss Eisner's method for calculating expectations.)
- What is it with "Section 4"???! :-)
Assignment
Additional reading (and David's comments)
- Lari and Young, 1990. Applications of stochastic context-free
grammars using the Inside-Outside algorithm. Computer Speech and
Language 4, 35-56. (For some reason this, and not the original
Baker paper, is the standard reference. The derivation of the
update equations is not as simple as it could be, I think.)
- Nederhof, 2003.
Weighted deductive parsing and Knuth's
algorithm.
CL
29(1), 135-143. (On best-first parsing. There is a similar paper
by Klein and Manning (IWPT, 2001), but this is much
simpler. However, the proofs in section 4 are spotty.)
- Goodman (1997). Global
thresholding and multiple-pass parsing", EMNLP.
- Chiang, 2003, Mildly context sensitive grammars for estimating
maximum entropy models
. Conference
on Formal Grammar. (Only tangentially related. Implicitly uses
the idea of semirings to generalize Inside-Outside and
conditional random fields to a large class of grammar formalisms
beyond CFG.)
[October 14]
Interlude: alignment templates for translation: models, decoding, and evaluation
- Topics and leaders:
- Background on IBM statistical MT models
- ATTM modeling and decoding
- Evaluation
Reading(s):
Assignment (due next class)
Additional reading
- Peter F. Brown, John Cocke, Stephen A. Della Pietra, Vincent
J. Della Pietra, Fredrick Jelinek, John D. Lafferty, Robert L. Mercer,
Paul S. Roossin (1990),
A Statistical Approach to Machine Translation
- Brown et al. (1993),
The Mathematics of Statistical Machine Translation: Parameter Estimation
- Franz Och, F. Och (2002), Section 5.1 (phrase extraction) and
Chapter 6 (search, especially Section 6.3) of
Statistical Machine Translation: From Single Word Models to
Alignment Templates. Ph.D. thesis, RWTH Aachen, Germany.
- Franz Och (2003), Minimum Classification Error Training for
Statistical Machine Translation.
- Chin-Yew Lin and Eduard Hovy (2003),
Automatic Evaluation of Summaries Using N-gram Co-Occurrence Statistics
- Lavie, A., K. Sagae and S. Jayaraman (2004), The
Significance of Recall in Automatic Metrics for MT Evaluation
- JHU summer workshop 2003 statistical
MT bibliography
- Lin and Och (2004), Automatic
Evaluation of Machine Translation Quality Using Longest Common
Subsequence and Skip-Bigram Statistics
- Babych and Hartley (2004),
Extending the BLEU MT Evaluation Method with Frequency Weightings
- More summarization evaluation references
[October 21]
Topic Area 6
- Topics:
- MT evaluation discussion (carried over from last week)
- Supervised classification: Naive Bayes
- Reading(s):
- Assignment (due next class)
- Start putting together your class project idea.
- Additional reading
Related Talk on November October 25
Paul Smith, "Statistics, Machine Learning and Data Mining", Monday,
October 25, 2004, 4:00 pm, MATH 3206, Abstract is at the Mathematics
Graduate Minicourse Series page, along with links to relevant
papers.
[October 28]
Topic Area 7
- Topics:
- Decision lists
- Support vector machines
Reading(s):
Assignment
- Class project sketch/proposal due Monday Nov 1
Additional reading
- Yarowsky (1999),
"Hierarchical Decision Lists for Word Sense Disambiguation"
- Goodman, "An
incremental decision list learner"
- Marti Hearst (1998) "Support
Vector Machines"
- T. Joachims (1997), "Text
Categorization with Support Vector Machines: Learning with Many
Relevant Features"
- Schoelkopf and Smola, Learning with
Kernels: Support Vector Machines, Regularization, Optimization
and Beyond, Chapter 7; see also the appendix on mathematical
background.
- Andrew Ng, Lecture
notes on support vector machines
- Burges (1998), A
Tutorial on Support Vector Machines for Pattern Recognition
- Martin Law, An
introduction to support vector machines (slides)
[November 4]
Topic Area 8
- Topics:
- Supervised classification with special attention to unlabeled or noisy examples
Reading(s):
Assignment (for extra credit, due Nov 16)
Additional reading
- Hwa et al., Corrected
Co-training for Statistical Parsers
- Robert Schapire (1999), A Brief
Introduction to Boosting
- Steedman et al. (2003), Bootstrapping
Statistical Parsers from Small Datasets
- Avrim Blum and Tom Mitchell (1998), Combining
Labeled and Unlabeled Data with Co-Training
- Schapire (2002), The
Boosting Approach to Machine Learning: An Overview
- Leo Breiman (1996), Bagging
Predictors
- Abney, Understanding
the Yarowsky Algorithm
- Rudin et al.,
Dynamics of Boosting (shows that AdaBoost can converge to a
non-maximum margin)
- Rudin et al.,
Analysis of Boosting Algorithms using the Smooth Margin
function (introduces two new boosting algorithms that maximize
the margin, plus more analysis of AdaBoost)
Related Talk on November 8
Judea Pearl, "Reasoning with Cause and Effect", Monday, November 8,
2004, 11:00am, 1402 Chemistry Building (note location change!).
His abstract is at the Logic and AI
Seminar Series, along with links to relevant papers.
[November 11]
Topic Area 9
Reading(s):
Assignment
Additional reading
[November 18]
Topic Area 10
Reading(s):
- Chapter from draft book on graphical models
(available only in hardcopy)
Assignment
Additional reading
[November 25]
No class. Happy Thanksgiving!
[December 2]
Topic Area 11
- Guest leader: Noah Smith
- Topics
- Log-linear and maximum entropy models and conditional random fields
Reading(s):
Assignment (due next class)
Additional reading
- Highly recommended: Adwait Ratnaparkhi. A Simple
Introduction to Maximum Entropy Models for Natural Language
Processing. Technical Report 97-08, Institute for Research in
Cognitive Science, University of Pennsylvania.
- Abney (1997), Stochastic
Attribute-Value Grammars (?)
- Mark Johnson, Joint and
conditional estimation of tagging and parsing models
- Klein and Manning, "Conditional
Structure versus Conditional Estimation in NLP Models"
- Stephen Della Pietra, Vincent Della Pietra, and John Lafferty (1997),
Inducing features of random fields
- Hannah Wallach (2004) Conditional
Random Fields: An Introduction
- John Lafferty, Andrew McCallum, Fernando Pereira (2001),
Conditional Random Fields: Probabilistic Models for Segmenting and
Labeling Sequence Data
[December 9]
- Last meeting: Project presentations.
Final projects are due by December 19 at 11:59pm.
Some topics of potential interest that we're not covering
- Eisner's Dyna programming language
- Evaluation and significance testing
- Clustering
- Suffix trees
- Using evidence from the Web
- Minimum Bayes risk
- Minimum description length (MDL)
- Osborne notes
Many thanks to Bill Byrne, David Chiang, Bonnie Dorr, Jason Eisner, Christof Monz,
David Smith, Noah Smith, and undoubtedly others I'm forgetting, for
discussions about the syllabus. Responsibility for the outcome is, of
course, completely indeterminate. :-)