▌ ▐ ▐▀▌ █ █▀▌ ▐▀▌ ▌ ▐ ▐▌▐▌ ▐▀▀ ▀▌▐▌▐▀ █▀█ █▀█ █ █ █ █▀█ █ █ ▐▐▌█ █▀ ▌▐▌▐ ▌ ▐ ▌ ▐ ▐██ ███ ▌ ▐ █▄█ ▐ █ ▐██ ▄▌▐▌▐▄ (W)elcome (A)dvisees Tal(k)s (P)ublications (V)ita/Bio/Contact (F)AQ (T)eaching (B)log (L)inks
My thesis committee is made up of Daniel Marcu (CS), Kevin Knight (CS), Eduard Hovy (CS), Stefan Schaal (CS), Gareth James (Statistics), Andrew McCallum (UMass).

The thesis is available in PDF or Postscript format (warning: it's a big file!). BibTeX is also availble. You can also download my defense slides in either OpenOffice format or PDF (warning: animations don't come through in PDF).

The thesis abstract is:

Natural language processing is replete with problems whose outputs are highly complex and structured. The current state-of-the-art in machine learning is not yet sufficiently general to be applied to general problems in NLP. In this thesis, I present Searn (for "search-learn"), an approach to learning for structured outputs that is applicable to the wide variety of problems encountered in natural language (and, hopefully, to problems in other domains, such as vision and biology). To demonstrate Searn's general applicability, I present applications in such diverse areas as automatic document summarization and entity detection and tracking. In these applications, Searn is empirically shown to achieve state-of-the-art performance.

Searn is based on an integration of learning and search. This contrasts with standard approaches that define a model, learn parameters for that model, and then use the model and the learned parameters to produce new outputs. In most NLP problems, the "produce new outputs" step includes an intractable computation. One must therefore employ a heuristic search function for the production step. Instead of shying away from search, Searn attacks it head on and considers structured prediction to be defined by a search problem. The corresponding learning problem is then made natural: learn parameters so that search succeeds.

The two application domains I study most closely in this thesis are entity detection and tracking (EDT) and automatic document summarizatio