NAACL HLT 2009 Human Language Technologies: The 2009 Annual Conference of the North American Chapter of the Association for Computational Linguistics Proceedings of the Conference May 31 ­ June 5, 2009 Boulder, Colorado Production and Manufacturing by Omnipress Inc. 2600 Anderson Street Madison, WI 53707 USA Sponsors: · Rosetta Stone · CNGL · Microsoft Research · Google · AT&T · Language Weaver · J.D. Power · IBM Research · The Linguistic Data Consortium · The Human Language Technology Center of Excellence at the Johns Hopkins University · The Computational Language and Education Research Center at the University of Colorado at Boulder c 2009 The Association for Computational Linguistics Order copies of this and other ACL proceedings from: Association for Computational Linguistics (ACL) 209 N. Eighth Street Stroudsburg, PA 18360 USA Tel: +1-570-476-8006 Fax: +1-570-476-0860 acl@aclweb.org ISBN: 978-1-932432-41-1 ii Preface: General Chair I am honored that the North American Chapter of the Association of Computational Linguistics (NAACL) has given me the opportunity, as General Conference Chair, to continue the NAACL HLT tradition of covering topics from all areas of Human Language Technology, which makes it possible for researchers to discuss algorithms and applications that cut across the fields of natural language processing (NLP), speech processing, and information retrieval (IR). I have been very fortunate to work with a terrific group of Technical Program Co-Chairs: Michael Collins (NLP), Shri Narayanan (speech), Douglas W. Oard (IR), and Lucy Vanderwende (NLP). This year the technical program emphasizes the breadth and interdisciplinary nature of human language processing research. The plenary talks will stretch our thinking about how language is used by considering the application of language to vision in one case, and language as it relates to food in another. There are two special sessions with themes that cut across multiple sub-areas of HLT: Large Scale Language Processing and Speech Information Retrieval. We also recognize the increasing importance of industry in our field with a lunchtime panel discussion on the Next Big Applications in Industry, with thanks to Bill Dolan for organizing and moderating the discussion. Finally, we have a breadth of excellent technical papers in lecture and poster sessions, thanks to the efforts of our Senior Program Committee members, the many reviewers on the Program Committee who helped us keep to our schedule, and the Paper Awards Committee. Together they have done a great job in putting together an interesting technical program. It has also been a pleasure to work with Local Organizers Martha Palmer and Jim Martin, who have done a terrific job in hosting a meeting that shows us Colorado's character as well as offering a great technical program. I hope you enjoy your stay in beautiful Boulder, as you are learning about new ideas and networking with valued colleagues. The tradition of NAACL HLT is that it incorporates many events, including tutorials and workshops that have expanded in scope such that they are almost as big as the main conference. As a result, many other people have played important roles in making the overall conference a success and representative of the breadth of HLT. Specifically, I thank Matthew Stone, Gokhan Tur and Diana Inkpen for their work as Publicity Chairs; Christy Doran and Eric Ringger for their work as Publications Chairs; Fred Popowich and Michael Johnston for serving as Demo Chairs; Tutorial Chairs Ciprian Chelba, Paul Kantor and Brian Roark for bringing us an outstanding slate of tutorials; Workshop Chairs Nizar Habash and Mark Hasegawa-Johnson for their efforts in choosing and supporting the 12 workshops that extend our program by two days; and the Student Co-Chairs of the Doctoral Consortium organizers Svetlana Stenchikova, Ulrich Germann and Chirag Shah working with faculty advisors Carolyn Ros´ and Anoop e Sarkar. Thanks also to Nicolas Nicolov for his efforts as NAACL HLT Sponsorship Chair, working in coordination with Sponsorship Chairs from other ACL regions. Of course, we greatly appreciate the support of our sponsors: Rosetta Stone, CNGL, Microsoft Research, Google, AT&T, Language Weaver, J.D. Power, IBM Research, the Linguistic Data Consortium, the Human Language Technology Center of Excellence at the Johns Hopkins University, and the Computational Language and Education Research Center at the University of Colorado at Boulder. iii In organizing this conference, we have had a lot of support from the NAACL Board and the HLT Advisory Board. I would particularly like to thank Owen Rambow, Jennifer Chu-carroll, Chris Manning and Graeme Hirst for their help and advice. Last, but certainly not least, we are indebted to Priscilla Rasmussen for her expertise and support in running the conference. Mari Ostendorf, University of Washington iv Preface: Program Chairs We welcome you to NAACL HLT 2009! The NAACL HLT program continues to include high-quality work in the areas of computational linguistics, information retrieval, and speech technology. This year, 260 full papers were submitted, of which 75 papers were accepted (giving a 29% acceptance rate); and 178 short papers were submitted, of which 71 were accepted (giving a 40% acceptance rate). Two best paper awards were given at the conference, to "Unsupervised Morphological Segmentation with Log-Linear Models", by Hoifung Poon, Colin Cherry and Kristina Toutanova (this paper also received the best student paper award), and "11,001 New Features for Statistical Machine Translation", by David Chiang, Kevin Knight and Wei Wang. The senior program committee members for the conference nominated an initial set of papers that were candidates for the awards; the final decisions were then made by a committee chaired by Candace Sidner, and with Hal Daume III, Roland Kuhn, Ryan McDonald, and Mark Steedman as its other members. We would like to congratulate the authors, and thank the committee for their work in choosing these papers. NAACL HLT 2009 consists of oral presentations of all full papers, oral or poster presentations of short papers, and tutorials and software demonstrations. We are delighted to have two keynote speakers: Antonio Torralba, with a talk "Understanding Visual Scenes", and Dan Jurafsky, with a talk "The Language of Food". In addition, we have a panel on emerging application areas in computational linguistics, chaired by Bill Dolan. We would like to thank the authors for submitting a remarkable set of papers to the conference. The review process was organized through a two-tier system, with eighteen senior program committee (SPC) members, and 352 reviewers. The SPC members managed the review process for both the full and short paper submissions: each full paper received at least three reviews, and each short paper received at least two reviews. We are thoroughly indebted to the reviewers for all their work, and to the SPC members for the long hours they spent in evaluating the submissions. In addition, we would like to thank Rich Gerber and the START team for their help with the system that managed paper submissions and reviews; the local arrangement chairs, James Martin and Martha Palmer, for their help with organizing the program; and the publication chairs, Christy Doran and Eric Ringger, for putting together these proceedings. Finally, we are incredibly grateful to the general chair, Mari Ostendorf, for the invaluable advice and support that she provided throughout every step of the process. We hope that you enjoy the conference! Michael Collins, Massachusetts Institute of Technology Shri Narayanan, University of Southern California Douglas W. Oard, University of Maryland Lucy Vanderwende, Microsoft Research v Organizers General Chair: Mari Ostendorf, University of Washington Local Arrangements: James Martin, University of Colorado Martha Palmer, University of Colorado Program Committee Chairs: Michael Collins, Massachusetts Institute of Technology Shri Narayanan, University of Southern California Douglas W. Oard, University of Maryland Lucy Vanderwende, Microsoft Research Publicity Chairs: Matthew Stone, Rutgers University Gokhan Tur, SRI International Diana Inkpen, University of Ottawa Publications Chairs: Christy Doran, MITRE Eric Ringger, Brigham Young University Tutorials Chairs: Ciprian Chelba, Google Paul Kantor, Rutgers University Brian Roark, Oregon Health and Science University Workshops Chairs: Nizar Habash, Columbia University Mark Hasegawa-Johnson, University of Illinois Doctoral Consortium Organizers: Carolyn Ros´ , Faculty Chair, CMU e Anoop Sarkar, Faculty Chair, Simon Fraser University vii Svetlana Stoyachev, Student Co-Chair, Stony Brook University Ulrich Germann, Student Co-Chair, University of Toronto Chirag Shah, Student Co-Chair, University of North Carolina Demo Chairs: Fred Popowich, Simon Fraser University Michael Johnston, AT&T Sponsorship Committee: Nicolas Nicolov (Local Chair) Hitoshi Isahara and Kim-Teng Lua (Asian ACL Rrepresentatives) Philipp Koehn and Josef van Genabith (European ACL Representatives) Srinivas Bangalore and Christy Doran (American ACL Representatives) viii Program Committee Senior Program Committee Members: Michiel Bacchiani, Google Regina Barzilay, Massachusetts Institute of Technology Kenneth W. Church, Microsoft Research Charles L. A. Clarke, University of Waterloo Eric Fosler-Lussier, Ohio State University Sharon Goldwater, University of Edinburgh Julia Hirschberg, Columbia University Jimmy Huang, York University Mark Johnson, Brown University Philipp Koehn, University of Edinburgh Roland Kuhn, National Research Council of Canada, IIT Gina-Anne Levow, University of Manchester Dekang Lin, Google Ryan McDonald, Google Premkumar Natarajan, BBN Technologies Patrick Pantel, Yahoo! Labs Kristina Toutanova, Microsoft Research Geoff Zweig, Microsoft Research Paper Award Committee: Candace Sidner, Chair, BAE Systems AIT Hal Daum´ III, University of Utah e Roland Kuhn, NRC Institute for Information Technology Ryan McDonald, Google Inc. Mark Steedman, University of Edinburgh Program Committee Members: Stephen Abney Meni Adler Eugene Agichtein Eneko Agirre Lars Ahrenberg Adam Albright Enrique Alfonseca Afra Alishahi Sophia Ananiadou Shankar Ananthakrishnan Bill Andreopoulos Galen Andrew ix Walter Andrews Masayuki Asahara Necip Fazil Ayan Mark Baillie Timothy Baldwin Roberto Basili Ron Bekkerman Sabine Bergler Shane Bergsma Rahul Bhagat Dan Bikel Mikhail Bilenko Alexandra Birch Alan Black Sasha Blair-Goldensohn John Blitzer Paul Boersma Johan Bos Alexandre Bouchard-C^ t´ oe S.R.K. Branavan Chris Brew Ted Briscoe Chris Brockett Stefan Buettcher Razvan Bunescu Jill Burstein Cory Butz William Byrne Chris Callison-Burch Claire Cardie Giuseppe Carenini Marine Carpuat Xavier Carreras Francisco Casacuberta Joyce Chai Yllias Chali Nate Chambers Jason Chang Eugene Charniak Ciprian Chelba Harr Chen Colin Cherry David Chiang Tat-Seng Chua Grace Chung Massimiliano Ciaramita Stephen Clark Peter Clark Mark Craven Mathias Creutz Aron Culotta James Cussens Robert Dale Cristian Danescu Niculescu-Mizil Hal Daum´ III e Guy De Pauw John DeNero Barbara Di Eugenio x Mona Diab Bill Dolan Christy Doran Doug Downey Mark Dredze Markus Dreyer Rebecca Dridan Kevin Duh Chris Dyer Andreas Eisele Jacob Eisenstein Jason Eisner Michael Elhadad Noemie Elhadad Mark Ellison Micha Elsner Dominique Estival Oren Etzioni Hui Fang Marcello Federico Paolo Ferragina Jenny Finkel Erin Fitzgerald Radu Florian George Foster Dayne Freitag Pascale Fung Robert Gaizauskas Michael Gamon Kuzman Ganchev Jianfeng Gao Claire Gardent Stuart Geman Ulrich Germann Shlomo Geva Mazin Gilbert Daniel Gildea Jesus Gimenez Roxana Girju Randy Goebel John Goldsmith Ralph Grishman Asela Gunawardana Gholamreza Haffari Aria Haghighi Udo Hahn Dilek Hakkani-T¨ r u Keith Hall Hyoil Han Mary Harper Saa Hasan s Mark Hasegawa-Johnson Timothy J. Hazen Xiaodong He William Headden Peter Heeman James Henderson Iris Hendrickx Graeme Hirst Hieu Hoang Kristy Hollingshead Mark Hopkins Vronique Hoste Chu-Ren Huang Liang Huang Rebecca Hwa Diana Inkpen Abe Ittycheriah Gaja Jarosz Heng Ji Richard Johansson Howard Johnson Rie Johnson Doug Jones Gareth Jones Aravind Joshi Min-Yen Kan Chia-lin Kao Nikiforos Karamanis Rohit Kate Vlado Keselj Shahram Khadivi Sanjeev Khudanpur Adam Kilgarriff Jin-Dong Kim Owen Kimball Dan Klein Kevin Knight Mamoru Komachi Grzegorz Kondrak Terry Koo Anna Korhonen xi Kimmo Koskenniemi Emiel Krahmer Jonas Kuhn Shankar Kumar Christian K¨ nig o Philippe Langlais Mirella Lapata Alex Lascarides Alon Lavie Claudia Leacock Lillian Lee Yoong Keok Lee James Lester Gregor Leusch Roger Levy David Lewis Wei Li Xiao Li Haizhou Li Hang Li Ping Li Percy Liang Hank Liao Jimmy Lin Chin-Yew Lin Bing Liu Yang Liu Tie-Yan Liu Andrej Ljolje Adam Lopez Alex Lopez-Ortiz Bill MacCartney Nitin Madnani Bernardo Magnini Jonathan Mamou Suresh Manandhar Lidia Mangu Gideon Mann Chris Manning Daniel Marcu Evgeny Matusov Arne Mauser David McAllester Andrew McCallum Diana McCarthy David McClosky Kathy McCoy Kathleen McKeown Susan McRoy Qiaozhu Mei Paola Merlo Rada Mihalcea Yusuke Miyao Saif Mohammad Dan Moldovan Bob Moore Richard Moot Pedro Moreno Dragos Munteanu Smaranda Muresan Muthu Muthukrishnan Tetsuji Nakagawa Preslav Nakov Ani Nenkova Hermann Ney Hwee Tou Ng Vincent Ng Raymond Ng Patrick Nguyen Jian-Yun Nie Joakim Nivre Franz Och Kemal Oflazer Scott Olsson Luca Onnis Miles Osborne Tim Paek Bo Pang Marius Pasca Rebecca Passonneau Matthias Paulik Ted Pedersen Marco Pennacchiotti Mati Pentus Amy Perfors Slav Petrov Joseph Picone Janet Pierrehumbert Livia Polanyi Hoifung Poon Ana-Maria Popescu Maja Popovic xii Fred Popowich John Prager Rohit Prasad Partha Pratim Talukdar Matthew Purver Chris Quirk Drago Radev Rajat Raina Daniel Ramage Owen Rambow Vivek Kumar Rangarajan Sridhar Deepak Ravichandran Stefan Riezler Ellen Riloff Eric Ringger Brian Roark Barbara Rosario Dan Roth Alex Rudnicky Marta Ruiz Anton Rytting Kenji Sagae Johan Schalkwyk David Schlangen Tanja Schultz Petr Schwarz Holger Schwenk Satoshi Sekine Mike Seltzer Stephanie Seneff Wade Shen Stuart Shieber Luo Si Michel Simard Olivier Siohan Kevin Small David Smith Noah Smith Mark Smucker Rion Snow Ben Snyder Radu Soricut Richard Sproat Amit Srivastava David Stallard Mark Steedman Mark Stevenson Michael Strube Amarnag Subramanya Torsten Suel Eiichiro Sumita Charles Sutton David Talbot Ben Taskar Yee Whye Teh Simone Teufel Joerg Tiedemann Christoph Tillmann Ivan Titov Isabel Trancoso David Traum Andrew Trotman Peter Turney Nicola Ueffing Jay Urbain Antal van den Bosch Benjamin van Durme Olga Vechtomova Dimitra Vergyri Evelyne Viegas David Vilar Ye-Yi Wang Qin Wang Nigel Ward Taro Watanabe Bonnie Webber MIchael White Richard Wicentowski Jason Williams Shuly Wintner Dekai Wu Mingfang Wu Peng Xu Roman Yangarber Alex Yates Zheng Ye Scott Wen-tau Yih Chen Yu Dong Yu Fabio Massimo Zanzotto Richard Zens Luke Zettlemoyer Hao Zhang Ming Zhou Wei Zhou Bowen Zhou Jerry Zhu Jianhan Zhu Andreas Zollmann xiii Table of Contents Subjectivity Recognition on Word Senses via Semi-supervised Mincuts Fangzhong Su and Katja Markert . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 Integrating Knowledge for Subjectivity Sense Labeling Yaw Gyamfi, Janyce Wiebe, Rada Mihalcea and Cem Akkaya . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 A Study on Similarity and Relatedness Using Distributional and WordNet-based Approaches Eneko Agirre, Enrique Alfonseca, Keith Hall, Jana Kravalova, Marius Pasca and Aitor Soroa . 19 A Fully Unsupervised Word Sense Disambiguation Method Using Dependency Knowledge Ping Chen, Wei Ding, Chris Bowes and David Brown . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 Learning Phoneme Mappings for Transliteration without Parallel Data Sujith Ravi and Kevin Knight . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37 A Corpus-Based Approach for the Prediction of Language Impairment in Monolingual English and Spanish-English Bilingual Children Keyur Gabani, Melissa Sherman, Thamar Solorio, Yang Liu, Lisa Bedore and Elizabeth Pe~ a . 46 n A Discriminative Latent Variable Chinese Segmenter with Hybrid Word/Character Information Xu Sun, Yaozhong Zhang, Takuya Matsuzaki, Yoshimasa Tsuruoka and Jun'ichi Tsujii . . . . . . . 56 Improved Reconstruction of Protolanguage Word Forms Alexandre Bouchard-C^ t´ , Thomas L. Griffiths and Dan Klein . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65 oe Shared Logistic Normal Distributions for Soft Parameter Tying in Unsupervised Grammar Induction Shay Cohen and Noah A. Smith . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74 Adding More Languages Improves Unsupervised Multilingual Part-of-Speech Tagging: a Bayesian Non-Parametric Approach Benjamin Snyder, Tahira Naseem, Jacob Eisenstein and Regina Barzilay . . . . . . . . . . . . . . . . . . . . 83 Efficiently Parsable Extensions to Tree-Local Multicomponent TAG Rebecca Nesson and Stuart Shieber. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .92 Improving Unsupervised Dependency Parsing with Richer Contexts and Smoothing William P. Headden III, Mark Johnson and David McClosky . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101 Context-Dependent Alignment Models for Statistical Machine Translation Jamie Brunning, Adri` de Gispert and William Byrne . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110 a Graph-based Learning for Statistical Machine Translation Andrei Alexandrescu and Katrin Kirchhoff . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119 Intersecting Multilingual Data for Faster and Better Statistical Translations Yu Chen, Martin Kay and Andreas Eisele . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 128 xv Without a 'doubt'? Unsupervised Discovery of Downward-Entailing Operators Cristian Danescu-Niculescu-Mizil, Lillian Lee and Richard Ducott . . . . . . . . . . . . . . . . . . . . . . . . 137 The Role of Implicit Argumentation in Nominal SRL Matthew Gerber, Joyce Chai and Adam Meyers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 146 Jointly Identifying Predicates, Arguments and Senses using Markov Logic Ivan Meza-Ruiz and Sebastian Riedel . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 155 Structured Generative Models for Unsupervised Named-Entity Clustering Micha Elsner, Eugene Charniak and Mark Johnson . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 164 Hierarchical Dirichlet Trees for Information Retrieval Gholamreza Haffari and Yee Whye Teh . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 173 Phrase-Based Query Degradation Modeling for Vocabulary-Independent Ranked Utterance Retrieval J. Scott Olsson and Douglas W. Oard . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 182 Japanese Query Alteration Based on Lexical Semantic Similarity Masato Hagiwara and Hisami Suzuki . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 191 Context-based Message Expansion for Disentanglement of Interleaved Text Conversations Lidan Wang and Douglas W. Oard . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 200 Unsupervised Morphological Segmentation with Log-Linear Models Hoifung Poon, Colin Cherry and Kristina Toutanova . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 209 11,001 New Features for Statistical Machine Translation David Chiang, Kevin Knight and Wei Wang . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 218 Efficient Parsing for Transducer Grammars John DeNero, Mohit Bansal, Adam Pauls and Dan Klein . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 227 Preference Grammars: Softening Syntactic Constraints to Improve Statistical Machine Translation Ashish Venugopal, Andreas Zollmann, Noah A. Smith and Stephan Vogel . . . . . . . . . . . . . . . . . . 236 Using a Dependency Parser to Improve SMT for Subject-Object-Verb Languages Peng Xu, Jaeho Kang, Michael Ringgaard and Franz Och . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 245 Learning Bilingual Linguistic Reordering Model for Statistical Machine Translation Han-Bin Chen, Jian-Cheng Wu and Jason S. Chang . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 254 May All Your Wishes Come True: A Study of Wishes and How to Recognize Them Andrew B. Goldberg, Nathanael Fillmore, David Andrzejewski, Zhiting Xu, Bryan Gibson and Xiaojin Zhu . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 263 Predicting Risk from Financial Reports with Regression Shimon Kogan, Dimitry Levin, Bryan R. Routledge, Jacob S. Sagi and Noah A. Smith . . . . . . 272 xvi Domain Adaptation with Latent Semantic Association for Named Entity Recognition Honglei Guo, Huijia Zhu, Zhili Guo, Xiaoxun Zhang, Xian Wu and Zhong Su . . . . . . . . . . . . . . 281 Semi-Automatic Entity Set Refinement Vishnu Vyas and Patrick Pantel . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 290 Unsupervised Constraint Driven Learning For Transliteration Discovery Ming-Wei Chang, Dan Goldwasser, Dan Roth and Yuancheng Tu . . . . . . . . . . . . . . . . . . . . . . . . . 299 On the Syllabification of Phonemes Susan Bartlett, Grzegorz Kondrak and Colin Cherry . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 308 Improving nonparameteric Bayesian inference: experiments on unsupervised word segmentation with adaptor grammars Mark Johnson and Sharon Goldwater . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 317 Joint Parsing and Named Entity Recognition Jenny Rose Finkel and Christopher D. Manning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 326 Minimal-length linearizations for mildly context-sensitive dependency trees Y. Albert Park and Roger Levy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 335 Positive Results for Parsing with a Bounded Stack using a Model-Based Right-Corner Transform William Schuler . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 344 Hierarchical Text Segmentation from Multi-Scale Lexical Cohesion Jacob Eisenstein . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 353 Exploring Content Models for Multi-Document Summarization Aria Haghighi and Lucy Vanderwende . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 362 Global Models of Document Structure using Latent Permutations Harr Chen, S.R.K. Branavan, Regina Barzilay and David R. Karger. . . . . . . . . . . . . . . . . . . . . . . .371 Assessing and Improving the Performance of Speech Recognition for Incremental Systems Timo Baumann, Michaela Atterer and David Schlangen . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 380 Geo-Centric Language Models for Local Business Voice Search Amanda Stent, Ilija Zeljkovic, Diamantino Caseiro and Jay Wilpon . . . . . . . . . . . . . . . . . . . . . . . . 389 Improving the Arabic Pronunciation Dictionary for Phone and Word Recognition with LinguisticallyBased Pronunciation Rules Fadi Biadsy, Nizar Habash and Julia Hirschberg . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 397 Using a maximum entropy model to build segmentation lattices for MT Chris Dyer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 406 Active Learning for Statistical Phrase-based Machine Translation Gholamreza Haffari, Maxim Roy and Anoop Sarkar . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 415 xvii Semi-Supervised Lexicon Mining from Parenthetical Expressions in Monolingual Web Pages Xianchao Wu, Naoaki Okazaki and Jun'ichi Tsujii . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 424 Hierarchical Phrase-Based Translation with Weighted Finite State Transducers Gonzalo Iglesias, Adri` de Gispert, Eduardo R. Banga and William Byrne . . . . . . . . . . . . . . . . . 433 a Improved pronunciation features for construct-driven assessment of non-native spontaneous speech Lei Chen, Klaus Zechner and Xiaoming Xi . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 442 Performance Prediction for Exponential Language Models Stanley Chen . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 450 Tied-Mixture Language Modeling in Continuous Space Ruhi Sarikaya, Mohamed Afify and Brian Kingsbury . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 459 Shrinking Exponential Language Models Stanley Chen . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 468 Predicting Response to Political Blog Posts with Topic Models Tae Yano, William W. Cohen and Noah A. Smith . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 477 An Iterative Reinforcement Approach for Fine-Grained Opinion Mining Weifu Du and Songbo Tan . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 486 For a few dollars less: Identifying review pages sans human labels Luciano Barbosa, Ravi Kumar, Bo Pang and Andrew Tomkins . . . . . . . . . . . . . . . . . . . . . . . . . . . . 494 More than Words: Syntactic Packaging and Implicit Sentiment Stephan Greene and Philip Resnik. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .503 Streaming for large scale NLP: Language Modeling Amit Goyal, Hal Daume III and Suresh Venkatasubramanian . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 512 The Effect of Corpus Size on Case Frame Acquisition for Discourse Analysis Ryohei Sasano, Daisuke Kawahara and Sadao Kurohashi . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 521 Semantic-based Estimation of Term Informativeness Kirill Kireyev . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 530 Optimal Reduction of Rule Length in Linear Context-Free Rewriting Systems Carlos G´ mez-Rodr´guez, Marco Kuhlmann, Giorgio Satta and David Weir . . . . . . . . . . . . . . . . 539 o i Inducing Compact but Accurate Tree-Substitution Grammars Trevor Cohn, Sharon Goldwater and Phil Blunsom . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 548 Hierarchical Search for Parsing Adam Pauls and Dan Klein . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 557 An effective Discourse Parser that uses Rich Linguistic Information Rajen Subba and Barbara Di Eugenio . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 566 xviii Graph-Cut-Based Anaphoricity Determination for Coreference Resolution Vincent Ng . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 575 Using Citations to Generate surveys of Scientific Paradigms Saif Mohammad, Bonnie Dorr, Melissa Egan, Ahmed Hassan, Pradeep Muthukrishan, Vahed Qazvinian, Dragomir Radev and David Zajic . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 584 Non-Parametric Bayesian Areal Linguistics Hal Daume III . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 593 Hierarchical Bayesian Domain Adaptation Jenny Rose Finkel and Christopher D. Manning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 602 Online EM for Unsupervised Models Percy Liang and Dan Klein . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 611 Unsupervised Approaches for Automatic Keyword Extraction Using Meeting Transcripts Feifan Liu, Deana Pennell, Fei Liu and Yang Liu . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 620 A Finite-State Turn-Taking Model for Spoken Dialog Systems Antoine Raux and Maxine Eskenazi . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 629 Extracting Social Meaning: Identifying Interactional Style in Spoken Conversation Dan Jurafsky, Rajesh Ranganath and Dan McFarland. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .638 Linear Complexity Context-Free Parsing Pipelines via Chart Constraints Brian Roark and Kristy Hollingshead . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 647 Improved Syntactic Models for Parsing Speech with Repairs Tim Miller . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 656 A model of local coherence effects in human sentence processing as consequences of updates from bottom-up prior to posterior beliefs Klinton Bicknell and Roger Levy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 665 xix Conference Program Overview Monday, June 1, 2009 9:00­10:10 10:40­11:20 Plenary Session ­ Invited Talk by Antonio Torralba: Understanding Visual Scenes Session 1A: Semantics Session 1B: Multilingual Processing / Morphology and Phonology Session 1C: Syntax and Parsing Student Research Workshop Session 1 Short Paper Presentations: Session 2A: Machine Translation Session 2B: Information Retrieval / Information Extraction / Sentiment Session 2C: Dialog / Speech / Semantics Student Research Workshop Session 2 Session 3A: Machine Translation Session 3B: Semantics Session 3C: Information Retrieval Student Research Workshop Session 3 Poster and Demo Session Student Research Workshop Poster Session 2:00­3:30 4:00­5:40 6:30­9:30 Tuesday, June 2, 2009 9:00-10:10 10:10­11:40 Plenary Session: Paper Award Presentations Session 4A: Machine Translation Session 4B: Sentiment Analysis / Information Extraction Session 4C: Machine Learning / Morphology and Phonology Short Paper Presentations: Session 5A: Machine Translation / Generation / Semantics Session 5B: Machine Learning / Syntax Session 5C: SPECIAL SESSION ­ Speech Indexing and Retrieval Session 6A: Syntax and Parsing Session 6B: Discourse and Summarization Session 6C: Spoken Language Systems 2:00­3:30 4:00­5:15 xxi Wednesday, June 3, 2009 9:00­10:10 Plenary Session ­ Invited Talk by Dan Jurafsky: Ketchup, Espresso, and Chocolate Chip Cookies: Travels in the Language of Food Session 7A: Machine Translation Session 7B: Speech Recognition and Language Modeling Session 7C: Sentiment Analysis Panel Discussion: Emerging Application Areas in Computational Linguistics NAACL Business Meeting Session 8A: Large-scale NLP Session 8B: Syntax and Parsing Session 8C: Discourse and Summarization Session 9A: Machine Learning Session 9B: Dialog Systems Session 9C: Syntax and Parsing 10:40­12:20 12:40-1:40 1:40­2:30 2:30­3:45 4:15­5:30 xxii Conference Program Monday, June 1, 2009 Plenary Session 9:00­10:10 Welcome and Invited Talk: Understanding Visual Scenes Antonio Torralba Break Session 1A: Semantics 10:40­11:05 Subjectivity Recognition on Word Senses via Semi-supervised Mincuts Fangzhong Su and Katja Markert Integrating Knowledge for Subjectivity Sense Labeling Yaw Gyamfi, Janyce Wiebe, Rada Mihalcea and Cem Akkaya A Study on Similarity and Relatedness Using Distributional and WordNet-based Approaches Eneko Agirre, Enrique Alfonseca, Keith Hall, Jana Kravalova, Marius Pasca and Aitor Soroa A Fully Unsupervised Word Sense Disambiguation Method Using Dependency Knowledge Ping Chen, Wei Ding, Chris Bowes and David Brown Session 1B: Multilingual Processing / Morphology and Phonology 10:40­11:05 Learning Phoneme Mappings for Transliteration without Parallel Data Sujith Ravi and Kevin Knight A Corpus-Based Approach for the Prediction of Language Impairment in Monolingual English and Spanish-English Bilingual Children Keyur Gabani, Melissa Sherman, Thamar Solorio, Yang Liu, Lisa Bedore and Elizabeth Pe~ a n A Discriminative Latent Variable Chinese Segmenter with Hybrid Word/Character Information Xu Sun, Yaozhong Zhang, Takuya Matsuzaki, Yoshimasa Tsuruoka and Jun'ichi Tsujii Improved Reconstruction of Protolanguage Word Forms Alexandre Bouchard-C^ t´ , Thomas L. Griffiths and Dan Klein oe 10:10­10:40 11:05­11:30 11:30­11:55 11:55­12:20 11:05­11:30 11:30­11:55 11:55­12:20 xxiii Monday, June 1, 2009 (continued) Session 1C: Syntax and Parsing 10:40­11:05 Shared Logistic Normal Distributions for Soft Parameter Tying in Unsupervised Grammar Induction Shay Cohen and Noah A. Smith Adding More Languages Improves Unsupervised Multilingual Part-of-Speech Tagging: a Bayesian Non-Parametric Approach Benjamin Snyder, Tahira Naseem, Jacob Eisenstein and Regina Barzilay Efficiently Parsable Extensions to Tree-Local Multicomponent TAG Rebecca Nesson and Stuart Shieber Improving Unsupervised Dependency Parsing with Richer Contexts and Smoothing William P. Headden III, Mark Johnson and David McClosky Student Research Workshop Session 1: Note: all student research workshop papers are located in the Companion volume of the proceedings 10:40­11:10 Classifier Combination Techniques Applied to Coreference Resolution Smita Vemulapalli, Xiaoqiang Luo, John F. Pitrelli and Imed Zitouni Solving the "Who's Mark Johnson Puzzle": Information Extraction Based Cross Document Coreference Jian Huang, Sarah M. Taylor, Jonathan L. Smith, Konstantinos A. Fotiadis and C. Lee Giles Exploring Topic Continuation Follow-up Questions using Machine Learning Manuel Kirschner and Raffaella Bernardi Lunch Break 11:05­11:30 11:30­11:55 11:55­12:20 11:15­11:45 11:50­12:20 12:20­2:00 xxiv Monday, June 1, 2009 (continued) Session 2A: Short Paper Presentations: Machine Translation Note: all short papers are located in the Companion volume of the proceedings 2:00­2:15 Cohesive Constraints in A Beam Search Phrase-based Decoder Nguyen Bach, Stephan Vogel and Colin Cherry Revisiting Optimal Decoding for IBM Machine Translation Model 4 James Clarke and Sebastian Riedel Efficient Extraction of Oracle-best Translations from Hypergraphs Zhifei Li and Sanjeev Khudanpur Semantic Roles for SMT: A Hybrid Two-Pass Model Dekai Wu and Pascale Fung Comparison of Extended Lexicon Models in Search and Rescoring for SMT Saa Hasan and Hermann Ney s Simplex Armijo Downhill Algorithm for Optimizing Statistical Machine Translation System Parameters Bing Zhao and Shengyuan Chen Session 2B: Short Paper Presentations: Information Retrieval / Information Extraction / Sentiment Note: all short papers are located in the Companion volume of the proceedings 2:00­2:15 Translation Corpus Source and Size in Bilingual Retrieval Paul McNamee, James Mayfield and Charles Nicholas Large-scale Computation of Distributional Similarities for Queries Enrique Alfonseca, Keith Hall and Silvana Hartmann Text Categorization from Category Name via Lexical Reference Libby Barak, Ido Dagan and Eyal Shnarch Identifying Types of Claims in Online Customer Reviews Shilpa Arora, Mahesh Joshi and Carolyn Rose 2:15­2:30 2:30­2:45 2:45­3:00 3:00­3:15 3:15­3:30 2:15­2:30 2:30­2:45 2:45­3:00 xxv Monday, June 1, 2009 (continued) 3:00­3:15 Towards Automatic Image Region Annotation - Image Region Textual Coreference Resolution Emilia Apostolova and Dina Demner-Fushman TESLA: A Tool for Annotating Geospatial Language Corpora Nate Blaylock, Bradley Swain and James Allen Session 2C: Short Paper Presentations: Dialog / Speech / Semantics Note: all short papers are located in the Companion volume of the proceedings 2:00­2:15 Modeling Dialogue Structure with Adjacency Pair Analysis and Hidden Markov Models Kristy Elizabeth Boyer, Robert Phillips, Eun Young Ha, Michael Wallis, Mladen Vouk and James Lester Towards Natural Language Understanding of Partial Speech Recognition Results in Dialogue Systems Kenji Sagae, Gwen Christian, David DeVault and David Traum Spherical Discriminant Analysis in Semi-supervised Speaker Clustering Hao Tang, Stephen Chu and Thomas Huang Learning Bayesian Networks for Semantic Frame Composition in a Spoken Dialog System Marie-Jean Meurs, Fabrice Lefvre and Renato De Mori Evaluation of a System for Noun Concepts Acquisition from Utterances about Images (SINCA) Using Daily Conversation Data Yuzu Uchida and Kenji Araki Web and Corpus Methods for Malay Count Classifier Prediction Jeremy Nicholson and Timothy Baldwin 3:15­3:30 2:15­2:30 2:30­2:45 2:45­3:00 3:00­3:15 3:15­3:30 xxvi Monday, June 1, 2009 (continued) Student Research Workshop Session 2 Note: all student research workshop papers are located in the Companion volume of the proceedings 2:00­2:30 Sentence Realisation from Bag of Words with Dependency Constraints Karthik Gali and Sriram Venkatapathy Using Language Modeling to Select Useful Annotation Data Dmitriy Dligach and Martha Palmer Break Session 3A: Machine Translation 4:00­4:25 Context-Dependent Alignment Models for Statistical Machine Translation Jamie Brunning, Adri` de Gispert and William Byrne a Graph-based Learning for Statistical Machine Translation Andrei Alexandrescu and Katrin Kirchhoff Intersecting Multilingual Data for Faster and Better Statistical Translations Yu Chen, Martin Kay and Andreas Eisele No Presentation Session 3B: Semantics 4:00­4:25 Without a 'doubt'? Unsupervised Discovery of Downward-Entailing Operators Cristian Danescu-Niculescu-Mizil, Lillian Lee and Richard Ducott The Role of Implicit Argumentation in Nominal SRL Matthew Gerber, Joyce Chai and Adam Meyers Jointly Identifying Predicates, Arguments and Senses using Markov Logic Ivan Meza-Ruiz and Sebastian Riedel 2:35­3:05 3:30­4:00 4:25­4:50 4:50­5:15 5:15­5:40 4:25­4:50 4:50­5:15 xxvii Monday, June 1, 2009 (continued) 5:15­5:40 Structured Generative Models for Unsupervised Named-Entity Clustering Micha Elsner, Eugene Charniak and Mark Johnson Session 3C: Information Retrieval 4:00­4:25 Hierarchical Dirichlet Trees for Information Retrieval Gholamreza Haffari and Yee Whye Teh Phrase-Based Query Degradation Modeling for Vocabulary-Independent Ranked Utterance Retrieval J. Scott Olsson and Douglas W. Oard Japanese Query Alteration Based on Lexical Semantic Similarity Masato Hagiwara and Hisami Suzuki Context-based Message Expansion for Disentanglement of Interleaved Text Conversations Lidan Wang and Douglas W. Oard Student Research Workshop Session 3 Note: all student research workshop papers are located in the Companion volume of the proceedings 4:00­4:30 Pronunciation Modeling in Spelling Correction for Writers of English as a Foreign Language Adriane Boyd Building a Semantic Lexicon of English Nouns via Bootstrapping Ting Qian, Benjamin Van Durme and Lenhart Schubert Multiple Word Alignment with Profile Hidden Markov Models Aditya Bhargava and Grzegorz Kondrak Poster and Demo Session Note: all short papers and demo abstracts are located in the Companion volume of the proceedings Minimum Bayes Risk Combination of Translation Hypotheses from Alternative Morphological Decompositions Adri de Gispert, Sami Virpioja, Mikko Kurimo and William Byrne 4:25­4:50 4:50­5:15 5:15­5:40 4:35­5:05 5:10­5:40 6:30­9:30 xxviii Monday, June 1, 2009 (continued) Generating Synthetic Children's Acoustic Models from Adult Models Andreas Hagen, Bryan Pellom and Kadri Hacioglu Detecting Pitch Accents at the Word, Syllable and Vowel Level Andrew Rosenberg and Julia Hirschberg Shallow Semantic Parsing for Spoken Language Understanding Bonaventura Coppola, Alessandro Moschitti and Giuseppe Riccardi Automatic Agenda Graph Construction from Human-Human Dialogs using Clustering Method Cheongjae Lee, Sangkeun Jung, Kyungduk Kim and Gary Geunbae Lee A Simple Sentence-Level Extraction Algorithm for Comparable Data Christoph Tillmann and Jian-ming Xu Learning Combination Features with L1 Regularization Daisuke Okanohara and Jun'ichi Tsujii Multi-scale Personalization for Voice Search Daniel Bolanos, Geoffrey Zweig and Patrick Nguyen The Importance of Sub-Utterance Prosody in Predicting Level of Certainty Heather Pon-Barry and Stuart Shieber Using Integer Linear Programming for Detecting Speech Disfluencies Kallirroi Georgila Contrastive Summarization: An Experiment with Consumer Reviews Kevin Lerman and Ryan McDonald Topic Identification Using Wikipedia Graph Centrality Kino Coursey and Rada Mihalcea Extracting Bilingual Dictionary from Comparable Corpora with Dependency Heterogeneity Kun Yu and Junichi Tsujii Domain Adaptation with Artificial Data for Semantic Parsing of Speech Lonneke van der Plas, James Henderson and Paola Merlo xxix Monday, June 1, 2009 (continued) Extending Pronunciation Lexicons via Non-phonemic Respellings Lucian Galescu A Speech Understanding Framework that Uses Multiple Language Models and Multiple Understanding Models Masaki Katsumaru, Mikio Nakano, Kazunori Komatani, Kotaro Funakoshi, Tetsuya Ogata and Hiroshi G. Okuno Taking into Account the Differences between Actively and Passively Acquired Data: The Case of Active Learning with Support Vector Machines for Imbalanced Datasets Michael Bloodgood and Vijay Shanker Faster MT Decoding Through Pervasive Laziness Michael Pust and Kevin Knight Evaluating the Syntactic Transformations in Gold Standard Corpora for Statistical Sentence Compression Naman K Gupta, Sourish Chaudhuri and Carolyn P Rose Incremental Adaptation of Speech-to-Speech Translation Nguyen Bach, Roger Hsiao, Matthias Eck, Paisarn Charoenpornsawat, Stephan Vogel, Tanja Schultz, Ian Lane, Alex Waibel and Alan Black Name Perplexity Octavian Popescu Answer Credibility: A Language Modeling Approach to Answer Validation Protima Banerjee and Hyoil Han Exploiting Named Entity Classes in CCG Surface Realization Rajakrishnan Rajkumar, Michael White and Dominic Espinosa Search Engine Adaptation by Feedback Control Adjustment for Time-sensitive Query Ruiqiang zhang, yi Chang, Zhaohui Zheng, Donald Metzler and Jian-yun Nie A Local Tree Alignment-based Soft Pattern Matching Approach for Information Extraction Seokhwan Kim, Minwoo Jeong and Gary Geunbae Lee Classifying Factored Genres with Part-of-Speech Histograms Sergey Feldman, Marius Marin, Julie Medero and Mari Ostendorf Towards Effective Sentence Simplification for Automatic Processing of Biomedical Text Siddhartha Jonnalagadda, Luis Tari, Jrg Hakenberg, Chitta Baral and Graciela Gonzalez xxx Monday, June 1, 2009 (continued) Improving SCL Model for Sentiment-Transfer Learning Songbo Tan and Xueqi Cheng MICA: A Probabilistic Dependency Parser Based on Tree Insertion Grammars (Application Note) Srinivas Bangalore, Pierre Boullier, Alexis Nasr, Owen Rambow and Benot Sagot Lexical and Syntactic Adaptation and Their Impact in Deployed Spoken Dialog Systems Svetlana Stoyanchev and Amanda Stent Analysing Recognition Errors in Unlimited-Vocabulary Speech Recognition Teemu Hirsim¨ ki and Mikko Kurimo a The independence of dimensions in multidimensional dialogue act annotation Volha Petukhova and Harry Bunt Improving Coreference Resolution by Using Conversational Metadata Xiaoqiang Luo, Radu Florian and Todd Ward Using N-gram based Features for Machine Translation System Combination Yong Zhao and Xiaodong He Language Specific Issue and Feature Exploration in Chinese Event Extraction Zheng Chen and Heng Ji Improving A Simple Bigram HMM Part-of-Speech Tagger by Latent Annotation and SelfTraining Zhongqiang Huang, Vladimir Eidelman and Mary Harper 6:30­9:30 Student Research Workshop Poster Session Note: all student research workshop papers are located in the Companion volume of the proceedings Also: All papers presented in the morning and afternoon sessions of the student research workshop will also be shown as posters. Using Emotion to Gain Rapport in a Spoken Dialog System Jaime Acosta Interactive Annotation Learning with Indirect Feature Voting Shilpa Arora and Eric Nyberg xxxi Monday, June 1, 2009 (continued) Loss-Sensitive Discriminative Training of Machine Transliteration Models Kedar Bellare, Koby Crammer and Dayne Freitag Syntactic Tree-based Relation Extraction Using a Generalization of Collins and Duffy Convolution Tree Kernel Mahdy Khayyamian, Seyed Abolghasem Mirroshandel and Hassan Abolhassani Towards Building a Competitive Opinion Summarization System: Challenges and Keys Elena Lloret, Alexandra Balahur, Manuel Palomar and Andres Montoyo Domain-Independent Shallow Sentence Ordering Thade Nahnsen Towards Unsupervised Recognition of Dialogue Acts Nicole Novielli and Carlo Strapparava Modeling Letter-to-Phoneme Conversion as a Phrase Based Statistical Machine Translation Problem with Minimum Error Rate Training Taraka Rama, Anil Kumar Singh and Sudheer Kolachina Disambiguation of Preposition Sense Using Linguistically Motivated Features Stephen Tratz and Dirk Hovy xxxii Tuesday, June 2, 2009 Plenary Session 9:00­9:10 9:10­9:40 Paper Awards Unsupervised Morphological Segmentation with Log-Linear Models Hoifung Poon, Colin Cherry and Kristina Toutanova 11,001 New Features for Statistical Machine Translation David Chiang, Kevin Knight and Wei Wang Break Session 4A: Machine Translation 10:10­10:35 Efficient Parsing for Transducer Grammars John DeNero, Mohit Bansal, Adam Pauls and Dan Klein Preference Grammars: Softening Syntactic Constraints to Improve Statistical Machine Translation Ashish Venugopal, Andreas Zollmann, Noah A. Smith and Stephan Vogel Using a Dependency Parser to Improve SMT for Subject-Object-Verb Languages Peng Xu, Jaeho Kang, Michael Ringgaard and Franz Och Learning Bilingual Linguistic Reordering Model for Statistical Machine Translation Han-Bin Chen, Jian-Cheng Wu and Jason S. Chang Session 4B: Sentiment Analysis / Information Extraction 10:10­10:35 May All Your Wishes Come True: A Study of Wishes and How to Recognize Them Andrew B. Goldberg, Nathanael Fillmore, David Andrzejewski, Zhiting Xu, Bryan Gibson and Xiaojin Zhu Predicting Risk from Financial Reports with Regression Shimon Kogan, Dimitry Levin, Bryan R. Routledge, Jacob S. Sagi and Noah A. Smith Domain Adaptation with Latent Semantic Association for Named Entity Recognition Honglei Guo, Huijia Zhu, Zhili Guo, Xiaoxun Zhang, Xian Wu and Zhong Su Semi-Automatic Entity Set Refinement Vishnu Vyas and Patrick Pantel 9:40­10:10 10:10-10:40 10:35­10:50 10:50­11:15 11:15­11:40 10:35­10:50 10:50­11:15 11:15­11:40 xxxiii Tuesday, June 2, 2009 (continued) Session 4C: Machine Learning / Morphology and Phonology 10:10­10:35 Unsupervised Constraint Driven Learning For Transliteration Discovery Ming-Wei Chang, Dan Goldwasser, Dan Roth and Yuancheng Tu On the Syllabification of Phonemes Susan Bartlett, Grzegorz Kondrak and Colin Cherry Improving nonparameteric Bayesian inference: experiments on unsupervised word segmentation with adaptor grammars Mark Johnson and Sharon Goldwater No Presentation Lunch Break Session 5A: Short Paper Presentations: Machine Translation / Generation / Semantics Note: all short papers are located in the Companion volume of the proceedings 2:00­2:15 Statistical Post-Editing of a Rule-Based Machine Translation System Antonio-L. Lagarda, Vicent Alabau, Francisco Casacuberta, Roberto Silva and Enrique Daz-de-Liao On the Importance of Pivot Language Selection for Statistical Machine Translation Michael Paul, Hirofumi Yamamoto, Eiichiro Sumita and Satoshi Nakamura Tree Linearization in English: Improving Language Model Based Approaches Katja Filippova and Michael Strube Determining the position of adverbial phrases in English Huayan Zhong and Amanda Stent Estimating and Exploiting the Entropy of Sense Distributions Peng Jin, Diana McCarthy, Rob Koeling and John Carroll Semantic classification with WordNet Kernels Diarmuid Saghdha 10:35­10:50 10:50­11:15 11:15­11:40 12:20­2:00 2:15­2:30 2:30­2:45 2:45­3:00 3:00­3:15 3:15­3:30 xxxiv Tuesday, June 2, 2009 (continued) Session 5B: Short Paper Presentations: Machine Learning / Syntax Note: all short papers are located in the Companion volume of the proceedings 2:00­2:15 Sentence Boundary Detection and the Problem with the U.S. Dan Gillick Quadratic Features and Deep Architectures for Chunking Joseph Turian, James Bergstra and Yoshua Bengio Active Zipfian Sampling for Statistical Parser Training Onur Cobanoglu Combining Constituent Parsers Victoria Fossum and Kevin Knight Recognising the Predicate-argument Structure of Tagalog Meladel Mistica and Timothy Baldwin Reverse Revision and Linear Tree Combination for Dependency Parsing Giuseppe Attardi and Felice Dell'Orletta Session 5C: Short Paper Presentations: SPECIAL SESSION ­ Speech Indexing and Retrieval Note: all short papers are located in the Companion volume of the proceedings 2:00­2:15 2:15­2:30 Introduction to the Special Session on Speech Indexing and Retrieval Anchored Speech Recognition for Question Answering Sibel Yaman, Gokan Tur, Dimitra Vergyri, Dilek Hakkani-Tur, Mary Harper and Wen Wang Score Distribution Based Term Specific Thresholding for Spoken Term Detection Dogan Can and Murat Saraclar Automatic Chinese Abbreviation Generation Using Conditional Random Field Dong Yang, Yi-Cheng Pan and Sadaoki Furui 2:15­2:30 2:30­2:45 2:45­3:00 3:00­3:15 3:15­3:30 2:30­2:45 2:45­3:00 xxxv Tuesday, June 2, 2009 (continued) 3:00­3:15 Fast decoding for open vocabulary spoken term detection Bhuvana Ramabhadran, Abhinav Sethy, Jonathan Mamou, Brian Kingsbury and Upendra Chaudhari Tightly coupling Speech Recognition and Search Taniya Mishra and Srinivas Bangalore Break Session 6A: Syntax and Parsing 4:00­4:25 Joint Parsing and Named Entity Recognition Jenny Rose Finkel and Christopher D. Manning Minimal-length linearizations for mildly context-sensitive dependency trees Y. Albert Park and Roger Levy Positive Results for Parsing with a Bounded Stack using a Model-Based Right-Corner Transform William Schuler Session 6B: Discourse and Summarization 4:00­4:25 Hierarchical Text Segmentation from Multi-Scale Lexical Cohesion Jacob Eisenstein Exploring Content Models for Multi-Document Summarization Aria Haghighi and Lucy Vanderwende Global Models of Document Structure using Latent Permutations Harr Chen, S.R.K. Branavan, Regina Barzilay and David R. Karger 3:15­3:30 3:30­4:00 4:25­4:50 4:50­5:15 4:25­4:50 4:50­5:15 xxxvi Tuesday, June 2, 2009 (continued) Session 6C: Spoken Language Systems 4:00­4:25 Assessing and Improving the Performance of Speech Recognition for Incremental Systems Timo Baumann, Michaela Atterer and David Schlangen Geo-Centric Language Models for Local Business Voice Search Amanda Stent, Ilija Zeljkovic, Diamantino Caseiro and Jay Wilpon Improving the Arabic Pronunciation Dictionary for Phone and Word Recognition with Linguistically-Based Pronunciation Rules Fadi Biadsy, Nizar Habash and Julia Hirschberg 4:25­4:50 4:50­5:15 Wednesday, June 3, 2009 Plenary Session 9:00­10:10 Invited Talk: Ketchup, Espresso, and Chocolate Chip Cookies: Travels in the Language of Food Dan Jurafsky Break Session 7A: Machine Translation 10:40­11:05 Using a maximum entropy model to build segmentation lattices for MT Chris Dyer Active Learning for Statistical Phrase-based Machine Translation Gholamreza Haffari, Maxim Roy and Anoop Sarkar Semi-Supervised Lexicon Mining from Parenthetical Expressions in Monolingual Web Pages Xianchao Wu, Naoaki Okazaki and Jun'ichi Tsujii Hierarchical Phrase-Based Translation with Weighted Finite State Transducers Gonzalo Iglesias, Adri` de Gispert, Eduardo R. Banga and William Byrne a 10:10­10:40 11:05­11:30 11:30­11:55 11:55­12:20 xxxvii Wednesday, June 3, 2009 (continued) Session 7B: Speech Recognition and Language Modeling 10:40­11:05 Improved pronunciation features for construct-driven assessment of non-native spontaneous speech Lei Chen, Klaus Zechner and Xiaoming Xi Performance Prediction for Exponential Language Models Stanley Chen Tied-Mixture Language Modeling in Continuous Space Ruhi Sarikaya, Mohamed Afify and Brian Kingsbury Shrinking Exponential Language Models Stanley Chen Session 7C: Sentiment Analysis 10:40­11:05 Predicting Response to Political Blog Posts with Topic Models Tae Yano, William W. Cohen and Noah A. Smith An Iterative Reinforcement Approach for Fine-Grained Opinion Mining Weifu Du and Songbo Tan For a few dollars less: Identifying review pages sans human labels Luciano Barbosa, Ravi Kumar, Bo Pang and Andrew Tomkins More than Words: Syntactic Packaging and Implicit Sentiment Stephan Greene and Philip Resnik Lunch Break Panel Discussion: Emerging Application Areas in Computational Linguistics Chaired by Bill Dolan, Microsoft Panelists: Jill Burstein, Educational Testing Service; Joel Tetreault, Educational Testing Service; Patrick Pantel, Yahoo; Andy Hickl, Language Computer Corporation + Swingly NAACL Business Meeting 11:05­11:30 11:30­11:55 11:55­12:20 11:05­11:30 11:30­11:55 11:55­12:20 12:20­1:40 12:40-1:40 1:40­2:30 xxxviii Wednesday, June 3, 2009 (continued) Session 8A: Large-scale NLP 2:30­2:55 Streaming for large scale NLP: Language Modeling Amit Goyal, Hal Daume III and Suresh Venkatasubramanian The Effect of Corpus Size on Case Frame Acquisition for Discourse Analysis Ryohei Sasano, Daisuke Kawahara and Sadao Kurohashi Semantic-based Estimation of Term Informativeness Kirill Kireyev Session 8B: Syntax and Parsing 2:30­2:55 Optimal Reduction of Rule Length in Linear Context-Free Rewriting Systems Carlos G´ mez-Rodr´guez, Marco Kuhlmann, Giorgio Satta and David Weir o i Inducing Compact but Accurate Tree-Substitution Grammars Trevor Cohn, Sharon Goldwater and Phil Blunsom Hierarchical Search for Parsing Adam Pauls and Dan Klein Session 8C: Discourse and Summarization 2:30­2:55 An effective Discourse Parser that uses Rich Linguistic Information Rajen Subba and Barbara Di Eugenio Graph-Cut-Based Anaphoricity Determination for Coreference Resolution Vincent Ng Using Citations to Generate surveys of Scientific Paradigms Saif Mohammad, Bonnie Dorr, Melissa Egan, Ahmed Hassan, Pradeep Muthukrishan, Vahed Qazvinian, Dragomir Radev and David Zajic Break 2:55­3:20 3:20­3:45 2:55­3:20 3:20­3:45 2:55­3:20 3:20­3:45 3:45­4:15 xxxix Wednesday, June 3, 2009 (continued) Session 9A: Machine Learning 4:15­4:40 Non-Parametric Bayesian Areal Linguistics Hal Daume III Hierarchical Bayesian Domain Adaptation Jenny Rose Finkel and Christopher D. Manning Online EM for Unsupervised Models Percy Liang and Dan Klein Session 9B: Dialog Systems 4:15­4:40 Unsupervised Approaches for Automatic Keyword Extraction Using Meeting Transcripts Feifan Liu, Deana Pennell, Fei Liu and Yang Liu A Finite-State Turn-Taking Model for Spoken Dialog Systems Antoine Raux and Maxine Eskenazi Extracting Social Meaning: Identifying Interactional Style in Spoken Conversation Dan Jurafsky, Rajesh Ranganath and Dan McFarland Session 9C: Syntax and Parsing 4:15­4:40 Linear Complexity Context-Free Parsing Pipelines via Chart Constraints Brian Roark and Kristy Hollingshead Improved Syntactic Models for Parsing Speech with Repairs Tim Miller A model of local coherence effects in human sentence processing as consequences of updates from bottom-up prior to posterior beliefs Klinton Bicknell and Roger Levy 4:40­5:05 5:05­5:30 4:40­5:05 5:05­5:30 4:40­5:05 5:05­5:30 xl Subjectivity Recognition on Word Senses via Semi-supervised Mincuts Fangzhong Su School of Computing University of Leeds fzsu@comp.leeds.ac.uk Katja Markert School of Computing University of Leeds markert@comp.leeds.ac.uk Abstract We supplement WordNet entries with information on the subjectivity of its word senses. Supervised classifiers that operate on word sense definitions in the same way that text classifiers operate on web or newspaper texts need large amounts of training data. The resulting data sparseness problem is aggravated by the fact that dictionary definitions are very short. We propose a semi-supervised minimum cut framework that makes use of both WordNet definitions and its relation structure. The experimental results show that it outperforms supervised minimum cut as well as standard supervised, non-graph classification, reducing the error rate by 40%. In addition, the semi-supervised approach achieves the same results as the supervised framework with less than 20% of the training data. the sentence contains a favourable opinion. However, such word-based indicators can be misleading for two reasons. First, contextual indicators such as irony and negation can reverse subjectivity or polarity indications (Polanyi and Zaenen, 2004). Second, different word senses of a single word can actually be of different subjectivity or polarity. A typical subjectivity-ambiguous word, i.e. a word that has at least one subjective and at least one objective sense, is positive, as shown by the two example senses given below.1 (1) positive, electropositive--having a positive electric charge;"protons are positive" (objective) (2) plus, positive--involving advantage or good; "a plus (or positive) factor" (subjective) 1 Introduction There is considerable academic and commercial interest in processing subjective content in text, where subjective content refers to any expression of a private state such as an opinion or belief (Wiebe et al., 2005). Important strands of work include the identification of subjective content and the determination of its polarity, i.e. whether a favourable or unfavourable opinion is expressed. Automatic identification of subjective content often relies on word indicators, such as unigrams (Pang et al., 2002) or predetermined sentiment lexica (Wilson et al., 2005). Thus, the word positive in the sentence "This deal is a positive development for our company." gives a strong indication that We concentrate on this latter problem by automatically creating lists of subjective senses, instead of subjective words, via adding subjectivity labels for senses to electronic lexica, using the example of WordNet. This is important as the problem of subjectivity-ambiguity is frequent: We (Su and Markert, 2008) find that over 30% of words in our dataset are subjectivity-ambiguous. Information on subjectivity of senses can also improve other tasks such as word sense disambiguation (Wiebe and Mihalcea, 2006). Moreover, Andreevskaia and Bergler (2006) show that the performance of automatic annotation of subjectivity at the word level can be hurt by the presence of subjectivity-ambiguous words in the training sets they use. 1 All examples in this paper are from WordNet 2.0. Human Language Technologies: The 2009 Annual Conference of the North American Chapter of the ACL, pages 1­9, Boulder, Colorado, June 2009. c 2009 Association for Computational Linguistics 1 We propose a semi-supervised approach based on minimum cut in a lexical relation graph to assign subjectivity (subjective/objective) labels to word senses.2 Our algorithm outperforms supervised minimum cuts and standard supervised, non-graph classification algorithms (like SVM), reducing the error rate by up to 40%. In addition, the semi-supervised approach achieves the same results as the supervised framework with less than 20% of the training data. Our approach also outperforms prior approaches to the subjectivity recognition of word senses and performs well across two different data sets. The remainder of this paper is organized as follows. Section 2 discusses previous work. Section 3 describes our proposed semi-supervised minimum cut framework in detail. Section 4 presents the experimental results and evaluation, followed by conclusions and future work in Section 5. 2 Related Work There has been a large and diverse body of research in opinion mining, with most research at the text (Pang et al., 2002; Pang and Lee, 2004; Popescu and Etzioni, 2005; Ounis et al., 2006), sentence (Kim and Hovy, 2005; Kudo and Matsumoto, 2004; Riloff et al., 2003; Yu and Hatzivassiloglou, 2003) or word (Hatzivassiloglou and McKeown, 1997; Turney and Littman, 2003; Kim and Hovy, 2004; Takamura et al., 2005; Andreevskaia and Bergler, 2006; Kaji and Kitsuregawa, 2007) level. An up-to-date overview is given in Pang and Lee (2008). Graph-based algorithms for classification into subjective/objective or positive/negative language units have been mostly used at the sentence and document level (Pang and Lee, 2004; Agarwal and Bhattacharyya, 2005; Thomas et al., 2006), instead of aiming at dictionary annotation as we do. We also cannot use prior graph construction methods for the document level (such as physical proximity of sentences, used in Pang and Lee (2004)) at the word sense level. At the word level Takamura et al. (2005) use a semi-supervised spin model for word polarity determination, where the graph It can be argued that subjectivity labels are maybe rather more graded than the clear-cut binary distinction we assign. However, in Su and Markert (2008a) as well as Wiebe and Mihalcea (2006) we find that human can assign the binary distinction to word senses with a high level of reliability. 2 is constructed using a variety of information such as gloss co-occurrences and WordNet links. Apart from using a different graph-based model from ours, they assume that subjectivity recognition has already been achieved prior to polarity recognition and test against word lists containing subjective words only. However, Kim and Hovy (2004) and Andreevskaia and Bergler (2006) show that subjectivity recognition might be the harder problem with lower human agreement and automatic performance. In addition, we deal with classification at the word sense level, treating also subjectivity-ambiguous words, which goes beyond the work in Takamura et al. (2005). Word Sense Level: There are three prior approaches addressing word sense subjectivity or polarity classification. Esuli and Sebastiani (2006) determine the polarity (positive/negative/objective) of word senses in WordNet. However, there is no evaluation as to the accuracy of their approach. They then extend their work (Esuli and Sebastiani, 2007) by applying the Page Rank algorithm to rank the WordNet senses in terms of how strongly a sense possesses a given semantic property (e.g., positive or negative). Apart from us tackling subjectivity instead of polarity, their Page Rank graph is also constructed focusing on WordNet glosses (linking glosses containing the same words), whereas we concentrate on the use of WordNet relations. Both Wiebe and Mihalcea (2006) and our prior work (Su and Markert, 2008) present an annotation scheme for word sense subjectivity and algorithms for automatic classification. Wiebe and Mihalcea (2006) use an algorithm relying on distributional similarity and an independent, large manually annotated opinion corpus (MPQA) (Wiebe et al., 2005). One of the disadvantages of their algorithm is that it is restricted to senses that have distributionally similar words in the MPQA corpus, excluding 23% of their test data from automatic classification. Su and Markert (2008) present supervised classifiers, which rely mostly on WordNet glosses and do not effectively exploit WordNet's relation structure. 3 Semi-Supervised Mincuts 3.1 Minimum Cuts: The Main Idea Binary classification with minimum cuts (Mincuts) in graphs is based on the idea that similar items 2 should be grouped in the same cut. All items in the training/test data are seen as vertices in a graph with undirected weighted edges between them specifying how strong the similarity/association between two vertices is. We use minimum s-t cuts: the graph contains two particular vertices s (source, corresponds to subjective) and t (sink, corresponds to objective) and each vertex u is connected to s and t via a weighted edge that can express how likely u is to be classified as s or t in isolation. Binary classification of the vertices is equivalent to splitting the graph into two disconnected subsets of all vertices, S and T with s S and t T . This corresponds to removing a set of edges from the graph. As similar items should be in the same part of the split, the best split is one which removes edges with low weights. In other words, a minimum cut problem is to find a partition of the graph which minimizes the following formula, where w(u, v) expresses the weight of an edge between two vertices. W (S, T ) = uS,vT subjective or both objective.3 An example here is the antonym relation, where two antonyms such as good--morally admirable and evil, wicked--morally bad or wrong are both subjective. Second, Mincuts can be easily expanded into a semi-supervised framework (Blum and Chawla, 2001). This is essential as the existing labeled datasets for our problem are small. In addition, glosses are short, leading to sparse high dimensional vectors in standard feature representations. Also, WordNet connections between different parts of the WordNet hierarchy can also be sparse, leading to relatively isolated senses in a graph in a supervised framework. Semi-supervised Mincuts allow us to import unlabeled data that can serve as bridges to isolated components. More importantly, as the unlabeled data can be chosen to be related to the labeled and test data, they might help pull test data to the right cuts (categories). 3.3 Formulation of Semi-supervised Mincuts w(u, v) Globally optimal minimum cuts can be found in polynomial time and near-linear running time in practice, using the maximum flow algorithm (Pang and Lee, 2004; Cormen et al., 2002). 3.2 Why might Semi-supervised Minimum Cuts Work? The formulation of our semi-supervised Mincut for sense subjectivity classification involves the following steps, which we later describe in more detail. 1. We define two vertices s (source) and t (sink), which correspond to the "subjective" and "objective" category, respectively. Following the definition in Blum and Chawla (2001), we call the vertices s and t classification vertices, and all other vertices (labeled, test, and unlabeled data) example vertices. Each example vertex corresponds to one WordNet sense and is connected to both s and t via a weighted edge. The latter guarantees that the graph is connected. 2. For the test and unlabeled examples, we see the edges to the classification vertices as the probability of them being subjective/objective disregarding other example vertices. We use a supervised classifier to set these edge weights. For the labeled training examples, they are connected by edges with a high constant weight to the classification vertices that they belong to. 3. WordNet relations are used to construct the edges between two example vertices. Such See Kamps et al. (2004) for an early indication of such properties for some WordNet relations. 3 We propose semi-supervised mincuts for subjectivity recognition on senses for several reasons. First, our problem satisfies two major conditions necessary for using minimum cuts. It is a binary classification problem (subjective vs. objective senses) as is needed to divide the graph into two components. Our dataset also lends itself naturally to s-t Mincuts as we have two different views on the data. Thus, the edges of a vertex (=sense) to the source/sink can be seen as the probability of a sense being subjective or objective without taking similarity to other senses into account, for example via considering only the sense gloss. In contrast, the edges between two senses can incorporate the WordNet relation hierarchy, which is a good source of similarity for our problem as many WordNet relations are subjectivity-preserving, i.e. if two senses are connected via such a relation they are likely to be both 3 edges can exist between any pair of example vertices, for example between two unlabeled examples. 4. After graph construction we then employ a maximum-flow algorithm to find the minimum s-t cuts of the graph. The cut in which the source vertex s lies is classified as "subjective", and the cut in which the sink vertex t lies is "objective". We now describe the above steps in more detail. Selection of unlabeled data: Random selection of unlabeled data might hurt the performance of Mincuts, as they might not be related to any sense in our training/test data (denoted by A). Thus a basic principle is that the selected unlabeled senses should be related to the training/test data by WordNet relations. We therefore simply scan each sense in A, and collect all senses related to it via one of the WordNet relations in Table 1. All such senses that are not in A are collected in the unlabeled data set. Weighting of edges to the classification vertices: The edge weight to s and t represents how likely it is that an example vertex is initially put in the cut in which s (subjective) or t (objective) lies. For unlabeled and test vertices, we use a supervised classifier (SVM4 ) with the labeled data as training data to assign the edge weights. The SVM is also used as a baseline and its features are described in Section 4.3. As we do not wish the Mincut to reverse labels of the labeled training data, we assign a high constant weight of 5 to the edge between a labeled vertex and its corresponding classification vertex, and a low weight of 0.01 to the edge to the other classification vertex. Assigning weights to WordNet relations: We connect two vertices that are linked by one of the ten WordNet relations in Table 1 via an edge. Not all WordNet relations we use are subjectivitypreserving to the same degree: for example, hyponyms (such as simpleton) of objective senses (such as person) do not have to be objective. However, we aim for high graph connectivity and we can assign different weights to different relations We employ LIBSVM, available at http://www.csie. ntu.edu.tw/~cjlin/libsvm/. Linear kernel and probability estimates are used in this work. 4 to reflect the degree to which they are subjectivitypreserving. Therefore, we experiment with two methods of weight assignment. Method 1 (NoSL) assigns the same constant weight of 1.0 to all WordNet relations. Method 2 (SL) reflects different degrees of preserving subjectivity. To do this, we adapt an unsupervised method of generating a large noisy set of subjective and objective senses from our previous work (Su and Markert, 2008). This method uses a list of subjective words (SL)5 to classify each WordNet sense with at least two subjective words in its gloss as subjective and all other senses as objective. We then count how often two senses related via a given relation have the same or a different subjectivity label. The weight is computed by #same/(#same+#different). Results are listed in Table 1. Table 1: Relation weights (Method 2) Method #Same #Different Antonym 2,808 309 Similar-to 6,887 1,614 Derived-from 4,630 947 Direct-Hypernym 71,915 8,600 Direct-Hyponym 71,915 8,600 Attribute 350 109 Also-see 1,037 337 Extended-Antonym 6,917 1,651 Domain 4,387 892 Domain-member 4,387 892 Weight 0.90 0.81 0.83 0.89 0.89 0.76 0.75 0.81 0.83 0.83 Example graph: An example graph is shown in Figure 1. The three example vertices correspond to the senses religious--extremely scrupulous and conscientious, scrupulous--having scruples; arising from a sense of right and wrong; principled; and flicker, spark, glint--a momentary flash of light respectively. The vertex "scrupulous" is unlabeled data derived from the vertex "religious"(a test item) by the relation "similar-to". 4 4.1 Experiments and Evaluation Datasets We conduct the experiments on two different gold standard datasets. One is the Micro-WNOp corpus, 5 Available at http://www.cs.pitt.edu/mpqa 4 religious 0.24 similar-to 0.81 scrupulous 0.76 subjective 0.83 0.17 objective 0.16 flicker 0.84 Figure 1: Graph of Word Senses which is representative of the part-of-speech distribution in WordNet 6 . It includes 298 words with 703 objective and 358 subjective WordNet senses. The second one is the dataset created by Wiebe and Mihalcea (2006).7 It only contains noun and verb senses, and includes 60 words with 236 objective and 92 subjective WordNet senses. As the Micro-WNOp set is larger and also contains adjective and adverb senses, we describe our results in more detail on that corpus in the Section 4.3 and 4.4. In Section 4.5, we shortly discuss results on Wiebe&Mihalcea's dataset. 4.2 Baseline and Evaluation We compare to a baseline that assigns the most frequent category objective to all senses, which achieves an accuracy of 66.3% and 72.0% on MicroWNOp and Wiebe&Mihalcea's dataset respectively. We use the McNemar test at the significance level of 5% for significance statements. All evaluations are carried out by 10-fold cross-validation. 4.3 Standard Supervised Learning We use an SVM classifier to compare our proposed semi-supervised Mincut approach to a reasonable Available at http://www.comp.leeds.ac.uk/ markert/data. This dataset was first used with a different annotation scheme in Esuli and Sebastiani (2007) and we also used it in Su and Markert (2008). 7 Available at http://www.cs.pitt.edu/~wiebe/ pubs/papers/goldstandard.total.acl06. 6 baseline.8 Three different feature types are used. Lexical Features (L): a bag-of-words representation of the sense glosses with stop word filtering. Relation Features (R): First, we use two features for each of the ten WordNet relations in Table 1, describing how many relations of that type the sense has to senses in the subjective or objective part of the training set, respectively. This provides a non-graph summary of subjectivity-preserving links. Second, we manually collected a small set (denoted by SubjSet) of seven subjective verb and noun senses which are close to the root in WordNet's hypernym tree. A typical example element of SubjSet is psychological feature --a feature of the mental life of a living organism, which indicates subjectivity for its hyponyms such as hope -- the general feeling that some desire will be fulfilled. A binary feature describes whether a noun/verb sense is a hyponym of an element of SubjSet. Monosemous Feature (M): for each sense, we scan if a monosemous word is part of its synset. If so, we further check if the monosemous word is collected in the subjective word list (SL). The intuition is that if a monosemous word is subjective, obviously its (single) sense is subjective. For example, the sense uncompromising, inflexible--not making concessions is subjective, as "uncompromising" is a monosemous word and also in SL. We experiment with different combinations of features and the results are listed in Table 2, prefixed by "SVM". All combinations perform significantly better than the more frequent category baseline and similarly to the supervised Naive Bayes classifier (see S&M in Table 2) we used in Su and Markert (2008). However, improvements by adding more features remain small. In addition, we compare to a supervised classifier (see Lesk in Table 2) that just assigns each sense the subjectivity label of its most similar sense in the training data, using Lesk's similarity measure from Pedersen's WordNet similarity package9 . We use Lesk as it is one of the few measures applicable across all parts-of-speech. This SVM is also used to provide the edge weights to the classification vertices in the Mincut approach. 9 Available at http://www.d.umn.edu/~tpederse/ similarity.html. 8 5 Table 2: Results of SVM and Mincuts with different settings of feature Method Baseline S&M Lesk SVM-L L-SL L-NoSL SVM-LM LM-SL LM-NoSL SVM-LR LR-SL LR-NoSL SVM-LRM LRM-SL LRM-NoSL 1 2 Precision N/A 66.2% 65.6% 69.6% 82.0% 80.8% 68.9% 83.2% 83.6% 68.4% 82.7% 82.4% 69.8% 85.5% 84.6% Subjective Recall 0 64.5% 50.3% 37.7% 43.3% 43.6% 42.2% 44.4% 44.1% 45.3% 65.4% 65.4% 47.2% 65.6% 65.9% F-score N/A 65.3% 56.9% 48.9% 56.7% 56.6% 52.3% 57.9% 57.8% 54.5% 73.0% 72.9% 56.3% 74.2% 74.1% Precision 66.3% 82.2% 77.5% 74.3% 76.7% 76.7% 75.4% 77.1% 77.1% 76.2% 84.1% 84.0% 76.9% 84.4% 84.4% Objective Recall 100% 83.2% 86.6% 91.6% 95.2% 94.7% 90.3% 95.4% 95.6% 89.3% 93.0% 92.9% 89.6% 94.3% 93.9% F-score 79.7% 82.7% 81.8% 82.0% 85.0% 84.8% 82.2% 85.3% 85.3% 82.3% 88.3% 88.2% 82.8% 89.1% 88.9% Accuracy 66.3% 76.9% 74.4% 73.4% 77.7% 77.5% 74.1% 78.2% 78.2% 74.5% 83.7% 83.6% 75.3% 84.6% 84.4% 3 L, R and M correspond to the lexical, relation and monosemous features respectively. SVM-L corresponds to using lexical features only for the SVM classifier. Likewise, SVMLRM corresponds to using a combination for lexical, relation, and monosemous features for the SVM classifier. L-SL corresponds to the Mincut that uses only lexical features for the SVM classifier, and subjective list (SL) to infer the weight of WordNet relations. Likewise, LM-NoSL corresponds to the Mincut algorithm that uses lexical and monosemous features for the SVM, and predefined constants for WordNet relations (without subjective list). 4.4 Semi-supervised Graph Mincuts Using our formulation in Section 3.3, we import 3,220 senses linked by the ten WordNet relations to any senses in Micro-WNOp as unlabeled data. We construct edge weights to classification vertices using the SVM discussed above and use WordNet relations for links between example vertices, weighted by either constants (NoSL) or via the method illustrated in Table 1 (SL). The results are also summarized in Table 2. Semi-supervised Mincuts always significantly outperform the corresponding SVM classifiers, regardless of whether the subjectivity list is used for setting edge weights. We can also see that we achieve good results without using any other knowledge sources (setting LR-NoSL). The example in Figure 1 explains why semisupervised Mincuts outperforms the supervised approach. The vertex "religious" is initially assigned the subjective/objective probabilities 0.24/0.76 by the SVM classifier, leading to a wrong classification. However, in our graph-based Mincut framework, the vertex "religious" might link to other vertices (for example, it links to the vertex "scrupulous" in the unlabeled data by the relation "similar-to"). The mincut algorithm will put vertices "religious" and "scrupulous" in the same cut (subjective category) as this results in the least cost 0.93 (ignoring the cost of assigning the unrelated sense of "flicker"). In other words, the edges between the vertices are likely to correct some initially wrong classification and pull the vertices into the right cuts. In the following we will analyze the best minimum cut algorithm LRM-SL in more detail. We measure its accuracy for each part-of-speech in the Micro-WNOp dataset. The number of noun, adjective, adverb and verb senses in Micro-WNOp is 484, 265, 31 and 281, respectively. The result is listed in Table 3. The significantly better performance of semi-supervised mincuts holds across all parts-ofspeech but the small set of adverbs, where there is no significant difference between the baseline, SVM and the Mincut algorithm. 6 Accuracy(%) Table 3: Method Baseline SVM Mincut Accuracy for Different Part-Of-Speech Noun Adjective Adverb Verb 76.9% 61.1% 77.4% 72.6% 81.4% 63.4% 83.9% 75.1% 88.6% 78.9% 77.4% 84.0% 89 86 83 80 77 74 71 Mincuts SVM We will now investigate how LRM-SL performs with different sizes of labeled and unlabeled data. All learning curves are generated via averaging 10 learning curves from 10-fold cross-validation. Performance with different sizes of labeled data: we randomly generate subsets of labeled data A1 , A2 ... An , and guarantee that A1 A2 ... An . Results for the best SVM (LRM) and the best minimum cut (LRM-SL) are listed in Table 4, and the corresponding learning curve is shown in Figure 2. As can be seen, the semi-supervised Mincuts is consistently better than SVM. Moreover, the semisupervised Mincut with only 200 labeled data items performs even better than SVM with 954 training items (78.9% vs 75.3%), showing that our semisupervised framework allows for a training data reduction of more than 80%. Table 4: Accuracy with different sizes of labeled data # labeled data SVM Mincuts 100 69.1% 72.2% 200 72.6% 78.9% 400 74.4% 82.7% 600 75.5% 83.7% 800 76.0% 84.1% 900 75.6% 84.8% 954 (all) 75.3% 84.6% 68 100 200 300 400 500 600 700 800 900 1000 Size of Labeled Data Figure 2: Learning curve with different sizes of labeled data The results are listed in Table 5 and Table 6 respectively. The corresponding learning curves are shown in Figure 3. We see that performance improves with the increase of unlabeled data. In addition, the curves seem to converge when the size of unlabeled data is larger than 3,000. From the results in Tabel 5 one can also see that hyponymy is the relation accounting for the largest increase. Table 6: Accuracy with different sizes of unlabeled data (random selection) # unlabeled data Accuracy 0 75.9% 200 76.5% 500 78.6% 1000 80.2% 2000 82.8% 3000 84.0% 3220 84.6% Performance with different sizes of unlabeled data: We propose two different settings. Option1: Use a subset of the ten relations to generate the unlabeled data (and edges between example vertices). For example, we first use {antonym, similar-to} only to obtain a unlabeled dataset U1 , then use a larger subset of the relations like {antonym, similar-to, direct-hyponym, directhypernym} to generate another unlabeled dataset U2 , and so forth. Obviously, Ui is a subset of Ui+1 . Option2: Use all the ten relations to generate the unlabeled data U . We then randomly select subsets of U , such as subset U1 , U2 and U3 , and guarantee that U1 U2 U3 . . . U . Furthermore, these results also show that a supervised mincut without unlabeled data performs only on a par with other supervised classifiers (75.9%). The reason is that if we exclude the unlabeled data, there are only 67 WordNet relations/edges between senses in the small Micro-WNOp dataset. In contrast, the use of unlabeled data adds more edges (4,586) to the graph, which strongly affects the graph cut partition (see also Figure 1). 4.5 Comparison to Prior Approaches In our previous work (Su and Markert, 2008), we report 76.9% as the best accuracy on the same Micro- 7 Table 5: Accuracy with different sizes of unlabeled data from WordNet relation Relation # unlabeled data Accuracy {} 0 75.3% {similar-to} 418 79.1% {similar-to, antonym} 514 79.5% {similar-to, antonym, direct-hypernym, direct- 2,721 84.4% hyponym} {similar-to, antonym, direct-hypernym, direct- 3,004 84.4% hyponym, also-see, extended-antonym} {similar-to, antonym, direct-hypernym, direct- 3,220 84.6% hyponym, also-see, extended-antonym, derived-from, attribute, domain, domain-member} 89 87 Accuracy(%) 85 83 81 79 77 75 0 500 1000 1500 2000 2500 3000 3500 Size of Unlabeled Data Option1 Option2 Figure 3: Learning curve with different sizes of unlabeled data son to their precision of 55% at the same recall). Our F-score is 0.63 (vs. 0.52). To check whether the high performance is just due to our larger training set, we also conduct 10-fold cross-validation on Wiebe. The accuracy achieved is 81.1% and the F-score 0.56 (vs. 0.52), suggesting that our algorithm performs better. Our algorithm can be used on all WordNet senses whereas theirs is restricted to senses that have distributionally similar words in the MPQA corpus (see Section 2). However, they use an unsupervised algorithm i.e. they do not need labeled word senses, although they do need a large, manually annotated opinion corpus. 5 WNOp dataset used in the previous sections, using a supervised Naive Bayes (S&M in Tabel 2). Our best result from Mincuts is significantly better at 84.6% (see LRM-SL in Table 2). For comparison to Wiebe and Mihalcea (2006), we use their dataset for testing, henceforth called Wiebe (see Section 4.1 for a description). Wiebe and Mihalcea (2006) report their results in precision and recall curves for subjective senses, such as a precision of about 55% at a recall of 50% for subjective senses. Their F-score for subjective senses seems to remain relatively static at 0.52 throughout their precision/recall curve. We run our best Mincut LRM-SL algorithm with two different settings on Wiebe. Using MicroWNOp as training set and Wiebe as test set, we achieve an accuracy of 83.2%, which is similar to the results on the Micro-WNOp dataset. At the recall of 50% we achieve a precision of 83.6% (in compari- Conclusion and Future Work We propose a semi-supervised minimum cut algorithm for subjectivity recognition on word senses. The experimental results show that our proposed approach is significantly better than a standard supervised classification framework as well as a supervised Mincut. Overall, we achieve a 40% reduction in error rates (from an error rate of about 25% to an error rate of 15%). To achieve the results of standard supervised approaches with our model, we need less than 20% of their training data. In addition, we compare our algorithm to previous state-of-the-art approaches, showing that our model performs better on the same datasets. Future work will explore other graph construction methods, such as the use of morphological relations as well as thesaurus and distributional similarity measures. We will also explore other semisupervised algorithms. 8 References Alekh Agarwal and Pushpak Bhattacharyya. 2005. Sentiment Analysis: A new Approach for Effective Use of Linguistic Knowledge and Exploiting Similarities in a Set of Documents to be Classified. Proceedings of ICON'05. Alina Andreevskaia and Sabine Bergler. 2006. Mining WordNet for Fuzzy Sentiment: Sentiment Tag Extraction from WordNet Glosses. Proceedings of EACL'06. Avrim Blum and Shuchi Chawla. 2001. Learning from Labeled and Unlabeled Data using Graph Mincuts. Proceedings of ICML'01. Thomas Cormen, Charles Leiserson, Ronald Rivest and Clifford Stein. 2002. Introduction to Algorithms. Second Edition, the MIT Press. Kushal Dave, Steve Lawrence, and David Pennock. 2003. Mining the Peanut Gallery: Opinion Extraction and Semantic Classification of Product Reviews. Proceedings of WWW'03. Andrea Esuli and Fabrizio Sebastiani. 2006. SentiWordNet: A Publicly Available Lexical Resource for Opinion Mining. Proceedings of LREC'06. Andrea Esuli and Fabrizio Sebastiani. 2007. PageRanking WordNet Synsets: An application to Opinion Mining. Proceedings of ACL'07. Vasileios Hatzivassiloglou and Kathleen McKeown. 1997. Predicting the Semantic Orientation of Adjectives. Proceedings of ACL'97. Nobuhiro Kaji and Masaru Kitsuregawa. 2007. Building Lexicon for Sentiment Analysis from Massive Collection of HTML Documents. Proceedings of EMNLP'07. Japp Kamps, Maarten Marx, Robert Mokken, and Maarten de Rijke. 2004. Using WordNet to Measure Semantic Orientation of Adjectives. Proceedings of LREC'04. Soo-Min Kim and Eduard Hovy. 2004. Determining the Sentiment of Opinions. Proceedings of COLING'04. Soo-Min Kim and Eduard Hovy. 2005. Automatic Detection of Opinion Bearing Words and Sentences. Proceedings of ICJNLP'05. Taku Kudo and Yuji Matsumoto. 2004. A Boosting Algorithm for Classification of Semi-structured Text. Proceedings of EMNLP'04. Iadh Ounis, Maarten de Rijke, Craig Macdonald, Gilad Mishne and Ian Soboroff. 2006. Overview of the TREC-2006 Blog Track. Proceedings of TREC'06. Bo Pang and Lillian Lee. 2004. A Sentiment Education: Sentiment Analysis Using Subjectivity summarization Based on Minimum Cuts. Proceedings of ACL'04. Bo Pang and Lillian Lee. 2008. Opinion Mining and Sentiment Analysis. Foundations and Trends in Information Retrieval 2(1-2). Bo Pang, Lillian Lee, and Shivakumar Vaithyanathan. 2002. Thumbs up? Sentiment Classification using Machine Learning Techniques. Proceedings of EMNLP'02. Livia Polanyi and Annie Zaenen. 2004. Contextual Valence Shifters. Proceedings of the AAAI Spring Symposium on Exploring Attitude and Affect in Text: Theories and Applications. Ana-Maria Popescu and Oren Etzioni. 2005. Extracting Product Features and Opinions from Reviews Proceedings of EMNLP'05. Ellen Riloff, Janyce Wiebe, and Theresa Wilson. 2003. Learning Subjective Nouns using Extraction Pattern Bootstrapping. Proceedings of CoNLL'03 Fangzhong Su and Katja Markert. 2008. From Words to Senses: A Case Study in Subjectivity Recognition. Proceedings of COLING'08. Fangzhong Su and Katja Markert. 2008a. Eliciting Subjectivity and Polarity Judgements on Word Senses. Proceedings of COLING'08 workshop of Human Judgements in Computational Linguistics. Hiroya Takamura, Takashi Inui, and Manabu Okumura. 2005. Extracting Semantic Orientations of Words using Spin Model. Proceedings of ACL'05. Matt Thomas, Bo Pang and Lilian Lee. 2006. Get out the vote: Determining support or opposition from Congressional floor-debate transcripts. Proceedings of EMNLP'06. Peter Turney. 2002. Thumbs up or Thumbs down? Semantic orientation applied to unsupervised classification of reviews. Proceedings of ACL'02. Peter Turney and Michael Littman. 2003. Measuring Praise and Criticism: Inference of Semantic Orientation from Association. ACM Transaction on Information Systems. Hong Yu and Vasileios Hatzivassiloglou. 2003. Towards Answering Opinion Questions: Separating Facts from Opinions and Identifying the Polarity of Opinion Sentences. Proceedings of EMNLP'03. Janyce Wiebe and Rada Micalcea. 2006. Word Sense and Subjectivity. Proceedings of ACL'06. Janyce Wiebe, Theresa Wilson, and Claire Cardie. 2005. Annotating Expressions of Opinions and Emotions in Language. Language Resources and Evaluation. Theresa Wilson, Janyce Wiebe, and Paul Hoffmann. 2005. Recognizing Contextual Polarity in Phrase-Level Sentiment Analysis. Proceedings of HLT/EMNLP'05. 9 Integrating Knowledge for Subjectivity Sense Labeling Yaw Gyamfi and Janyce Wiebe Rada Mihalcea University of Pittsburgh University of North Texas {anti,wiebe}@cs.pitt.edu rada@cs.unt.edu Cem Akkaya University of Pittsburgh cem@cs.pitt.edu Abstract This paper introduces an integrative approach to automatic word sense subjectivity annotation. We use features that exploit the hierarchical structure and domain information in lexical resources such as WordNet, as well as other types of features that measure the similarity of glosses and the overlap among sets of semantically related words. Integrated in a machine learning framework, the entire set of features is found to give better results than any individual type of feature. 1 Introduction Automatic extraction of opinions, emotions, and sentiments in text (subjectivity analysis) to support applications such as product review mining, summarization, question answering, and information extraction is an active area of research in NLP. Many approaches to opinion, sentiment, and subjectivity analysis rely on lexicons of words that may be used to express subjectivity. However, words may have both subjective and objective senses, which is a source of ambiguity in subjectivity and sentiment analysis. We show that even words judged in previous work to be reliable clues of subjectivity have significant degrees of subjectivity sense ambiguity. To address this ambiguity, we present a method for automatically assigning subjectivity labels to word senses in a taxonomy, which uses new features and integrates more diverse types of knowledge than in previous work. We focus on nouns, which are challenging and have received less attention in automatic subjectivity and sentiment analysis. A common approach to building lexicons for subjectivity analysis is to begin with a small set of seeds which are prototypically subjective (or positive/negative, in sentiment analysis), and then follow semantic links in WordNet-like resources. By far, the emphasis has been on horizontal relations, such as synonymy and antonymy. Exploiting vertical links opens the door to taking into account the information content of ancestor concepts of senses with known and unknown subjectivity. We develop novel features that measure the similarity of a target word sense with a seed set of senses known to be subjective, where the similarity between two concepts is determined by the extent to which they share information, measured by the information content associated with their least common subsumer (LCS). Further, particularizing the LCS features to domain greatly reduces calculation while still maintaining effective features. We find that our new features do lead to significant improvements over methods proposed in previous work, and that the combination of all features gives significantly better performance than any single type of feature alone. We also ask, given that there are many approaches to finding subjective words, if it would make sense for word- and sense-level approaches to work in tandem, or should we best view them as competing approaches? We give evidence suggesting that first identifying subjective words and then disambiguating their senses would be an effective approach to subjectivity sense labeling. Human Language Technologies: The 2009 Annual Conference of the North American Chapter of the ACL, pages 10­18, Boulder, Colorado, June 2009. c 2009 Association for Computational Linguistics 10 There are several motivations for assigning subjectivity labels to senses. First, (Wiebe and Mihalcea, 2006) provide evidence that word sense labels, together with contextual subjectivity analysis, can be exploited to improve performance in word sense disambiguation. Similarly, given subjectivity sense labels, word-sense disambiguation may potentially help contextual subjectivity analysis. In addition, as lexical resources such as WordNet are developed further, subjectivity labels would provide principled criteria for refining word senses, as well as for clustering similar meanings to create more coursegrained sense inventories. For many opinion mining applications, polarity (positive, negative) is also important. The overall framework we envision is a layered approach: classifying instances as objective or subjective, and further classifying the subjective instances by polarity. Decomposing the problem into subproblems has been found to be effective for opinion mining. This paper addresses the first of these subproblems. 2 Background We adopt the definitions of subjective and objective from Wiebe and Mihalcea (2006) (hereafter WM). Subjective expressions are words and phrases being used to express opinions, emotions, speculations, etc. WM give the following examples: His alarm grew. He absorbed the information quickly. UCC/Disciples leaders roundly condemned the Iranian President's verbal assault on Israel. What's the catch? Polarity (also called semantic orientation) is also important to NLP applications in sentiment analysis and opinion extraction. In review mining, for example, we want to know whether an opinion about a product is positive or negative. Even so, we believe there are strong motivations for a separate subjective/objective (S/O) classification as well. First, expressions may be subjective but not have any particular polarity. An example given by (Wilson et al., 2005) is Jerome says the hospital feels no different than a hospital in the states. An NLP application system may want to find a wide range of private states attributed to a person, such as their motivations, thoughts, and speculations, in addition to their positive and negative sentiments. Second, distinguishing S and O instances has often proven more difficult than subsequent polarity classification. Researchers have found this at various levels of analysis, including the manual annotation of phrases (Takamura et al., 2006), sentiment classification of phrases (Wilson et al., 2005), sentiment tagging of words (Andreevskaia and Bergler, 2006b), and sentiment tagging of word senses (Esuli and Sebastiani, 2006a). Thus, effective methods for S/O classification promise to improve performance for sentiment classification. In fact, researchers in sentiment analysis have realized benefits by decomposing the problem into S/O and polarity classification (Yu and Hatzivassiloglou, 2003; Pang and Lee, 2004; Wilson et al., 2005; Kim and Hovy, 2006). One reason is that different features may be relevant for the two subproblems. For example, negation features are more important for polarity classification than for subjectivity classification. Note that some of our features require vertical links that are present in WordNet for nouns and verbs but not for other parts of speech. Thus we address nouns (leaving verbs to future work). There are other motivations for focusing on nouns. Relatively little work in subjectivity and sentiment analysis has focused on subjective nouns. Also, a study (Bruce and Wiebe, 1999) showed that, of the major parts of speech, nouns are the most ambiguous with respect to the subjectivity of their instances. Turning to word senses, we adopt the definitions from WM. First, subjective: "Classifying a sense as S means that, when the sense is used in a text or conversation, we expect it to express subjectivity; we also expect the phrase or sentence containing it to be subjective [WM, pp. 2-3]." In WM, it is noted that sentences containing objective senses may not be objective, as in the sentence Will someone shut that darn alarm off? Thus, objective senses are defined as follows: "Classifying a sense as O means that, when the sense is used in a text or conversation, we do not expect it to express subjectivity and, if the phrase or sentence containing it is subjective, the subjectivity is due to something else [WM, p 3]." The following subjective examples are given in 11 WM: His alarm grew. alarm, dismay, consternation ­ (fear resulting from the awareness of danger) => fear, fearfulness, fright ­ (an emotion experienced in anticipation of some specific pain or danger (usually accompanied by a desire to flee or fight)) What's the catch? catch ­ (a hidden drawback; "it sounds good but what's the catch?") => drawback ­ (the quality of being a hindrance; "he pointed out all the drawbacks to my plan") The following objective examples are given in WM: The alarm went off. alarm, warning device, alarm system ­ (a device that signals the occurrence of some undesirable event) => device ­ (an instrumentality invented for a particular purpose; "the device is small enough to wear on your wrist"; "a device intended to conserve water") He sold his catch at the market. catch, haul ­ (the quantity that was caught; "the catch was only 10 fish") => indefinite quantity ­ (an estimated quantity) WM performed an agreement study and report that good agreement (=0.74) can be achieved between human annotators labeling the subjectivity of senses. For a similar task, (Su and Markert, 2008) also report good agreement. 3 Related Work Many methods have been developed for automatically identifying subjective (opinion, sentiment, attitude, affect-bearing, etc.) words, e.g., (Turney, 2002; Riloff and Wiebe, 2003; Kim and Hovy, 2004; Taboada et al., 2006; Takamura et al., 2006). Five groups have worked on subjectivity sense labeling. WM and Su and Markert (2008) (hereafter SM) assign S/O labels to senses, while Esuli and Sebastiani (hereafter ES) (2006a; 2007), Andreevskaia and Bergler (hereafter AB) (2006b; 2006a), and (Valitutti et al., 2004) assign polarity labels. WM, SM, and ES have evaluated their systems against manually annotated word-sense data. WM's annotations are described above; SM's are similar. In the scheme ES use (Cerini et al., 2007), senses are assigned three scores, for positivity, negativity, and neutrality. There is no unambiguous mapping between the labels of WM/SM and ES, first because WM/SM use distinct classes and ES use numerical ratings, and second because WM/SM distinguish between objective senses on the one hand and neutral subjective senses on the other, while those are both neutral in the scheme used by ES. WM use an unsupervised corpus-based approach, in which subjectivity labels are assigned to word senses based on a set of distributionally similar words in a corpus annotated with subjective expressions. SM explore methods that use existing resources that do not require manually annotated data; they also implement a supervised system for comparison, which we will call SMsup. The other three groups start with positive and negative seed sets and expand them by adding synonyms and antonyms, and traversing horizontal links in WordNet. AB, ES, and SMsup additionally use information contained in glosses; AB also use hyponyms; SMsup also uses relation and POS features. AB perform multiple runs of their system to assign fuzzy categories to senses. ES use a semi-supervised, multiple-classifier learning approach. In a later paper, (Esuli and Sebastiani, 2007), ES again use information in glosses, applying a random walk ranking algorithm to a graph in which synsets are linked if a member of the first synset appears in the gloss of the second. Like ES and SMsup, we use machine learning, but with more diverse sources of knowledge. Further, several of our features are novel for the task. The LCS features (Section 6.1) detect subjectivity by measuring the similarity of a candidate word sense with a seed set. WM also use a similarity measure, but as a way to filter the output of a measure of distributional similarity (selecting words for a given word sense), not as we do to cumulatively calculate the subjectivity of a word sense. Another novel aspect of our similarity features is that they are particularized to domain, which greatly reduces calculation. The domain subjectivity LCS features (Section 6.2) are also novel for our task. So is augmenting seed sets with monosemous words, for greater coverage without requiring human intervention or sacrificing quality. Note that none of our features as we specifically define them has been used in previous work; combining them together, our approach outperforms previous approaches. 12 4 Lexicon and Annotations We use the subjectivity lexicon of (Wiebe and Riloff, 2005)1 both to create a subjective seed set and to create the experimental data sets. The lexicon is a list of words and phrases that have subjective uses, though only word entries are used in this paper (i.e., we do not address phrases at this point). Some entries are from manually developed resources, including the General Inquirer, while others were derived from corpora using automatic methods. Through manual review and empirical testing on data, (Wiebe and Riloff, 2005) divided the clues into strong (strongsubj) and weak (weaksubj) subjectivity clues. Strongsubj clues have subjective meanings with high probability, and weaksubj clues have subjective meanings with lower probability. To support our experiments, we annotated the senses2 of polysemous nouns selected from the lexicon, using WM's annotation scheme described in Section 2. Due to time constraints, only some of the data was labeled through consensus labeling by two annotators; the rest was labeled by one annotator. Overall, 2875 senses for 882 words were annotated. Even though all are senses of words from the subjectivity lexicon, only 1383 (48%) of the senses are subjective. The words labeled strongsubj are in fact less ambiguous than those labeled weaksubj in our analysis, thus supporting the reliability classifications in the lexicon. 55% (1038/1924) of the senses of strongsubj words are subjective, while only 36% (345/951) of the senses of weaksubj words are subjective. For the analysis in Section 7.3, we form subsets of the data annotated here to test performance of our method on different data compositions. then expand the set with their hyponyms, as they were found useful in previous work by AB (2006b; 2006a). This yields a subjective seed set of 645 senses. After removing the word senses that belong to the same synset, so that only one word sense per synset is left, we ended up with 603 senses. To create the objective seed set, two annotators manually annotated 800 random senses from WordNet, and selected for the objective seed set the ones they both agreed are clearly objective. This creates an objective seed set of 727. Again we removed multiple senses from the same synset leaving us with 722. The other 73 senses they annotated are added to the mixed data set described below. As this sampling shows, WordNet nouns are highly skewed toward objective senses, so finding an objective seed set is not difficult. 6 6.1 Features Sense Subjectivity LCS Feature 5 Seed Sets Both subjective and objective seed sets are used to define the features described below. For seeds, a large number is desirable for greater coverage, although high quality is also important. We begin to build our subjective seed set by adding the monosemous strongsubj nouns of the subjectivity lexicon (there are 397 of these). Since they are monosemous, they pose no problem of sense ambiguity. We 1 2 Available at http://www.cs.pitt.edu/mpqa In WordNet 2.0 This feature measures the similarity of a target sense with members of the subjective seed set. Here, similarity between two senses is determined by the extent to which they share information, measured by using the information content associated with their least common subsumer. For an intuition behind this feature, consider this example. In WordNet, the hypernym of the "strong criticism" sense of attack is criticism. Several other negative subjective senses are descendants of criticism, including the relevant senses of fire, thrust, and rebuke. Going up one more level, the hypernym of criticism is the "expression of disapproval" meaning of disapproval, which has several additional negative subjective descendants, such as the "expression of opposition and disapproval" sense of discouragement. Our hypothesis is that the cases where subjectivity is preserved in the hypernym structure, or where hypernyms do lead from subjective senses to others, are the ones that have the highest least common subsumer score with the seed set of known subjective senses. We calculate similarity using the informationcontent based measure proposed in (Resnik, 1995), as implemented in the WordNet::Similarity package (using the default option in which LCS values are computed over the SemCor corpus).3 Given a 3 http://search.cpan.org/dist/WordNet-Similarity/ 13 taxonomy such as WordNet, the information content associated with a concept is determined as the likelihood of encountering that concept, defined as -log(p(C)), where p(C) is the probability of seeing concept C in a corpus. The similarity between two concepts is then defined in terms of information content as: LCSs (C1 , C2 ) = max[-log(p(C))], where C is the concept that subsumes both C1 and C2 and has the highest information content (i.e., it is the least common subsumer (LCS)). For this feature, a score is assigned to a target sense based on its semantic similarity to the members of a seed set; in particular, the maximum such similarity is used. For a target sense t and a seed set S, we could have used the following score: Score(t, S) = max LCSs (t, s) sS cluded only a subjective feature to put more emphasis on the subjective senses. In the future, features could be defined with respect to objectivity, as well as polarity and other properties of subjectivity. 6.2 Domain Subjectivity LCS Score We also include a feature reflecting the subjectivity of the domain of the target sense. Domains are assigned scores as follows. For domain D and seed set S: DomainLCSscore(D, S) = avedDS M emLCSscore(d, D, S) where: M emLCSscore(d, D, S) = max LCSs (d, di ) However, several researchers have noted that subjectivity may be domain specific. A version of WordNet exists, WordNet Domains (Gliozzo et al., 2005), which associates each synset with one of the domains in the Dewey Decimal library classification. After sorting our subjective seed set into different domains, we observed that over 80% of the subjective seed senses are concentrated in six domains (the rest are distributed among 35 domains). Thus, we decided to particularize the semantic similarity feature to domain, such that only the subset of the seed set in the same domain as the target sense is used to compute the feature. This involves much less calculation, as LCS values are calculated only with respect to a subset of the seed set. We hypothesized that this would still be an effective feature, while being more efficient to calculate. This will be important when this method is applied to large resources such as the entire WordNet. Thus, for seed set S and target sense t which is in domain D, the feature is defined as the following score: SenseLCSscore(t, D, S) = max LCSs (t, d) dDS di DS,di =d The value of this feature for a sense is the score assigned to that sense's domain. 6.3 Common Related Senses This feature is based on the intersection between the set of senses related (via WordNet relations) to the target sense and the set of senses related to members of a seed set. First, for the target sense and each member of the seed set, a set of related senses is formed consisting of its synonyms, antonyms and direct hypernyms as defined by WordNet. For a sense s, R(s) is s together with its related senses. Then, given a target sense t and a seed set S we compute an average percentage overlap as follows: RelOverlap(t, S) = si S |R(t)R(si )| max (|R(t)|,|R(si )|) The value of a feature is its score. Two features are included in the experiments below, one for each of the subjective and objective seed sets. 6.4 Gloss-based features |S| The seed set is a parameter, so we could have defined a feature reflecting similarity to the objective seed set as well. Since WordNet is already highly skewed toward objective noun senses, any naive classifier need only guess the majority class for high accuracy for the objective senses. We in- These features are Lesk-style features (Lesk, 1986) that exploit overlaps between glosses of target and seed senses. We include two types in our work. 6.4.1 Average Percentage Gloss Overlap Features For a sense s, gloss(s) is the set of stems in the gloss of s (excluding stop words). Then, given a tar- 14 get sense t and a seed set S, we compute an average percentage overlap as follows: GlOverlap(t, S) = si S gloss(t)rR(s ) gloss(r) i max (|gloss(t)|,|rR(s ) gloss(r)|) i | | |S| As above, R(s) is considered for each seed sense s, but now only the target sense t is considered, not R(t). We did this because we hypothesized that the gloss can provide sufficient context for a given target sense, so that the addition of related words is not necessary. We include two features, one for each of the subjective and objective seed sets. 6.4.2 Vector Gloss Overlap Features For this feature we also consider overlaps of stems in glosses (excluding stop words). The overlaps considered are between the gloss of the target sense t and the glosses of R(s) for all s in a seed set (for convenience, we will refer to these as seedRelationSets). A vector of stems is created, one for each stem (excluding stop words) that appears in a gloss of a member of seedRelationSets. If a stem in the gloss of the target sense appears in this vector, then the vector entry for that stem is the total count of that stem in the glosses of the target sense and all members of seedRelationSets. A feature is created for each vector entry whose value is the count at that position. Thus, these features consider counts of individual stems, rather than average proportions of overlaps, as for the previous type of gloss feature. Two vectors of features are used, one where the seed set is the subjective seed set, and one where it is the objective seed set. 6.5 Summary Features Acc P R F All 77.3 72.8 74.3 73.5 Standalone Ablation Results All 77.3 72.8 74.3 73.5 LCS 68.2 69.3 44.2 54.0 Gloss vector 74.3 71.2 68.5 69.8 69.4 75.8 40.6 52.9 Overlaps Leave-One-Out Ablation Results All 77.3 72.8 74.3 73.5 LCS 75.2 70.9 70.6 70.7 Gloss vector 75.0 74.4 61.8 67.5 Overlaps 74.8 71.9 73.8 72.8 Table 1: Results for the mixed corpus (2354 senses, 57.82% O)) 7. Vector of gloss words (SS) 8. Vector of gloss words (OS) 7 Experiments In summary, we use the following features (here, SS is the subjective seed set and OS is the objective one). 1. 2. 3. 4. 5. 6. SenseLCSscore(t, D, SS) DomainLCSscore(D, SS) RelOverlap(t, SS) RelOverlap(t, OS) GlOverlap(t, SS) GlOverlap(t, OS) We perform 10-fold cross validation experiments on several data sets, using SVM light (Joachims, 1999)4 under its default settings. Based on our random sampling of WordNet, it appears that WordNet nouns are highly skewed toward objective senses. (Esuli and Sebastiani, 2007) argue that random sampling from WordNet would yield a corpus mostly consisting of objective (neutral) senses, which would be "pretty useless as a benchmark for testing derived lexical resources for opinion mining [p. 428]." So, they use a mixture of subjective and objective senses in their data set. To create a mixed corpus for our task, we annotated a second random sample from WordNet (which is as skewed as the previously mentioned one). We added together all of the senses of words in the lexicon which we annotated, the leftover senses from the selection of objective seed senses, and this new sample. We removed duplicates, multiple senses from the same synset, and any senses belonging to the same synset in either of the seed sets. This resulted in a corpus of 2354 senses, 993 (42.18%) of which are subjective and 1361 (57.82%) of which are objective. The results with all of our features on this mixed corpus are given in Row 1 of Table 1. In Table 1, the 4 http://svmlight.joachims.org/ 15 first column identifies the features, which in this case is all of them. The next three columns show overall accuracy, and precision and recall for finding subjective senses. The baseline accuracy for the mixed data set (guessing the more frequent class, which is objective) is 57.82%. As the table shows, the accuracy is substantially above baseline.5 7.1 Analysis and Discussion Data (#senses) mixed (2354 57.8% O) strong+weak (1132) weaksubj (566) strongsubj (566) Acc 77.3 77.7 71.3 78.6 P 72.8 76.8 70.3 78.8 R 74.3 78.9 71.1 78.6 F 73.5 77.8 70.7 78.7 Table 2: Results for different data sets (all are 50% S, unless otherwise notes) In this section, we seek to gain insights by performing ablation studies, evaluating our method on different data compositions, and comparing our results to previous results. 7.2 Ablation Studies These results provide evidence that LCS and Gloss vector are better together than either of them alone. 7.3 Results on Different Data Sets Since there are several features, we divided them into sets for the ablation studies. The vector-ofgloss-words features are the most similar to ones used in previous work. Thus, we opted to treat them as one ablation group (Gloss vector). The Overlaps group includes the RelOverlap(t, SS), RelOverlap(t, OS), GlOverlap(t, SS), and GlOverlap(t, OS) features. Finally, the LCS group includes the SenseLCSscore and the DomainLCSscore features. There are two types of ablation studies. In the first, one group of features at a time is included. Those results are in the middle section of Table 1. Thus, for example, the row labeled LCS in this section is for an experiment using only the LCS features. In comparison to performance when all features are used, F-measure for the Overlaps and LCS ablations is significantly different at the p < .01 level, and, for the Gloss Vector ablation, it is significantly different at the p = .052 level (one-tailed t-test). Thus, all of the features together have better performance than any single type of feature alone. In the second type of ablation study, we use all the features minus one group of features at a time. The results are in the bottom section of Table 1. Thus, for example, the row labeled LCS in this section is for an experiment using all but the LCS features. F-measures for LCS and Gloss vector are significantly different at the p = .056 and p = .014 levels, respectively. However, F-measure for the Overlaps ablation is not significantly different (p = .39). Note that, because the majority class is O, baseline recall (and thus F-measure) is 0. 5 Several methods have been developed for identifying subjective words. Perhaps an effective strategy would be to begin with a word-level subjectivity lexicon, and then perform subjectivity sense labeling to sort the subjective from objective senses of those words. We also wondered about the relative effectiveness of our method on strongsubj versus weaksubj clues. To answer these questions, we apply the full model (again in 10-fold cross validation experiments) to data sets composed of senses of polysemous words in the subjectivity lexicon. To support comparison, all of the data sets in this section have a 50%-50% objective/subjective distribution.6 The results are presented in Table 2. For comparison, the first row repeats the results for the mixed corpus from Table 1. The second row shows results for a corpus of senses of a mixture of strongsubj and weaksubj words. The corpus was created by selecting a mixture of strongsubj and weaksubj words, extracting their senses and the S/O labels applied to them in Section 4, and then randomly removing senses of the more frequent class until the distribution is uniform. We see that the results on this corpus are better than on the mixed data set, even though the baseline accuracy is lower and the corpus is smaller. This supports the idea that an effective strategy would be to first identify opinion-bearing words, and then apply our method to those words to sort out their subjective and objective senses. The third row shows results for a weaksubj subset As with the mixed data set, we removed from these data sets multiple senses from the same synset and any senses in the same synset in either of the seed sets. 6 16 Method Our method WM, 60% recall SentiWordNet mapping P 56.8 44.0 60.0 R 66.0 66.0 17.3 F 61.1 52.8 26.8 Table 3: Results for WM Corpus (212 senses, 76% O) Method Our Method SM CV* SM SL* A 81.3% 82.4% 78.3% P 60.3% 70.8% 53.0% R 63.3% 41.1% 57.4% F 61.8% 52.0% 54.9% data set, which is the data set used by ES, reannotated by SM. CV* is their supervised system and SL* is their best non-supervised one. Our method has higher F-measure than the others.8 Note that the focus of SM's work is not supervised machine learning. 8 Conclusions Table 4: Results for SM Corpus (484 senses, 76.9% O) of the strong+weak corpus and the fourth shows results for a strongsubj subset that is of the same size. As expected, the results for the weaksubj senses are lower while those for the strongsubj senses are higher, as weaksubj clues are more ambiguous. 7.4 Comparisons with Previous Work WM and SM address the same task as we do. To compare our results to theirs, we apply our full model (in 10-fold cross validation experiments) to their data sets.7 Table 3 has the WM data set results. WM rank their senses and present their results in the form of precision recall curves. The second row of Table 3 shows their results at the recall level achieved by our method (66%). Their precision at that level is substantially below ours. Turning to ES, to create S/O annotations, we applied the following heuristic mapping (which is also used by SM for the purpose of comparison): any sense for which the sum of positive and negative scores is greater than or equal to 0.5 is S, otherwise it is O. We then evaluate the mapped tags against the gold standard of WM. The results are in Row 3 of Table 3. Note that this mapping is not fair to SentiWordNet, as the tasks are quite different, and we do not believe any conclusions can be drawn. We include the results to eliminate the possibility that their method is as good ours on our task, despite the differences between the tasks. Table 4 has the results for the noun subset of SM's 7 The WM data set is available at http://www.cs.pitt.edu/www.cs.pitt.edu/~wiebe. ES applied their method in (2006b) to WordNet, and made the results available as SentiWordNet at http://sentiwordnet.isti.cnr.it/. In this paper, we introduced an integrative approach to automatic subjectivity word sense labeling which combines features exploiting the hierarchical structure and domain information of WordNet, as well as similarity of glosses and overlap among sets of semantically related words. There are several contributions. First, we learn several things. We found (in Section 4) that even reliable lists of subjective (opinion-bearing) words have many objective senses. We asked if word- and sense-level approaches could be used effectively in tandem, and found (in Section 7.3) that an effective strategy is to first identify opinion-bearing words, and then apply our method to sort out their subjective and objective senses. We also found (in Section 7.2) that the entire set of features gives better results than any individual type of feature alone. Second, several of the features are novel for our task, including those exploiting the hierarchical structure of a lexical resource, domain information, and relations to seed sets expanded with monosemous senses. Finally, the combination of our particular features is effective. For example, on senses of words from a subjectivity lexicon, accuracies range from 20 to 29 percentage points above baseline. Further, our combination of features outperforms previous approaches. Acknowledgments This work was supported in part by National Science Foundation awards #0840632 and #0840608. The authors are grateful to Fangzhong Su and Katja Markert for making their data set available, and to the three paper reviewers for their helpful suggestions. 8 We performed the same type of evaluation as in SM's paper. That is, we assign a subjectivity label to one word sense for each synset, which is the same as applying a subjectivity label to a synset as a whole as done by SM. 17 References Alina Andreevskaia and Sabine Bergler. 2006a. Mining wordnet for a fuzzy sentiment: Sentiment tag extraction from wordnet glosses. In Proceedings of the 11rd Conference of the European Chapter of the Association for Computational Linguistics. Alina Andreevskaia and Sabine Bergler. 2006b. Sentiment tag extraction from wordnet glosses. In Proceedings of 5th International Conference on Language Resources and Evaluation. Rebecca Bruce and Janyce Wiebe. 1999. Recognizing subjectivity: A case study of manual tagging. Natural Language Engineering, 5(2):187­205. S. Cerini, V. Campagnoni, A. Demontis, M. Formentelli, and C. Gandini. 2007. Micro-wnop: A gold standard for the evaluation of automatically compiled lexical resources for opinion mining. In Language resources and linguistic theory: Typology, second language acquisition, English linguistics. Milano. Andrea Esuli and Fabrizio Sebastiani. 2006a. Determining term subjectivity and term orientation for opinion mining. In 11th Meeting of the European Chapter of the Association for Computational Linguistics. Andrea Esuli and Fabrizio Sebastiani. 2006b. SentiWordNet: A publicly available lexical resource for opinion mining. In Proceedings of the 5th Conference on Language Resources and Evaluation, Genova, IT. Andrea Esuli and Fabrizio Sebastiani. 2007. PageRanking wordnet synsets: An application to opinion mining. In Proceedings of the 45th Annual Meeting of the Association of Computational Linguistics, pages 424­ 431, Prague, Czech Republic, June. A. Gliozzo, C. Strapparava, E. d'Avanzo, and B. Magnini. 2005. Automatic acquisition of domain specific lexicons. Tech. report, IRST, Italy. T. Joachims. 1999. Making large-scale SVM learning practical. In B. Scholkopf, C. Burgess, and A. Smola, editors, Advances in Kernel Methods ­ Support Vector Learning, Cambridge, MA. MIT-Press. Soo-Min Kim and Eduard Hovy. 2004. Determining the sentiment of opinions. In Proceedings of the Twentieth International Conference on Computational Linguistics, pages 1267­1373, Geneva, Switzerland. Soo-Min Kim and Eduard Hovy. 2006. Identifying and analyzing judgment opinions. In Proceedings of Empirical Methods in Natural Language Processing, pages 200­207, New York. M.E. Lesk. 1986. Automatic sense disambiguation using machine readable dictionaries: How to tell a pine cone from an ice cream cone. In Proceedings of the SIGDOC Conference 1986, Toronto, June. Bo Pang and Lillian Lee. 2004. A sentimental education: Sentiment analysis using subjectivity summarization based on minimum cuts. In Proceedings of the Annual Meeting of the Association for Computational Linguistics , pages 271­278, Barcelona, ES. Association for Computational Linguistics. Philip Resnik. 1995. Using information content to evaluate semantic similarity in a taxonomy. In Proc. International Joint Conference on Artificial Intelligence. E. Riloff and J. Wiebe. 2003. Learning extraction patterns for subjective expressions. In Conference on Empirical Methods in Natural Language Processing, pages 105­112. Fangzhong Su and Katja Markert. 2008. From word to sense: a case study of subjectivity recognition. In Proceedings of the 22nd International Conference on Computational Linguistics, Manchester. M. Taboada, C. Anthony, and K. Voll. 2006. Methods for creating semantic orientation databases. In Proceedings of 5th International Conference on Language Resources and Evaluation . Hiroya Takamura, Takashi Inui, and Manabu Okumura. 2006. Latent variable models for semantic orientations of phrases. In Proceedings of the 11th Meeting of the European Chapter of the Association for Computational Linguistics , Trento, Italy. P. Turney. 2002. Thumbs up or thumbs down? semantic orientation applied to unsupervised classification of reviews. In Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics, pages 417­424, Philadelphia. Alessandro Valitutti, Carlo Strapparava, and Oliviero Stock. 2004. Developing affective lexical resources. PsychNology Journal, 2(1):61­83. J. Wiebe and R. Mihalcea. 2006. Word sense and subjectivity. In Proceedings of the Annual Meeting of the Association for Computational Linguistics, Sydney, Australia. Janyce Wiebe and Ellen Riloff. 2005. Creating subjective and objective sentence classifiers from unannotated texts. In Proceedings of the 6th International Conference on Intelligent Text Processing and Computational Linguistics , pages 486­497, Mexico City, Mexico. Theresa Wilson, Janyce Wiebe, and Paul Hoffmann. 2005. Recognizing contextual polarity in phrase-level sentiment analysis. In Proceedings of the Human Language Technologies Conference/Conference on Empirical Methods in Natural Language Processing , pages 347­354, Vancouver, Canada. Hong Yu and Vasileios Hatzivassiloglou. 2003. Towards answering opinion questions: Separating facts from opinions and identifying the polarity of opinion sentences. In Conference on Empirical Methods in Natural Language Processing , pages 129­136, Sapporo, Japan. 18 A Study on Similarity and Relatedness Using Distributional and WordNet-based Approaches Eneko Agirre Enrique Alfonseca Keith Hall Jana Kravalova§ Marius Pasca Aitor Soroa ¸ IXA NLP Group, University of the Basque Country Google Inc. § Institute of Formal and Applied Linguistics, Charles University in Prague {e.agirre,a.soroa}@ehu.es {ealfonseca,kbhall,mars}@google.com kravalova@ufal.mff.cuni.cz Abstract This paper presents and compares WordNetbased and distributional similarity approaches. The strengths and weaknesses of each approach regarding similarity and relatedness tasks are discussed, and a combination is presented. Each of our methods independently provide the best results in their class on the RG and WordSim353 datasets, and a supervised combination of them yields the best published results on all datasets. Finally, we pioneer cross-lingual similarity, showing that our methods are easily adapted for a cross-lingual task with minor losses. 1 Introduction Measuring semantic similarity and relatedness between terms is an important problem in lexical semantics. It has applications in many natural language processing tasks, such as Textual Entailment, Word Sense Disambiguation or Information Extraction, and other related areas like Information Retrieval. The techniques used to solve this problem can be roughly classified into two main categories: those relying on pre-existing knowledge resources (thesauri, semantic networks, taxonomies or encyclopedias) (Alvarez and Lim, 2007; Yang and Powers, 2005; Hughes and Ramage, 2007) and those inducing distributional properties of words from corpora (Sahami and Heilman, 2006; Chen et al., 2006; Bollegala et al., 2007). In this paper, we explore both families. For the first one we apply graph based algorithms to WordNet, and for the second we induce distributional similarities collected from a 1.6 Terabyte Web corpus. Previous work suggests that distributional similarities suffer from certain limitations, which make 19 them less useful than knowledge resources for semantic similarity. For example, Lin (1998b) finds similar phrases like captive-westerner which made sense only in the context of the corpus used, and Budanitsky and Hirst (2006) highlight other problems that stem from the imbalance and sparseness of the corpora. Comparatively, the experiments in this paper demonstrate that distributional similarities can perform as well as the knowledge-based approaches, and a combination of the two can exceed the performance of results previously reported on the same datasets. An application to cross-lingual (CL) similarity identification is also described, with applications such as CL Information Retrieval or CL sponsored search. A discussion on the differences between learning similarity and relatedness scores is provided. The paper is structured as follows. We first present the WordNet-based method, followed by the distributional methods. Section 4 is devoted to the evaluation and results on the monolingual and crosslingual tasks. Section 5 presents some analysis, including learning curves for distributional methods, the use of distributional similarity to improve WordNet similarity, the contrast between similarity and relatedness, and the combination of methods. Section 6 presents related work, and finally, Section 7 draws the conclusions and mentions future work. 2 WordNet-based method WordNet (Fellbaum, 1998) is a lexical database of English, which groups nouns, verbs, adjectives and adverbs into sets of synonyms (synsets), each expressing a distinct concept. Synsets are interlinked with conceptual-semantic and lexical relations, including hypernymy, meronymy, causality, etc. Given a pair of words and a graph-based representation of WordNet, our method has basically two Human Language Technologies: The 2009 Annual Conference of the North American Chapter of the ACL, pages 19­27, Boulder, Colorado, June 2009. c 2009 Association for Computational Linguistics steps: We first compute the personalized PageRank over WordNet separately for each of the words, producing a probability distribution over WordNet synsets. We then compare how similar these two discrete probability distributions are by encoding them as vectors and computing the cosine between the vectors. We represent WordNet as a graph G = (V, E) as follows: graph nodes represent WordNet concepts (synsets) and dictionary words; relations among synsets are represented by undirected edges; and dictionary words are linked to the synsets associated to them by directed edges. For each word in the pair we first compute a personalized PageRank vector of graph G (Haveliwala, 2002). Basically, personalized PageRank is computed by modifying the random jump distribution vector in the traditional PageRank equation. In our case, we concentrate all probability mass in the target word. Regarding PageRank implementation details, we chose a damping value of 0.85 and finish the calculation after 30 iterations. These are default values, and we did not optimize them. Our similarity method is similar, but simpler, to that used by (Hughes and Ramage, 2007), which report very good results on similarity datasets. More details of our algorithm can be found in (Agirre and Soroa, 2009). The algorithm and needed resouces are publicly available1 . 2.1 WordNet relations and versions The WordNet versions that we use in this work are the Multilingual Central Repository or MCR (Atserias et al., 2004) (which includes English WordNet version 1.6 and wordnets for several other languages like Spanish, Italian, Catalan and Basque), and WordNet version 3.02 . We used all the relations in MCR (except cooccurrence relations and selectional preference relations) and in WordNet 3.0. Given the recent availability of the disambiguated gloss relations for WordNet 3.03 , we also used a version which incorporates these relations. We will refer to the three versions as MCR16, WN30 and WN30g, respectively. Our choice was mainly motivated by the fact that MCR contains tightly aligned 1 2 wordnets of several languages (see below). 2.2 Cross-linguality MCR follows the EuroWordNet design (Vossen, 1998), which specifies an InterLingual Index (ILI) that links the concepts across wordnets of different languages. The wordnets for other languages in MCR use the English WordNet synset numbers as ILIs. This design allows a decoupling of the relations between concepts (which can be taken to be language independent) and the links from each content word to its corresponding concepts (which is language dependent). As our WordNet-based method uses the graph of the concepts and relations, we can easily compute the similarity between words from different languages. For example, consider a English-Spanish pair like car ­ coche. Given that the Spanish WordNet is included in MCR we can use MCR as the common knowledge-base for the relations. We can then compute the personalized PageRank for each of car and coche on the same underlying graph, and then compare the similarity between both probability distributions. As an alternative, we also tried to use publicly available mappings for wordnets (Daude et al., 2000)4 in order to create a 3.0 version of the Spanish WordNet. The mapping was used to link Spanish variants to 3.0 synsets. We used the English WordNet 3.0, including glosses, to construct the graph. The two Spanish WordNet versions are referred to as MCR16 and WN30g. 3 Context-based methods http://http://ixa2.si.ehu.es/ukb/ Available from http://http://wordnet.princeton.edu/ 3 http://wordnet.princeton.edu/glosstag In this section, we describe the distributional methods used for calculating similarities between words, and profiting from the use of a large Web-based corpus. This work is motivated by previous studies that make use of search engines in order to collect cooccurrence statistics between words. Turney (2001) uses the number of hits returned by a Web search engine to calculate the Pointwise Mutual Information (PMI) between terms, as an indicator of synonymy. Bollegala et al. (2007) calculate a number of popular relatedness metrics based on page counts, 4 http://www.lsi.upc.es/nlp/tools/download-map.php. 20 like PMI, the Jaccard coefficient, the Simpson coefficient and the Dice coefficient, which are combined with lexico-syntactic patterns as model features. The model parameters are trained using Support Vector Machines (SVM) in order to later rank pairs of words. A different approach is the one taken by Sahami and Heilman (2006), who collect snippets from the results of a search engine and represent each snippet as a vector, weighted with the tf·idf score. The semantic similarity between two queries is calculated as the inner product between the centroids of the respective sets of vectors. To calculate the similarity of two words w1 and w2 , Ruiz-Casado et al. (2005) collect snippets containing w1 from a Web search engine, extract a context around it, replace it with w2 and check for the existence of that modified context in the Web. Using a search engine to calculate similarities between words has the drawback that the data used will always be truncated. So, for example, the numbers of hits returned by search engines nowadays are always approximate and rounded up. The systems that rely on collecting snippets are also limited by the maximum number of documents returned per query, typically around a thousand. We hypothesize that by crawling a large corpus from the Web and doing standard corpus analysis to collect precise statistics for the terms we should improve over other unsupervised systems that are based on search engine results, and should yield results that are competitive even when compared to knowledge-based approaches. In order to calculate the semantic similarity between the words in a set, we have used a vector space model, with the following three variations: In the bag-of-words approach, for each word w in the dataset we collect every term t that appears in a window centered in w, and add them to the vector together with its frequency. In the context window approach, for each word w in the dataset we collect every window W centered in w (removing the central word), and add it to the vector together with its frequency (the total number of times we saw window W around w in the whole corpus). In this case, all punctuation symbols are replaced with a special token, to unify patterns like , the said to and ' the said to. Throughout the paper, when we mention a context 21 window of size N it means N words at each side of the phrase of interest. In the syntactic dependency approach, we parse the entire corpus using an implementation of an Inductive Dependency parser as described in Nivre (2006). For each word w we collect a template of the syntactic context. We consider sequences of governing words (e.g. the parent, grand-parent, etc.) as well as collections of descendants (e.g., immediate children, grandchildren, etc.). This information is then encoded as a contextual template. For example, the context template cooks delicious could be contexts for nouns such as food, meals, pasta, etc. This captures both syntactic preferences as well as selectional preferences. Contrary to Pado and Lapata (2007), we do not use the labels of the syntactic dependencies. Once the vectors have been obtained, the frequency for each dimension in every vector is weighted using the other vectors as contrast set, with the 2 test, and finally the cosine similarity between vectors is used to calculate the similarity between each pair of terms. Except for the syntactic dependency approach, where closed-class words are needed by the parser, in the other cases we have removed stopwords (pronouns, prepositions, determiners and modal and auxiliary verbs). 3.1 Corpus used We have used a corpus of four billion documents, crawled from the Web in August 2008. An HTML parser is used to extract text, the language of each document is identified, and non-English documents are discarded. The final corpus remaining at the end of this process contains roughly 1.6 Terawords. All calculations are done in parallel sharding by dimension, and it is possible to calculate all pairwise similarities of the words in the test sets very quickly on this corpus using the MapReduce infrastructure. A complete run takes around 15 minutes on 2,000 cores. 3.2 Cross-linguality In order to calculate similarities in a cross-lingual setting, where some of the words are in a language l other than English, the following algorithm is used: Method MCR16 WN30 WN30g CW Window size BoW Syn CW+ Syn 1 2 3 4 5 6 7 1 2 3 4 5 6 7 G1,D0 G2,D0 G3,D0 G1,D1 G2,D1 G3,D1 4; G1,D0 4; G2,D0 4; G3,D0 4; G1,D1 4; G2,D1 4; G3,D1 RG dataset 0.83 [0.73, 0.89] 0.79 [0.67, 0.86] 0.83 [0.73, 0.89] 0.83 [0.73, 0.89] 0.83 [0.74, 0.90] 0.85 [0.76, 0.91] 0.89 [0.82, 0.93] 0.80 [0.70, 0.88] 0.75 [0.62, 0.84] 0.72 [0.58, 0.82] 0.81 [0.70, 0.88] 0.80 [0.69, 0.87] 0.79 [0.67, 0.86] 0.78 [0.66, 0.86] 0.77 [0.64, 0.85] 0.76 [0.63, 0.85] 0.75 [0.62, 0.84] 0.81 [0.70, 0.88] 0.82 [0.72, 0.89] 0.81 [0.71, 0.88] 0.82 [0.72, 0.89] 0.82 [0.73, 0.89] 0.82 [0.72, 0.88] 0.88 [0.81, 0.93] 0.87 [0.80, 0.92] 0.86 [0.77, 0.91] 0.83 [0.73, 0.89] 0.83 [0.73, 0.89] 0.82 [0.72, 0.89] WordSim353 dataset 0.53 (0.56) [0.45, 0.60] 0.56 (0.58) [0.48, 0.63] 0.66 (0.69) [0.59, 0.71] 0.63 [0.57, 0.69] 0.60 [0.53, 0.66] 0.59 [0.52, 0.65] 0.60 [0.53, 0.66] 0.58 [0.51, 0.65] 0.58 [0.50, 0.64] 0.57 [0.49, 0.63] 0.64 [0.57, 0.70] 0.64 [0.58, 0.70] 0.64 [0.58, 0.70] 0.65 [0.58, 0.70] 0.64 [0.58, 0.70] 0.65 [0.58, 0.70] 0.64 [0.58, 0.70] 0.62 [0.55, 0.68] 0.55 [0.48, 0.62] 0.62 [0.56, 0.68] 0.62 [0.55, 0.68] 0.62 [0.55, 0.68] 0.62 [0.55, 0.68] 0.66 [0.59, 0.71] 0.64 [0.57, 0.70] 0.63 [0.56, 0.69] 0.48 [0.40, 0.56] 0.49 [0.40, 0.56] 0.48 [0.40, 0.56] Context ll never forget the * on his face when he had a giant * on his face and room with a huge * on her face and the state of every * will be updated every repair or replace the * if it is stolen located on the north * of the Bay of areas on the eastern * of the Adriatic Sea Thesaurus of Current English * The Oxford Pocket Thesaurus be understood that the * 10 may be designed a fight between a * and a snake and RG terms and frequencies grin,2,smile,10 grin,3,smile,2 grin,2,smile,6 automobile,2,car,3 automobile,2,car,2 shore,14,coast,2 shore,3,coast,2 slave,3,boy,5,shore,3,string,2 wizard,4,glass,4,crane,5,smile,5 implement,5,oracle,2,lad,2 food,3,car,2,madhouse,3,jewel,3 asylum,4,tool,8,journey,6,etc. crane,3,tool,3 bird,3,crane,5 Table 2: Sample of context windows for the terms in the RG dataset. Table 1: Spearman correlation results for the various WordNet-based models and distributional models. CW=Context Windows, BoW=bag of words, Syn=syntactic vectors. For Syn, the window size is actually the tree-depth for the governors and descendants. For examples, G1 indicates that the contexts include the parents and D2 indicates that both the children and grandchildren make up the contexts. The final grouping includes both contextual windows (at width 4) and syntactic contexts in the template vectors. Max scores are bolded. latedness are annotated without any distinction. Several studies indicate that the human scores consistently have very high correlations with each other (Miller and Charles, 1991; Resnik, 1995), thus validating the use of these datasets for evaluating semantic similarity. For the cross-lingual evaluation, the two datasets were modified by translating the second word in each pair into Spanish. Two humans translated simultaneously both datasets, with an inter-tagger agreement of 72% for RG and 84% for WordSim353. 4.2 Results 1. Replace each non-English word in the dataset with its 5-best translations into English using state-of-the-art machine translation technology. 2. The vector corresponding to each Spanish word is calculated by collecting features from all the contexts of any of its translations. 3. Once the vectors are generated, the similarities are calculated in the same way as before. 4 4.1 Experimental results Gold-standard datasets We have used two standard datasets. The first one, RG, consists of 65 pairs of words collected by Rubenstein and Goodenough (1965), who had them judged by 51 human subjects in a scale from 0.0 to 4.0 according to their similarity, but ignoring any other possible semantic relationships that might appear between the terms. The second dataset, WordSim3535 (Finkelstein et al., 2002) contains 353 word pairs, each associated with an average of 13 to 16 human judgements. In this case, both similarity and reAvailable at http://www.cs.technion.ac.il/ gabr/resources/data/wordsim353/wordsim353.html 5 Table 1 shows the Spearman correlation obtained on the RG and WordSim353 datasets, including the interval at 0.95 of confidence6 . Overall the distributional context-window approach performs best in the RG, reaching 0.89 correlation, and both WN30g and the combination of context windows and syntactic context perform best on WordSim353. Note that the confidence intervals are quite large in both RG and WordSim353, and few of the pairwise differences are statistically significant. Regarding WordNet-based approaches, the use of the glosses and WordNet 3.0 (WN30g) yields the best results in both datasets. While MCR16 is close to WN30g for the RG dataset, it lags well behind on WordSim353. This discrepancy is further analyzed is Section 5.3. Note that the performance of WordNet in the WordSim353 dataset suffers from unknown words. In fact, there are nine pairs which returned null similarity for this reason. The numTo calculate the Spearman correlations values are transformed into ranks, and we calculate the Pearson correlation on them. The confidence intervals refer to the Pearson correlations of the rank vectors. 6 22 Figure 1: Effect of the size of the training corpus, for the best distributional similarity model in each dataset. Left: WordSim353 with bag-of-words, Right: RG with context windows. Dataset RG Method MCR16 WN30g Bag of words Context windows MCR16 WN30g Bag of words Context windows overall 0.78 0.74 0.68 0.83 0.42 (0.53) 0.58 (0.67) 0.53 0.52 interval [0.66, 0.86] [0.61, 0.84] [0.53, 0.79] [0.73, 0.89] [0.34, 0.51] [0.51, 0.64] [0.45, 0.61] [0.44, 0.59] WS353 -0.05 -0.09 -0.23 -0.05 -0.11 (-0.03) -0.07 (-0.02) -0.12 -0.11 Table 3: Results obtained by the different methods on the Spanish/English cross-lingual datasets. The column shows the performance difference with respect to the results on the original dataset. ber in parenthesis in Table 1 for WordSim353 shows the results for the 344 remaining pairs. Section 5.2 shows a proposal to overcome this limitation. The bag-of-words approach tends to group together terms that can have a similar distribution of contextual terms. Therefore, terms that are topically related can appear in the same textual passages and will get high values using this model. We see this as an explanation why this model performed better than the context window approach for WordSim353, where annotators were instructed to provide high ratings to related terms. On the contrary, the context window approach tends to group together words that are exchangeable in exactly the same context, preserving order. Table 2 illustrates a few examples of context collected. Therefore, true synonyms and hyponyms/hyperonyms will receive high similarities, whereas terms related topically or based on any other semantic relation (e.g. movie and star) will have lower scores. This explains why this method performed better for the RG dataset. Section 5.3 confirms these observations. 4.3 Cross-lingual similarity context windows methods drop only 5 percentage points, showing that cross-lingual similarity is feasible, and that both cross-lingual strategies are robust. The results for WordSim353 show that WN30g is the best for this dataset, with the rest of the methods falling over 10 percentage points relative to the monolingual experiment. A closer look at the WordNet results showed that most of the drop in performance was caused by out-of-vocabulary words, due to the smaller vocabulary of the Spanish WordNet. Though not totally comparable, if we compute the correlation over pairs covered in WordNet alone, the correlation would drop only 2 percentage points. In the case of the distributional approaches, the fall in performance was caused by the translations, as only 61% of the words were translated into the original word in the English datasets. 5 Detailed analysis and system combination In this section we present some analysis, including learning curves for distributional methods, the use of distributional similarity to improve WordNet similarity, the contrast between similarity and relatedness, and the combination of methods. 5.1 Learning curves for distributional methods Table 3 shows the results for the English-Spanish cross-lingual datasets. For RG, MCR16 and the 23 Figure 1 shows that the correlation improves with the size of the corpus, as expected. For the results using the WordSim353 corpus, we show the results of the bag-of-words approach with context size 10. Results improve from 0.5 Spearman correlation up to 0.65 when increasing the corpus size three orders of magnitude, although the effect decays at the end, which indicates that we might not get fur- Method WN30 WN30g Without similar words 0.56 (0.58) [0.48, 0.63] 0.66 (0.69) [0.59, 0.71] With similar words 0.58 [0.51, 0.65] 0.68 [0.62, 0.73] Table 4: Results obtained replacing unknown words with their most similar three words (WordSim353 dataset). Method MCR16 WN30 WN30g BoW CW overall 0.53 [0.45, 0.60] 0.56 [0.48, 0.63] 0.66 [0.59, 0.71] 0.65 [0.59, 0.71] 0.60 [0.53, 0.66] Similarity 0.65 [0.56, 0.72] 0.73 [0.65, 0.79] 0.72 [0.64, 0.78] 0.70 [0.63, 0.77] 0.77 [0.71, 0.82] Relatedness 0.33 [0.21, 0.43] 0.38 [0.27, 0.48] 0.56 [0.46, 0.64] 0.62 [0.53, 0.69] 0.46 [0.36, 0.55] Table 5: Results obtained on the WordSim353 dataset and on the two similarity and relatedness subsets. ther gains going beyond the current size of the corpus. With respect to results for the RG dataset, we used a context-window approach with context radius 4. Here, results improve even more with data size, probably due to the sparse data problem collecting 8-word context windows if the corpus is not large enough. Correlation improves linearly right to the end, where results stabilize around 0.89. 5.2 Combining both approaches: dealing with unknown words in WordNet Although the vocabulary of WordNet is very extensive, applications are bound to need the similarity between words which are not included in WordNet. This is exemplified in the WordSim353 dataset, where 9 pairs contain words which are unknown to WordNet. In order to overcome this shortcoming, we could use similar words instead, as provided by the distributional thesaurus. We used the distributional thesaurus defined in Section 3, using context windows of width 4, to provide three similar words for each of the unknown words in WordNet. Results improve for both WN30 and WN30g, as shown in Table 4, attaining our best results for WordSim353. 5.3 Similarity vs. relatedness We mentioned above that the annotation guidelines of WordSim353 did not distinguish between similar and related pairs. As the results in Section 4 show, different techniques are more appropriate to calculate either similarity or relatedness. In order to study this effect, ideally, we would have two versions of the dataset, where annotators were given precise instructions to distinguish similarity in one case, and relatedness in the other. Given the lack of such datasets, we devised a simpler approach in 24 order to reuse the existing human judgements. We manually split the dataset in two parts, as follows. First, two humans classified all pairs as being synonyms of each other, antonyms, identical, hyperonym-hyponym, hyponym-hyperonym, holonym-meronym, meronym-holonym, and noneof-the-above. The inter-tagger agreement rate was 0.80, with a Kappa score of 0.77. This annotation was used to group the pairs in three categories: similar pairs (those classified as synonyms, antonyms, identical, or hyponym-hyperonym), related pairs (those classified as meronym-holonym, and pairs classified as none-of-the-above, with a human average similarity greater than 5), and unrelated pairs (those classified as none-of-the-above that had average similarity less than or equal to 5). We then created two new gold-standard datasets: similarity (the union of similar and unrelated pairs), and relatedness (the union of related and unrelated)7 . Table 5 shows the results on the relatedness and similarity subsets of WordSim353 for the different methods. Regarding WordNet methods, both WN30 and WN30g perform similarly on the similarity subset, but WN30g obtains the best results by far on the relatedness data. These results are congruent with our expectations: two words are similar if their synsets are in close places in the WordNet hierarchy, and two words are related if there is a connection between them. Most of the relations in WordNet are of hierarchical nature, and although other relations exist, they are far less numerous, thus explaining the good results for both WN30 and WN30g on similarity, but the bad results of WN30 on relatedness. The disambiguated glosses help find connections among related concepts, and allow our method to better model relatedness with respect to WN30. The low results for MCR16 also deserve some comments. Given the fact that MCR16 performed very well on the RG dataset, it comes as a surprise that it performs so poorly for the similarity subset of WordSim353. In an additional evaluation, we attested that MCR16 does indeed perform as well as MCR30g on the similar pairs subset. We believe that this deviation could be due to the method used to construct the similarity dataset, which includes some pairs of loosely related pairs labeled as unrelated. 7 Available at http://alfonseca.org/eng/research/wordsim353.html Methods combined in the SVM WN30g, bag of words WN30g, context windows WN30g, syntax WN30g, bag of words, context windows, syntax RG dataset 0.88 [0.82, 0.93] 0.90 [0.84, 0.94] 0.89 [0.83, 0.93] 0.96 [0.93, 0.97] WordSim353 dataset 0.78 [0.73, 0.81] 0.73 [0.68, 0.79] 0.75 [0.70, 0.79] 0.78 [0.73, 0.82] WordSim353 similarity 0.81 [0.76, 0.86] 0.83 [0.78, 0.87] 0.83 [0.78, 0.87] 0.83 [0.78, 0.87] WordSim353 relatedness 0.72 [0.65, 0.77] 0.64 [0.56, 0.71] 0.67 [0.60, 0.74] 0.71 [0.65, 0.77] Table 6: Results using a supervised combination of several systems. Max values are bolded for each dataset. Method (Sahami et al., 2006) (Chen et al., 2006) (Wu and Palmer, 1994) (Leacock et al., 1998) (Resnik, 1995) (Lin, 1998a) (Bollegala et al., 2007) (Jiang and Conrath, 1997) (Jarmasz, 2003) (Patwardhan et al., 2006) (Alvarez and Lim, 2007) (Yang and Powers, 2005) (Hughes et al., 2007) Personalized PageRank Bag of words Context window Syntactic contexts SVM Source Web snippets Web snippets WordNet WordNet WordNet WordNet Web snippets WordNet Roget's WordNet WordNet WordNet WordNet WordNet Web corpus Web corpus Web corpus Web, WN Spearman (MC) 0.62 [0.32, 0.81] 0.69 [0.42, 0.84] 0.78 [0.59, 0.90] 0.79 [0.59, 0.90] 0.81 [0.62, 0.91] 0.82 [0.65, 0.91] 0.82 [0.64, 0.91] 0.83 [0.67, 0.92] 0.87 [0.73, 0.94] n/a n/a 0.87 [0.73, 0.91] 0.90 0.89 [0.77, 0.94] 0.85 [0.70, 0.93] 0.88 [0.76, 0.95] 0.76 [0.54, 0.88] 0.92 [0.84, 0.96] Pearson (MC) 0.58 [0.26, 0.78] 0.69 [0.42, 0.85] 0.78 [0.57, 0.89] 0.82 [0.64, 0.91] 0.80 [0.60, 0.90] 0.83 [0.67, 0.92] 0.83 [0.67, 0.92] 0.85 [0.69, 0.93] 0.87 [0.74, 0.94] 0.91 0.91 0.92 [0.84, 0.96] n/a n/a 0.84 [0.69, 0.93] 0.89 [0.77, 0.95] 0.74 [0.51, 0.87] 0.93 [0.85, 0.97] Concerning the techniques based on distributional similarities, the method based on context windows provides the best results for similarity, and the bagof-words representation outperforms most of the other techniques for relatedness. 5.4 Supervised combination In order to gain an insight on which would be the upper bound that we could obtain when combining our methods, we took the output of three systems (bag of words with window size 10, context window with size 4, and the WN30g run). Each of these outputs is a ranking of word pairs, and we implemented an oracle that chooses, for each pair, the rank that is most similar to the rank of the pair in the gold-standard. The outputs of the oracle have a Spearman correlation of 0.97 for RG and 0.92 for WordSim353, which gives as an indication of the correlations that could be achieved by choosing for each pair the rank output by the best classifier for that pair. The previous results motivated the use of a supervised approach to combine the output of the different systems. We created a training corpus containing pairs of pairs of words from the datasets, having as features the similarity and rank of each pair involved as given by the different unsupervised systems. A classifier is trained to decide whether the first pair is more similar than the second one. For example, a training instance using two unsupervised classifiers is 0.001364, 31, 0.327515, 64, 0.084805, 57, 0.109061, 59, negative Table 7: Comparison with previous approaches for MC. not have a held-out set, so we used the standard settings of Weka, without trying to modify parameters, e.g. C. Each word pair is scored with the number of pairs that were considered to have less similarity using the SVM. The results using 10-fold crossvalidation are shown in Table 6. A combination of all methods produces the best results reported so far for both datasets, statistically significant for RG. 6 Related work meaning that the similarities given by the first classifier to the two pairs were 0.001364 and 0.327515 respectively, which ranked them in positions 31 and 64. The second classifier gave them similarities of 0.084805 and 0.109061 respectively, which ranked them in positions 57 and 59. The class negative indicates that in the gold-standard the first pair has a lower score than the second pair. We have trained a SVM to classify pairs of pairs, and use its output to rank the entries in both datasets. It uses a polynomial kernel with degree 4. We did 25 Contrary to the WordSim353 dataset, common practice with the RG dataset has been to perform the evaluation with Pearson correlation. In our believe Pearson is less informative, as the Pearson correlation suffers much when the scores of two systems are not linearly correlated, something which happens often given due to the different nature of the techniques applied. Some authors, e.g. Alvarez and Lim (2007), use a non-linear function to map the system outputs into new values distributed more similarly to the values in the gold-standard. In their case, the mapping function was exp ( -x ), which was chosen 4 empirically. Finding such a function is dependent on the dataset used, and involves an extra step in the similarity calculations. Alternatively, the Spearman correlation provides an evaluation metric that is independent of such data-dependent transformations. Most similarity researchers have published their Word pair automobile, car journey, voyage gem, jewel boy, lad coast, shore asylum, madhouse magician, wizard midday, noon furnace, stove food, fruit bird, cock bird, crane implement, tool brother, monk M&C 3.92 3.84 3.84 3.76 3.7 3.61 3.5 3.42 3.11 3.08 3.05 2.97 2.95 2.82 SVM 62 54 61 57 53 45 49 61 50 47 46 38 55 42 Word pair crane, implement brother, lad car, journey monk, oracle food, rooster coast, hill forest, graveyard monk, slave lad, wizard coast, forest cord, smile glass, magician rooster, voyage noon, string M&C 1.68 1.66 1.16 1.1 0.89 0.87 0.84 0.55 0.42 0.42 0.13 0.11 0.08 0.08 SVM 26 39 37 32 3 34 27 17 13 18 5 10 1 5 Table 8: Our best results for the MC dataset. Method (Strube and Ponzetto, 2006) (Jarmasz, 2003) (Jarmasz, 2003) (Hughes and Ramage, 2007) (Finkelstein et al., 2002) (Gabrilovich and Markovitch, 2007) (Gabrilovich and Markovitch, 2007) SVM Source Wikipedia WordNet Roget's WordNet Web corpus, WN ODP Wikipedia Web corpus, WN Spearman 0.19­0.48 0.33­0.35 0.55 0.55 0.56 0.65 0.75 0.78 other corpus-based methods. The most similar approach to our distributional technique is Finkelstein et al. (2002), who combined distributional similarities from Web documents with a similarity from WordNet. Their results are probably worse due to the smaller data size (they used 270,000 documents) and the differences in the calculation of the similarities. The only method which outperforms our non-supervised methods is that of (Gabrilovich and Markovitch, 2007) when based on Wikipedia, probably because of the dense, manually distilled knowledge contained in Wikipedia. All in all, our supervised combination gets the best published results on this dataset. 7 Conclusions and future work Table 9: Comparison with previous work for WordSim353. complete results on a smaller subset of the RG dataset containing 30 word pairs (Miller and Charles, 1991), usually referred to as MC, making it possible to compare different systems using different correlation. Table 7 shows the results of related work on MC that was available to us, including our own. For the authors that did not provide the detailed data we include only the Pearson correlation with no confidence intervals. Among the unsupervised methods introduced in this paper, the context window produced the best reported Spearman correlation, although the 0.95 confidence intervals are too large to allow us to accept the hypothesis that it is better than all others methods. The supervised combination produces the best results reported so far. For the benefit of future research, our results for the MC subset are displayed in Table 8. Comparison on the WordSim353 dataset is easier, as all researchers have used Spearman. The figures in Table 9) show that our WordNet-based method outperforms all previously published WordNet methods. We want to note that our WordNetbased method outperforms that of Hughes and Ramage (2007), which uses a similar method. Although there are some differences in the method, we think that the main performance gain comes from the use of the disambiguated glosses, which they did not use. Our distributional methods also outperform all 26 This paper has presented two state-of-the-art distributional and WordNet-based similarity measures, with a study of several parameters, including performance on similarity and relatedness data. We show that the use of disambiguated glosses allows for the best published results for WordNet-based systems on the WordSim353 dataset, mainly due to the better modeling of relatedness (as opposed to similarity). Distributional similarities have proven to be competitive when compared to knowledgebased methods, with context windows being better for similarity and bag of words for relatedness. Distributional similarity was effectively used to cover out-of-vocabulary items in the WordNet-based measure providing our best unsupervised results. The complementarity of our methods was exploited by a supervised learner, producing the best results so far for RG and WordSim353. Our results include confidence values, which, surprisingly, were not included in most previous work, and show that many results over RG and WordSim353 are indistinguishable. The algorithm for WordNet-base similarity and the necessary resources are publicly available8 . This work pioneers cross-lingual extension and evaluation of both distributional and WordNet-based measures. We have shown that closely aligned wordnets provide a natural and effective way to compute cross-lingual similarity with minor losses. A simple translation strategy also yields good results for distributional methods. 8 http://ixa2.si.ehu.es/ukb/ References E. Agirre and A. Soroa. 2009. Personalizing pagerank for word sense disambiguation. In Proc. of EACL 2009, Athens, Greece. M.A. Alvarez and S.J. Lim. 2007. A Graph Modeling of Semantic Similarity between Words. Proc. of the Conference on Semantic Computing, pages 355­362. J. Atserias, L. Villarejo, G. Rigau, E. Agirre, J. Carroll, B. Magnini, and P. Vossen. 2004. The meaning multilingual central repository. In Proc. of Global WordNet Conference, Brno, Czech Republic. D. Bollegala, Matsuo Y., and M. Ishizuka. 2007. Measuring semantic similarity between words using web search engines. In Proceedings of WWW'2007. A. Budanitsky and G. Hirst. 2006. Evaluating WordNetbased Measures of Lexical Semantic Relatedness. Computational Linguistics, 32(1):13­47. H. Chen, M. Lin, and Y. Wei. 2006. Novel association measures using web search with double checking. In Proceedings of COCLING/ACL 2006. J. Daude, L. Padro, and G. Rigau. 2000. Mapping WordNets using structural information. In Proceedings of ACL'2000, Hong Kong. C. Fellbaum, editor. 1998. WordNet: An Electronic Lexical Database and Some of its Applications. MIT Press, Cambridge, Mass. L. Finkelstein, E. Gabrilovich, Y. Matias, E. Rivlin, Z. Solan, G. Wolfman, and E. Ruppin. 2002. Placing Search in Context: The Concept Revisited. ACM Transactions on Information Systems, 20(1):116­131. E. Gabrilovich and S. Markovitch. 2007. Computing Semantic Relatedness using Wikipedia-based Explicit Semantic Analysis. Proc of IJCAI, pages 6­12. T. H. Haveliwala. 2002. Topic-sensitive pagerank. In WWW '02: Proceedings of the 11th international conference on World Wide Web, pages 517­526. T. Hughes and D. Ramage. 2007. Lexical semantic relatedness with random graph walks. In Proceedings of EMNLP-CoNLL-2007, pages 581­589. M. Jarmasz. 2003. Roget's Thesuarus as a lexical resource for Natural Language Processing. J.J. Jiang and D.W. Conrath. 1997. Semantic similarity based on corpus statistics and lexical taxonomy. In Proceedings of International Conference on Research in Computational Linguistics, volume 33. Taiwan. C. Leacock and M. Chodorow. 1998. Combining local context and WordNet similarity for word sense identification. WordNet: An Electronic Lexical Database, 49(2):265­283. D. Lin. 1998a. An information-theoretic definition of similarity. In Proc. of ICML, pages 296­304, Wisconsin, USA. D. Lin. 1998b. Automatic Retrieval and Clustering of Similar Words. In Proceedings of ACL-98. G.A. Miller and W.G. Charles. 1991. Contextual correlates of semantic similarity. Language and Cognitive Processes, 6(1):1­28. J. Nivre. 2006. Inductive Dependency Parsing, volume 34 of Text, Speech and Language Technology. Springer. S. Pado and M. Lapata. 2007. Dependency-based construction of semantic space models. Computational Linguistics, 33(2):161­199. S. Patwardhan and T. Pedersen. 2006. Using WordNetbased Context Vectors to Estimate the Semantic Relatedness of Concepts. In Proceedings of the EACL Workshop on Making Sense of Sense: Bringing Computational Linguistics and Pycholinguistics Together, pages 1­8, Trento, Italy. P. Resnik. 1995. Using Information Content to Evaluate Semantic Similarity in a Taxonomy. Proc. of IJCAI, 14:448­453. H. Rubenstein and J.B. Goodenough. 1965. Contextual correlates of synonymy. Communications of the ACM, 8(10):627­633. M Ruiz-Casado, E. Alfonseca, and P. Castells. 2005. Using context-window overlapping in Synonym Discovery and Ontology Extension. In Proceedings of RANLP-2005, Borovets, Bulgaria,. M. Sahami and T.D. Heilman. 2006. A web-based kernel function for measuring the similarity of short text snippets. Proc. of WWW, pages 377­386. M. Strube and S.P. Ponzetto. 2006. WikiRelate! Computing Semantic Relatedness Using Wikipedia. In Proceedings of the AAAI-2006, pages 1419­1424. P.D. Turney. 2001. Mining the Web for Synonyms: PMIIR versus LSA on TOEFL. Lecture Notes in Computer Science, 2167:491­502. P. Vossen, editor. 1998. EuroWordNet: A Multilingual Database with Lexical Semantic Networks. Kluwer Academic Publishers. Z. Wu and M. Palmer. 1994. Verb semantics and lexical selection. In Proc. of ACL, pages 133­138, Las Cruces, New Mexico. D. Yang and D.M.W. Powers. 2005. Measuring semantic similarity in the taxonomy of WordNet. Proceedings of the Australasian conference on Computer Science. 27 A Fully Unsupervised Word Sense Disambiguation Method Using Dependency Knowledge Ping Chen Dept. of Computer and Math. Sciences University of Houston-Downtown chenp@uhd.edu Chris Bowes Dept. of Computer and Math. Sciences University of Houston-Downtown bowesc@uhd.edu Abstract Word sense disambiguation is the process of determining which sense of a word is used in a given context. Due to its importance in understanding semantics of natural languages, word sense disambiguation has been extensively studied in Computational Linguistics. However, existing methods either are brittle and narrowly focus on specific topics or words, or provide only mediocre performance in real-world settings. Broad coverage and disambiguation quality are critical for a word sense disambiguation system. In this paper we present a fully unsupervised word sense disambiguation method that requires only a dictionary and unannotated text as input. Such an automatic approach overcomes the problem of brittleness suffered in many existing methods and makes broad-coverage word sense disambiguation feasible in practice. We evaluated our approach using SemEval 2007 Task 7 (Coarse-grained English All-words Task), and our system significantly outperformed the best unsupervised system participating in SemEval 2007 and achieved the performance approaching top-performing supervised systems. Although our method was only tested with coarse-grained sense disambiguation, it can be directly applied to fine-grained sense disambiguation. Wei Ding Department of Computer Science University of Massachusetts-Boston ding@cs.umb.edu David Brown Dept. of Computer and Math. Sciences University of Houston-Downtown brownd@uhd.edu is the process of determining which sense of a homograph is used in a given context. WSD is a long-standing problem in Computational Linguistics, and has significant impact in many real-world applications including machine translation, information extraction, and information retrieval. Generally, WSD methods use the context of a word for its sense disambiguation, and the context information can come from either annotated/unannotated text or other knowledge resources, such as WordNet (Fellbaum, 1998), SemCor (SemCor, 2008), Open Mind Word Expert (Chklovski and Mihalcea, 2002), eXtended WordNet (Moldovan and Rus, 2001), Wikipedia (Mihalcea, 2007), parallel corpora (Ng, Wang, and Chan, 2003). In (Ide and V´ ronis, e 1998) many different WSD approaches were described. Usually, WSD techniques can be divided into four categories (Agirre and Edmonds, 2006), · Dictionary and knowledge based methods. These methods use lexical knowledge bases such as dictionaries and thesauri, and hypothesize that context knowledge can be extracted from definitions of words. For example, Lesk disambiguated two words by finding the pair of senses with the greatest word overlap in their dictionary definitions (Lesk, 1986). · Supervised methods. Supervised methods mainly adopt context to disambiguate words. A supervised method includes a training phase and a testing phase. In the training phase, a sense-annotated training corpus is required, from which syntactic and semantic features are extracted to create a classifier using machine 1 Introduction In many natural languages, a word can represent multiple meanings/senses, and such a word is called a homograph. Word sense disambiguation(WSD) 28 Human Language Technologies: The 2009 Annual Conference of the North American Chapter of the ACL, pages 28­36, Boulder, Colorado, June 2009. c 2009 Association for Computational Linguistics learning techniques, such as Support Vector Machine (Novischi et al., 2007). In the following testing phase, a word is classified into senses (Mihalcea, 2002) (Ng and Lee, 1996). Currently supervised methods achieve the best disambiguation quality (about 80% precision and recall for coarse-grained WSD in the most recent WSD evaluation conference SemEval 2007 (Navigli et al., 2007)). Nevertheless, since training corpora are manually annotated and expensive, supervised methods are often brittle due to data scarcity, and it is hard to annotate and acquire sufficient contextual information for every sense of a large number of words existing in natural languages. · Semi-supervised methods. To overcome the knowledge acquisition bottleneck problem suffered by supervised methods, these methods make use of a small annotated corpus as seed data in a bootstrapping process (Hearst, 1991) (Yarowsky, 1995). A word-aligned bilingual corpus can also serve as seed data (Ng, Wang, and Chan, 2003). · Unsupervised methods. These methods acquire contextual information directly from unannotated raw text, and senses can be induced from text using some similarity measure (Lin, 1997). However, automatically acquired information is often noisy or even erroneous. In the most recent SemEval 2007 (Navigli et al., 2007), the best unsupervised systems only achieved about 70% precision and 50% recall. Disambiguation of a limited number of words is not hard, and necessary context information can be carefully collected and hand-crafted to achieve high disambiguation accuracy as shown in (Yarowsky, 1995). However, such approaches suffer a significant performance drop in practice when domain or vocabulary is not limited. Such a "cliff-style" performance collapse is called brittleness, which is due to insufficient knowledge and shared by many techniques in Artificial Intelligence. The main challenge of a WSD system is how to overcome the knowledge acquisition bottleneck and efficiently collect the huge amount of context knowledge. More precisely, a practical WSD need figure out how to create 29 and maintain a comprehensive, dynamic, and up-todate context knowledge base in a highly automatic manner. The context knowledge required in WSD has the following properties: 1. The context knowledge need cover a large number of words and their usage. Such a requirement of broad coverage is not trivial because a natural language usually contains thousands of words, and some popular words can have dozens of senses. For example, the Oxford English Dictionary has approximately 301,100 main entries (Oxford, 2003), and the average polysemy of the WordNet inventory is 6.18 (Fellbaum, 1998). Clearly acquisition of such a huge amount of knowledge can only be achieved with automatic techniques. 2. Natural language is not a static phenomenon. New usage of existing words emerges, which creates new senses. New words are created, and some words may "die" over time. It is estimated that every year around 2,500 new words appear in English (Kister, 1992). Such dynamics requires a timely maintenance and updating of context knowledge base, which makes manual collection even more impractical. Taking into consideration the large amount and dynamic nature of context knowledge, we only have limited options when choosing knowledge sources for WSD. WSD is often an unconscious process to human beings. With a dictionary and sample sentences/phrases an average educated person can correctly disambiguate most polysemous words. Inspired by human WSD process, we choose an electronic dictionary and unannotated text samples of word instances as context knowledge sources for our WSD system. Both sources can be automatically accessed, provide an excellent coverage of word meanings and usage, and are actively updated to reflect the current state of languages. In this paper we present a fully unsupervised WSD system, which only requires WordNet sense inventory and unannotated text. In the rest of this paper, section 2 describes how to acquire and represent the context knowledge for WSD. We present our WSD algorithm in section 3. Our WSD system is evaluated with SemEval-2007 Task 7 (Coarse-grained English Figure 1: Context Knowledge Acquisition and Representation Process to provide broad and up-to-date context knowledge for WSD. The major concern about Web documents is inconsistency of their quality, and many Web pages are spam or contain erroneous information. However, factual errors in Web pages will not hurt the performance of WSD. Nevertheless, the quality of context knowledge is affected by broken sentences of poor linguistic quality and invalid word usage, e.g., sentences like "Colorless green ideas sleep furiously" that violate commonsense knowledge. Based on our experience these kind of errors are negligible when using popular Web search engines to retrieve relevant Web pages. To start the acquisition process, words that need to be disambiguated are compiled and saved in a text file. Each single word is submitted to a Web search engine as a query. Several search engines provide API's for research communities to automatically retrieve large number of Web pages. In our experiments we used both Google and Yahoo! API's to retrieve up to 1,000 Web pages for each tobe-disambiguated word. Collected Web pages are cleaned first, e.g., control characters and HTML tags are removed. Then sentences are segmented simply based on punctuation (e.g., ?, !, .). Sentences that contain the instances of a specific word are extracted and saved into a local repository. 2.2 Parsing Sentences organized according to each word are sent to a dependency parser, Minipar. Dependency parsers have been widely used in Computational Linguistics and natural language processing. An evaluation with the SUSANNE corpus shows that Minipar achieves 89% precision with respect to dependency relations (Lin, 1998). After parsing sentences are converted to parsing trees and saved in files. Neither our simple sentence segmentation approach nor Minipar parsing is 100% accurate, so a small number of invalid dependency relations may exist in parsing trees. The impact of these erroneous relations will be minimized in our WSD algorithm. Comparing with tagging or chunking, parsing is relatively expensive and time-consuming. However, in our method parsing is not performed in real time when we disambiguate words. Instead, sentences All-words Task) data set, and the experiment results are discussed in section 4. We conclude in section 5. 2 Context Knowledge Acquisition and Representation Figure 1 shows an overview of our context knowledge acquisition process, and collected knowledge is saved in a local knowledge base. Here are some details about each step. 2.1 Corpus building through Web search The goal of this step is to collect as many as possible valid sample sentences containing the instances of to-be-disambiguated words. Preferably these instances are also diverse and cover many senses of a word. We have considered two possible text sources, 1. Electronic text collection, e.g., Gutenberg project (Gutenberg, 1971). Such collections often include thousands of books, which are often written by professionals and can provide many valid and accurate usage of a large number of words. Nevertheless, books in these collections are usually copyright-free and old, hence are lack of new words or new senses of words used in modern English. 2. Web documents. Billions of documents exist in the World Wide Web, and millions of Web pages are created and updated everyday. Such a huge dynamic text collection is an ideal source 30 Figure 3: WSD Procedure Figure 2: Merging two parsing trees. The number beside each edge is the number of occurrences of this dependency relation existing in the context knowledge base. process as our context knowledge base. As a fully automatic knowledge acquisition process, it is inevitable to include erroneous dependency relations in the knowledge base. However, since in a large text collection valid dependency relations tend to repeat far more times than invalid ones, these erroneous edges only have minimal impact on the disambiguation quality as shown in our evaluation results. are parsed only once to extract dependency relations, then these relations are merged and saved in a local knowledge base for the following disambiguation. Hence, parsing will not affect the speed of disambiguation at all. 2.3 Merging dependency relations After parsing, dependency relations from different sentences are merged and saved in a context knowledge base. The merging process is straightforward. A dependency relation includes one head word/node and one dependent word/node. Nodes from different dependency relations are merged into one as long as they represent the same word. An example is shown in Figure 2, which merges the following two sentences: "Computer programmers write software." "Many companies hire computer programmers." In a dependency relation "word1 word2 ", word1 is the head word, and word2 is the dependent word. After merging dependency relations, we will obtain a weighted directed graph with a word as a node, a dependency relation as an edge, and the number of occurrences of dependency relation as weight of an edge. This weight indicates the strength of semantic relevancy of head word and dependent word. This graph will be used in the following WSD 31 3 WSD Algorithm Our WSD approach is based on the following insight: If a word is semantically coherent with its context, then at least one sense of this word is semantically coherent with its context. Assume that the text to be disambiguated is semantically valid, if we replace a word with its glosses one by one, the correct sense should be the one that will maximize the semantic coherence within this word's context. Based on this idea we set up our WSD procedure as shown in Figure 3. First both the original sentence that contains the to-be-disambiguated word and the glosses of to-bedisambiguated word are parsed. Then the parsing tree generated from each gloss is matched with the parsing tree of original sentence one by one. The gloss most semantically coherent with the original sentence will be chosen as the correct sense. How to measure the semantic coherence is critical. Our idea is based on the following hypotheses (assume word1 is the to-be-disambiguated word): · In a sentence if word1 is dependent on word2 , and we denote the gloss of the correct sense of word1 as g1i , then g1i contains the most semantically coherent words that are dependent on word2 ; · In a sentence if a set of words DEP1 are dependent on word1 , and we denote the gloss of the correct sense of word1 as g1i , then g1i contains the most semantically coherent words that DEP1 are dependent on. For example, we try to disambiguate "company" in "A large company hires many computer programmers", after parsing we obtain the dependency relations "hire company" and "company large". The correct sense for the word "company" should be "an institution created to conduct business". If in the context knowledge base there exist the dependency relations "hire institution" or "institution large", then we believe that the gloss "an institution created to conduct business" is semantically coherent with its context - the original sentence. The gloss with the highest semantic coherence will be chosen as the correct sense. Obviously, the size of context knowledge base has a positive impact on the disambiguation quality, which is also verified in our experiments (see Section 4.2). Figure 4 shows our detailed WSD algorithm. Semantic coherence score is generated by the function T reeM atching, and we adopt a sentence as the context of a word. We illustrate our WSD algorithm through an example. Assume we try to disambiguate "company" in the sentence "A large software company hires many computer programmers". "company" has 9 senses as a noun in WordNet 2.1. Let's pick the following two glosses to go through our WSD process. · an institution created to conduct business · small military unit First we parse the original sentence and two glosses, and get three weighted parsing trees as shown in Figure 5. All weights are assigned to nodes/words in these parsing trees. In the parsing tree of the original sentence the weight of a node is reciprocal of the distance between this node and tobe-disambiguated node "company" (line 12 in Figure 4). In the parsing tree of a gloss the weight of a node is reciprocal of the level of this node in the parsing tree (line 16 in Figure 4). Assume that our context knowledge base contains relevant dependency relations shown in Figure 6. 32 Input: Glosses from WordNet; S: the sentence to be disambiguated; G: the knowledge base generated in Section 2; 1. Input a sentence S, W = {w| w's part of speech is noun, verb, adjective, or adverb, w S}; 2. Parse S with a dependency parser, generate parsing tree TS ; 3. For each w W { 4. Input all w's glosses from WordNet; 5. For each gloss wi { 6. Parse wi , get a parsing tree Twi ; 7. score = TreeMatching(TS , Twi ); } 8. If the highest score is larger than a preset threshold, choose the sense with the highest score as the correct sense; 9. Otherwise, choose the first sense. 10. } TreeMatching(TS , Twi ) 11. For each node nSi TS { 1 12. Assign weight wSi = lSi , lSi is the length between nSi and wi in TS ; 13. } 14. For each node nwi Twi { 15. Load its dependent words Dwi from G; 1 16. Assign weight wwi = lwi , lwi is the level number of nwi in Twi ; 17. For each nSj { 18. If nSj Dwi 19. calculate connection strength sji between nSj and nwi ; 20. score = score + wSi × wwi × sji ; 21. } 22. } 23. Return score; Figure 4: WSD Algorithm The weights in the context knowledge base are assigned to dependency relation edges. These weights are normalized to [0, 1] based on the number of dependency relation instances obtained in the acquisition and merging process. A large number of occurrences will be normalized to a high value (close to 1), and a small number of occurrences will be nor- 1.0×1.0×0.7 + 1.0×0.25×0.8 + 1.0×0.25×0.9 = 1.125 We go through the same process with the second gloss "small military unit". "Large" is the only dependent word of "company" appearing in the dependent word set of "unit" in gloss 2, so the coherence score of gloss 2 in the current context is: 1.0 × 1.0 × 0.8 = 0.8 After comparing the coherence scores of two glosses, we choose sense 1 of "company" as the correct sense (line 9 in Figure 4). This example illustrates that a strong dependency relation between a head word and a dependent word has a powerful disambiguation capability, and disambiguation quality is also significantly affected by the quality of dictionary definitions. Figure 5: Weighted parsing trees of the original sentence and two glosses of "company" Figure 6: A fragment of context knowledge base malized to a low value (close to 0). Now we load the dependent words of each word in gloss 1 from the knowledge base (line 14, 15 in Figure 4), and we get {small, large} for "institution" and {large, software} for "business". In the dependent words of "company", "large" belongs to the dependent word sets of "institution" and "business", and "software" belongs to the dependent word set of "business", so the coherence score of gloss 1 is calculated as (line 19, 20 in Figure 4): 33 In Figure 4 the T reeM atching function matches the dependent words of to-be-disambiguated word (line 15 in Figure 4), and we call this matching strategy as dependency matching. This strategy will not work if a to-be-disambiguated word has no dependent words at all, for example, when the word "company" in "Companies hire computer programmers" has no dependent words. In this case, we developed the second matching strategy, which is to match the head words that the to-be-disambiguated word is dependent on, such as matching "hire" (the head word of "company") in Figure 5(a). Using the dependency relation "hire company", we can correctly choose sense 1 since there is no such relation as "hire unit" in the knowledge base. This strategy is also helpful when disambiguating adjectives and adverbs since they usually only depend on other words, and rarely any other words are dependent on them. The third matching strategy is to consider synonyms as a match besides the exact matching words. Synonyms can be obtained through the synsets in WordNet. For example, when we disambiguate "company" in "Big companies hire many computer programmers", "big" can be considered as a match for "large". We call this matching strategy as synonym matching. The three matching strategies can be combined and applied together, and in Section 4.1 we show the experiment results of 5 different matching strategy combinations. 4 Experiments We have evaluated our method using SemEval-2007 Task 07 (Coarse-grained English All-words Task) test set (Navigli et al., 2007). The task organizers provide a coarse-grained sense inventory created with SSI algorithm (Navigli and Velardi, 2005), training data, and test data. Since our method does not need any training or special tuning, neither coarse-grained sense inventory nor training data was used. The test data includes: a news article about "homeless" (including totally 951 words, 368 words are annotated and need to be disambiguated), a review of the book "Feeding Frenzy" (including totally 987 words, 379 words are annotated and need to be disambiguated), an article about some traveling experience in France (including totally 1311 words, 500 words are annotated and need to be disambiguated), computer programming(including totally 1326 words, 677 words are annotated and need to be disambiguated), and a biography of the painter Masaccio (including totally 802 words, 345 words are annotated and need to be disambiguated). Two authors of (Navigli et al., 2007) independently and manually annotated part of the test set (710 word instances), and the pairwise agreement was 93.80%. This inter-annotator agreement is usually considered an upper-bound for WSD systems. We followed the WSD process described in Section 2 and 3 using the WordNet 2.1 sense repository that is adopted by SemEval-2007 Task 07. All experiments were performed on a Pentium 2.33GHz dual core PC with 3GB memory. Among the 2269 tobe-disambiguated words in the five test documents, 1112 words are unique and submitted to Google API as queries. The retrieved Web pages were cleaned, and 1945189 relevant sentences were extracted. On average 1749 sentences were obtained for each word. The Web page retrieval step took 3 days, and the cleaning step took 2 days. Parsing was very time-consuming and took 11 days. The merging step took 3 days. Disambiguation of 2269 words in the 5 test articles took 4 hours. All these steps can be parallelized and run on multiple computers, and the whole process will be shortened accordingly. The overall disambiguation results are shown in Table 1. For comparison we also listed the results of the top three systems and three unsuper34 vised systems participating in SemEval-2007 Task 07. All of the top three systems (UoR-SSI, NUSPT, NUS-ML) are supervised systems, which used annotated resources (e.g., SemCor, Defense Science Organization Corpus) during the training phase. Our fully unsupervised WSD system significantly outperforms the three unsupervised systems (SUSSZFR, SUSSX-C-WD, SUSSX-CR) and achieves performance approaching the top-performing supervised WSD systems. 4.1 Impact of different matching strategies to disambiguation quality To test the effectiveness of different matching strategies discussed in Section 3, we performed some additional experiments. Table 2 shows the disambiguation results by each individual document with the following 5 matching strategies: 1. Dependency matching only. 2. Dependency and backward matching. 3. Dependency and synonym backward matching. 4. Dependency and synonym dependency matching. 5. Dependency, backward, synonym backward, and synonym dependency matching. As expected combination of more matching strategies results in higher disambiguation quality. By analyzing the scoring details, we verified that backward matching is especially useful to disambiguate adjectives and adverbs. Adjectives and adverbs are often dependent words, so dependency matching itself rarely finds any matched words. Since synonyms are semantically equivalent, it is reasonable that synonym matching can also improve disambiguation performance. 4.2 Impact of knowledge base size to disambiguation quality To test the impact of knowledge base size to disambiguation quality we randomly selected 1339264 sentences (about two thirds of all sentences) from our text collection and built a smaller knowledge base. Table 3 shows the experiment results. Overall disambiguation quality has dropped slightly, which System UoR-SSI NUS-PT NUS-ML TreeMatch SUSSZ-FR SUSSX-C-WD SUSSX-CR Attempted 100.0 100.0 100.0 100.0 72.8 72.8 72.8 Precision 83.21 82.50 81.58 73.65 71.73 54.54 54.30 Recall 83.21 82.50 81.58 73.65 52.23 39.71 39.53 F1 83.21 82.50 81.58 73.65 60.44 45.96 45.75 Table 1: Overall disambiguation scores (Our system "TreeMatch" is marked in bold) Matching strategy 1 2 3 4 5 d001 P R 72.28 72.28 70.65 70.65 79.89 79.89 80.71 80.71 80.16 80.16 d002 P R 66.23 66.23 70.98 70.98 75.20 75.20 78.10 78.10 78.10 78.10 d003 P R 63.20 63.20 65.20 65.20 69.00 69.00 72.80 72.80 69.40 69.40 d004 P R 66.47 66.47 72.23 72.23 71.94 71.94 71.05 71.05 72.82 72.82 d005 P R 56.52 56.52 58.84 58.84 64.64 64.64 67.54 67.54 66.09 66.09 Overall P R 65.14 65.14 68.18 68.18 72.01 72.01 73.65 73.65 73.12 73.12 Table 2: Disambiguation scores by article with 5 matching strategies shows a positive correlation between the amount of context knowledge and disambiguation quality. It is reasonable to assume that our disambiguation performance can be improved further by collecting and incorporating more context knowledge. Matching strategy 1 2 3 4 5 Overall P R 65.36 65.36 67.78 67.78 68.09 68.09 70.69 70.69 67.78 67.78 our method may provide a viable solution to the problem of WSD. The future work includes: 1. Continue to build the knowledge base, enlarge the coverage and improve the system performance. The experiment results in Section 4.2 clearly show that more word instances can improve the disambiguation accuracy and recall scores; 2. WSD is often an unconscious process for human beings. It is unlikely that a reader examines all surrounding words when determining the sense of a word, which calls for a smarter and more selective matching strategy than what we have tried in Section 4.1; 3. Test our WSD system on fine-grained SemEval 2007 WSD task 17. Although we only evaluated our approach with coarse-grained senses, our method can be directly applied to finegrained WSD without any modifications. Table 3: Disambiguation scores by article with a smaller knowledge base 5 Conclusion and Future Work Broad coverage and disambiguation quality are critical for WSD techniques to be adopted in practice. This paper proposed a fully unsupervised WSD method. We have evaluated our approach with SemEval-2007 Task 7 (Coarse-grained English Allwords Task) data set, and we achieved F-scores approaching the top performing supervised WSD systems. By using widely available unannotated text and a fully unsupervised disambiguation approach, 35 Acknowledgments This work is partially funded by NSF grant 0737408 and Scholar Academy at the University of Houston Downtown. This paper contains proprietary information protected under a pending U.S. patent. References Agirre, Eneko, Philip Edmonds (eds.). 2006. Word Sense Disambiguation: Algorithms and Applications, Springer. Chklovski, T. and Mihalcea, R. 2002. Building a sense tagged corpus with open mind word expert. In Proceedings of the Acl-02 Workshop on Word Sense Disambiguation: Recent Successes and Future Directions, Morristown, NJ, 116-122. C. Fellbaum, WordNet: An Electronic Lexical Database, MIT press, 1998 Project Gutenberg, available at www.gutenberg.org Hearst, M. (1991) Noun Homograph Disambiguation Using Local Context in Large Text Corpora, Proc. 7th Annual Conference of the University of Waterloo Center for the New OED and Text Research, Oxford. Nancy Ide and Jean V´ ronis. 1998. Introduction to the e special issue on word sense disambiguation: the state of the art. Comput. Linguist., 24(1):2­40. Kister, Ken. "Dictionaries defined", Library Journal, Vol. 117 Issue 11, p43, 4p, 2bw Lesk, M. 1986. Automatic sense disambiguation using machine readable dictionaries: how to tell a pine cone from an ice cream cone. In Proceedings of the 5th Annual international Conference on Systems Documentation (Toronto, Ontario, Canada). V. DeBuys, Ed. SIGDOC '86. Dekang Lin. 1998. Dependency-based evaluation of minipar. In Proceedings of the LREC Workshop on the Evaluation of Parsing Systems, pages 234­241, Granada, Spain. Lin, D. 1997. Using syntactic dependency as local context to resolve word sense ambiguity. In Proceedings of the 35th Annual Meeting of the Association For Computational Linguistics and Eighth Conference of the European Chapter of the Association For Computational Linguistics (Madrid, Spain, July 07 - 12, 1997). Rada Mihalcea, Using Wikipedia for Automatic Word Sense Disambiguation, in Proceedings of the North American Chapter of the Association for Computational Linguistics (NAACL 2007), Rochester, April 2007. Rada Mihalcea. 2002. Instance based learning with automatic feature selection applied to word sense disambiguation. In Proceedings of the 19th international conference on Computational linguistics, pages 1­7, Morristown, NJ. Dan Moldovan and Vasile Rus, Explaining Answers with Extended WordNet, ACL 2001. Roberto Navigli, Kenneth C. Litkowski, and Orin Hargraves. 2007. Semeval-2007 task 07: Coarsegrained english all-words task. In Proceedings of the Fourth International Workshop on Semantic Evaluations (SemEval-2007), pages 30­35, Prague, Czech Republic. Roberto Navigli and Paola Velardi. 2005. Structural semantic interconnections: a knowledge-based approach to word sense disambiguation. IEEE Transactions on Pattern Analysis and Machine Intelligence (PAMI), 27(7):10631074. Hwee Tou Ng, Bin Wang, and Yee Seng Chan. Exploiting Parallel Texts for Word Sense Disambiguation: An Empirical Study. ACL, 2003. Hwee Tou Ng and Hian Beng Lee. 1996. Integrating multiple knowledge sources to disambiguate word sense: an exemplar-based approach. In Proceedings of the 34th annual meeting on Association for Computational Linguistics, pages 40­47, Morristown, NJ. Adrian Novischi, Muirathnam Srikanth, and Andrew Bennett. 2007. Lcc-wsd: System description for English coarse grained all words task at semeval 2007. In Proceedings of the Fourth International Workshop on Semantic Evaluations (SemEval-2007), pages 223­ 226, Prague, Czech Republic. Catherine Soanes and Angus Stevenson, editors. 2003. Oxford Dictionary of English. Oxford University Press. Rada Mihalcea, available at http://www.cs.unt.edu/ rada/downloads.html Yarowsky, D. 1995. Unsupervised word sense disambiguation rivaling supervised methods. In Proceedings of the 33rd Annual Meeting on Association For Computational Linguistics (Cambridge, Massachusetts, June 26 - 30, 1995). 36 Learning Phoneme Mappings for Transliteration without Parallel Data Sujith Ravi and Kevin Knight University of Southern California Information Sciences Institute Marina del Rey, California 90292 {sravi,knight}@isi.edu Abstract We present a method for performing machine transliteration without any parallel resources. We frame the transliteration task as a decipherment problem and show that it is possible to learn cross-language phoneme mapping tables using only monolingual resources. We compare various methods and evaluate their accuracies on a standard name transliteration task. We explore the third dimension, where we see several techniques in use: · Manually-constructed transliteration models, e.g., (Hermjakob et al., 2008). · Models constructed from bilingual dictionaries of terms and names, e.g., (Knight and Graehl, 1998; Huang et al., 2004; Haizhou et al., 2004; Zelenko and Aone, 2006; Yoon et al., 2007; Li et al., 2007; Karimi et al., 2007; Sherif and Kondrak, 2007b; Goldwasser and Roth, 2008b). · Extraction of parallel examples from bilingual corpora, using bootstrap dictionaries e.g., (Sherif and Kondrak, 2007a; Goldwasser and Roth, 2008a). · Extraction of parallel examples from comparable corpora, using bootstrap dictionaries, and temporal and word co-occurrence, e.g., (Sproat et al., 2006; Klementiev and Roth, 2008). · Extraction of parallel examples from web queries, using bootstrap dictionaries, e.g., (Nagata et al., 2001; Oh and Isahara, 2006; Kuo et al., 2006; Wu and Chang, 2007). · Comparing terms from different languages in phonetic space, e.g., (Tao et al., 2006; Goldberg and Elhadad, 2008). In this paper, we investigate methods to acquire transliteration mappings from non-parallel sources. We are inspired by previous work in unsupervised learning for natural language, e.g. (Yarowsky, 1995; 1 Introduction Transliteration refers to the transport of names and terms between languages with different writing systems and phoneme inventories. Recently there has been a large amount of interesting work in this area, and the literature has outgrown being citable in its entirety. Much of this work focuses on backtransliteration, which tries to restore a name or term that has been transported into a foreign language. Here, there is often only one correct target spelling--for example, given jyon.kairu (the name of a U.S. Senator transported to Japanese), we must output "Jon Kyl", not "John Kyre" or any other variation. There are many techniques for transliteration and back-transliteration, and they vary along a number of dimensions: · phoneme substitution vs. character substitution · heuristic vs. generative vs. discriminative models · manual vs. automatic knowledge acquisition 37 Human Language Technologies: The 2009 Annual Conference of the North American Chapter of the ACL, pages 37­45, Boulder, Colorado, June 2009. c 2009 Association for Computational Linguistics WFSA - A English word sequence WFST - B ( SPENCER ABRAHAM ) English sound sequence WFST - C Japanese sound sequence WFST - D Japanese katakana sequence () ( S P EH N S ER EY B R AH HH AE M ) ( S U P E N S A A E E B U R A H A M U ) Figure 1: Model used for back-transliteration of Japanese katakana names and terms into English. The model employs a four-stage cascade of weighted finite-state transducers (Knight and Graehl, 1998). Goldwater and Griffiths, 2007), and we are also inspired by cryptanalysis--we view a corpus of foreign terms as a code for English, and we attempt to break the code. 2 Background We follow (Knight and Graehl, 1998) in tackling back-transliteration of Japanese katakana expressions into English. Knight and Graehl (1998) developed a four-stage cascade of finite-state transducers, shown in Figure 1. · WFSA A - produces an English word sequence w with probability P(w) (based on a unigram word model). · WFST B - generates an English phoneme sequence e corresponding to w with probability P(e|w). · WFST C - transforms the English phoneme sequence into a Japanese phoneme sequence j according to a model P(j|e). · WFST D - writes out the Japanese phoneme sequence into Japanese katakana characters according to a model P(k|j). Using the cascade in the reverse (noisy-channel) direction, they are able to translate new katakana names and terms into English. They report 36% error in translating 100 U.S. Senators' names, and they report exceeding human transliteration performance in the presence of optical scanning noise. The only transducer that requires parallel training data is WFST C. Knight and Graehl (1998) take several thousand phoneme string pairs, automatically align them with the EM algorithm (Dempster et al., 1977), and construct WFST C from the aligned phoneme pieces. We re-implement their basic method by instantiating a densely-connected version of WFST C with 38 all 1-to-1 and 1-to-2 phoneme connections between English and Japanese. Phoneme bigrams that occur fewer than 10 times in a Japanese corpus are omitted, and we omit 1-to-3 connections. This initial WFST C model has 15320 uniformly weighted parameters. We then train the model on 3343 phoneme string pairs from a bilingual dictionary, using the EM algorithm. EM immediately reduces the connections in the model to those actually observed in the parallel data, and after 14 iterations, there are only 188 connections left with P(j|e) 0.01. Figure 2 shows the phonemic substitution table learnt from parallel training. We use this trained WFST C model and apply it to the U.S. Senator name transliteration task (which we update to the 2008 roster). We obtain 40% error, roughly matching the performance observed in (Knight and Graehl, 1998). 3 Task and Data The task of this paper is to learn the mappings in Figure 2, but without parallel data, and to test those mappings in end-to-end transliteration. We imagine our problem as one faced by monolingual English speaker wandering around Japan, reading a multitude of katakana signs, listening to people speak Japanese, and eventually deciphering those signs into English. To mis-quote Warren Weaver: "When I look at a corpus of Japanese katakana, I say to myself, this is really written in English, but it has been coded in some strange symbols. I will now proceed to decode." Our larger motivation is to move toward easily-built transliteration systems for all language pairs, regardless of parallel resources. While Japanese/English transliteration has its own particular features, we believe it is a reasonable starting point. e AA j o a oo aa P(j|e) 0.49 0.46 0.02 0.02 e AY j ai i a iy ay b bu P(j|e) 0.84 0.09 0.03 0.01 0.01 0.82 0.15 e EH j e a P(j|e) 0.94 0.03 e HH j h w ha P(j|e) 0.95 0.02 0.02 e L j r ru P(j|e) 0.62 0.37 e OY j oi oe o i P(j|e) 0.89 0.04 0.04 0.04 e SH j sh y sh yu ssh y sh i ssh e t to tt o ts tt u ts u ch su s sh to ch te t a u uu ua dd u ssh o P(j|e) 0.33 0.31 0.17 0.12 0.04 0.02 0.01 0.43 0.25 0.17 0.04 0.03 0.02 0.02 0.02 0.48 0.22 0.16 0.04 0.04 0.02 0.02 0.02 0.79 0.09 0.04 0.03 0.02 0.02 e V j b bu w a P(j|e) 0.75 0.17 0.03 0.02 AE a a ssh an 0.93 0.02 0.02 B ER aa a ar ru or er ee e ei a ai 0.8 0.08 0.03 0.02 0.02 0.02 0.58 0.15 0.12 0.1 0.03 IH i e in a 0.89 0.05 0.01 0.01 M m mu n 0.68 0.22 0.08 P p pu pp u pp 0.63 0.16 0.13 0.06 T W w u o i 0.73 0.17 0.04 0.02 AH a o e i u 0.6 0.13 0.11 0.07 0.06 CH AO o oo a on au u 0.6 0.27 0.05 0.03 0.03 0.01 D tch i ch ch i ch y tch y tch ssh y k d do dd o z j u 0.27 0.24 0.23 0.2 0.02 0.02 0.01 0.01 0.54 0.27 0.06 0.02 0.02 0.01 EY IY ii i e ee 0.58 0.3 0.07 0.03 N n nn 0.96 0.02 PAUSE pause 1.0 TH Y y i e a 0.7 0.26 0.02 0.02 F h hu hh hh u 0.58 0.35 0.04 0.02 JH jy j ji jj i z 0.35 0.24 0.21 0.14 0.04 NG n gu ng i u o a 0.62 0.22 0.09 0.04 0.01 0.01 0.01 R r a o ru aa 0.61 0.27 0.07 0.03 0.01 UH Z AW au aw ao a uu oo o 0.69 0.15 0.06 0.04 0.02 0.02 0.02 DH z zu az 0.87 0.08 0.04 G g gu gg u gy gg ga 0.66 0.19 0.1 0.03 0.01 0.01 K k ku kk u kk ky ki 0.53 0.2 0.16 0.05 0.02 0.01 OW o oo ou 0.57 0.39 0.02 S su s sh u ss ssh 0.43 0.37 0.08 0.05 0.02 0.01 UW uu u yu 0.67 0.29 0.02 ZH z zu u su j a n i s o jy ji j 0.27 0.25 0.16 0.07 0.06 0.06 0.03 0.03 0.02 0.02 0.43 0.29 0.29 Figure 2: Phonemic substitution table learnt from 3343 parallel English/Japanese phoneme string pairs. English phonemes are in uppercase, Japanese in lowercase. Mappings with P(j|e) > 0.01 are shown. A A A A A A A A A A A A A A A A A I I I J J K K CH I D O K U P U R A Z A N D O T I S U T O T O S E R I N A P I S U T O N A N B I R U D I I D O K E N B E R I I I A K A PP U I T O A SH I A K O O S U U A : : CH E N J I CH E S : D E K O R A D E T O M O E P I G U R E R A N D O : J Y A I A N J Y A Z U : M Y U U Z E : : N E B A D A : O P U T I K U S U : P I I T A A P I KK A A P I N G U U P I P E R A J I N A M I D O P I S A P I U R A P O I N T O : : : W W W W Y Y Y A A A A U U U K N N S N N U O K T E I I B I A A PP U E N P O R I N O N TT O SH I S U T E M U T I B U R U T O A M U TS U U M U : : Z E N E R A R U E A K O N Z E R O Z O N B I I Z U : : Figure 3: Some Japanese phoneme sequences generated from the monolingual katakana corpus using WFST D. Our monolingual resources are: · 43717 unique Japanese katakana sequences collected from web newspaper data. We split multi-word katakana phrases on the center-dot ("·") character, and select a final corpus of 9350 unique sequences. We add monolingual Japanese versions of the 2008 U.S. Senate roster.1 · The CMU pronunciation dictionary of English, We use "open" EM testing, in which unlabeled test data is allowed to be part of unsupervised training. However, no parallel data is allowed. 1 with 112,151 entries. · The English gigaword corpus. Knight and Graehl (1998) already use frequently-occurring capitalized words to build the WFSA A component of their four-stage cascade. We seek to use our English knowledge (derived from 2 and 3) to decipher the Japanese katakana corpus (1) into English. Figure 3 shows a portion of the Japanese corpus, which we transform into Japanese phoneme sequences using the monolingual resource of WFST D. We note that the Japanese phoneme inventory contains 39 unique ("ciphertext") symbols, 39 compared to the 40 English ("plaintext") phonemes. Our goal is to compare and evaluate the WFST C model learnt under two different scenarios--(a) using parallel data, and (b) using monolingual data. For each experiment, we train only the WFST C model and then apply it to the name transliteration task--decoding 100 U.S. Senator names from Japanese to English using the automata shown in Figure 1. For all experiments, we keep the rest of the models in the cascade (WFSA A, WFST B, and WFST D) unchanged. We evaluate on whole-name error-rate (maximum of 100/100) as well as normalized word edit distance, which gives partial credit for getting the first or last name correct. 4 Acquiring Phoneme Mappings from Non-Parallel Data Our main data consists of 9350 unique Japanese phoneme sequences, which we can consider as a single long sequence j. As suggested by Knight et al (2006), we explain the existence of j as the result of someone initially producing a long English phoneme sequence e, according to P(e), then transforming it into j, according to P(j|e). The probability of our observed data P(j) can be written as: P (j) = e We allow EM to run until the P(j) likelihood ratio between subsequent training iterations reaches 0.9999, and we terminate early if 200 iterations are reached. Finally, we decode our test set of U.S. Senator names. Following Knight et al (2006), we stretch out the P(j|e) model probabilities after decipherment training and prior to decoding our test set, by cubing their values. Decipherment under the conditions of transliteration is substantially more difficult than solving letter-substitution ciphers (Knight et al., 2006; Ravi and Knight, 2008; Ravi and Knight, 2009) or phoneme-substitution ciphers (Knight and Yamada, 1999). This is because the target table contains significant non-determinism, and because each symbol has multiple possible fertilities, which introduces uncertainty about the length of the target string. 4.1 Baseline P(e) Model P (e) · P (j|e) Clearly, we can design P(e) in a number of ways. We might expect that the more the system knows about English, the better it will be able to decipher the Japanese. Our baseline P(e) is a 2-gram phoneme model trained on phoneme sequences from the CMU dictionary. The second row (2a) in Figure 4 shows results when we decipher with this fixed P(e). This approach performs poorly and gets all the Senator names wrong. 4.2 Consonant Parity We take P(e) to be some fixed model of monolingual English phoneme production, represented as a weighted finite-state acceptor (WFSA). P(j|e) is implemented as the initial, uniformly-weighted WFST C described in Section 2, with 15320 phonemic connections. We next maximize P(j) by manipulating the substitution table P(j|e), aiming to produce a result such as shown in Figure 2. We accomplish this by composing the English phoneme model P(e) WFSA with the P(j|e) transducer. We then use the EM algorithm to train just the P(j|e) parameters (inside the composition that predicts j), and guess the values for the individual phonemic substitutions that maximize the likelihood of the observed data P(j).2 In our experiments, we use the Carmel finite-state transducer package (Graehl, 1997), a toolkit with an algorithm for EM training of weighted finite-state transducers. 2 When training under non-parallel conditions, we find that we would like to keep our WFST C model small, rather than instantiating a fully-connected model. In the supervised case, parallel training allows the trained model to retain only those connections which were observed from the data, and this helps eliminate many bad connections from the model. In the unsupervised case, there is no parallel data available to help us make the right choices. We therefore use prior knowledge and place a consonant-parity constraint on the WFST C model. Prior to EM training, we throw out any mapping from the P(j|e) substitution model that does not have the same number of English and Japanese consonant phonemes. This is a pattern that we observe across a range of transliteration tasks. Here are ex- 40 Phonemic Substitution Model 1 2a 2b e j = { 1-to-1, 1-to-2 } + EM aligned with parallel data e j = { 1-to-1, 1-to-2 } + decipherment training with 2-gram English P(e) e j = { 1-to-1, 1-to-2 } + decipherment training with 2-gram English P(e) + consonant-parity e j = { 1-to-1, 1-to-2 } + decipherment training with 3-gram English P(e) + consonant-parity e j = { 1-to-1, 1-to-2 } + decipherment training with a word-based English model + consonant-parity e j = { 1-to-1, 1-to-2 } + decipherment training with a word-based English model + consonant-parity + initialize mappings having consonant matches with higher probability weights Name Transliteration Error whole-name error norm. edit distance 40 25.9 100 98 100.0 89.8 2c 94 73.6 2d 77 57.2 2e 73 54.2 Figure 4: Results on name transliteration obtained when using the phonemic substitution model trained under different scenarios--(1) parallel training data, (2a-e) using only monolingual resources. amples of mappings where consonant parity is violated: K => a EH => s a N => e e EY => n Modifying the WFST C in this way leads to better decipherment tables and slightly better results for the U.S. Senator task. Normalized edit distance drops from 100 to just under 90 (row 2b in Figure 4). 4.3 Better English Models Row 2c in Figure 4 shows decipherment results when we move to a 3-gram English phoneme model for P(e). We notice considerable improvements in accuracy. On the U.S. Senator task, normalized edit distance drops from 89.8 to 73.6, and whole-name error decreases from 98 to 94. When we analyze the results from deciphering with a 3-gram P(e) model, we find that many of the Japanese phoneme test sequences are decoded into English phoneme sequences (such as "IH K R IH N" and "AE G M AH N") that are not valid words. This happens because the models we used for decipherment so far have no knowledge of what constitutes a globally valid English sequence. To help the phonemic substitution model learn this information automatically, we build a word-based P(e) from English phoneme sequences in the CMU dictionary and use this model for decipherment train41 ing. The word-based model produces complete English phoneme sequences corresponding to 76,152 actual English words from the CMU dictionary. The English phoneme sequences are represented as paths through a WFSA, and all paths are weighted equally. We represent the word-based model in compact form, using determinization and minimization techniques applicable to weighted finite-state automata. This allows us to perform efficient EM training on the cascade of P(e) and P(j|e) models. Under this scheme, English phoneme sequences resulting from decipherment are always analyzable into actual words. Row 2d in Figure 4 shows the results we obtain when training our WFST C with a word-based English phoneme model. Using the word-based model produces the best result so far on the phonemic substitution task with non-parallel data. On the U.S. Senator task, word-based decipherment outperforms the other methods by a large margin. It gets 23 out of 100 Senator names exactly right, with a much lower normalized edit distance (57.2). We have managed to achieve this performance using only monolingual data. This also puts us within reach of the parallel-trained system's performance (40% whole-name errors, and 25.9 word edit distance error) without using a single English/Japanese pair for training. To summarize, the quality of the English phoneme e AA j a o i u e oo ya aa a i e o u uu oo P(j|e) 0.37 0.25 0.15 0.08 0.07 0.03 0.01 0.01 0.52 0.19 0.11 0.08 0.03 0.02 0.02 e AY j ai oo e i a uu yu u o ee b p k m s g t z d ch y g k b sh s r ch y p m ch d do n to sh i ku k gu b s h r b w t p g jy d k P(j|e) 0.36 0.13 0.12 0.11 0.11 0.05 0.02 0.02 0.02 0.02 0.41 0.12 0.09 0.07 0.04 0.04 0.03 0.02 0.02 0.02 0.12 0.11 0.09 0.07 0.07 0.07 0.07 0.06 0.06 0.06 0.16 0.15 0.05 0.03 0.03 0.03 0.03 0.03 0.03 0.02 0.13 0.12 0.09 0.08 0.07 0.07 0.06 0.05 0.05 0.03 e EH j e a o i u oo yu ai aa a u o e oo ii yu uu i ee a i u o e oo ei ii uu h hu b sh i p m r s ha bu gu g ku bu k b to t ha d P(j|e) 0.37 0.24 0.12 0.12 0.06 0.04 0.01 0.01 0.47 0.17 0.08 0.07 0.04 0.03 0.03 0.02 0.02 0.02 0.3 0.22 0.11 0.09 0.06 0.06 0.05 0.04 0.02 0.01 0.18 0.14 0.09 0.07 0.07 0.06 0.04 0.03 0.03 0.02 0.13 0.11 0.08 0.06 0.04 0.04 0.03 0.03 0.03 0.03 e HH j h s k b m w p g ky d i e a u o oo P(j|e) 0.45 0.12 0.09 0.08 0.07 0.03 0.03 0.03 0.02 0.02 0.36 0.25 0.15 0.09 0.09 0.01 e L j r n ru ri t mu m wa ta ra m n k r s h t g b mu n ru su mu kk u ku hu to pp u bi tt o ru n kk u su mu dd o tch i pp u jj i a o oo u i ya e uu ai ii P(j|e) 0.3 0.19 0.15 0.04 0.03 0.02 0.02 0.01 0.01 0.01 0.3 0.08 0.08 0.07 0.06 0.05 0.04 0.04 0.04 0.03 0.56 0.09 0.04 0.02 0.02 0.02 0.02 0.01 0.01 0.01 0.21 0.17 0.14 0.1 0.07 0.06 0.04 0.03 0.03 0.03 0.3 0.25 0.12 0.09 0.07 0.04 0.04 0.02 0.02 0.01 e OY j a i yu oi ya yo e o oo ei p pu n k sh i ku su pa t ma pause P(j|e) 0.27 0.16 0.1 0.1 0.09 0.08 0.08 0.06 0.02 0.02 0.18 0.08 0.05 0.05 0.04 0.04 0.03 0.03 0.02 0.02 1.0 e SH j sh y m r s p sa h b t k t to ta n ku k te s r gu k pu ku d hu su bu ko ga sa a o e yu ai i uu oo aa u u a o uu i yu ii e oo ee P(j|e) 0.22 0.11 0.1 0.06 0.06 0.05 0.05 0.05 0.04 0.04 0.2 0.16 0.05 0.04 0.03 0.03 0.02 0.02 0.02 0.02 0.21 0.11 0.1 0.08 0.07 0.05 0.04 0.03 0.03 0.02 0.24 0.14 0.11 0.1 0.09 0.08 0.07 0.07 0.03 0.02 0.39 0.15 0.13 0.12 0.04 0.03 0.03 0.03 0.02 0.02 e V j b k m s d r t h sh n w r m s k h b t p d s k m g p b r d ur ny to zu ru su gu mu n do ji ch i m p t h d s b r jy k P(j|e) 0.34 0.14 0.13 0.07 0.07 0.04 0.03 0.02 0.01 0.01 0.23 0.2 0.13 0.08 0.07 0.06 0.06 0.04 0.04 0.02 0.25 0.18 0.07 0.06 0.05 0.05 0.04 0.04 0.03 0.03 0.14 0.11 0.11 0.1 0.09 0.07 0.06 0.06 0.02 0.02 0.17 0.16 0.15 0.13 0.1 0.08 0.07 0.05 0.03 0.02 AE B ER IH M P T W AH a o i e u ee oo aa o a e oo i u yu ee oo au a ai aa e o i iy ea 0.31 0.23 0.17 0.12 0.1 0.02 0.01 0.01 0.29 0.26 0.14 0.12 0.08 0.05 0.03 0.01 0.2 0.19 0.18 0.11 0.11 0.05 0.04 0.04 0.02 0.01 CH EY IY AO D F JH AW DH G K i ii a aa u o oo ia ee e b k jy s m t j h sh d k n ku kk u to su sh i r ko ka 0.25 0.21 0.15 0.12 0.07 0.05 0.02 0.02 0.02 0.02 0.13 0.1 0.1 0.08 0.08 0.07 0.07 0.07 0.06 0.05 0.17 0.1 0.1 0.05 0.03 0.03 0.02 0.02 0.02 0.02 N PAUSE TH Y NG R OW S r n ur ri ru d t s m k su n ru to ku sh i ri mu hu ch i 0.53 0.07 0.05 0.03 0.02 0.02 0.01 0.01 0.01 0.01 0.4 0.11 0.05 0.03 0.03 0.02 0.02 0.02 0.02 0.02 UH Z UW ZH Figure 5: Phonemic substitution table learnt from non-parallel corpora. For each English phoneme, only the top ten mappings with P(j|e) > 0.01 are shown. model used in decipherment training has a large effect on the learnt P(j|e) phonemic substitution table (i.e., probabilities for the various phoneme mappings within the WFST C model), which in turn affects the quality of the back-transliterated English output produced when decoding Japanese. Figure 5 shows the phonemic substitution table learnt using word-based decipherment. The mappings are reasonable, given the lack of parallel data. They are not entirely correct--for example, the mapping "S s u" is there, but "S s" is missing. Sample end-to-end transliterations are illustrated in Figure 6. The figure shows how the transliteration results from non-parallel training improve steadily as we use stronger decipherment techniques. We note that in one case (LAUTENBERG), the decipherment mapping table leads to a correct answer where the mapping table derived from parallel data does not. Because parallel data is limited, it may not contain all of the necessary mappings. 4.4 Size of Japanese Training Data Monolingual corpora are more easily available than parallel corpora, so we can use increasing amounts of monolingual Japanese training data during decipherment training. The table below shows that using more Japanese training data produces better transliteration results when deciphering with the word-based English model. Japanese training data (# of phoneme sequences) 4,674 9,350 Error on name transliteration task whole-name error normalized word edit distance 87 69.7 77 57.2 42 !"#$%&'()*'+"%,)-./0%12%,0)2,/$34/&'#$5 !"#$%&'()*'+"%,)-./0%12%,0)2,/$34/&'#$ !"#$%&'()*'+"%,)-./0%12%,0)2,/$34/&'#$5 !"#$#%&' ()""*+,-.%/0*" !"#$#%&' 45678-9::;< !,#G,%33)?%+#,&H).#)!/,/44%4)7/&/)<3%IH !,#G,%33)?%+#,&H).#)!/,/44%4)7/&/)<3 !"#$%&'()*'+"%,)-./0%12%,0)2,/$34/&'# 1&"&''*' =*+#>2*"?*%,- 1&"&''*' 12)%*,#+ =*+#>2*"?*%,=*+#>2*"?*%,=*+#>2*"?*%, !"#$#%&' 12)%*,#+ ()""*+,-.%/0*" =*+#>2*"?*%,=*+#>2*"? ()""*+,-.%/0*" 1&"&''*'!"#$#%&' 12)%*,#+ =*+#>2*"?*%,=*+#>2*"?*%, ()""*+,-.%/0*" =*+#>2*"?*%,=*+#> 3"&#%#%$ @A*,2)B-9C =*+#>2*"?*%,@A*,2)B-DC 1&"&''*' 12)%*,#+ EC @A*,2)B 3"&#%#%$ @A*,2)B-DC@A*,2)B-9C =*+#>2*"?*%,@A*,2)B-DC @A*,2)B E 3"&#%#%$ @A*,2)B-9C @A*,2)B EC 3"&#%#%$ @A*,2)B-9C @A*,2)B-DC @A*, 45678-9::;< 45678-9::;< 45678-9::;< !,#G,%33)?%+#,&H).#)!/,/44%4)7/&/)<3%IH !,#G,%33)?%+#,&H).#)!/,/44%4)7/&/ !"#$%& '%()*+ ,-'.& /00 123#& /)%4 567!& 8%0! 8(&9:6; <=>?& @3A# <2?& B#C5# ?)#7& D%E#@%7 F1GH(GI!"#$%& '%()*+ .JI.K.A,-'.& =.HLGM-.7.7./00 F1GH(GIF1GH(GI F1GH(GI;!:*6)*<=9=) F1GH(GI F1GH(GI;!:*6)*<=9=) F1GH(GIF1GH(GIF1GH(GIF1GH(GIF1GH(GI!"#$%& F1GH(GIF1GH(GI- ;!:*6)*<=9=) F1GH(GI- F1GH(GI;!:*6)*<=9=) F1GH !"#$%& .JI.K.A.JI.K.A 67689:.) .JI.K.A.JI.K.A .JI.K.A.JI. .JI.K.A'%()*+ .JI.K.A 67689:.) 67689:.) .JI.K.A .JI.K.A- 67689:.) '%()*+ .JI.K.A N.OHG-.MM.I=/)%4 567!& 8%0! A.P-J.Q(QF- 123#& =.HLGM-.7.7.:*:)>: ;*@6?A.B)6:;2) 7:.A =.HLGM :*:)>: =.HLGM,-'.& >=?6:)6:;2) :*:)>: =.HLGM :*:)>: >=?6:)6:;2) ;*@6?A.B)6:;2) >=?6:)6:;2)>=?6:)6:;2) 7:.A6886):C:*=) ;*@6?A.B)6:;2) 7:.A6886 =.HLGM-.7.7.- =.HLGM-.7.7.=.HLGM ;*@6?A.B)6:;2) 7:.A6886):C:*=) ,-'.& /00 /00 N.OHG-.MM.I=N.OHG-.MM.I= N.OHG CE?7) N.OHG-.MM.I=N.OHG-.MM.I=- N.OHG-.MM.I= *@=A*6)D=@.) 123#& *@=A*6)D=@.) N.OHG CE?7) N.OHG-.MM.I= N.OHG-.MM.I=- N.OHG-.MM.I= *@=A*6)D=@.) N.OHG CE?7) N.OHG-.MM.I=- N.OHG CE?7) *@=A*6)D=@.) N.OHG-. 123#& /)%4 /)%4 A.P-J.Q(QFA.P-J.Q(QF 966;6)D:96;) A.P C==>;) A.P F=*<;) A.P-J.Q(QFA.P-J.Q(QF 966;6)D:96;) A.P C==>;) A.P 567!& A.P-J.Q(QF 567!& 8%0! 966;6)D:96;) A.P-J.Q(QF- N.O A.P C==>;) A.P-J.Q(QF A.P F=*<;) 966;6)D:96;) A.P C==>;) A.P F=* J!J-JGHHG338(&9:6; 8%0! J!J-JGHHG33 9E);*@6?A.B) D:*>)CA88A.B) C=2@)C6..A.B) J!J-JGHHG33 9E);*@6?A.B) 8(&9:6; J!J-JGHHG33@3A# D:*>)CA88A.B) C=2@ J!J-JGHHG33<=>?& @3A# J!J-JGHHG33 9E);*@6?A.B) R!FG1K-JL=GHD:*>)CA88A.B) C=2@)C6..A.B) D:*>)CA88A.B) J!J-JGHHG33J!J-JGHHG33 R!FG1K-JL=GH 9E);*@6?A.B) C=2@)C6. 8(&9:6;R!FG1K-JL=GH R!FG1K-JL=GHD:!:.)CA76.) R!FG1K-JL=GH@=<;6)8:C=?) D:!:.)CA76.) R!FG <=>?& @=<;6)8:C=?) R!FG1K-JL=GH<2?& B#C5# R!FG1K-JL=GH @=<;6)8:C=?) RGSS-JLH5.A.H- C6.D:9A.) RGSS C6.D:9A.)D:!:.)CA76.) D:!:.)CA76.) R!FG1K-JL=GHR!FG1K-JL=GHR!FG1K-JL=GH RGSS-JLH5.A.H D=@.)!F6AFF6? @=<;6)8:C=?) R!FG1K-J <=>?& RGSS C6.D:9A.) RGSS RGSS-JLH5.A.H- RGSS-JLH5.A.H D=@.)!F6AFF6? RGSS <2?& @3A# B#C5# RGSS-JLH5.A.H RGSS C6.D SI.H7 ;<.)F8A.2) F?:.*6) F?:. RGSS-JLH5.A.H- <2?& RGSS-JLH5.A.H RGSS-JLH5.A.H- SI.H7- C6.D:9A.) D=@.)!F6AFF6? D=@.)!F6AFF6? RGSS RGSS C6.D:9A.) RGSS C6.D:9A.) SI.H7SI.H7 ?)#7& ;<.)F8A.2) F?:.*6) F?:.*6) ?)#7& 8:<26.C:*@M.Q3GHJGI5M.Q3GHJGI5M.Q3 D%E#@%7 M.Q3GHJGI5 8:<26.C:*@ M.Q3GHJGI5 M.Q3GHJGI5D%E#@%7 B#C5# SI.H7M.Q3GHJGI5 ?)#7& SI.H7 D%E#@%7 8:<26.C:*@ SI.H7;<.)F8A.2) M.Q3GHJGI5 SI.H7 F?:.*6) 8:<26.C:*@ M.Q3GHJGI5- ;<.)F8A.2) F?:.*6) M.Q3GHJGI5- F?:.*6) M.Q3GHJGI5- F?:.*6) M.Q3GHJ Figure 6: Results for end-to-end name transliteration. This figure shows the correct answer, the answer obtained by training mappings on parallel data (Knight and Graehl, 1998), and various answers obtained by deciphering nonparallel data. Method 1 uses a 2-gram P(e), Method 2 uses a 3-gram P(e), and Method 3 uses a word-based P(e). 4.5 P(j|e) Initialization So far, the P(j|e) connections within the WFST C model were initialized with uniform weights prior to EM training. It is a known fact that the EM algorithm does not necessarily find a global minimum for the given objective function. If the search space is bumpy and non-convex as is the case in our problem, EM can get stuck in any of the local minima depending on what weights were used to initialize the search. Different sets of initialization weights can lead to different convergence points during EM training, or in other words, depending on how the P(j|e) probabilities are initialized, the final P(j|e) substitution table learnt by EM can vary. We can use some prior knowledge to initialize the probability weights in our WFST C model, so as to give EM a good starting point to work with. Instead of using uniform weights, in the P(j|e) model we set higher weights for the mappings where English and Japanese sounds share common consonant phonemes. For example, mappings such as: N => n D => d N => a n D => d o 43 are weighted X (a constant) times higher than other mappings such as: N D => b => B N => r EY => a a in the P(j|e) model. In our experiments, we set the value X to 100. Initializing the WFST C in this way results in EM learning better substitution tables and yields slightly better results for the Senator task. Normalized edit distance drops from 57.2 to 54.2, and the wholename error is also reduced from 77% to 73% (row 2e in Figure 4). 4.6 Size of English Training Data We saw earlier (in Section 4.4) that using more monolingual Japanese training data yields improvements in decipherment results. Similarly, we hypothesize that using more monolingual English data can drive the decipherment towards better transliteration results. On the English side, we build different word-based P(e) models, each trained on different amounts of data (English phoneme sequences from the CMU dictionary). The table below shows that deciphering with a word-based English model built from more data produces better transliteration results. English training data (# of phoneme sequences) 76,152 97,912 Error on name transliteration task whole-name error normalized word edit distance 73 54.2 66 49.3 This yields the best transliteration results on the Senator task with non-parallel data, getting 34 out of 100 Senator names exactly right. 4.7 Re-ranking Results Using the Web It is possible to improve our results on the U.S. Senator task further using external monolingual resources. Web counts are frequently used to automatically re-rank candidate lists for various NLP tasks (Al-Onaizan and Knight, 2002). We extract the top 10 English candidates produced by our wordbased decipherment method for each Japanese test name. Using a search engine, we query the entire English name (first and last name) corresponding to each candidate, and collect search result counts. We then re-rank the candidates using the collected Web counts and pick the most frequent candidate as our choice. For example, France Murkowski gets only 1 hit on Google, whereas Frank Murkowski gets 135,000 hits. Re-ranking the results in this manner lowers the whole-name error on the Senator task from 66% to 61%, and also lowers the normalized edit distance from 49.3 to 48.8. However, we do note that re-ranking using Web counts produces similar improvements in the case of parallel training as well and lowers the whole-name error from 40% to 24%. So, the re-ranking idea, which is simple and requires only monolingual resources, seems like a nice strategy to apply at the end of transliteration experiments (during decoding), and can result in further gains on the final transliteration performance. quence, the correct back-transliterated phoneme sequence is present somewhere in the English data) and apply the same decipherment strategy using a word-based English model. The table below compares the transliteration results for the U.S. Senator task, when using comparable versus non-parallel data for decipherment training. While training on comparable corpora does have benefits and reduces the whole-name error to 59% on the Senator task, it is encouraging to see that our best decipherment results using only non-parallel data comes close (66% error). English/Japanese Corpora (# of phoneme sequences) Comparable Corpora (English = 2,608 Japanese = 2,455) Non-Parallel Corpora (English = 98,000 Japanese = 9,350) Error on name transliteration task whole-name error normalized word edit distance 59 41.8 66 49.3 6 Conclusion 5 Comparable versus Non-Parallel Corpora We have presented a method for attacking machine transliteration problems without parallel data. We developed phonemic substitution tables trained using only monolingual resources and demonstrated their performance in an end-to-end name transliteration task. We showed that consistent improvements in transliteration performance are possible with the use of strong decipherment techniques, and our best system achieves significant improvements over the baseline system. In future work, we would like to develop more powerful decipherment models and techniques, and we would like to harness the information available from a wide variety of monolingual resources, and use it to further narrow the gap between parallel-trained and non-parallel-trained approaches. We also present decipherment results when using comparable corpora for training the WFST C model. We use English and Japanese phoneme sequences derived from a parallel corpus containing 2,683 phoneme sequence pairs to construct comparable corpora (such that for each Japanese phoneme se44 7 Acknowledgements This research was supported by the Defense Advanced Research Projects Agency under SRI International's prime Contract Number NBCHD040058. References Y. Al-Onaizan and K. Knight. 2002. Translating named entities using monolingual and bilingual resources. In Proc. of ACL. A. P. Dempster, N. M. Laird, and D. B. Rubin. 1977. Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society Series, 39(4):1­38. Y. Goldberg and M. Elhadad. 2008. Identification of transliterated foreign words in Hebrew script. In Proc. of CICLing. D. Goldwasser and D. Roth. 2008a. Active sample selection for named entity transliteration. In Proc. of ACL/HLT Short Papers. D. Goldwasser and D. Roth. 2008b. Transliteration as constrained optimization. In Proc. of EMNLP. S. Goldwater and L. Griffiths, T. 2007. A fully Bayesian approach to unsupervised part-of-speech tagging. In Proc. of ACL. J. Graehl. 1997. Carmel finite-state toolkit. http://www.isi.edu/licensed-sw/carmel. L. Haizhou, Z. Min, and S. Jian. 2004. A joint sourcechannel model for machine transliteration. In Proc. of ACL. U. Hermjakob, K. Knight, and H. Daume. 2008. Name translation in statistical machine translation--learning when to transliterate. In Proc. of ACL/HLT. F. Huang, S. Vogel, and A. Waibel. 2004. Improving named entity translation combining phonetic and semantic similarities. In Proc. of HLT/NAACL. S. Karimi, F. Scholer, and A. Turpin. 2007. Collapsed consonant and vowel models: New approaches for English-Persian transliteration and backtransliteration. In Proc. of ACL. A. Klementiev and D. Roth. 2008. Named entity transliteration and discovery in multilingual corpora. In Learning Machine Translation. MIT press. K. Knight and J. Graehl. 1998. Machine transliteration. Computational Linguistics, 24(4):599­612. K. Knight and K. Yamada. 1999. A computational approach to deciphering unknown scripts. In Proc. of the ACL Workshop on Unsupervised Learning in Natural Language Processing. K. Knight, A. Nair, N. Rathod, and K. Yamada. 2006. Unsupervised analysis for decipherment problems. In Proc. of COLING/ACL. J. Kuo, H. Li, and Y. Yang. 2006. Learning transliteration lexicons from the web. In Proc. of ACL/COLING. H. Li, C. Sim, K., J. Kuo, and M. Dong. 2007. Semantic transliteration of personal names. In Proc. of ACL. M. Nagata, T. Saito, and K. Suzuki. 2001. Using the web as a bilingual dictionary. In Proc. of the ACL Workshop on Data-driven Methods in Machine Translation. J. Oh and H. Isahara. 2006. Mining the web for transliteration lexicons: Joint-validation approach. In Proc. of the IEEE/WIC/ACM International Conference on Web Intelligence. S. Ravi and K. Knight. 2008. Attacking decipherment problems optimally with low-order n-gram models. In Proc. of EMNLP. S. Ravi and K. Knight. 2009. Probabilistic methods for a Japanese syllable cipher. In Proc. of the International Conference on the Computer Processing of Oriental Languages (ICCPOL). T. Sherif and G. Kondrak. 2007a. Bootstrapping a stochastic transducer for arabic-english transliteration extraction. In Proc. of ACL. T. Sherif and G. Kondrak. 2007b. Substring-based transliteration. In Proc. of ACL. R. Sproat, T. Tao, and C. Zhai. 2006. Named entity transliteration with comparable corpora. In Proc. of ACL. T. Tao, S. Yoon, A. Fister, R. Sproat, and C. Zhai. 2006. Unsupervised named entity transliteration using temporal and phonetic correlation. In Proc. of EMNLP. J. Wu and S. Chang, J. 2007. Learning to find English to Chinese transliterations on the web. In Proc. of EMNLP/CoNLL. D. Yarowsky. 1995. Unsupervised word sense disambiguation rivaling supervised methods. In Proc. of ACL. S. Yoon, K. Kim, and R. Sproat. 2007. Multilingual transliteration using feature based phonetic method. In Proc. of ACL. D. Zelenko and C. Aone. 2006. Discriminative methods for transliteration. In Proc. of EMNLP. 45 A Corpus-Based Approach for the Prediction of Language Impairment in Monolingual English and Spanish-English Bilingual Children Keyur Gabani and Melissa Sherman and Thamar Solorio and Yang Liu Department of Computer Science The University of Texas at Dallas keyur,mesh,tsolorio,yangl@hlt.utdallas.edu ~ Lisa M. Bedore and Elizabeth D. Pena Department of Communication Sciences and Disorders The University of Texas at Austin lbedore,lizp@mail.utexas.edu Abstract In this paper we explore a learning-based approach to the problem of predicting language impairment in children. We analyzed spontaneous narratives of children and extracted features measuring different aspects of language including morphology, speech fluency, language productivity and vocabulary. Then, we evaluated a learning-based approach and compared its predictive accuracy against a method based on language models. Empirical results on monolingual English-speaking children and bilingual Spanish-English speaking children show the learning-based approach is a promising direction for automatic language assessment. 1 Introduction The question of how best to identify children with language disorders is a topic of ongoing debate. One common assessment approach is based on cutoff scores from standardized, norm-referenced language assessment tasks. Children scoring at the lower end of the distribution, typically more than 1.25 or 1.5 Standard Deviations (SD) below the mean, are identified as having language impairment (Tomblin et al., 1997). This cutoff-based approach has several well-documented weaknesses that may result in both over- and under-identification of children as language impaired (Plante and Vance, 1994). Recent studies have suggested considerable overlap between children with language impairment and their typically developing cohorts on many of these tasks (e.g., (Pe~ a et al., 2006b; Spaulding et n 46 al., 2006)). In addition, scores and cutoffs on standardized tests depend on the distribution of scores from the particular samples used in normalizing the measure. Thus, the validity of the measure for children whose demographic and other socioeconomic characteristics are not well represented in the test's normative sample is a serious concern. Finally, most norm-referenced tests of language ability rely heavily on exposure to mainstream language and experiences, and have been found to be biased against children from families with low parental education and socioeconomic status, as well as children from different ethnic backgrounds (Campbell et al., 1997; Dollaghan and Campbell, 1998). This paper aims to develop a reliable and automatic method for identifying the language status of children. We propose the use of different lexicosyntactic features, typically used in computational linguistics, in combination with features inspired by current assessment practices in the field of language disorders to train Machine Learning (ML) algorithms. The two main contributions of this paper are: 1) It is one step towards developing a reliable and automatic approach for language status prediction in English-speaking children; 2) It provides evidence showing that the same approach can be adapted to predict language status in SpanishEnglish bilingual children. 2 2.1 Related Work Monolingual English-Speaking Children Several hypotheses exist that try to explain the grammatical deficits of children with Language Impair- Human Language Technologies: The 2009 Annual Conference of the North American Chapter of the ACL, pages 46­55, Boulder, Colorado, June 2009. c 2009 Association for Computational Linguistics ment (LI). Young children normally go through a stage where they use non-finite forms of verbs in grammatical contexts where finite forms are required (Wexler, 1994). This is referred as the optional infinitive stage. The Extended Optional Infinitive (EOI) theory (Rice and Wexler, 1996) suggests that children with LI exhibit the use of a "young" grammar for an extended period of time, where tense, person, and number agreement markers are omitted. In contrast to the EOI theory, the surface account theory (Leonard et al., 1997) assumes that children with LI have reduced processing capabilities. This deficit affects the perception of low stress morphemes, such as -ed, -s, be and do, resulting in an inconsistent use of these verb morphemes. Spontaneous narratives are considered as one of the most ecologically valid ways to measure communicative competence (Botting, 2002). They represent various aspects involved in children's everyday communication. Typical measures for spontaneous language samples include Mean Length of Utterance (MLU) in words, Number of Different Words (NDW), and errors in grammatical morphology. Assessment approaches compare children's performance on these measures against expected performance. As mentioned in Section 1, these cutoff based methods raise questions concerning accuracy and bias. Manually analyzing the narratives is also a very time consuming task. After transcribing the sample, clinicians need to code for the different clinical markers and other morphosyntactic information. This can take up to several hours for each child making it infeasible to analyze a large number of samples. 2.2 Bilingual Spanish-English Speaking Children children (Restrepo and Guti´ rrez-Clellen, 2001). e Spanish speaking children with LI show different clinical markers than English speaking children with LI. As mentioned above, English speakers have problems with verb morphology. In contrast, Spanish speakers have been found to have problems with noun morphology, in particular in the use of articles and clitics (Restrepo and Guti´ rrez-Clellen, 2001; e Jacobson and Schwartz, 2002; Bedore and Leonard, 2005). Bedore and Leonard (2005) also found differences in the error patterns of Spanish and related languages such as Italian. Spanish-speakers tend to both omit and substitute articles and clitics, while the dominant errors for Italian-speakers are omissions. 3 Our Approach We use language models (LMs) in our initial investigation, and later explore more complex ML algorithms to improve the results. Our ultimate goal is to discover a highly accurate ML method that can be used to assist clinicians in the task of LI identification in children. 3.1 Language Models for Predicting Language Impairment Bilingual children face even more identification challenges due to their dual language acquisition. They can be mistakenly labeled as LI due to: 1) the inadequate use of translations of assessment tools; 2) an over reliance on features specific to English; 3) a lack of appropriate expectations about how the languages of a bilingual child should develop (Bedore and Pe~ a, 2008); 4) or the use of standardized n tests where the normal distribution used to compare language performance is composed of monolingual 47 LMs are statistical models used to estimate the probability of a given sequence of words. They have been explored previously for clinical purposes. Roark et al. (2007) proposed cross entropy of LMs trained on Part-of-Speech (POS) sequences as a measure of syntactic complexity with the aim of determining mild cognitive impairment in adults. Solorio and Liu (2008) evaluated LMs on a small data set in a preliminary trial on LI prediction. The intuition behind using LMs is that they can identify atypical grammatical patterns and help discriminate the population with potential LI from the Typically Developing (TD) one. We use LMs trained on POS tags rather than on words. Using POS tags can address the data sparsity issue in LMs, and place less emphasis on the vocabulary and more emphasis on the syntactic patterns. We trained two separate LMs using POS tags from the transcripts of TD and LI children, respectively. The language status of a child is predicted using the following criterion: d(s) = LI TD if (P PT D (s) > P PLI (s)) otherwise where s represents a transcript from a child, and P PT D (s) and P PLI (s) are the perplexity values from the TD and LI LMs, respectively. We used the SRI Language Modeling Toolkit (Stolcke, 2002) for training the LMs and calculating perplexities. 3.2 Machine Learning for Predicting Language Impairment Although LMs have been used successfully on different human language processing tasks, they are typically trained and tested on language samples larger than what is usually collected by clinicians for the purpose of diagnosing a child with potential LI. Clinicians make use of additional information beyond children's speech, such as parent and teacher questionnaires and test scores on different language assessment tasks. Therefore in addition to using LMs for children language status prediction, we explore a machine learning classification approach that can incorporate more information for better prediction. We aim to identify effective features for this task and expect this information will help clinicians in their assessment. We consider various ML algorithms for the classification task, including Naive Bayes, Artificial Neural Networks (ANNs), Support Vector Machines (SVM), and Boosting with Decision Stumps. Weka (Witten and Frank, 1999) was used in our experiments due to its known reliability and the availability of a large number of algorithms. Below we provide a comprehensive list of features that we explored for both English and Spanish-English transcripts. We group these features according to the aspect of language they focus on. Features specific to Spanish are discussed in Section 5.2. 1. Language productivity (a) Mean Length of Utterance (MLU) in words Due to a general deficit of language ability, children with LI have been found to produce language samples with a shorter MLU in words because they produce 48 grammatically simpler sentences when compared to their TD peers. (b) Total number of words This measure is widely used when building language profiles of children for diagnostic and treatment purposes. (c) Degree of support In spontaneous samples of children's speech, it has been pointed out that children with potential LI need more encouragement from the investigator (Wetherell et al., 2007) than their TD peers. A support prompt can be a question like "What happened next?" We count the number of utterances, or turns, of the investigator interviewing the child for this feature. 2. Morphosyntactic skills (a) Ratio of number of raw verbs to the total number of verbs As mentioned previously, children with LI omit tense markers in verbs more often than their TD cohorts. For example: ...the boy look into the hole but didn't find... Hence, we include the ratio of the number of raw verbs to the total number of verbs as a feature. (b) Subject-verb agreement Research has shown that English-speaking children with LI have difficulties marking subject-verb agreement (Clahsen and Hansen, 1997; Sch¨ tze and Wexler, 1996). u An illustration of subject-verb disagreement is the following: ...and he were looking behind the rocks As a way of capturing this information in the machine learning setting, we consider various bigrams of POS tags: noun and verb, noun and auxiliary verb, pronoun and verb, and pronoun and auxiliary verb. These features are included in a bagof-words fashion using individual counts. Also, we allow a window between these pairs to capture agreement between sub- ject and verb that may have modifiers in between. (c) Number of different POS tags This feature is the total number of different POS tags in each transcript. 3. Vocabulary knowledge We use the Number of Different Words (NDW) to represent vocabulary knowledge of a child. Although such measures can be biased against children from different backgrounds, we expect this possible negative effect to decrease as a result of having a richer pool of features. 4. Speech fluency Repetitions, revisions, and filled pauses have been considered indicators of language learning difficulties (Thordardottir and Weismer, 2002; Wetherell et al., 2007). In this work we include as features (a) the number of fillers, such as uh, um, er; and (b) the number of disfluencies (abandoned words) found in each transcript. 5. Perplexities from LMs As mentioned in Section 3.1 we trained LMs of order 1, 2, and 3 on POS tags extracted from TD and LI children. We use the perplexity values from these models as features. Additionally, differences in perplexity values from LI and TD LMs for different orders are used as features. 6. Standard scores A standard score, known as a z-score, is the difference between an observation and the mean relative to the standard deviation. For this feature group, we first find separate distributions for the MLU in words, NDW and total number of utterances for the TD and LI populations. Then, for each transcript, we compute the standard scores based on each of these six distributions. This represents how well the child is performing relative to the TD and LI populations. Note that a cross validation setup was used to obtain the distribution for the TD and LI children for training. This is also required for the LM features above. 49 4 4.1 Experiments with Monolingual Children The Monolingual English Data Set Our target population for this work is children with an age range of 3 to 6 years old. However, currently we do not have any monolingual data sets readily available to test our approach in this age range. In the field of communication disorders data sharing is not a common practice due to the sensitive content of the material in the language samples of children, and also due to the large amount of effort and time it takes researchers to collect, transcribe, and code the data before they can begin their analysis. To evaluate our approach we used a dataset from CHILDES (MacWhinney, 2000) that includes narratives from English-speaking adolescents with and without LI with ages ranging between 13 and 16 years old. Even though the age range is outside the range we are interested in, we believe that this data set can still be helpful in exploring the feasibility of our approach as a first step. This data set contains 99 TD adolescents and 19 adolescents who met the LI profile at one point in the duration of the study. There are transcripts from each child for two tasks: a story telling and a spontaneous personal narrative. The first task is a picture prompted story telling task using the wordless picture book, "Frog, Where Are You?" (Mayer, 1969). In this story telling task children first look at the story book ­to develop a story in memory­ and then are asked to narrate the story. This type of elicitation task encourages the use of past tense constructions, providing plenty of opportunities for extracting clinical markers. In the spontaneous personal narrative task, the child is asked to talk about a person who annoys him/her the most and describe the most annoying features of that person. This kind of spontaneous personal narrative encourages the participant for the use of third person singular forms (-s). Detailed information of this data set can be found in (Wetherell et al., 2007). We processed the transcripts using the CLAN toolkit (MacWhinney, 2000). MOR and POST from CLAN are used for morphological analysis and POS tagging of the children's speech. We decided to use these analyzers since they are customized for children's speech. Method Baseline 1-gram LMs 2-gram LMs 3-gram LMs P (%) 28.57 41.03 75.00 80.00 Story telling R (%) F1 (%) 10.53 15.38 84.21 55.17 47.37 58.06 21.05 33.33 Personal narrative P (%) R (%) F1 (%) 33.33 15.79 21.43 34.21 68.42 45.61 55.56 26.32 35.71 87.50 36.84 51.85 Table 1: Evaluation of language models on the monolingual English data set. Algorithm Naive Bayes Bayesian Network SVM ANNs Boosting Story telling P (%) R (%) F1 (%) 38.71 63.16 48.00 58.33 73.68 65.12 76.47 68.42 72.22 62.50 52.63 57.14 70.59 63.16 66.67 Personal narrative P (%) R (%) F1 (%) 34.78 42.11 38.10 28.57 42.11 34.04 47.06 42.11 44.44 50.00 47.37 48.65 69.23 47.37 56.25 Table 2: Evaluation of machine learning algorithms on the monolingual English data set. 4.2 Results with Monolingual English-Speaking Children The performance measures we use are: precision (P), recall (R), and F-measure (F1 ). Here the LI category is the positive class and the TD category is the negative class. Table 1 shows the results of leave-one-out-crossvalidation (LOOCV) obtained from the LM approach for the story telling and spontaneous personal narrative tasks. It also shows results from a baseline method that predicts language status by using standard scores on measures that have been associated with LI in children (Dollaghan, 2004). The three measures we used for the baseline are: MLU in words, NDW, and total number of utterances produced. To compute this baseline we estimate the mean and standard deviation of these measures using LOOCV with the TD population as our normative sample. The baseline predicts that a child has LI if the child scores more than 1.25 SD below the mean on at least two out of the three measures. Although LMs yield different results for the story telling and personal narrative tasks, they both provide consistently better results than the baseline. For the story telling task the best results, in terms of the F1 measure, are achieved by a bigram LM (F1 = 58.06%) while for the personal narrative the highest F1 measure (51.85%) is from the trigram LM. If we consider precision, both tasks have the same increas50 ing pattern when increasing LM orders. However for recall that is not the case. In the story telling task, recall decreases at the expense of higher precision, but for the personal narrative task, the trigram LM reaches a better trade-off between precision and recall, which yields a high F1 measure. We also evaluated 4-gram LMs, but results did not improve, most likely because we do not have enough data to train higher order LMs. The results for different ML algorithms are shown in Table 2, obtained by using all features described in Section 3.2. The feature based approach using ML algorithms outperformed using only LMs on both tasks. For the story telling task, SVM with a linear kernel achieves the best results (F1 = 72.22%), while Boosting with Decision Stumps provides the best performance (F1 = 56.25%) for the personal narrative task. 4.3 Feature and Error Analysis The ML results shown above use the entire feature set described in Subsection 3.2. The next question we ask is the effectiveness of different features for this task. The datasets we are using in our evaluation are very small, especially considering the number of positive instances. This prevents us from having a separate subset of the data for parameter tuning or feature selection. Therefore, we performed additional experiments to evaluate the usefulness of individual features. Figure 1 shows the F1 measures 100 80 F-measure (%) 60 40 20 0 1 2 3 4 Features 5 6 Personal Narrative Story Telling Figure 1: Discriminating power of different groups of features. The numbers on the x-axis correspond to the feature groups in Section 3.2. when using different feature groups. The numbers on the x-axis correspond to the feature groups described in Section 3.2. The F1 measure value for each of the features is the highest value obtained by running different ML algorithms for classification. We noticed that for the story telling task, using perplexity values from LMs (group 5) as a feature in the ML setting outperforms the LM threshold approach by a large margin. It seems that having the perplexity values as well as the perplexity differences from all the LMs of different orders in the ML algorithm provides a better estimation of the target concept. Only the standard scores (group 6) yield a higher F1 measure for the personal narrative task than the story telling one. The majority of the features (5 out of 6 groups) provide higher F1 measures for the story telling task, which explains the significantly better results on this task over the personal narrative in our learning approach. This is consistent with previous work contrasting narrative genre stating that the restrictive setting of a story retell is more revealing of language difficulties than spontaneous narratives, where the subjects have more control on the content and style (Wetherell et al., 2007). We also performed some error analysis for some of the transcripts that were consistently misidentified by different ML algorithms. In the story telling task, we find that some LI transcripts are misclassified as TD because they (1) have fewer fillers, disfluencies, and degree of support; (2) are similar to 51 the TD transcripts, which is depicted by the perplexity values for these transcripts; or (3) contain higher MLU in words as compared to their LI peers. Some of the reasons for classifying transcripts in the TD category as LI are shorter MLU in words as compared to other TD peers, large number of fillers, and excessive repetitions of words and phrases unlike the other TD children. These factors are consistent with the effective features that we found from Figure 1. For the personal narrative task, standard scores (group 6) and language productivity (group 1) have an important role in classification, as shown in Figure 1. The TD transcripts that are misidentified have lower standard scores and MLU in words than those of their TD peers. We believe that another source of noise in the transcripts comes from the POS tags themselves. For instance, we found that many verbs in present tense for third person singular are tagged as plural nouns, which results in a failure to capture subjectverb agreement. Lastly, according to the dataset description, children in the LI category met the LI criteria at one stage in their lifetime and some of these children also had, or were receiving, some educational support in the school environment at the time of data collection. This support for children with LI is meant to improve their performance on language related tasks, making the automatic classification problem more complicated. This also raises the question about the reference label (TD or LI) for each child in the data set we used. The details about which children received interventions are not specified in the dataset description. 5 Experiments with Bilingual Children In this section we generalize the approach to a Spanish-English bilingual population. In adapting the approach to our bilingual population we face two challenges: first, what shows to be promising for a monolingual and highly heterogeneous population may not be as successful in a bilingual setting where we expect to have a large variability of exposure to each language; second, there is a large difference in the mean age of the monolingual setting and that of our bilingual one. This age difference will result in different speech patterns. Younger children pro- duce more ill-formed sentences since they are still in a language acquisition phase. Lastly, the clinical markers in adolescents are geared towards problems at the pragmatic and discourse levels, while at younger ages they focus more on syntax and morphology. For dealing with the first challenge we are extracting language-specific features and hope that by looking at both languages we can reach a good discrimination performance. For the second challenge, our feature engineering approach has been focused on younger children from the beginning. We are aiming to capture the type of morphosyntactic patterns that can identify LI in young children. In addition, the samples in the bilingual population are story retells, and our feature setting showed to be a good match for this task. Therefore, we expect our approach to capture relevant classification patterns, even in the presence of noisy utterances. 5.1 The Bilingual Data Set not corrected by the interviewer. Due to this, the English transcripts contain Spanish utterances and vice versa. We believe that words in the non-target language help contribute to a more accurate language development profile. Therefore, in our work we decided to keep these code-switched elements. A combined lexicon approach was used to tag the mixedlanguage fragments. If a word does not appear in the target language lexicon, we apply the POS tag from the non-target language. 5.2 Spanish-Specific Features The transcripts for the bilingual LI task come from an on-going longitudinal study of language impairment in Spanish-English speaking children (Pe~ a et n al., 2006a). The children in this study were enrolled in kindergarten with a mean age of about 70 months. Of the 59 children, 6 were identified as having a possible LI by an expert in communication disorders, while 53 were identified as TD. Six of the TD children were excluded due to missing information, yielding a total of 47 TD children. Each child told a series of stories based on Mercer Mayer's wordless picture books (Mayer, 1969). Two stories were told in English and two were told in Spanish, for a total of four transcripts per child. The books used for English were "A Boy, A Dog, and A Frog" and "Frog, Where Are You?" The books used for Spanish retelling were "Frog on His Own" and "Frog Goes to Dinner." The transcripts for each separate language were combined, yielding one instance per language for each child. An interesting aspect of the bilingual data is that the children mix languages in their narratives. This phenomenon is called code-switching. At the beginning of a retelling session, the interviewer encourages the child to speak the target language if he/she is not doing so. Once the child begins speaking the correct language, any code-switching thereafter is 52 Many structural differences exist between Spanish, a Romance language, and English, a Germanic language. Spanish is morphologically richer than English. It contains a larger number of different verb conjugations and it uses a two gender system for nouns, adjectives, determiners, and participles. A Spanish-speaking child with LI will have difficulties with different grammatical elements, such as articles and clitics, than an English-speaking child (Bedore and Pe~ a, 2008). These differences indicate that the n Spanish feature set will need to be tailored towards the Spanish language. To account for Spanish-specific patterns we included new POS bigrams as features. To capture the use of correct and incorrect gender and number marking morphology, we added noun-adjective, determiner-noun, and number-noun bigrams to the list of morphosyntactic features. 5.3 Results on Bilingual Children Results are shown for the baseline and LM threshold approach for the bilingual data set in Table 3. The baseline is computed from the same measures as the monolingual dataset (MLU in words, NDW, and total utterances). Compared to Table 1, the values in Table 3 are generally lower than on the monolingual story telling task. In this inherently difficult task, the bilingual transcripts are more disfluent than the monolingual ones. This could be due to the age of the children or their bilingual status. Recent studies on psycholinguistics and language production have shown that bilingual speakers have both languages active at speech production time (Kroll et al., 2008) and it is possible that this may cause interference, especially in children still in the phase of language acqui- Method Baseline 1-gram LMs 2-gram LMs 3-gram LMs P (%) 20.00 40.00 50.00 100.00 English R (%) F1 (%) 16.66 18.18 33.33 36.36 33.33 40.00 33.33 50.00 P (%) 16.66 17.64 33.33 0.00 Spanish R (%) 16.66 50.00 16.66 0.00 F1 (%) 16.66 26.08 22.22 - Table 3: Evaluation of language models on Bilingual Spanish-English data set. sition. In addition, the LMs in the monolingual task were trained using more instances per class, possibly yielding better results. There are some different patterns between using the English and Spanish transcripts. In English, the unigram models provide the least discriminative value, and the bigram and trigram models improve discrimination. We also evaluated higher order ngrams, but did not obtain any further improvement. We found that the classification accuracy of the LM approach was influenced by two children with LI who were consistently marked as LI due to a greater perplexity value from the TD LM. A further analysis shows that these children spoke mostly Spanish on the "English" tasks yielding larger perplexities from the TD LM, which was trained from mostly English. In contrast, the LI LM was created with transcripts containing more Spanish than the TD one, and thus test transcripts with a lot of Spanish do not inflate perplexity values that much. For Spanish, unigram LMs provide some discriminative usefulness, and then the bigram performance decreases while the trigram model provides no discriminative value. One reason for this may be that the Spanish LMs have a larger vocabulary. In the Spanish LMs, there are 2/3 more POS tags than in the English LM. This size difference dramatically increases the possible bigrams and trigrams, therefore increasing the number of parameters to estimate. In addition, we are using an "off the shelf" POS tagger (provided by CLAN) and this may add noise in the feature extraction process. Since we do not have gold standard annotations for these transcripts, we cannot measure the POS tagging accuracy. A rough estimate based on manually revising one transcript in each language showed a POS tagging accuracy of 90% for English and 84% for Spanish. Most of the POS tagger errors involve verbs, nouns and pronouns. Thus while the accu53 100 80 F-measure (%) 60 40 20 0 1 2 3 4 Features 5 6 Combined Spanish English Figure 2: Discriminating power of different groups of features for the bilingual population. The numbers on the x-axis correspond to the feature groups in Section 3.2. racy might not seem that low, it can still have a major impact on our approach since it involves the POS categories that are more relevant for this task. Table 4 shows the results from various ML algorithms. In addition to predicting the language status with the English and Spanish samples separately, we also combined the English and Spanish transcripts together for each child, and used all the features from both languages in order to allow a prediction based on both samples. The best F1 measure for this task (60%) is achieved by using the Naive Bayes algorithm with the combined Spanish-English feature set. This is an improvement over both the separate English and Spanish trials. The Naive Bayes algorithm provided the best discrimination for the English (54%) and Combined data sets and Boosting and SVM provided the best discrimination for the Spanish set (18%). 5.4 Feature Analysis Similar to the monolingual dataset, we performed additional experiments exploring the contribution of different groups of features. We tested the six Algorithm ANNs SVM Naive Bayes Logistic Regression Boosting P (%) 66.66 14.28 60.00 25.00 50.00 English R (%) F1 (%) 33.33 44.44 16.66 15.38 50.00 54.54 16.66 20.00 33.33 40.00 P (%) 0.00 20.00 0.00 20.00 Spanish R (%) F1 (%) 0.00 16.66 18.18 0.00 0.00 16.66 18.18 P (%) 100.00 66.66 75.00 50.00 66.66 Combined R (%) F1 (%) 16.66 28.57 33.33 44.44 50.00 60.00 33.33 40.00 33.33 44.44 Table 4: Evaluation of machine learning algorithms on the Bilingual Spanish-English data set. groups of features described in Section 3.2 separately. Overall, the combined LM perplexity values (group 5) provided the best discriminative value (F1 = 66%). The LM perplexity values performed the best for English. It even outperformed using all the features in the ML algorithm, suggesting some feature selection is needed for this task. The morpohsyntactic skills (group 2) provided the best discriminative value for the Spanish language features, and performed better than the complete feature set for Spanish. Within group 2, we evaluated different POS bigrams for the Spanish and English sets and observed that most of the bigram combinations by themselves are usually weak predictors of language status. In the Spanish set, out of all of the lexical combinations, only the determiner-noun, noun-verb, and pronoun-verb categories provided some discriminative value. The determiner-noun category captured the correct and incorrect gender marking between the two POS tags. The noun-verb and pronoun-verb categories covered the correct and incorrect usage of subject-verb combinations. Interestingly enough, the pronoun-verb category performed well by itself, yielding an F1 measure of 54%. There are also some differences in the frequencies of bigram features in the English and Spanish data sets. For example, there is no noun-auxiliary POS pattern in Spanish, and the pronoun-auxiliary bigram appears less frequently in Spanish than in English because in Spanish the use of personal pronouns is not mandatory since the verb inflection will disambiguate the subject of the sentence. The vocabulary knowledge feature (group 3) did not provide any discriminative value for any of the language tasks. This may be because bilingual children receive less input for each language than a monolingual child learning one language, or due to the varied vocabulary acquisition rate in our bilin54 gual population. 6 Conclusions and Future Work In this paper we present results on the use of LMs and ML techniques trained on features representing different aspects of language gathered from spontaneous speech samples for the task of assisting clinicians in determining language status in children. First, we evaluate our approach on a monolingual English-speaking population. Next, we show that this ML approach can be successfully adapted to a bilingual Spanish-English population. ML algorithms provide greater discriminative power than only using a threshold approach with LMs. Our current efforts are devoted to improving prediction accuracy by refining our feature set. We are working on creating a gold standard corpus of children's transcripts annotated with POS tags. This data set will help us improve accuracy on our POSbased features. We are also exploring the use of socio-demographic features such as the educational level of parents, the gender of children, and enrollment status on free lunch programs. Acknowledgments This work was supported by NSF grant 0812134, and by grant 5 UL1 RR024982 from NCRR, a component of NIH. We also thank the three NAACL reviewers for insightful comments on the submitted version of this paper. References Lisa M. Bedore and Laurence B. Leonard. 2005. Verb inflections and noun phrase morphology in the spontaneous speech of Spanish-speaking children with specific language impairment. Applied Psycholinguistics, 26(2):195­225. Lisa M. Bedore and Elizabeth D. Pe~ a. 2008. Assessn ment of bilingual children for identification of language impairment: Current findings and implications for practice. International Journal of Bilingual Education and Bilingualism, 11(1):1­29. Nicola Botting. 2002. Narrative as a tool for the assessment of linguistic and pragmatic impairments. Child Language Teaching and Therapy, 18(1):1­21. Thomas Campbell, Chris Dollaghan, Herbert Needleman, and Janine Janosky. 1997. Reducing bias in language assessment: Processing-dependent measures. Journal of Speech, Language, and Hearing Research, 40(3):519­525. Harald Clahsen and Detlef Hansen. 1997. The grammatical agreement deficit in specific language impairment: Evidence from therapy experiments. In Myrna Gopnik, editor, The Inheritance and Innateness of Grammar, chapter 7. Oxford University Press, New York. Christine A. Dollaghan and Thomas F. Campbell. 1998. Nonword repetition and child language impairment. Journal of Speech, Language, and Hearing Research, 41(5):1136­1146. Christine A. Dollaghan. 2004. Taxometric analyses of specific language impairment in 3- and 4-year-old children. Journal of Speech, Language, and Hearing Research, 47(2):464­475. Peggy F. Jacobson and Richard G. Schwartz. 2002. Morphology in incipient bilingual Spanish-speaking preschool children with specific language impairment. Applied Psycholinguistics, 23(1):23­41. Judith F. Kroll, Chip Gerfen, and Paola E. Dussias. 2008. Laboratory designs and paradigms: Words, sounds, sentences. In L. Wei and M. G. Moyer, editors, The Blackwell Guide to Research Methods in Bilingualism and Multilingualism, chapter 7. Blackwell Pub. Laurence B. Leonard, Julia A. Eyer, Lisa M. Bedore, and Bernard G. Grela. 1997. Three accounts of the grammatical morpheme difficulties of Englishspeaking children with specific language impairment. Journal of Speech, Language, and Hearing Research, 40(4):741­753. Brian MacWhinney. 2000. The CHILDES project: Tools for analyzing talk. Lawrence Erlbaum, Mahwah, NJ. Mercer Mayer. 1969. Frog, where are you? Dial Press. Elizabeth D. Pe~ a, Lisa M. Bedore, Ronald B. Gillam, n and Thomas Bohman. 2006a. Diagnostic markers of language impairment in bilingual children. Grant awarded by the NIDCD, NIH. Elizabeth D. Pe~ a, Tammie J. Spaulding, and Elena n Plante. 2006b. The composition of normative groups and diagnostic decision making: Shooting ourselves in the foot. American Journal of Speech-Language Pathology, 15(3):247­254. Elena Plante and Rebecca Vance. 1994. Selection of preschool language tests: A data-based approach. Language, Speech, and Hearing Services in Schools, 25(1):15­24. Mar´a Adelaida Restrepo and Vera F. Guti´ rrez-Clellen. i e 2001. Article use in Spanish-speaking children with specific language impairment. Journal of Child Language, 28(2):433­452. Mabel L. Rice and Kenneth Wexler. 1996. Toward tense as a clinical marker of specific language impairment in English-speaking children. Journal of Speech and Hearing Research, 39(6):1239­1257. Brian Roark, Margaret Mitchell, and Kristy Hollingshead. 2007. Syntactic complexity measures for detecting mild cognitive impairment. In Proceedings of the Workshop on BioNLP 2007, pages 1­8. ACL. Carson T. Sch¨ tze and Kenneth Wexler. 1996. Subject u case licensing and English root infinitives. In Proceedings of the 20th Annual Boston University Conference on Language Development. Cascadilla Press. Thamar Solorio and Yang Liu. 2008. Using language models to identify language impairment in SpanishEnglish bilingual children. In Proceedings of the Workshop on BioNLP 2008, pages 116­117. ACL. Tammie J. Spaulding, Elena Plante, and Kimberly A. Farinella. 2006. Eligibility criteria for language impairment: Is the low end of normal always appropriate? Language, Speech, and Hearing Services in Schools, 37(1):61­72. Andreas Stolcke. 2002. SRILM ­ an extensible language modeling toolkit. In Proceedings of the International Conference on Spoken Language Processing, volume 2, pages 901­904. Elin T. Thordardottir and Susan Ellis Weismer. 2002. Content mazes and filled pauses on narrative language samples of children with specific language impairment. Brain and Cognition, 48(2-3):587­592. J. Bruce Tomblin, Nancy L. Records, Paula Buckwalter, Xuyang Zhang, Elaine Smith, and Marlea O'Brien. 1997. Prevalence of specific language impairment in kindergarten children. Journal of Speech, Language, and Hearing Research, 40(6):1245­1260. Danielle Wetherell, Nicola Botting, and Gina ContiRamsden. 2007. Narrative in adolescent specific language impairment (SLI): a comparison with peers across two different narrative genres. International Journal of Language and Communication Disorders, 42:583­605(23). Kenneth Wexler. 1994. Optional infinitives. In David Lightfoot and Norbert Hornstein, editors, Verb Movement. Cambridge University Press. Ian H. Witten and Eibe Frank. 1999. Data Mining: Practical Machine Learning Tools and Techniques with Java Implementations. Morgan Kaufmann. 55 A Discriminative Latent Variable Chinese Segmenter with Hybrid Word/Character Information Xu Sun Department of Computer Science University of Tokyo sunxu@is.s.u-tokyo.ac.jp Takuya Matsuzaki Department of Computer Science University of Tokyo matuzaki@is.s.u-tokyo.ac.jp Yaozhong Zhang Department of Computer Science University of Tokyo yaozhong.zhang@is.s.u-tokyo.ac.jp Yoshimasa Tsuruoka School of Computer Science University of Manchester yoshimasa.tsuruoka@manchester.ac.uk Jun'ichi Tsujii Department of Computer Science, University of Tokyo, Japan School of Computer Science, University of Manchester, UK National Centre for Text Mining, UK tsujii@is.s.u-tokyo.ac.jp Abstract Conventional approaches to Chinese word segmentation treat the problem as a characterbased tagging task. Recently, semi-Markov models have been applied to the problem, incorporating features based on complete words. In this paper, we propose an alternative, a latent variable model, which uses hybrid information based on both word sequences and character sequences. We argue that the use of latent variables can help capture long range dependencies and improve the recall on segmenting long words, e.g., named-entities. Experimental results show that this is indeed the case. With this improvement, evaluations on the data of the second SIGHAN CWS bakeoff show that our system is competitive with the best ones in the literature. 1 Introduction For most natural language processing tasks, words are the basic units to process. Since Chinese sentences are written as continuous sequences of characters, segmenting a character sequence into a word sequence is the first step for most Chinese processing applications. In this paper, we study the problem of Chinese word segmentation (CWS), which aims to find these basic units (words1 ) for a given sentence in Chinese. Chinese character sequences are normally ambiguous, and out-of-vocabulary (OOV) words are a major source of the ambiguity. Typical examples of OOV words include named entities (e.g., organization names, person names, and location names). Those named entities may be very long, and a difficult case occurs when a long word W (|W | 4) consists of some words which can be separate words on their own; in such cases an automatic segmenter may split the OOV word into individual words. For example, (Computer Committee of International Federation of Automatic Control) is one of the organization names in the Microsoft Research corpus. Its length is 13 and it contains more than 6 individual words, but it should be treated as a single word. Proper recognition of long OOV words are meaningful not only for word segmentation, but also for a variety of other purposes, e.g., full-text indexing. However, as is illustrated, recognizing long words (without sacrificing the performance on short words) is challenging. Conventional approaches to Chinese word segmentation treat the problem as a character-based laFollowing previous work, in this paper, words can also refer to multi-word expressions, including proper names, long named entities, idioms, etc. 1 Human Language Technologies: The 2009 Annual Conference of the North American Chapter of the ACL, pages 56­64, Boulder, Colorado, June 2009. c 2009 Association for Computational Linguistics 56 beling task (Xue, 2003). Labels are assigned to each character in the sentence, indicating whether the character xi is the start (Labeli = B), middle or end of a multi-character word (Labeli = C). A popular discriminative model that have been used for this task is the conditional random fields (CRFs) (Lafferty et al., 2001), starting with the model of Peng et al. (2004). In the Second International Chinese Word Segmentation Bakeoff (the second SIGHAN CWS bakeoff) (Emerson, 2005), two of the highest scoring systems in the closed track competition were based on a CRF model (Tseng et al., 2005; Asahara et al., 2005). While the CRF model is quite effective compared with other models designed for CWS, it may be limited by its restrictive independence assumptions on non-adjacent labels. Although the window can in principle be widened by increasing the Markov order, this may not be a practical solution, because the complexity of training and decoding a linearchain CRF grows exponentially with the Markov order (Andrew, 2006). To address this difficulty, a choice is to relax the Markov assumption by using the semi-Markov conditional random field model (semi-CRF) (Sarawagi and Cohen, 2004). Despite the theoretical advantage of semi-CRFs over CRFs, however, some previous studies (Andrew, 2006; Liang, 2005) exploring the use of a semi-CRF for Chinese word segmentation did not find significant gains over the CRF ones. As discussed in Andrew (2006), the reason may be that despite the greater representational power of the semi-CRF, there are some valuable features that could be more naturally expressed in a character-based labeling model. For example, on a CRF model, one might use the feature "the current character xi is X and the current label Labeli is C". This feature may be helpful in CWS for generalizing to new words. For example, it may rule out certain word boundaries if X were a character that normally occurs only as a suffix but that combines freely with some other basic forms to create new words. This type of features is slightly less natural in a semi-CRF, since in that case local features (yi , yi+1 , x) are defined on pairs of adjacent words. That is to say, information about which characters are not on boundaries is only implicit. Notably, except the hybrid Markov/semi-Markov system in An- drew (2006)2 , no other studies using the semi-CRF (Sarawagi and Cohen, 2004; Liang, 2005; Daum´ e III and Marcu, 2005) experimented with features of segmenting non-boundaries. In this paper, instead of using semi-Markov models, we describe an alternative, a latent variable model, to learn long range dependencies in Chinese word segmentation. We use the discriminative probabilistic latent variable models (DPLVMs) (Morency et al., 2007; Petrov and Klein, 2008), which use latent variables to carry additional information that may not be expressed by those original labels, and therefore try to build more complicated or longer dependencies. This is especially meaningful in CWS, because the used labels are quite coarse: Label(y) {B, C}, where B signifies beginning a word and C signifies the continuation of a word.3 For example, by using DPLVM, the aforementioned feature may turn to "the current character xi is X, Labeli = C, and LatentV ariablei = LV ". The current latent variable LV may strongly depend on the previous one or many latent variables, and therefore we can model the long range dependencies which may not be captured by those very coarse labels. Also, since character and word information have their different advantages in CWS, in our latent variable model, we use hybrid information based on both character and word sequences. 2 2.1 A Latent Variable Segmenter Discriminative Probabilistic Latent Variable Model Given data with latent structures, the task is to learn a mapping between a sequence of observations x = x1 , x2 , . . . , xm and a sequence of labels y = y1 , y2 , . . . , ym . Each yj is a class label for the j'th character of an input sequence, and is a member of a set Y of possible class labels. For each sequence, the model also assumes a sequence of latent variables h = h1 , h2 , . . . , hm , which is unobservable in training examples. The DPLVM is defined as follows (Morency et al., The system was also used in Gao et al. (2007), with an improved performance in CWS. 3 In practice, one may add a few extra labels based on linguistic intuitions (Xue, 2003). 2 57 2007): P (y|x, ) = h P (y|h, x, )P (h|x, ), (1) where are the parameters of the model. DPLVMs can be seen as a natural extension of CRF models, and CRF models can be seen as a special case of DPLVMs that have only one latent variable for each label. To make the training and inference efficient, the model is restricted to have disjoint sets of latent variables associated with each class label. Each hj is a member in a set Hyj of possible latent variables for the class label yj . H is defined as the set of all possible latent variables, i.e., the union of all Hyj sets. / Since sequences which have any hj Hyj will by definition have P (y|x, ) = 0, the model can be further defined4 as: P (y|x, ) = hHy1 ×...×Hym Viterbi algorithm because of the incorporation of hidden states. In this paper, we use a technique based on A search and dynamic programming described in Sun and Tsujii (2009), for producing the most probable label sequence y on DPLVM. In detail, an A search algorithm5 (Hart et al., 1968) with a Viterbi heuristic function is adopted to produce top-n latent paths, h1 , h2 , . . . hn . In addition, a forward-backward-style algorithm is used to compute the exact probabilities of their corresponding label paths, y1 , y2 , . . . yn . The model then tries to determine the optimal label path based on the top-n statistics, without enumerating the remaining low-probability paths, which could be exponentially enormous. The optimal label path y is ready when the following "exact-condition" is achieved: P (y1 |x, ) - (1 - P (yk |x, )) 0, (6) P (h|x, ), (2) yk LPn where P (h|x, ) is defined by the usual conditional random field formulation: P (h|x, ) = exp ·f (h, x) , h exp ·f (h, x) (3) in which f (h, x) is a feature vector. Given a training set consisting of n labeled sequences, (xi , yi ), for i = 1 . . . n, parameter estimation is performed by optimizing the objective function, n where y1 is the most probable label sequence in current stage. It is straightforward to prove that y = y1 , and further search is unnecessary. This is because the remaining probability mass, 1 - yk LPn P (yk |x, ), cannot beat the current optimal label path in this case. For more details of the inference, refer to Sun and Tsujii (2009). 2.2 Hybrid Word/Character Information L() = i=1 log P (yi |xi , ) - R(). (4) The first term of this equation is the conditional loglikelihood of the training data. The second term is a regularizer that is used for reducing overfitting in parameter estimation. For decoding in the test stage, given a test sequence x, we want to find the most probable label sequence, y : y = argmaxy P (y|x, ). (5) We divide our main features into two types: character-based features and word-based features. The character-based features are indicator functions that fire when the latent variable label takes some value and some predicate of the input (at a certain position) corresponding to the label is satisfied. For each latent variable label hi (the latent variable label at position i), we use the predicate templates as follows: · Input characters/numbers/letters locating at positions i - 2, i - 1, i, i + 1 and i + 2 · The character/number/letter bigrams locating at positions i - 2, i - 1, i and i + 1 A search and its variants, like beam-search, are widely used in statistical machine translation. Compared to other search techniques, an interesting point of A search is that it can produce top-n results one-by-one in an efficient manner. 5 For latent conditional models like DPLVMs, the best label path y cannot directly be produced by the 4 It means that Eq. 2 is from Eq. 1 with additional definition. 58 · Whether xj and xj+1 are identical, for j = (i- 2) . . . (i + 1) · Whether xj and xj+2 are identical, for j = (i- 3) . . . (i + 1) The latter two feature templates are designed to detect character or word reduplication, a morphological phenomenon that can influence word segmentation in Chinese. The word-based features are indicator functions that fire when the local character sequence matches a word or a word bigram. A dictionary containing word and bigram information was collected from the training data. For each latent variable label unigram hi , we use the set of predicate template checking for word-based features: · The identity of the string xj . . . xi , if it matches a word A from the word-dictionary of training data, with the constraint i-6 < j < i; multiple features will be generated if there are multiple strings satisfying the condition. · The identity of the string xi . . . xk , if it matches a word A from the word-dictionary of training data, with the constraint i < k < i+6; multiple features could be generated. · The identity of the word bigram (xj . . . xi-1 , xi . . . xk ), if it matches a word bigram in the bigram dictionary and satisfies the aforementioned constraints on j and k; multiple features could be generated. · The identity of the word bigram (xj . . . xi , xi+1 . . . xk ), if it matches a word bigram in the bigram dictionary and satisfies the aforementioned constraints on j and k; multiple features could be generated. All feature templates were instantiated with values that occur in positive training examples. We found that using low-frequency features that occur only a few times in the training set improves performance on the development set. We hence do not do any thresholding of the DPLVM features: we simply use all those generated features. The aforementioned word based features can incorporate word information naturally. In addition, following Wang et al. (2006), we found using a very simple heuristic can further improve the segmentation quality slightly. More specifically, two operations, merge and split, are performed on the DPLVM/CRF outputs: if a bigram A B was not observed in the training data, but the merged one AB was, then A B will be simply merged into AB; on the other hand, if AB was not observed but A B appeared, then it will be split into A B. We found this simple heuristic on word information slightly improved the performance (e.g., for the PKU corpus, +0.2% on the F-score). 3 Experiments We used the data provided by the second International Chinese Word Segmentation Bakeoff to test our approaches described in the previous sections. The data contains three corpora from different sources: Microsoft Research Asia (MSR), City University of Hong Kong (CU), and Peking University (PKU). Since the purpose of this work is to evaluate the proposed latent variable model, we did not use extra resources such as common surnames, lexicons, parts-of-speech, and semantics. For the generation of word-based features, we extracted a word list from the training data as the vocabulary. Four metrics were used to evaluate segmentation results: recall (R, the percentage of gold standard output words that are correctly segmented by the decoder), precision (P , the percentage of words in the decoder output that are segmented correctly), balanced F-score (F ) defined by 2P R/(P + R), recall of OOV words (R-oov). For more detailed information on the corpora and these metrics, refer to Emerson (2005). 3.1 Training the DPLVM Segmenter We implemented DPLVMs in C++ and optimized the system to cope with large scale problems, in which the feature dimension is beyond millions. We employ the feature templates defined in Section 2.2, taking into account those 3,069,861 features for the MSR data, 2,634,384 features for the CU data, and 1,989,561 features for the PKU data. As for numerical optimization, we performed gradient decent with the Limited-Memory BFGS 59 (L-BFGS)6 optimization technique (Nocedal and Wright, 1999). L-BFGS is a second-order QuasiNewton method that numerically estimates the curvature from previous gradients and updates. With no requirement on specialized Hessian approximation, L-BFGS can handle large-scale problems in an efficient manner. Since the objective function of the DPLVM model is non-convex, we randomly initialized parameters for the training.7 To reduce overfitting, we employed an L2 Gaussian weight prior8 (Chen and Rosenfeld, 1999). During training, we varied the L2 regularization term (with values 10k , k from -3 to 3), and finally set the value to 1. We use 4 hidden variables per label for this task, compromising between accuracy and efficiency. 3.2 Comparison on Convergence Speed 1800K 1500K Obj. Func. Value 1200K 900K 600K 300K 0 100 200 300 400 500 600 700 800 900 Forward-Backward Passes DPLVM CRF Figure 1: The value of the penalized loss based on the number of iterations: DPLVMs vs. CRFs on the MSR data. First, we show a comparison of the convergence speed between the objective function of DPLVMs and CRFs. We apply the L-BFGS optimization algorithm to optimize the objective function of DPLVM and CRF models, making a comparison between them. We find that the number of iterations required for the convergence of DPLVMs are fewer than for CRFs. Figure 1 illustrates the convergence-speed comparison on the MSR data. The DPLVM model arrives at the plateau of convergence in around 300 iterations, with the penalized loss of 95K when #passes = 300; while CRFs require 900 iterations, with the penalized loss of 98K when #passes = 900. However, we should note that the time cost of the DPLVM model in each iteration is around four times higher than the CRF model, because of the incorporation of hidden variables. In order to speed up the For numerical optimization on latent variable models, we also experimented the conjugate-gradient (CG) optimization algorithm and stochastic gradient decent algorithm (SGD). We found the L-BFGS with L2 Gaussian regularization performs slightly better than the CG and the SGD. Therefore, we adopt the L-BFGS optimizer in this study. 7 For a non-convex objective function, different parameter initializations normally bring different optimization results. Therefore, to approach closer to the global optimal point, it is recommended to perform multiple experiments on DPLVMs with random initialization and then select a good start point. 8 We also tested the L-BFGS with L1 regularization, and we found the L-BFGS with L2 regularization performs better in this task. 6 MSR CU PKU Style S.C. T.C. S.C. #W.T. 88K 69K 55K #Word 2,368K 1,455K 1,109K #C.T. 5K 5K 5K #Char 4,050K 2,403K 1,826K Table 1: Details of the corpora. W.T. represents word types; C.T. represents character types; S.C. represents simplified Chinese; T.C. represents traditional Chinese. training speed of the DPLVM model in the future, one solution is to use the stochastic learning technique9 . Another solution is to use a distributed version of L-BFGS to parallelize the batch training. 4 Results and Discussion Since the CRF model is one of the most successful models in Chinese word segmentation, we compared DPLVMs with CRFs. We tried to make experimental results comparable between DPLVMs and CRF models, and have therefore employed the same feature set, optimizer and fine-tuning strategy between the two. We also compared DPLVMs with semiCRFs and other successful systems reported in previous work. 4.1 Evaluation Results Three training and test corpora were used in the test, including the MSR Corpus, the CU Corpus, and the We have tried stochastic gradient decent, as described previously. It is possible to try other stochastic learning methods, e.g., stochastic meta decent (Vishwanathan et al., 2006). 9 60 MSR data DPLVM (*) CRF (*) semi-CRF (A06) semi-CRF (G07) CRF (Z06-a) Z06-b ZC07 Best05 (T05) CU data DPLVM (*) CRF (*) CRF (Z06-a) Z06-b ZC07 Best05 (T05) PKU data DPLVM (*) CRF (*) CRF (Z06-a) Z06-b ZC07 Best05 (C05) P 97.3 97.1 N/A N/A 96.5 97.2 N/A 96.2 P 94.7 94.3 95.0 95.2 N/A 94.1 P 95.6 95.2 94.3 94.7 N/A 95.3 R 97.3 96.8 N/A N/A 96.3 96.9 N/A 96.6 R 94.4 93.9 94.2 94.9 N/A 94.6 R 94.8 94.2 94.6 95.5 N/A 94.6 F 97.3 97.0 96.8 97.2 96.4 97.1 97.2 96.4 F 94.6 94.1 94.6 95.1 95.1 94.3 F 95.2 94.7 94.5 95.1 94.5 95.0 R-oov 72.2 72.0 N/A N/A 71.4 71.2 N/A 71.7 R-oov 68.8 65.8 73.6 74.1 N/A 69.8 R-oov 77.8 76.8 75.4 74.8 N/A 63.6 Table 2: Results from DPLVMs, CRFs, semi-CRFs, and other systems. PKU Corpus (see Table 1 for details). The results are shown in Table 2. The results are grouped into three sub-tables according to different corpora. Each row represents a CWS model. For each group, the rows marked by represent our models with hybrid word/character information. Best05 represents the best system of the Second International Chinese Word Segmentation Bakeoff on the corresponding data; A06 represents the semi-CRF model in Andrew (2006)10 , which was also used in Gao et al. (2007) (denoted as G07) with an improved performance; Z06-a and Z06-b represents the pure subword CRF model and the confidence-based combination of CRF and rule-based models, respectively (Zhang et al., 2006); ZC07 represents the word-based perceptron model in Zhang and Clark (2007); T05 represents the CRF model in Tseng et al. (2005); C05 represents the system in Chen et al. It is a hybrid Markov/semi-Markov CRF model which outperforms conventional semi-CRF models (Andrew, 2006). However, in general, as discussed in Andrew (2006), it is essentially still a semi-CRF model. 10 (2005). The best F-score and recall of OOV words of each group is shown in bold. As is shown in the table, we achieved the best F-score in two out of the three corpora. We also achieved the best recall rate of OOV words on those two corpora. Both of the MSR and PKU Corpus use simplified Chinese, while the CU Corpus uses the traditional Chinese. On the MSR Corpus, the DPLVM model reduced more than 10% error rate over the CRF model using exactly the same feature set. We also compared our DPLVM model with the semi-CRF models in Andrew (2006) and Gao et al. (2007), and demonstrate that the DPLVM model achieved slightly better performance than the semi-CRF models. Andrew (2006) and Gao et al. (2007) only reported the results on the MSR Corpus. In summary, tests for the Second International Chinese Word Segmentation Bakeoff showed competitive results for our method compared with the best results in the literature. Our discriminative latent variable models achieved the best F-scores on the MSR Corpus (97.3%) and PKU Corpus (95.2%); the latent variable models also achieved the best recalls of OOV words over those two corpora. We will analyze the results by varying the word-length in the following subsection. 4.2 Effect on Long Words One motivation of using a latent variable model for CWS is to use latent variables to more adequately learn long range dependencies, as we argued in Section 1. In the test data of the MSR Corpus, 19% of the words are longer than 3 characters; there are also 8% in the CU Corpus and 11% in the PKU Corpus, respectively. In the MSR Corpus, there are some extremely long words (Length > 10), while the CU and PKU corpus do not contain such extreme cases. Figure 2 shows the recall rate on different groups of words categorized by their lengths (the number of characters). As we expected, the DPLVM model performs much better on long words (Length 4) than the CRF model, which used exactly the same feature set. Compared with the CRF model, the DPLVM model exhibited almost the same level of performance on short words. Both models have the best performance on segmenting the words with the length of two. The performance of the CRF 61 100 Recall-MSR (%) 80 60 40 20 0 0 100 Recall-CU (%) 80 60 40 20 0 0 100 Recall-PKU (%) 90 80 70 60 50 40 0 2 4 6 8 10 12 14 Length of Word (PKU) DPLVM CRF 2 4 6 8 10 12 14 Length of Word (CU) DPLVM CRF 2 4 6 8 10 12 14 Length of Word (MSR) DPLVM CRF Gold Segmentation // Co-allocated org. names (Chen Yao) (Chen Fei) (Vasillis) Segmenter Output // // // // // // // Idioms // (propagandist) (desertification) // Table 3: Error analysis on the latent variable segmenter. The errors are grouped into four types: overgeneralization, errors on named entities, errors on idioms and errors from data-inconsistency. Figure 2: The recall rate on words grouped by the length. model deteriorates rapidly as the word length increases, which demonstrated the difficulty on modeling long range dependencies in CWS. Compared with the CRF model, the DPLVM model performed quite well in dealing with long words, without sacrificing the performance on short words. All in all, we conclude that the improvement of using the DPLVM model came from the improvement on modeling long range dependencies in CWS. 4.3 Error Analysis Table 3 lists the major errors collected from the latent variable segmenter. We examined the collected errors and found that many of them can be grouped into four types: over-generalization (the top row), errors on named entities (the following three rows), errors on idioms (the following three rows) and errors from inconsistency (the two rows at the bottom). Our system performed reasonably well on very complex OOV words, such as (Agricultural Bank of China, Shijiazhuang-city Branch, the second sales depart(Science ment) and and Technology Commission of China, National Institution on Scientific Information Analysis). However, it sometimes over-generalized to long words. For example, as shown in the top row, (National Department of Environmental Protection) (The Central Propaganda Department) and are two organization names, but they are incorrectly merged into a single word. As for the following three rows, (Chen Yao) (Chen Fei) are person names. They are and wrongly segmented because we lack the features to capture the information of person names (such useful knowledge, e.g., common surname list, are currently not used in our system). In the future, such errors may be solved by integrating open resources into our system. (Vasillis) is a transliterated foreign location name and is also wrongly segmented. For the corpora that considered 4 character idioms as a word, our system successfully combined most of new idioms together. This differs greatly from the results of CRFs. However, there are still a number of new idioms that failed to be correctly segmented, as listed from the fifth row to the seventh row. Finally, some errors are due to inconsistencies in // (prothe gold segmentation. For example, pagandist) is two words, but a word with similar 62 structure, (theorist), is one word. (desertification) is one word, but its synonym, // (desertification), is two words in the gold segmentation. 5 Conclusion and Future Work We presented a latent variable model for Chinese word segmentation, which used hybrid information based on both word and character sequences. We discussed that word and character information have different advantages, and could be complementary to each other. Our model is an alternative to the existing word based models and character based models. We argued that using latent variables can better capture long range dependencies. We performed experiments and demonstrated that our model can indeed improve the segmentation accuracy on long words. With this improvement, tests on the data of the Second International Chinese Word Segmentation Bakeoff show that our system is competitive with the best in the literature. Since the latent variable model allows a wide range of features, so the future work will consider how to integrate open resources into our system. The latent variable model handles latent-dependencies naturally, and can be easily extended to other labeling tasks. Acknowledgments We thank Kun Yu, Galen Andrew and Xiaojun Lin for the enlightening discussions. We also thank the anonymous reviewers who gave very helpful comments. This work was partially supported by Grantin-Aid for Specially Promoted Research (MEXT, Japan). References Galen Andrew. 2006. A hybrid markov/semi-markov conditional random field for sequence segmentation. Proceedings of EMNLP'06, pages 465­472. Masayuki Asahara, Kenta Fukuoka, Ai Azuma, ChooiLing Goh, Yotaro Watanabe, Yuji Matsumoto, and Takahashi Tsuzuki. 2005. Combination of machine learning methods for optimum chinese word segmentation. Proceedings of the fourth SIGHAN workshop, pages 134­137. Stanley F. Chen and Ronald Rosenfeld. 1999. A gaussian prior for smoothing maximum entropy models. Technical Report CMU-CS-99-108, CMU. Aitao Chen, Yiping Zhou, Anne Zhang, and Gordon Sun. 2005. Unigram language model for chinese word segmentation. Proceedings of the fourth SIGHAN workshop. Hal Daum´ III and Daniel Marcu. 2005. Learne ing as search optimization: approximate large margin methods for structured prediction. Proceedings of ICML'05. Thomas Emerson. 2005. The second international chinese word segmentation bakeoff. Proceedings of the fourth SIGHAN workshop, pages 123­133. Jianfeng Gao, Galen Andrew, Mark Johnson, and Kristina Toutanova. 2007. A comparative study of parameter estimation methods for statistical natural language processing. In Proceedings of the 45th Annual Meeting of the Association of Computational Linguistics (ACL'07), pages 824­831. P.E. Hart, N.J. Nilsson, and B. Raphael. 1968. A formal basis for the heuristic determination of minimum cost path. IEEE Trans. On System Science and Cybernetics, SSC-4(2):100­107. John Lafferty, Andrew McCallum, and Fernando Pereira. 2001. Conditional random fields: Probabilistic models for segmenting and labeling sequence data. Proceedings of ICML'01, pages 282­289. Percy Liang. 2005. Semi-supervised learning for natural language. Master's thesis, Massachusetts Institute of Technology. Louis-Philippe Morency, Ariadna Quattoni, and Trevor Darrell. 2007. Latent-dynamic discriminative models for continuous gesture recognition. Proceedings of CVPR'07, pages 1­8. Jorge Nocedal and Stephen J. Wright. 1999. Numerical optimization. Springer. F. Peng and A. McCallum. 2004. Chinese segmentation and new word detection using conditional random fields. Proceedings of COLING'04. Slav Petrov and Dan Klein. 2008. Discriminative loglinear grammars with latent variables. Proceedings of NIPS'08. Sunita Sarawagi and William Cohen. 2004. Semimarkov conditional random fields for information extraction. Proceedings of ICML'04. Xu Sun and Jun'ichi Tsujii. 2009. Sequential labeling with latent variables: An exact inference algorithm and its efficient approximation. Proceedings of the 12th Conference of the European Chapter of the Association for Computational Linguistics (EACL'09). Huihsin Tseng, Pichuan Chang, Galen Andrew, Daniel Jurafsky, and Christopher Manning. 2005. A conditional random field word segmenter for sighan bakeoff 63 2005. Proceedings of the fourth SIGHAN workshop, pages 168­171. S.V.N. Vishwanathan, Nicol N. Schraudolph, Mark W. Schmidt, and Kevin P. Murphy. 2006. Accelerated training of conditional random fields with stochastic meta-descent. Proceedings of ICML'06, pages 969­ 976. Xinhao Wang, Xiaojun Lin, Dianhai Yu, Hao Tian, and Xihong Wu. 2006. Chinese word segmentation with maximum entropy and n-gram language model. In Proceedings of the fifth SIGHAN workshop, pages 138­141, July. Nianwen Xue. 2003. Chinese word segmentation as character tagging. International Journal of Computational Linguistics and Chinese Language Processing, 8(1). Yue Zhang and Stephen Clark. 2007. Chinese segmentation with a word-based perceptron algorithm. Proceedings of ACL'07. Ruiqiang Zhang, Genichiro Kikui, and Eiichiro Sumita. 2006. Subword-based tagging by conditional random fields for chinese word segmentation. Proceedings of HLT/NAACL'06 companion volume short papers. 64 Improved Reconstruction of Protolanguage Word Forms Alexandre Bouchard-C^ t´ Thomas L. Griffiths Dan Klein oe Computer Science Division Department of Psychology University of California at Berkeley Berkeley, CA 94720 Abstract We present an unsupervised approach to reconstructing ancient word forms. The present work addresses three limitations of previous work. First, previous work focused on faithfulness features, which model changes between successive languages. We add markedness features, which model well-formedness within each language. Second, we introduce universal features, which support generalizations across languages. Finally, we increase the number of languages to which these methods can be applied by an order of magnitude by using improved inference methods. Experiments on the reconstruction of ProtoOceanic, Proto-Malayo-Javanic, and Classical Latin show substantial reductions in error rate, giving the best results to date. Kondrak, 2002) or all of the process (Oakes, 2000; Bouchard-C^ t´ et al., 2008). However, previous auoe tomated methods have been unable to leverage three important ideas a linguist would employ. We address these omissions here, resulting in a more powerful method for automatically reconstructing ancient protolanguages. First, linguists triangulate reconstructions from many languages, while past work has been limited to small numbers of languages. For example, Oakes (2000) used four languages to reconstruct Proto-Malayo-Javanic (PMJ) and Bouchard-C^ t´ et oe al. (2008) used two languages to reconstruct Classical Latin (La). We revisit these small datasets and show that our method significantly outperforms these previous systems. However, we also show that our method can be applied to a much larger data set (Greenhill et al., 2008), reconstructing ProtoOceanic (POc) from 64 modern languages. In addition, performance improves with more languages, which was not the case for previous methods. Second, linguists exploit knowledge of phonological universals. For example, small changes in vowel height or consonant place are more likely than large changes, and much more likely than change to arbitrarily different phonemes. In a statistical system, one could imagine either manually encoding or automatically inferring such preferences. We show that both strategies are effective. Finally, linguists consider not only how languages change, but also how they are internally consistent. Past models described how sounds do (or, more often, do not) change between nodes in the tree. To borrow broad terminology from the Optimality Theory literature (Prince and Smolensky, 1993), such models incorporated faithfulness features, capturing the ways in which successive forms remained similar to one another. However, each language has certain regular phonotactic patterns which con- 1 Introduction A central problem in diachronic linguistics is the reconstruction of ancient languages from their modern descendants (Campbell, 1998). Here, we consider the problem of reconstructing phonological forms, given a known linguistic phylogeny and known cognate groups. For example, Figure 1 (a) shows a collection of word forms in several Oceanic languages, all meaning to cry. The ancestral form in this case has been presumed to be /taNis/ in Blust (1993). We are interested in models which take as input many such word tuples, each representing a cognate group, along with a language tree, and induce word forms for hidden ancestral languages. The traditional approach to this problem has been the comparative method, in which reconstructions are done manually using assumptions about the relative probability of different kinds of sound change (Hock, 1986). There has been work attempting to automate part (Durham and Rogers, 1969; Eastlack, 1977; Lowe and Mazaudon, 1994; Covington, 1998; 65 Human Language Technologies: The 2009 Annual Conference of the North American Chapter of the ACL, pages 65­73, Boulder, Colorado, June 2009. c 2009 Association for Computational Linguistics strain these changes. We encode such patterns using markedness features, characterizing the internal phonotactic structure of each language. Faithfulness and markedness play roles analogous to the channel and language models of a noisy-channel system. We show that markedness features improve reconstruction, and can be used efficiently. tion and increased scale. 3 Model 2 Related work Our focus in this section is on describing the properties of the two previous systems for reconstructing ancient word forms to which we compare our method. Citations for other related work, such as similar approaches to using faithfulness and markedness features, appear in the body of the paper. In Oakes (2000), the word forms in a given protolanguage are reconstructed using a Viterbi multialignment between a small number of its descendant languages. The alignment is computed using handset parameters. Deterministic rules characterizing changes between pairs of observed languages are extracted from the alignment when their frequency is higher than a threshold, and a proto-phoneme inventory is built using linguistically motivated rules and parsimony. A reconstruction of each observed word is first proposed independently for each language. If at least two reconstructions agree, a majority vote is taken, otherwise no reconstruction is proposed. This approach has several limitations. First, it is not tractable for larger trees, since the time complexity of their multi-alignment algorithm grows exponentially in the number of languages. Second, deterministic rules, while elegant in theory, are not robust to noise: even in experiments with only four daughter languages, a large fraction of the words could not be reconstructed. In Bouchard-C^ t´ et al. (2008), a stochastic model oe of sound change is used and reconstructions are inferred by performing probabilistic inference over an evolutionary tree expressing the relationships between languages. The model does not support generalizations across languages, and has no way to capture phonotactic regularities within languages. As a consequence, the resulting method does not scale to large phylogenies. The work we present here addresses both of these issues, with a richer model and faster inference allowing improved reconstruc66 We start this section by introducing some notation. Let be a tree of languages, such as the examples in Figure 3 (c-e). In such a tree, the modern languages, whose word forms will be observed, are the leaves of . All internal nodes, particularly the root, are languages whose word forms are not observed. Let L denote all languages, modern and otherwise. All word forms are assumed to be strings in the International Phonological Alphabet (IPA).1 We assume that word forms evolve along the branches of the tree . However, it is not the case that each cognate set exists in each modern language. Formally, we assume there to be a known list of C cognate sets. For each c {1, . . . , C} let L(c) denote the subset of modern languages that have a word form in the c-th cognate set. For each set c {1, . . . , C} and each language L(c), we denote the modern word form by wc . For cognate set c, only the minimal subtree (c) containing L(c) and the root is relevant to the reconstruction inference problem for that set. From a high-level perspective, the generative process is quite simple. Let c be the index of the current cognate set, with topology (c). First, a word is generated for the root of (c) using an (initially unknown) root language model (distribution over strings). The other nodes of the tree are drawn incrementally as follows: for each edge in (c) use a branch-specific distribution over changes in strings to generate the word at node . In the remainder of this section, we clarify the exact form of the conditional distributions over string changes, the distribution over strings at the root, and the parameterization of this process. 3.1 Markedness and Faithfulness In Optimality Theory (OT) (Prince and Smolensky, 1993), two types of constraints influence the selection of a realized output given an input form: faithfulness and markedness constraints. Faithfulness enThe choice of a phonemic representation is motivated by the fact that most of the data available comes in this form. Diacritics are available in a smaller number of languages and may vary across dialects, so we discarded them in this work. 1 The yi 's may be a single character, multiple characters, or even empty. In the example shown, all three Language Word form over an # a ferred by performing probabilistic inference occur. ng i # of these cases Proto Oceanic /taNis/ y1 y2 y3 evolutionaryy7 y4 y5 y6 tree expressing the relationships beLau /aNi/ tween languages. Use of approximate inference i , we define a mutation Markov To generate y and (c) Kwara'ae /angi/ S I stochastic rules addresses some of the limitations of chain that incrementally adds zero or more characTaiof /taNis/ (Oakes, 2000), but the resulting method is computaters to an initially empty yi . First, we decide whether Table 1: A cognate set from the Austronesian dataset. All tionally demanding and consequently does not scale g nto large phylogenies. The high computational cost a ? .. word forms mean to cry. the current phoneme in the top word t = xi will be (f) /tai/ (e) /tai/ (d) of probabilistic inference also limits the features that deleted, in which case yi = as in the example of 1[g@Kw] can be included in the model (omitting global fea1[g] constrain these changes. We encode such patterns 1[Subst] generalizations across languages, t is not deleted, we chose a sin/s/ being deleted. If tures supporting /ai/ 1[(n)@Kw] using markedness/ai/ features, characterizing the interand 1[(n g)@Kw] features within languages). character in the bottom word. This markedness gle substitution The 1[Insert] nal phonotactic structure of each language. Faithwork we present here addresses both of these issues, 1[(g)@Kw] g fulness and markedness play /angi/analogous tonthe roles is the case both when /a/ is unchanged and when /N/ /angi/ with faster inference and a richer model allowing inchannel and language models of a noisy-channel substitutes to creased scale and improved reconstruction. /n/. We write S = {} for this set system. We show that markedness features greatly Figure 1: (a) A quality, and wefrom the Austronesian dataset. of outcomes, where is the special outcome indicognate set show how to 3 Model improve reconstruction work All wordefficiently. with them forms mean to cry. (b-d) The mutation model cating deletion. Importantly, the probabilities of this We start used in this paper. (b) The mutation of POc this section by introducing somecan depend on both the previous char/taNis/ to multinomial notation. Let be a tree of languages, such as the examples in 2 Related Work Kw. /angi/. (c) Graphical model depicting the 4 (c-e). In such a tree, the generated so far (i.e. the rightmost character Figure dependenacter modern languages, Our focus in this section is on describing the of the mutation Markov cies among variables in one step prop- whose word forms will be observed, are the leaves p of yi-1 ) and the current character in the previous ertieschain. two previous systems for for one step 1 . . this. process. nodes, particularly the root, of the (d) Active features reconstructin . m All internal ing ancient word forms to which we compare our are languages whose word forms are not observed. As we will see shortly, this al(e-f) Comparison of two inference procedures on trees: generation string (t). method. Citations for other related work, such as Let L denote all languages, modern and otherwise. lows modelling markedness and faithfulness at every Single sequence resampling (e) draws similar approaches to using faithfulness and marked- one sequence at aassumed to be strings in the All word forms are branch, jointly. This multinomial decision acts as time, conditioned body of the paper. ness features, appear in the on its parent and children, while ancesInternational Phonological Alphabet (IPA).1 In Oakes (2000), the worddrawsin a given prototry resampling (f) forms an aligned slice from first words the initial distribution of the mutation Markov chain. As a all approximation, we assume that word language are reconstructed using atrees, the latter forms evolve along the branches of the tree . Howsimultaneously. In large Viterbi multi- is more efficient We consider insertions only if a deletion was not alignment between a small number of its descendant ever, it is not the case that each cognate set exists than the former. selected in the first languages. The alignment is computed using hand- in each modern langugage. Formally, we assume step. Here, we draw from a set parameters. Deterministic rules characterizing there to be a known list of C cognate sets.over S , where this time the special outmultinomial For each changes between pairs of observed languages are ex- c {1, . . . , C} let L(c) denote the subset of modcome corresponds to stopping insertions, and the tracted from the alignment when their frequency is ern languages that have a word form in the c-th cogcourages similarity between the input and output other higher than a threshold, and a proto-phoneme inven- nate set. For each set c {1, . . .elements oflan- correspond to symbols that are , C} and each S while markedness favors well-formed output. tory is built using linguistically motivated rules and guage L(c), we denote the modern word form appended to yi . In this case, the conditioning enviparsimony. A reconstruction of each observed word by wc . computa- set c, only the minimal subtree Viewed from this perspective, previous For cognate ronment is t = xi and the current rightmost symbol is first proposed independently for each language. If (c) containing L(c) and the root is relevant to the tional approaches to reconstruction reconstruction inference p in yifor Insertions continue until is selected. In at least two reconstructions agree, a majority vote are based almost problem . that set. is taken, otherwise no reconstruction is proposed. exclusively on faithfulness, expressed through a mu- perspective, the generative proFrom a high-level the example, we follow the substitution of /N/ to /n/ This approach has several limitations. First, it is cess is quite simple. Let c be the index of the curtation model. Only the words in the language at the with an insertion of /g/, followed by a decision to not tractable for larger trees since the complexity of rent cognate set, with topology (c). First, a word root of the algorithm grows exponentially encouraged to the multi-alignment tree, if any, are explicitlyis generated for the rootstop that yi .an (initially use S,t,p, and I,t,p, to denote of (c) using We will in the number of languages. In contrast, we incorporate conbe well-formed. Second, determinis- unknown) root languagethe probabilities over the substitution and insertion model (distribution over tic rules, while elegant in theory, are not robust to strings). The other nodes of the tree are drawn instraintsexperiments with onlyfor each language with both decisions in the current branch . on markedness four daughter noise: even in crementally as follows: for each edge in (c) general and branch-specific constraints on faithfullanguages, a large fraction of the words could not be A similar process generates the word at the root 1 The choice of a phonemic representation is motivated by reconstructed. This is done using a lexicalized that most of the data available comes in this form. Dianess. the fact stochastic of a tree, treating this word as a single string In Bouchard-C^ t´ et al. (2008), a stochastic model critics are available in a smaller number of languages and may oe string transducer (Varadarajan et al., 2008). y1 them in this work. of sound change is used and reconstructions are in- vary across dialects, so we discartedgenerated from a dummy ancestor t = x1 . In We now make precise the conditional distribu- this case, only the insertion probabilities matter, and tions over pairs of evolving strings, referring to Fig- we separately parameterize these probabilities with ure 1 (b-d). Consider a language evolving to R,t,p, . There is no actual dependence on t at the for cognate set c. Assume we have a word form root, but this formulation allows us to unify the pax = wcl . The generative process for producing rameterization, with each ,t,p, R||+1 where y = wcl works as follows. First, we consider {R, S, I}. x to be composed of characters x1 x2 . . . xn , with the first and last being a special boundary symbol 3.2 Parameterization x1 = # which is never deleted, mutated, or Instead of directly estimating the transition probacreated. The process generates y = y1 y2 . . . yn in bilities of the mutation Markov chain (as the paramn chunks yi , i {1, . . . , n}, one for each xi . eters of a collection of multinomial distributions) we (a) (b) x1 x2 x3 x4 # t a x5 x6 x7 i s # 67 express them as the output of a log-linear model. We used the following feature templates: OPERATION identifies whether an operation in the mutation Markov chain is an insertion, a deletion, a substitution, a self-substitution (i.e. of the form x y, x = y), or the end of an insertion event. Examples in Figure 1 (d): 1[Subst] and 1[Insert]. MARKEDNESS consists of language-specific ngram indicator functions for all symbols in . Only unigram and bigram features are used for computational reasons, but we show in Section 5 that this already captures important constraints. Examples in Figure 1 (d): the bigram indicator 1[(n g)@Kw] (Kw stands for Kwara'ae, a language of the Solomon Islands), the unigram indicators 1[(n)@Kw] and 1[(g)@Kw]. FAITHFULNESS consists of indicators for mutation events of the form 1[x y], where x , y S . Examples: 1[N n], 1[N n@Kw]. Feature templates similar to these can be found for instance in Dreyer et al. (2008) and Chen (2003), in the context of string-to-string transduction. Note also the connection with stochastic OT (Goldwater and Johnson, 2003; Wilson, 2006), where a loglinear model mediates markedness and faithfulness of the production of an output form from an underlying input form. 3.3 Parameter sharing Data sparsity is a significant challenge in protolanguage reconstruction. While the experiments we present here use an order of magnitude more languages than previous computational approaches, the increase in observed data also brings with it additional unknowns in the form of intermediate protolanguages. Since there is one set of parameters for each language, adding more data is not sufficient for increasing the quality of the reconstruction: we show in Section 5.2 that adding extra languages can actually hurt reconstruction using previous methods. It is therefore important to share parameters across different branches in the tree in order to benefit from having observations from more languages. As an example of useful parameter sharing, consider the faithfulness features 1[/p/ /b/] and 1[/p/ /r/], which are indicator functions for the appearance of two substitutions for /p/. We would like the model to learn that the former event (a sim68 ple voicing change) should be preferred over the latter. In Bouchard-C^ t´ et al. (2008), this has to be oe learned for each branch in the tree. The difficulty is that not all branches will have enough information to learn this preference, meaning that we need to define the model in such a way that it can generalize across languages. We used the following technique to address this problem: we augment the sufficient statistics of Bouchard-C^ t´ et al. (2008) to include the current oe language (or language at the bottom of the current branch) and use a single, global weight vector instead of a set of branch-specific weights. Generalization across branches is then achieved by using features that ignore , while branch-specific features depend on . For instance, in Figure 1 (d), 1[N n] is an example of a universal (global) feature shared across all branches while 1[N n@Kw] is branchspecific. Similarly, all of the features in OPERA TION , MARKEDNESS and FAITHFULNESS have universal and branch-specific versions. 3.4 Objective function Concretely, the transition probabilities of the mutation and root generation are given by: ,t,p, () = exp{ , f (, t, p, , ) } × µ(, t, ), Z(, t, p, , ) where S , f : {S, I, R}×××L×S Rk is the sufficient statistics or feature function, ·, · denotes inner product and Rk is a weight vector. Here, k is the dimensionality of the feature space of the log-linear model. In the terminology of exponential families, Z and µ are the normalization function and reference measure respectively: Z(, t, p, , ) = 0 0 µ(, t, ) = 0 1 S exp{ , f (, t, p, , ) } if = S, t = #, = # if = R, = if = R, = # o.w. Here, µ is used to handle boundary conditions. We will also need the following notation: let P (·), P (·|·) denote the root and branch probability models described in Section 3.1 (with transition probabilities given by the above log-linear model), I(c), the set of internal (non-leaf) nodes in (c), pa( ), the parent of language , r(c), the root of (c) and W (c) = ( )|I(c)| . We can summarize our objective function as follows: C X c=1 log X wW (c) P (wc,r(c) ) Y I(c) P (wc, |wc,pa( ) ) - ||||2 2 2 2 (2009). Slices condition on observed data, avoiding the problems mentioned above, and can propagate information rapidly across the tree. The second term is a standard penalty (we used 2 = 1). L2 regularization 5 Experiments 4 Learning algorithm Learning is done using a Monte Carlo variant of the Expectation-Maximization (EM) algorithm (Dempster et al., 1977). The M step is convex and computed using L-BFGS (Liu et al., 1989); but the E step is intractable (Lunter et al., 2003), so we used a Markov chain Monte Carlo (MCMC) approximation (Tierney, 1994). At E step t = 1, 2, . . . , we simulated the chain for O(t) iterations; this regime is necessary for convergence (Jank, 2005). In the E step, the inference problem is to compute an expectation under the posterior over strings in a protolanguage given observed word forms at the leaves of the tree. The typical approach in biology or historical linguistics (Holmes and Bruno, 2001; Bouchard-C^ t´ et al., 2008) is to use Gibbs samoe pling, where the entire string at a single node in the tree is sampled, conditioned on its parent and children. This sampling domain is shown in Figure 1 (e), where the middle word is completely resampled but adjacent words are fixed. We will call this method Single Sequence Resampling (SSR). While conceptually simple, this approach suffers from problems in large trees (Holmes and Bruno, 2001). Consequently, we use a different MCMC procedure, called Ancestry Resampling (AR) that alleviates the mixing problems (Figure 1 (f)). This method was originally introduced for biological applications (Bouchard-C^ t´ et al., 2009), but commonalities beoe tween the biological and linguistic cases make it possible to use it in our model. Concretely, the problem with SSR arises when the tree under consideration is large or unbalanced. In this case, it can take a long time for information from the observed languages to propagate to the root of the tree. Indeed, samples at the root will initially be independent of the observations. AR addresses this problem by resampling one thin vertical slice of all sequences at a time, called an ancestry. For the precise definition, see Bouchard-C^ t´ et al. oe 69 We performed a comprehensive set of experiments to test the new method for reconstruction outlined above. In Section 5.1, we analyze in isolation the effects of varying the set of features, the number of observed languages, the topology, and the number of iterations of EM. In Section 5.2 we compare performance to an oracle and to three other systems. Evaluation of all methods was done by computing the Levenshtein distance (Levenshtein, 1966) between the reconstruction produced by each method and the reconstruction produced by linguists. We averaged this distance across reconstructed words to report a single number for each method. We show in Table 2 the average word length in each corpus; note that the Latin average is much larger, giving an explanation to the higher errors in the Romance dataset. The statistical significance of all performance differences are assessed using a paired t-test with significance level of 0.05. 5.1 Evaluating system performance We used the Austronesian Basic Vocabulary Database (Greenhill et al., 2008) as the basis for a series of experiments used to evaluate the performance of our system and the factors relevant to its success. The database includes partial cognacy judgments and IPA transcriptions, as well as a few reconstructed protolanguages. A reconstruction of Proto-Oceanic (POc) originally developed by Blust (1993) using the comparative method was the basis for evaluation. We used the cognate information provided in the database, automatically constructing a global tree2 and set of subtrees from the cognate set indicator matrix M ( , c) = 1[ L(c)], c {1, . . . , C}, L. For constructing the global tree, we used the implementation of neighbor joining in the Phylip package (Felsenstein, 1989). We used a distance based on cognates overlap, dc ( 1 , 2 ) = C We bootstrapped 1000 c=1 M ( 1 , c)M ( 2 , c). The dataset included a tree, but it was out of date as of November 2008 (Greenhill et al., 2008). 2 It POc La Es Pt Error 2.6 Condition Unsupervised full system 2.4 -FAITHFULNESS -MARKEDNESS 2.2 -Sharing -Topology 2 Semi-supervised system 2.6 2.2 Error Edit dist. 1.87 2.02 2.18 1.99 2.06 1.75 3.6 3 Error 3.6 3.4 3.2 3 2.8 2.6 8 lexical items from 587 languages 5.2 Comparisons against other methods Austronesian language family. The ones being confused with an improvement producedfirst two competing methods, PRAGUE and The s partial cognacy judgments and by for each the It is semi-supervised ical model) increasingrun. number of languages. in BCLKG, are described in Oakes (2000) and s, as well as a few reconstructed the sense that gold reconstruction for many internal Bouchard-C^ t´ et al. (2008) respectively and sumoe The results are reported in Figure 4. They conA reconstruction of Proto Oceanic nodes are not available (such as the common ancesautodeveloped by (Blust, 1993) using firm that large-scale inference is desirable for marized them in Section 1. Neither approach scales tor of Kw. and proto-language reconstruction: going from 2- large datasets. In the first case, the bottleneck well to matic Lau in Figure 6).3 method was the basis for evaluation. is the complexity of computing multi-alignments gnate information provided in the to-4, 4-to-8, 8-to-16,of a concrete run over signifiFigure 6 shows the results 16-to-32 languages all without guide trees and the vanishing probability tically constructing a global 32 languages, zoomingreconstruction.the Solomonic an avtree2 cantly helped in to a pair of There was still that independent reconstructions agree. In the seclanguages and the distance improvement 1. In the es from the cognate set indicator erage edit cognate set from Table of 0.05 from 32 to 64 languages, altough thisis as good as the ond case, the problem comes from slow mixing of was not statistically sigexample 1[ L(c)], c {1, . . . , C}, shown, the reconstruction the inference algorithm and the unregularized prooracle, nificant. ing the global tree, we used the though off by one character (the final /s/ is liferation of parameters. For this reason, we built a not presentWe then of the 32 inputs and of experiments inin any conducted a number therefore f neighbor joining in the Phylip third baseline that scales well in large datasets. not tended to assess the robustness of for both tein, 1989). The distanceisma- reconstructed). The diagrams show,the system, and to This mming distance of cognate the global and the local features, the expectations factors it third baseline, CENTROID, computes the indi- identify the contribution made by different C incorporates. First, we ran the IPA sound centroid of the observed word forms in Leveneach substitution superimposed on an system with 20 dif= c=1 M ( 1 , c)M ( 2 , c).ofWe shtein distance. Let L(x, y) denote the Levchart, well random of the top changes. Darker 0 samples and formed an accurate asferent as a list seeds and assessed the stability of the solution found. In This run learning was enshtein distance between word forms x and lines tree. The tree obtained is not bi- indicate higher counts.each cases, did not use stable y. Ideally, we would like the baseline to natural and constraints, but it can be Figure 5. nference algorithm scales linearly class helded performances. Seeseen that linplausible substitutions of learned. The return argminx yO L(x, y), where O = actor of the tree (in contrast,guistically Next, we found that all arethe following ablations SSR {y1 , . . . , y|O| } is the set of observed word forms. global features preferhurts reconstruction: changes, flat tree significantly a range of voicing using a lly (Lunter et al., 2003)). manner in which all languages motion, and so on, Note that the optimum is not changed if we restrict we verified experimentally is that changes, adjacent vowel are equidistant from the re including mutations root and from each other instead the minimization to be taken on x (O) such rved languages aids reconstruction constructedlike /s/ to /h/ which are common of the that m |x| M where m = mini |yi |, M = but poorly represented in dropping the markedness s. To test this hypothesis we added consensus tree, a naive attribute-based nat- features, maxi ural class scheme. sharing across branches and dropping the |yi | and (O) is the set of characters occurring languages in increasing order of disabling On the other hand, the features local to language Kwara'ae (Kw.) results of these in O. Even with this restriction, this optimization e target reconstruction of POc so thefaithfulness features. Thepick out the sub- experiis intractable. As an approximation, we considered set of ments are shown in Table 2. s that are most useful for POc re- these changes which are active in that branch, only strings built by at most k contiguous substrings such added first. This prevents the ef-as /s//t/ fortition. For comparison, we also included in the same taken from the word forms in O. If k = 1, then it close language after several distant table the performance of a semi-supervised system is equivalent to taking the min over x O. At the trained by K-fold validation. The system was ran uded a tree, but as of November 2008, it other end of the spectrum, if k = M , it is exact. 3 -1 offlat topolWe also tried a fully supervised systemK K time, with disjoint 1 - where a the POc. words atically and "has [not] been updated in a This scheme is exponential in k, but since words are ogy is used so that all of these latent internal nodes are avoided; given to the system (as observations in the graphrelatively short, we found that k = 2 often finds the but it did not perform as well. 1.8 2.4 Table 2: Effects of ablation of various aspects of our unsupervised system on mean edit distance to proto Oceanic. -Sharing corresponds to the subset of the feaetic trees for three language families. tures 1.4 1.8 e top left: Romance, Austronesian andin OPERATION, FAITHFULNESS and MARKEDNESS EM Iteration Number30 of modern languages that condition on 0 current language, -Topology correthe 60 0 10 20 nic. sponds to using a flat topology where the only edges in the tree connect modern languages to prototarget reconstruction of 5: Mean distance to the target reconstruction of Figure 4: Mean distance to the Oceanic. The Figure semi-supervised system as a function in the number of modern lan- a function of the EM iteration. text. All dif- POc as proto Oceanic is described of system and the factors relevant to ferences (compared toby the inference procedure. guages used the unsupervised full system) are database contained, as of Novemstatistically significant. 1.8 2.4 2.2 1.6 2 1.4 0 10 20 30 40 50 60 70 1.8 0 2 4 6 8 10 12 14 16 18 20 N. of modern lang. EM iteration Condition Unsupervised full system -FAITHFULNESS -MARKEDNESS -Sharing -Topology Semi-supervised system Edit dist. 1.87 2.02 2.18 1.99 2.06 1.75 Nggela Bugotu Tape Avava Neveei Naman Nese SantaAna Nahavaq Nati KwaraaeSol Lau Kwamera Tolo Marshalles PuloAnna ChuukeseAK SaipanCaro Puluwatese Woleaian PuloAnnan Carolinian Woleai Chuukese Nauna PaameseSou Anuta VaeakauTau Takuu Tokelau Tongan Samoan IfiraMeleM Tikopia Tuvalu Niue FutunaEast UveaEast Rennellese Emae Kapingamar Sikaiana Nukuoro Figure 2: Left: Mean distance to the target reconstruction of POc as a function of the number of modern languages used by the inference procedure. Right: Mean distance and confidence intervals as a function of the EM iteration, averaged over 20 random seeds and ran on 4 languages. samples and formed an accurate (90%) consensus tree. The tree obtained is not binary, but the AR inference algorithm scales linearly in the branching factor of the tree (in contrast, SSR scales exponentially (Lunter et al., 2003)). The first claim we verified experimentally is that having more observed languages aids reconstruction of protolanguages. To test this hypothesis we added observed modern languages in increasing order of distance dc to the target reconstruction of POc so that the languages that are most useful for POc reconstruction are added first. This prevents the effects of adding a close language after several distant ones being confused with an improvement produced by increasing the number of languages. The results are reported in Figure 2 (a). They confirm that large-scale inference is desirable for automatic protolanguage reconstruction: reconstruction improved statistically significantly with each increase except from 32 to 64 languages, where the average edit distance improvement was 0.05. We then conducted a number of experiments intended to assess the robustness of the system, and to identify the contribution made by different factors it incorporates. First, we ran the system with 20 different random seeds to assess the stability of the solutions found. In each case, learning was stable and accuracy improved during training. See Figure 2 (b). Next, we found that all of the following ablations significantly hurt reconstruction: using a flat tree (in which all languages are equidistant from the reconstructed root and from each other) instead of the consensus tree, dropping the markedness features, drop70 Table 1: Effects of ablation of various aspects of our unsupervised system on mean edit distance to POc. -Sharing corresponds to the restriction to the subset of the features in OPERATION, FAITHFULNESS and MARKED NESS that are branch-specific, -Topology corresponds to using a flat topology where the only edges in the tree connect modern languages to POc. The semi-supervised system is described in the text. All differences (compared to the unsupervised full system) are statistically significant. ping the faithfulness features, and disabling sharing across branches. The results of these experiments are shown in Table 1. For comparison, we also included in the same table the performance of a semi-supervised system trained by K-fold validation. The system was ran K = 5 times, with 1 - K -1 of the POc words given to the system as observations in the graphical model for each run. It is semi-supervised in the sense that gold reconstruction for many internal nodes are not available in the dataset (for example the common ancestor of Kwara'ae (Kw.) and Lau in Figure 3 (b)), so they are still not filled.3 Figure 3 (b) shows the results of a concrete run over 32 languages, zooming in to a pair of the Solomonic languages and the cognate set from Figure 1 (a). In the example shown, the reconstruction is as good as the ORACLE (described in Section 5.2), though off by one character (the final /s/ is not present in any of the 32 inputs and therefore is not reconstructed). In (a), diagrams show, for both the global and the local (Kwara'ae) features, the expectations of each substitution superimposed on an IPA sound chart, as well as a list of the top changes. Darker lines indicate higher counts. This run did not use natural class constraints, but it can We also tried a fully supervised system where a flat topology is used so that all of these latent internal nodes are avoided; but it did not perform as well--this is consistent with the -Topology experiment of Table 1. 3 be seen that linguistically plausible substitutions are learned. The global features prefer a range of voicing changes, manner changes, adjacent vowel motion, and so on, including mutations like /s/ to /h/ which are common but poorly represented in a naive attribute-based natural class scheme. On the other hand, the features local to the language Kwara'ae pick out the subset of these changes which are active in that branch, such as /s//t/ fortition. 5.2 Comparisons against other methods Comparison Protolanguage Heldout (prop.) Modern languages Cognate sets Observed words Mean word length CENTROID PRAGUE BCLKG POc 243 (1.0) 70 1321 10783 4.5 PMJ 79 (1.0) 4 179 470 5.0 La 293 (0.5) 2 583 1463 7.4 Table 2: Experimental setup: number of held-out protoword from (absolute and relative), of modern languages, cognate sets and total observed words. The split for BCLKG is the same as in Bouchard-C^ t´ et al. (2008). oe ACLE . This is superior to picking a single closest language to be used for all word forms, but it is possible for systems to perform better than the oracle since it has to return one of the observed word forms. We performed the comparison against Oakes (2000) and Bouchard-C^ t´ et al. (2008) on the same oe dataset and experimental conditions as those used in the respective papers (see Table 2). Note that the setup of Bouchard-C^ t´ et al. (2008) provides superoe vision (half of the Latin word forms are provided); all of the other comparisons are performed in a completely unsupervised manner. The PMJ dataset was compiled by Nothofer (1975), who also reconstructed the corresponding protolanguage. Since PRAGUE is not guaranteed to return a reconstruction for each cognate set, only 55 word forms could be directly compared to our system. We restricted comparison to this subset of the data. This favors PRAGUE since the system only proposes a reconstruction when it is certain. Still, our system outperformed PRAGUE, with an average distance of 1.60 compared to 2.02 for PRAGUE. The difference is marginally significant, p = 0.06, partly due to the small number of word forms involved. We also exceeded the performance of BCLKG on the Romance dataset. Our system's reconstruction had an edit distance of 3.02 to the truth against 3.10 for BCLKG. However, this difference was not significant (p = 0.15). We think this is because of the high level of noise in the data (the Romance dataset is the only dataset we consider that was automatically constructed rather than curated by linguists). A second factor contributing to this small difference may be that the the experimental setup of BCLKG used very few languages, while the performance of our system improves markedly with more languages. The first two competing methods, PRAGUE and BCLKG , are described in Oakes (2000) and Bouchard-C^ t´ et al. (2008) respectively and sumoe marized in Section 1. Neither approach scales well to large datasets. In the first case, the bottleneck is the complexity of computing multi-alignments without guide trees and the vanishing probability that independent reconstructions agree. In the second case, the problem comes from the unregularized proliferation of parameters and slow mixing of the inference algorithm. For this reason, we built a third baseline that scales well in large datasets. This third baseline, CENTROID, computes the centroid of the observed word forms in Levenshtein distance. Let L(x, y) denote the Levenshtein distance between word forms x and y. Ideally, we would like the baseline to return argminx yO L(x, y), where O = {y1 , . . . , y|O| } is the set of observed word forms. Note that the optimum is not changed if we restrict the minimization to be taken on x (O) such that m |x| M where m = mini |yi |, M = maxi |yi | and (O) is the set of characters occurring in O. Even with this restriction, this optimization is intractable. As an approximation, we considered only strings built by at most k contiguous substrings taken from the word forms in O. If k = 1, then it is equivalent to taking the min over x O. At the other end of the spectrum, if k = M , it is exact. This scheme is exponential in k, but since words are relatively short, we found that k = 2 often finds the same solution as higher values of k. The difference was in all the cases not statistically significant, so we report the approximation k = 2 in what follows. We also compared against an oracle, denoted OR ACLE , which returns argminyO L(y, x ), where x is the target reconstruction. We will denote it by OR 71 Universal a - e Universal (a) s - a - k - l - r - s - l - r Snd m pb =# f v & m pb =# f v & !C 8 h e g r l h 8 n t d !C sz < r n > t d sz < r > ) 2 6 % A ) 2 *1 ; 6 @ % A *1 ; 7 cB ç, j 7 cB ç, j ? kg x4 9 ? kg x4 9 : q3 ' .0 $ +" / h5 (b) /ta!i/ (POc) (c) .... PMJ Jv Mad Mal (d) POc (e) It La Es Pt Kwa k - g r - l Kwa g - k s - t N - n e i g o - k a N - n ( : q3 ' .0 $ +" / /a!i/ Nggela Bugotu Tape Avava Neveei Naman Nese SantaAna Nahavaq Nati KwaraaeSol Lau Kwamera Tolo Marshalles PuloAnna ChuukeseAK SaipanCaro Puluwatese Woleaian PuloAnnan Carolinian Woleai Chuukese Nauna PaameseSou Anuta VaeakauTau Takuu Tokelau Tongan Samoan IfiraMeleM Tikopia Tuvalu Niue FutunaEast UveaEast Rennellese Emae Kapingamar Sikaiana Nukuoro Luangiua Hawaiian Marquesan Tahitianth Rurutuan Maori Tuamotu Mangareva Rarotongan Penrhyn RapanuiEas Pukapuka Mwotlap Mota FijianBau Namakir Nguna ArakiSouth Saa Raga PeteraraMa h5 /angi/ (Kw.) ( /a!i/ (Lau) Figure 3: (a) A visualization of two learned faithfulness parameters: on the top, from the universal features, on the bottom, for one particular branch. Each pair of phonemes have a link with grayscale value proportional to the expectation of a transition between them. The five strongest links are also included at the right. (b) A sample taken from our POc experiments (see text). (c-e) Phylogenetic trees for three language families: Proto-Malayo-Javanic, Austronesian and Romance. o - a e - i s - t @ We conducted another experiment to verify this by running both systems in larger trees. Because the 1 Romance dataset had only three modern languages transcribed in IPA, we used the Austronesian dataset 1 to perform the test. The results were all significant in this setup: while our method went from an edit distance of 2.01 to 1.79 in the 4-to-8 languages experiment described in Section 5.1, BCLKG went from 3.30 to 3.38. This suggests that more languages can actually hurt systems that do not support parameter sharing. Since we have shown evidence that PRAGUE and BCLKG do not scale well to large datasets, we also compared against ORACLE and CENTROID in a large-scale setting. Specifically, we compare to the experimental setup on 64 modern languages used to reconstruct POc described before. Encouragingly, while the system's average distance (1.49) does not attain that of the ORACLE (1.13), we significantly outperform the CENTROID baseline (1.79). 5.3 Incorporating prior linguistic knowledge resentation of Kondrak (2000). We compared the performance of the system with and without STRUCT- FAITHFULNESS to check if the algorithm can recover the structure of natural classes in an unsupervised fashion. We found that with 2 or 4 observed languages, FAITHFULNESS underperformed STRUCT- FAITHFULNESS, but for larger trees, the difference was not significant. FAITH FULNESS even slightly outperformed its structured cousin with 16 observed languages. 6 Conclusion The model also supports the addition of prior linguistic knowledge. This takes the form of feature templates with more internal structure. We performed experiments with an additional feature template: STRUCT-FAITHFULNESS is a structured version of FAITHFULNESS , replacing x and y with their natural classes N (x) and N (y) where indexes types of classes, ranging over {manner, place, phonation, isOral, isCentral, height, backness, roundedness}. This feature set is reminiscent of the featurized rep72 By enriching our model to include important features like markedness, and by scaling up to much larger data sets than were previously possible, we obtained substantial improvements in reconstruction quality, giving the best results on past data sets. While many more complex phenomena are still unmodeled, from reduplication to borrowing to chained sound shifts, the current approach significantly increases the power, accuracy, and efficiency of automatic reconstruction. Acknowledgments We would like to thank Anna Rafferty and our reviewers for their comments. This work was supported by a NSERC fellowship to the first author and NSF grant number BCS-0631518 to the second author. References R. Blust. 1993. Central and central-Eastern MalayoPolynesian. Oceanic Linguistics, 32:241­293. A. Bouchard-C^ t´ , P. Liang, D. Klein, and T. L. Griffiths. oe 2008. A probabilistic approach to language change. In Advances in Neural Information Processing Systems 20. A. Bouchard-C^ t´ , M. I. Jordan, and D. Klein. 2009. oe Efficient inference in phylogenetic InDel trees. In Advances in Neural Information Processing Systems 21. L. Campbell. 1998. Historical Linguistics. The MIT Press. S. F. Chen. 2003. Conditional and joint models for grapheme-to-phoneme conversion. In Proceedings of Eurospeech. M. A. Covington. 1998. Alignment of multiple languages for historical comparison. In Proceedings of ACL 1998. A. P. Dempster, N. M. Laird, and D. B. Rubin. 1977. Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society. Series B (Methodological), 39(1):1­38. M. Dreyer, J. R. Smith, and J. Eisner. 2008. Latentvariable modeling of string transductions with finitestate methods. In Proceedings of EMNLP 2008. S. P. Durham and D. E. Rogers. 1969. An application of computer programming to the reconstruction of a proto-language. In Proceedings of the 1969 conference on Computational linguistics. C. L. Eastlack. 1977. Iberochange: A program to simulate systematic sound change in Ibero-Romance. Computers and the Humanities. J. Felsenstein. 1989. PHYLIP - PHYLogeny Inference Package (Version 3.2). Cladistics, 5:164­166. S. Goldwater and M. Johnson. 2003. Learning OT constraint rankings using a maximum entropy model. Proceedings of the Workshop on Variation within Optimality Theory. S. J. Greenhill, R. Blust, and R. D. Gray. 2008. The Austronesian basic vocabulary database: From bioinformatics to lexomics. Evolutionary Bioinformatics, 4:271­283. H. H. Hock. 1986. Principles of Historical Linguistics. Walter de Gruyter. I. Holmes and W. J. Bruno. 2001. Evolutionary HMM: a Bayesian approach to multiple alignment. Bioinformatics, 17:803­820. W. Jank. 2005. Stochastic variants of EM: Monte Carlo, quasi-Monte Carlo and more. In Proceedings of the American Statistical Association. G. Kondrak. 2000. A new algorithm for the alignment of phonetic sequences. In Proceedings of NAACL 2000. G. Kondrak. 2002. Algorithms for Language Reconstruction. Ph.D. thesis, University of Toronto. V. I. Levenshtein. 1966. Binary codes capable of correcting deletions, insertions and reversals. Soviet Physics Doklady, 10, February. D. C. Liu, J. Nocedal, and C. Dong. 1989. On the limited memory BFGS method for large scale optimization. Mathematical Programming, 45:503­528. J. B. Lowe and M. Mazaudon. 1994. The reconstruction engine: a computer implementation of the comparative method. Comput. Linguist., 20(3):381­417. G. A. Lunter, I. Mikl´ s, Y. S. Song, and J. Hein. 2003. o An efficient algorithm for statistical multiple alignment on arbitrary phylogenetic trees. Journal of Computational Biology, 10:869­889. B. Nothofer. 1975. The reconstruction of Proto-MalayoJavanic. M. Nijhoff. M. P. Oakes. 2000. Computer estimation of vocabulary in a protolanguage from word lists in four daughter languages. Journal of Quantitative Linguistics, 7(3):233­244. A. Prince and P. Smolensky. 1993. Optimality theory: Constraint interaction in generative grammar. Technical Report 2, Rutgers University Center for Cognitive Science. L. Tierney. 1994. Markov chains for exploring posterior distributions. The Annals of Statistics, 22(4):1701­ 1728. A. Varadarajan, R. K. Bradley, and I. H. Holmes. 2008. Tools for simulating evolution of aligned genomic regions with integrated parameter estimation. Genome Biology, 9:R147. C. Wilson. 2006. Learning phonology with substantive bias: An experimental and computational study of velar palatalization. Cognitive Science, 30.5:945­982. 73 Shared Logistic Normal Distributions for Soft Parameter Tying in Unsupervised Grammar Induction Shay B. Cohen and Noah A. Smith Language Technologies Institute School of Computer Science Carnegie Mellon University Pittsburgh, PA 15213, USA {scohen,nasmith}@cs.cmu.edu Abstract We present a family of priors over probabilistic grammar weights, called the shared logistic normal distribution. This family extends the partitioned logistic normal distribution, enabling factored covariance between the probabilities of different derivation events in the probabilistic grammar, providing a new way to encode prior knowledge about an unknown grammar. We describe a variational EM algorithm for learning a probabilistic grammar based on this family of priors. We then experiment with unsupervised dependency grammar induction and show significant improvements using our model for both monolingual learning and bilingual learning with a non-parallel, multilingual corpus. 1 Introduction Probabilistic grammars have become an important tool in natural language processing. They are most commonly used for parsing and linguistic analysis (Charniak and Johnson, 2005; Collins, 2003), but are now commonly seen in applications like machine translation (Wu, 1997) and question answering (Wang et al., 2007). An attractive property of probabilistic grammars is that they permit the use of well-understood parameter estimation methods for learning--both from labeled and unlabeled data. Here we tackle the unsupervised grammar learning problem, specifically for unlexicalized context-free dependency grammars, using an empirical Bayesian approach with a novel family of priors. There has been an increased interest recently in employing Bayesian modeling for probabilistic grammars in different settings, ranging from putting priors over grammar probabilities (Johnson et al., 74 2007) to putting non-parametric priors over derivations (Johnson et al., 2006) to learning the set of states in a grammar (Finkel et al., 2007; Liang et al., 2007). Bayesian methods offer an elegant framework for combining prior knowledge with data. The main challenge in Bayesian grammar learning is efficiently approximating probabilistic inference, which is generally intractable. Most commonly variational (Johnson, 2007; Kurihara and Sato, 2006) or sampling techniques are applied (Johnson et al., 2006). Because probabilistic grammars are built out of multinomial distributions, the Dirichlet family (or, more precisely, a collection of Dirichlets) is a natural candidate for probabilistic grammars because of its conjugacy to the multinomial family. Conjugacy implies a clean form for the posterior distribution over grammar probabilities (given the data and the prior), bestowing computational tractability. Following work by Blei and Lafferty (2006) for topic models, Cohen et al. (2008) proposed an alternative to Dirichlet priors for probabilistic grammars, based on the logistic normal (LN) distribution over the probability simplex. Cohen et al. used this prior to softly tie grammar weights through the covariance parameters of the LN. The prior encodes information about which grammar rules' weights are likely to covary, a more intuitive and expressive representation of knowledge than offered by Dirichlet distributions.1 The contribution of this paper is two-fold. First, from the modeling perspective, we present a generalization of the LN prior of Cohen et al. (2008), showing how to extend the use of the LN prior to Although the task, underlying model, and weights being tied were different, Eisner (2002) also showed evidence for the efficacy of parameter tying in grammar learning. 1 Human Language Technologies: The 2009 Annual Conference of the North American Chapter of the ACL, pages 74­82, Boulder, Colorado, June 2009. c 2009 Association for Computational Linguistics tie between any grammar weights in a probabilistic grammar (instead of only allowing weights within the same multinomial distribution to covary). Second, from the experimental perspective, we show how such flexibility in parameter tying can help in unsupervised grammar learning in the well-known monolingual setting and in a new bilingual setting where grammars for two languages are learned at once (without parallel corpora). Our method is based on a distribution which we call the shared logistic normal distribution, which is a distribution over a collection of multinomials from different probability simplexes. We provide a variational EM algorithm for inference. The rest of this paper is organized as follows. In §2, we give a brief explanation of probabilistic grammars and introduce some notation for the specific type of dependency grammar used in this paper, due to Klein and Manning (2004). In §3, we present our model and a variational inference algorithm for it. In §4, we report on experiments for both monolingual settings and a bilingual setting and discuss them. We discuss future work (§5) and conclude in §6. derivation y: K Nk p(x, y | ) = k,i k,i f (x,y) (1) k=1 i=1 K Nk = exp k=1 i=1 fk,i (x, y) log k,i where fk,i is a function that "counts" the number of times the kth distribution's ith event occurs in the derivation. The are a collection of K multinomials 1 , ..., K , the kth of which includes Nk events. Note that there may be many derivations y for a given string x--perhaps even infinitely many in some kinds of grammars. 2.1 Dependency Model with Valence HMMs and PCFGs are the best-known probabilistic grammars, but there are many others. In this paper, we use the "dependency model with valence" (DMV), due to Klein and Manning (2004). DMV defines a probabilistic grammar for unlabeled, projective dependency structures. Klein and Manning (2004) achieved their best results with a combination of DMV with a model known as the "constituent-context model" (CCM). We do not experiment with CCM in this paper, because it does not fit directly in a Bayesian setting (it is highly deficient) and because state-of-the-art unsupervised dependency parsing results have been achieved with DMV alone (Smith, 2006). Using the notation above, DMV defines x = x1 , x2 , ..., xn to be a sentence. x0 is a special "wall" symbol, $, on the left of every sentence. A tree y is defined by a pair of functions yleft and yright (both {0, 1, 2, ..., n} 2{1,2,...,n} ) that map each word to its sets of left and right dependents, respectively. Here, the graph is constrained to be a projective tree rooted at x0 = $: each word except $ has a single parent, and there are no cycles or crossing dependencies. yleft (0) is taken to be empty, and yright (0) contains the sentence's single head. Let y(i) denote the subtree rooted at position i. The probability P (y(i) | xi , ) of generating this subtree, given its head word xi , is defined recursively, as described in Fig. 1 (Eq. 2). The probability of the entire tree is given by p(x, y | ) = P (y(0) | $, ). The are the multinomial distributions s (· | ·, ·, ·) and c (· | ·, ·). To 2 Probabilistic Grammars and Dependency Grammar Induction A probabilistic grammar defines a probability distribution over grammatical derivations generated through a step-by-step process. HMMs, for example, can be understood as a random walk through a probabilistic finite-state network, with an output symbol sampled at each state. Each "step" of the walk and each symbol emission corresponds to one derivation step. PCFGs generate phrase-structure trees by recursively rewriting nonterminal symbols as sequences of "child" symbols (each itself either a nonterminal symbol or a terminal symbol analogous to the emissions of an HMM). Each step or emission of an HMM and each rewriting operation of a PCFG is conditionally independent of the other rewriting operations given a single structural element (one HMM or PCFG state); this Markov property permits efficient inference for the probability distribution defined by the probabilistic grammar. In general, a probabilistic grammar defines the joint probability of a string x and a grammatical 75 P (y(i) | xi , ) = D{left,right} s (stop × jyD (i) s (¬stop | xi , D, [yD (i) = ]) (2) | xi , D, firsty (j)) × c (xj | xi , D) × P (y(j) | xj , ) Figure 1: The "dependency model with valence" recursive equation. firsty (j) is a predicate defined to be true iff xj is the closest child (on either side) to its parent xi . The probability of the tree p(x, y | ) = P (y(0) | $, ). follow the general setting of Eq. 1, we index these distributions as 1 , ..., K . Headden et al. (2009) extended DMV so that the distributions c condition on the valence as well, with smoothing, and showed significant improvements for short sentences. Our experiments found that these improvements do not hold on longer sentences. Here we experiment only with DMV, but note that our techniques are also applicable to richer probabilistic grammars like that of Headden et al. 2.2 Learning DMV Klein and Manning (2004) learned the DMV probabilities from a corpus of part-of-speech-tagged sentences using the EM algorithm. EM manipulates to locally optimize the likelihood of the observed portion of the data (here, x), marginalizing out the hidden portions (here, y). The likelihood surface is not globally concave, so EM only locally optimizes the surface. Klein and Manning's initialization, though reasonable and language-independent, was an important factor in performance. Various alternatives to EM were explored by Smith (2006), achieving substantially more accurate parsing models by altering the objective function. Smith's methods did require substantial hyperparameter tuning, and the best results were obtained using small annotated development sets to choose hyperparameters. In this paper, we consider only fully unsupervised methods, though we the Bayesian ideas explored here might be merged with the biasing approaches of Smith (2006) for further benefit. which makes inference algorithms easier to derive. For example, if we make a "mean-field assumption," with respect to hidden structure and weights, the variational algorithm for approximately inferring the distribution over and trees y resembles the traditional EM algorithm very closely (Johnson, 2007). In fact, variational inference in this case takes an action similar to smoothing the counts using the exp- function during the E-step. Variational inference can be embedded in an empirical Bayes setting, in which we optimize the variational bound with respect to the hyperparameters as well, repeating the process until convergence. 3.1 Logistic Normal Distributions While Dirichlet priors over grammar probabilities make learning algorithms easy, they are limiting. In particular, as noted by Blei and Lafferty (2006), there is no explicit flexible way for the Dirichlet's parameters to encode beliefs about covariance between the probabilities of two events. To illustrate this point, we describe how a multinomial of dimension d is generated from a Dirichlet distribution with parameters = 1 , ..., d : 1. Generate j (j , 1) independently for j {1, ..., d}. 2. j j / i i . 3 Parameter Tying in the Bayesian Setting As stated above, comprises a collection of multinomials that weights the grammar. Taking the Bayesian approach, we wish to place a prior on those multinomials, and the Dirichlet family is a natural candidate for such a prior because of its conjugacy, 76 where (, 1) is a Gamma distribution with shape and scale 1. Correlation among i and j , i = j, cannot be modeled directly, only through the normalization in step 2. In contrast, LN distributions (Aitchison, 1986) provide a natural way to model such correlation. The LN draws a multinomial as follows: 1. Generate Normal(µ, ). 2. j exp(j )/ i exp(i ). I1 I2 I3 IN = = = = = = = = {1:2, 3:6, 7:9} {1:2, 3:6} {1:4, 5:7} {1:2} = = = = { I1,1 , { I2,1 , { { I4,L4 J1 1 I1,2 , I2,L2 I3,1 , J2 I1,L1 I3,L3 JK } } } } partition struct. S 1 2 3 4 ~ 1 ~ 2 ~ 3 = = = 1 3 1 3 1 2 1,1 , 1,2 , 1,3 , 1,4 , 1,5 , 1,6 , 1,7 , 1,8 , 1, 2,1 , 2,2 , 2,3 , 2,4 , 2,5 , 2, 2 3,1 , 3,2 , 3,3 , 3,4 , 3,5 , 3,6 , 3, 3 4,1 , 4, 4 1,1 + 2,1 + 4,1 , 1,2 + 2,2 + 4,2 1,3 + 2,3 + 3,1 , 1,4 + 2,4 + 3,2 , 1,5 + 2,5 + 3,3 , 1,6 + 2,6 + 3,4 combine 1,7 + 3,5 , 1,8 + 3,6 , 1,9 + 3,7 1 2 3 ~ = (exp 1 ) ~ = (exp 2 ) ~ = (exp 3 ) N1 i =1 N2 i =1 N3 i =1 Normal(µ1 , 1 ) Normal(µ2 , 2 ) Normal(µ3 , 3 ) Normal(µ4 , 4 ) sample Figure 2: An example of a shared logistic normal distribution, illustrating Def. 1. N = 4 experts are used to sample K = 3 multinomials; L1 = 3, L2 = 2, L3 = 2, L4 = 1, 1 = 9, 2 = 6, 3 = 7, 4 = 2, N1 = 2, N2 = 4, and N3 = 3. This figure is best viewed in color. exp 1,i ~ exp 2,i ~ softmax exp 3,i ~ Blei and Lafferty (2006) defined correlated topic models by replacing the Dirichlet in latent Dirichlet allocation models (Blei et al., 2003) with a LN distribution. Cohen et al. (2008) compared Dirichlet and LN distributions for learning DMV using empirical Bayes, finding substantial improvements for English using the latter. In that work, we obtained improvements even without specifying exactly which grammar probabilities covaried. While empirical Bayes learning permits these covariances to be discovered without supervision, we found that by initializing the covariance to encode beliefs about which grammar probabilities should covary, further improvements were possible. Specifically, we grouped the Penn Treebank part-of-speech tags into coarse groups based on the treebank annotation guidelines and biased the initial covariance matrix for each child distribution c (· | ·, ·) so that the probabilities of child tags from the same coarse group covaried. For example, the probability that a past-tense verb (VBD) has a singular noun (NN) as a right child may be correlated with the probability that it has a plural noun (NNS) as a right child. Hence linguistic 77 knowledge--specifically, a coarse grouping of word classes--can be encoded in the prior. A per-distribution LN distribution only permits probabilities within a multinomial to covary. We will generalize the LN to permit covariance among any probabilities in , throughout the model. For example, the probability of a past-tense verb (VBD) having a noun as a right child might correlate with the probability that other kinds of verbs (VBZ, VBN, etc.) have a noun as a right child. The partitioned logistic normal distribution (PLN) is a generalization of the LN distribution that takes the first step towards our goal (Aitchison, 1986). Generating from PLN involves drawing a random vector from a multivariate normal distribution, but the logistic transformation is applied to different parts of the vector, leading to sampled multinomial distributions of the required lengths from different probability simplices. This is in principle what is required for arbitrary covariance between grammar probabilities, except that DMV has O(t2 ) weights for a part-of-speech vocabulary of size t, requiring a very large multivariate normal distribution with O(t4 ) covariance parameters. 3.2 Shared Logistic Normal Distributions To solve this problem, we suggest a refinement of the class of PLN distributions. Instead of using a single normal vector for all of the multinomials, we use several normal vectors, partition each one and then recombine parts which correspond to the same multinomial, as a mixture. Next, we apply the logisitic transformation on the mixed vectors (each of which is normally distributed as well). Fig. 2 gives an example of a non-trivial case of using a SLN distribution, where three multinomials are generated from four normal experts. We now formalize this notion. For a natural number N , we denote by 1:N the set {1, ..., N }. For a vector in v RN and a set I 1:N , we denote by vI to be the vector created from v by using the coordinates in I. Recall that K is the number of multinomials in the probabilistic grammar, and Nk is the number of events in the kth multinomial. Definition 1. We define a shared logistic normal distribution with N "experts" over a collection of K multinomial distributions. Let n Normal(µn , n ) be a set of multivariate normal variables for n 1:N , where the length of n Ln is denoted n . Let In = {In,j }j=1 be a partition of 1: n into Ln sets, such that Ln In,j = j=1 1: n and In,j In,j = for j = j . Let Jk for k 1:K be a collection of (disjoint) subsets of {In,j | n 1:N, j 1: n , |In,j | = Nk }, such that all sets in Jk are of the same size, ~ Nk . Let k = |J1k | In,j Jk n,In,j , and k,i = exp(~k,i ) i exp(~k,i ) . We then say distributes according to the shared logistic normal distribution with partition structure S = ({In }N , {Jk }K ) n=1 k=1 and normal experts {(µn , n )}N and denote it by n=1 SLN(µ, , S). The partitioned LN distribution in Aitchison (1986) can be formulated as a shared LN distribution where N = 1. The LN collection used by Cohen et al. (2008) is the special case where N = K, each Ln = 1, each k = Nk , and each Jk = {Ik,1 }. The covariance among arbitrary k,i is not defined directly; it is implied by the definition of the normal experts n,In,j , for each In,j Jk . We note that a SLN can be represented as a PLN by relying on the distributivity of the covariance operator, and merging all the partition structure into one (perhaps 78 sparse) covariance matrix. However, if we are interested in keeping a factored structure on the covariance matrices which generate the grammar weights, we cannot represent every SLN as a PLN. It is convenient to think of each i,j as a weight associated with a unique event's probability, a certain outcome of a certain multinomial in the probabilistic grammar. By letting different i,j covary with each other, we loosen the relationships among k,j and permit the model--at least in principle-- to learn patterns from the data. Def. 1 also implies that we multiply several multinomials together in a product-of-experts style (Hinton, 1999), because the exponential of a mixture of normals becomes a product of (unnormalized) probabilities. Our extension to the model in Cohen et al. (2008) follows naturally after we have defined the shared LN distribution. The generative story for this model is as follows: 1. Generate SLN(µ, , S), where is a collection of vectors k , k = 1, ..., K. 2. Generate x and y from p(x, y | ) (i.e., sample from the probabilistic grammar). 3.3 Inference In this work, the partition structure S is known, the sentences x are observed, the trees y and the grammar weights are hidden, and the parameters of the shared LN distribution µ and are learned.2 Our inference algorithm aims to find the posterior over the grammar probabilities and the hidden structures (grammar trees y). To do that, we use variational approximation techniques (Jordan et al., 1999), which treat the problem of finding the posterior as an optimization problem aimed to find the best approximation q(, y) of the posterior p(, y | x, µ, , S). The posterior q needs to be constrained to be within a family of tractable and manageable distributions, yet rich enough to represent good approximations of the true posterior. "Best approximation" is defined as the KL divergence between q(, y) and p(, y | x, µ, , S). Our variational inference algorithm uses a meanfield assumption: q(, y) = q()q(y). The distribution q() is assumed to be a LN distribution with 2 In future work, we might aim to learn S. log p(x | µ, , S) ~ fk,i ~ k,i µC ~k (~k )2 C y N n=1 Eq [log p( k | µk , k )] + B K k=1 Nk ~ ~ i=1 fk,i k,i + H(q) (3) (4) q(y)fk,i (x, y) 1 ~ k Nk i =1 exp ~ µC - log k + 1 - ~k,i 1 |Jk | 1 |Jk |2 Ir,j Jk µC + ~k,i (~k,i )2 C 2 (5) (6) (7) µr,Ir,j ~ r,Ir,j ~2 Ir,j Jk Figure 3: Variational inference bound. Eq. 3 is the bound itself, using notation defined in Eqs. 4­7 for clarity. Eq. 4 defines expected counts of the grammar events under the variational distribution q(y), calculated using dynamic programming. Eq. 5 describes the weights for the weighted grammar defined by q(y). Eq. 6 and Eq. 7 describe the mean and the variance, respectively, for the multivariate normal eventually used with the weighted grammar. These values ~ are based on the parameterization of q() by µi,j and i,j . An additional set of variational parameters is k , which ~ ~2 helps resolve the non-conjugacy of the LN distribution through a first order Taylor approximation. all off-diagonal covariances fixed at zero (i.e., the variational parameters consist of a single mean µk,i ~ and a single variance k,i for each k,i ). There is ~2 ~ an additional variational parameter, k per multinomial, which is the result of an additional variational approximation because of the lack of conjugacy of the LN distribution to the multinomial distribution. The distribution q(y) is assumed to be defined by a ~ DMV with unnormalized probabilities . Inference optimizes the bound B given in Fig. 3 (Eq. 3) with respect to the variational parameters. Our variational inference algorithm is derived similarly to that of Cohen et al. (2008). Because we wish to learn the values of µ and , we embed variational inference as the E step within a variational EM algorithm, shown schematically in Fig. 4. In our experiments, we use this variational EM algorithm on a training set, and then use the normal experts' means to get a point estimate for , the grammar weights. This is called empirical Bayesian estimation. Our approach differs from maximum a posteriori (MAP) estimation, since we re-estimate the parameters of the normal experts. Exact MAP estimation is probably not feasible; a variational algorithm like ours might be applied, though better performance is expected from adjusting the SLN to fit the data. al., 1993) and the Chinese treebank (Xue et al., 2004). In both cases, following standard practice, sentences were stripped of words and punctuation, leaving part-of-speech tags for the unsupervised induction of dependency structure. For English, we train on §2­21, tune on §22 (without using annotated data), and report final results on §23. For Chinese, we train on §1­270, use §301­1151 for development and report testing results on §271­300.3 To evaluate performance, we report the fraction of words whose predicted parent matches the gold standard corpus. This performance measure is also known as attachment accuracy. We considered two parsing methods after extracting a point estimate for the grammar: the most probable "Viterbi" parse (argmaxy p(y | x, )) and the minimum Bayes risk (MBR) parse (argminy Ep(y |x,) [ (y; x, y )]) with dependency attachment error as the loss function (Goodman, 1996). Performance with MBR parsing is consistently higher than its Viterbi counterpart, so we report only performance with MBR parsing. 4.1 Nouns, Verbs, and Adjectives In this paper, we use a few simple heuristics to decide which partition structure S to use. Our heurisUnsupervised training for these datasets can be costly, and requires iteratively running a cubic-time inside-outside dynamic programming algorithm, so we follow Klein and Manning (2004) in restricting the training set to sentences of ten or fewer words in length. Short sentences are also less structurally ambiguous and may therefore be easier to learn from. 3 4 Experiments Our experiments involve data from two treebanks: the Wall Street Journal Penn treebank (Marcus et 79 Input: initial parameters µ(0) , (0) , partition structure S, observed data x, number of iterations T Output: learned parameters µ, t1; while t T do E-step (for = 1, ..., M ) do: repeat ,(t) optimize B w.r.t. µr , r = 1, ..., N ; ~ ,(t) optimize B w.r.t. r , r = 1, ..., N ; ~ ~ ,(t) update r , r = 1, ..., N ; ~ ,(t) update r , r = 1, ..., N ; ,(t) compute counts ~r , r = 1, ..., N ; f until convergence of B ; M-step: optimize B w.r.t. µ(t) and (t) ; t t + 1; end return µ(T ) , (T ) Figure 4: Main details of the variational inference EM algorithm with empirical Bayes estimation of µ and . B is the bound defined in Fig. 3 (Eq. 3). N is the number of normal experts for the SLN distribution defining the prior. M is the number of training examples. The full algorithm is given in Cohen and Smith (2009). of the predicate firsty (·), the set of multinomials s (· | x, D, v), for v V share a normal expert. · T IE N: This is the same as T IE V, only for nominal parents. · T IE V&N: Tie both verbs and nouns (in separate partitions). This is equivalent to taking the union of the partition structures of the above two settings. · T IE A: This is the same as T IE V, only for adjectival parents. Since inference for a model with parameter tying can be computationally intensive, we first run the inference algorithm without parameter tying, and then add parameter tying to the rest of the inference algorithm's execution until convergence. Initialization is important for the inference algorithm, because the variational bound is a nonconcave function. For the expected values of the normal experts, we use the initializer from Klein and Manning (2004). For the covariance matrices, we follow the setting in Cohen et al. (2008) in our experiments also described in §3.1. For each treebank, we divide the tags into twelve disjoint tag families.4 The covariance matrices for all dependency distributions were initialized with 1 on the diagonal, 0.5 between tags which belong to the same family, and 0 otherwise. This initializer has been shown to be more successful than an identity covariance matrix. 4.2 Monolingual Experiments We begin our experiments with a monolingual setting, where we learn grammars for English and Chinese (separately) using the settings described above. The attachment accuracy for this set of experiments is described in Table 1. The baselines include right attachment (where each word is attached to the word to its right), MLE via EM (Klein and Manning, 2004), and empirical Bayes with Dirichlet and LN priors (Cohen et al., 2008). We also include a "ceiling" (DMV trained using supervised MLE from the training sentences' trees). For English, we see that tying nouns, verbs or adjectives improves performance compared to the LN baseline. Tying both nouns and verbs improves performance a bit more. These are simply coarser tags: adjective, adverb, conjunction, foreign word, interjection, noun, number, particle, preposition, pronoun, proper noun, verb. 4 tics rely mainly on the centrality of content words: nouns, verbs, and adjectives. For example, in the English treebank, the most common attachment errors (with the LN prior from Cohen et al., 2008) happen with a noun (25.9%) or a verb (16.9%) parent. In the Chinese treebank, the most common attachment errors happen with noun (36.0%) and verb (21.2%) parents as well. The errors being governed by such attachments are the direct result of nouns and verbs being the most common parents in these data sets. Following this observation, we compare four different settings in our experiments (all SLN settings include one normal expert for each multinomial on its own, equivalent to the regular LN setting from Cohen et al.): · T IE V: We add normal experts that tie all probabilities corresponding to a verbal parent (any parent, using the coarse tags of Cohen et al., 2008). Let V be the set of part-of-speech tags which belong to the verb category. For each direction D (left or right), the set of multinomials of the form c (· | v, D), for v V , all share a normal expert. For each direction D and each boolean value B 80 Attach-Right EM (K&M, 2004) Dirichlet LN (CG&S, 2008) SLN, T IE V SLN, T IE N SLN, T IE V&N SLN, T IE A Biling. SLN, T IE V Biling. SLN, T IE N Biling. SLN, T IE V&N Biling. SLN, T IE A Supervised MLE Attach-Right EM (K&M, 2004) Dirichlet LN SLN, T IE V SLN, T IE N SLN, T IE V&N SLN, T IE A Biling. SLN, T IE V Biling. SLN, T IE N Biling. SLN, T IE V&N Biling. SLN, T IE A Supervised MLE attachment acc. (%) 10 20 all 38.4 33.4 31.7 46.1 39.9 35.9 46.1 40.6 36.9 59.4 45.9 40.5 60.2 46.2 40.0 60.2 46.7 40.9 61.3 47.4 41.4 59.9 45.8 40.9 47.6 41.7 61.6 48.1 42.1 61.8 62.0 48.0 42.2 61.3 47.6 41.7 84.5 74.9 68.8 34.9 34.6 34.6 38.3 36.1 32.7 38.3 35.9 32.4 50.1 40.5 35.8 42.0 35.8 51.9 43.0 38.4 33.7 45.0 39.2 34.2 47.4 40.4 35.2 51.9 42.0 35.8 48.0 38.9 33.8 51.5 41.7 35.3 52.0 41.3 35.2 84.3 66.1 57.6 ing information between those two models is done by softly tying grammar weights in the two hidden grammars. We first merge the models for English and Chinese by taking a union of the multinomial families of each and the corresponding prior parameters. We then add a normal expert that ties between the parts of speech in the respective partition structures for both grammars together. Parts of speech are matched through the single coarse tagset (footnote 4). For example, with T IE V, let V = V Eng V Chi be the set of part-of-speech tags which belong to the verb category for either treebank. Then, we tie parameters for all part-of-speech tags in V . We tested this joint model for each of T IE V, T IE N, T IE V&N, and T IE A. After running the inference algorithm which learns the two models jointly, we use unseen data to test each learned model separately. Table 1 includes the results for these experiments. The performance on English improved significantly in the bilingual setting, achieving highest performance with T IE V&N. Performance with Chinese is also the highest in the bilingual setting, with T IE A and T IE V&N. Chinese English Table 1: Attachment accuracy of different models, on test data from the Penn Treebank and the Chinese Treebank of varying levels of difficulty imposed through a length filter. Attach-Right attaches each word to the word on its right and the last word to $. Bold marks best overall accuracy per length bound, and marks figures that are not significantly worse (binomial sign test, p < 0.05). 5 Future Work In future work we plan to lexicalize the model, including a Bayesian grammar prior that accounts for the syntactic patterns of words. Nonparametric models (Teh, 2006) may be appropriate. We also believe that Bayesian discovery of cross-linguistic patterns is an exciting topic worthy of further exploration. 4.3 Bilingual Experiments 6 Conclusion Leveraging information from one language for the task of disambiguating another language has received considerable attention (Dagan, 1991; Smith and Smith, 2004; Snyder and Barzilay, 2008; Burkett and Klein, 2008). Usually such a setting requires a parallel corpus or other annotated data that ties between those two languages.5 Our bilingual experiments use the English and Chinese treebanks, which are not parallel corpora, to train parsers for both languages jointly. SharHaghighi et al. (2008) presented a technique to learn bilingual lexicons from two non-parallel monolingual corpora. 5 We described a Bayesian model that allows soft parameter tying among any weights in a probabilistic grammar. We used this model to improve unsupervised parsing accuracy on two different languages, English and Chinese, achieving state-of-the-art results. We also showed how our model can be effectively used to simultaneously learn grammars in two languages from non-parallel multilingual data. Acknowledgments This research was supported by NSF IIS-0836431. The authors thank the anonymous reviewers and Sylvia Rebholz for helpful comments. 81 References J. Aitchison. 1986. The Statistical Analysis of Compositional Data. Chapman and Hall, London. D. M. Blei and J. D. Lafferty. 2006. Correlated topic models. In Proc. of NIPS. D. M. Blei, A. Ng, and M. Jordan. 2003. Latent Dirichlet allocation. Journal of Machine Learning Research, 3:993­1022. D. Burkett and D. Klein. 2008. Two languages are better than one (for syntactic parsing). In Proc. of EMNLP. E. Charniak and M. Johnson. 2005. Coarse-to-fine nbest parsing and maxent discriminative reranking. In Proc. of ACL. S. B. Cohen and N. A. Smith. 2009. Inference for probabilistic grammars with shared logistic normal distributions. Technical report, Carnegie Mellon University. S. B. Cohen, K. Gimpel, and N. A. Smith. 2008. Logistic normal priors for unsupervised probabilistic grammar induction. In NIPS. M. Collins. 2003. Head-driven statistical models for natural language processing. Computational Linguistics, 29:589­637. I. Dagan. 1991. Two languages are more informative than one. In Proc. of ACL. J. Eisner. 2002. Transformational priors over grammars. In Proc. of EMNLP. J. R. Finkel, T. Grenager, and C. D. Manning. 2007. The infinite tree. In Proc. of ACL. J. Goodman. 1996. Parsing algorithms and metrics. In Proc. of ACL. A. Haghighi, P. Liang, T. Berg-Kirkpatrick, and D. Klein. 2008. Learning bilingual lexicons from monolingual corpora. In Proc. of ACL. W. P. Headden, M. Johnson, and D. McClosky. 2009. Improving unsupervised dependency parsing with richer contexts and smoothing. In Proc. of NAACLHLT. G. E. Hinton. 1999. Products of experts. In Proc. of ICANN. M. Johnson, T. L. Griffiths, and S. Goldwater. 2006. Adaptor grammars: A framework for specifying compositional nonparameteric Bayesian models. In NIPS. M. Johnson, T. L. Griffiths, and S. Goldwater. 2007. Bayesian inference for PCFGs via Markov chain Monte Carlo. In Proc. of NAACL. M. Johnson. 2007. Why doesn't EM find good HMM POS-taggers? In Proc. EMNLP-CoNLL. M. I. Jordan, Z. Ghahramani, T. S. Jaakola, and L. K. Saul. 1999. An introduction to variational methods for graphical models. Machine Learning, 37(2):183­ 233. D. Klein and C. D. Manning. 2004. Corpus-based induction of syntactic structure: Models of dependency and constituency. In Proc. of ACL. K. Kurihara and T. Sato. 2006. Variational Bayesian grammar induction for natural language. In Proc. of ICGI. P. Liang, S. Petrov, M. Jordan, and D. Klein. 2007. The infinite PCFG using hierarchical Dirichlet processes. In Proc. of EMNLP. M. P. Marcus, B. Santorini, and M. A. Marcinkiewicz. 1993. Building a large annotated corpus of English: The Penn treebank. Computational Linguistics, 19:313­330. D. A. Smith and N. A. Smith. 2004. Bilingual parsing with factored estimation: Using English to parse Korean. In Proc. of EMNLP, pages 49­56. N. A. Smith. 2006. Novel Estimation Methods for Unsupervised Discovery of Latent Structure in Natural Language Text. Ph.D. thesis, Johns Hopkins University. B. Snyder and R. Barzilay. 2008. Unsupervised multilingual learning for morphological segmentation. In Proc. of ACL. Y. W. Teh. 2006. A hierarchical Bayesian language model based on Pitman-Yor processes. In Proc. of COLING-ACL. M. Wang, N. A. Smith, and T. Mitamura. 2007. What is the Jeopardy model? a quasi-synchronous grammar for question answering. In Proc. of EMNLP. D. Wu. 1997. Stochastic inversion transduction grammars and bilingual parsing of parallel corpora. Comp. Ling., 23(3):377­404. N. Xue, F. Xia, F.-D. Chiou, and M. Palmer. 2004. The Penn Chinese Treebank: Phrase structure annotation of a large corpus. Natural Language Engineering, 10(4):1­30. 82 Adding More Languages Improves Unsupervised Multilingual Part-of-Speech Tagging: A Bayesian Non-Parametric Approach Benjamin Snyder, Tahira Naseem, Jacob Eisenstein, and Regina Barzilay Computer Science and Artificial Intelligence Laboratory Massachusetts Institute of Technology {bsnyder, tahira, jacobe, regina}@csail.mit.edu Abstract We investigate the problem of unsupervised part-of-speech tagging when raw parallel data is available in a large number of languages. Patterns of ambiguity vary greatly across languages and therefore even unannotated multilingual data can serve as a learning signal. We propose a non-parametric Bayesian model that connects related tagging decisions across languages through the use of multilingual latent variables. Our experiments show that performance improves steadily as the number of languages increases. 1 Introduction In this paper we investigate the problem of unsupervised part-of-speech tagging when unannotated parallel data is available in a large number of languages. Our goal is to develop a fully joint multilingual model that scales well and shows improved performance for individual languages as the total number of languages increases. Languages exhibit ambiguity at multiple levels, making unsupervised induction of their underlying structure a difficult task. However, sources of linguistic ambiguity vary across languages. For example, the word fish in English can be used as either a verb or a noun. In French, however, the noun poisson (fish) is entirely distinct from the verbal form p^ cher (to fish). Previous work has leveraged this e idea by building models for unsupervised learning from aligned bilingual data (Snyder et al., 2008). However, aligned data is often available for many languages. The benefits of bilingual learning vary 83 markedly depending on which pair of languages is selected, and without labeled data it is unclear how to determine which supplementary language is most helpful. In this paper, we show that it is possible to leverage all aligned languages simultaneously, achieving accuracy that in most cases outperforms even optimally chosen bilingual pairings. Even in expressing the same meaning, languages take different syntactic routes, leading to variation in part-of-speech sequences. Therefore, an effective multilingual model must accurately model common linguistic structure, yet remain flexible to the idiosyncrasies of each language. This tension only becomes stronger as additional languages are added to the mix. From a computational standpoint, the main challenge is to ensure that the model scales well as the number of languages increases. Care must be taken to avoid an exponential increase in the parameter space as well as the time complexity of inference procedure. We propose a non-parametric Bayesian model for joint multilingual tagging. The topology of our model connects tagging decisions within a language as well as across languages. The model scales linearly with the number of languages, allowing us to incorporate as many as are available. For each language, the model contains an HMM-like substructure and connects these substructures to one another by means of cross-lingual latent variables. These variables, which we refer to as superlingual tags, capture repeated multilingual patterns and thus reduce the overall uncertainty in tagging decisions. We evaluate our model on a parallel corpus of eight languages. The model is trained once using all Human Language Technologies: The 2009 Annual Conference of the North American Chapter of the ACL, pages 83­91, Boulder, Colorado, June 2009. c 2009 Association for Computational Linguistics languages, and its performance is tested separately for each on a held-out monolingual test set. When a complete tag lexicon is provided, our unsupervised model achieves an average accuracy of 95%, in comparison to 91% for an unsupervised monolingual Bayesian HMM and 97.4% for its supervised counterpart. Thus, on average, the gap between unsupervised and supervised monolingual performance is cut by nearly two thirds. We also examined scenarios where the tag lexicon is reduced in size. In all cases, the multilingual model yielded substantial performance gains. Finally, we examined the performance of our model when trained on all possible subsets of the eight languages. We found that performance improves steadily as the number of available languages increases. guages. Beyond Bilingual Learning While most work on multilingual learning focuses on bilingual analysis, some models operate on more than one pair of languages. For instance, Genzel (2005) describes a method for inducing a multilingual lexicon from a group of related languages. His model first induces bilingual models for each pair of languages and then combines them. Our work takes a different approach by simultaneously learning from all languages, rather than combining bilingual results. A related thread of research is multi-source machine translation (Och and Ney, 2001; Utiyama and Isahara, 2006; Cohn and Lapata, 2007) where the goal is to translate from multiple source languages to a single target language. Rather than jointly training all the languages together, these models train bilingual models separately, and then use their output to select a final translation. The selection criterion can be learned at training time since these models have access to the correct translation. In unsupervised settings, however, we do not have a principled means for selecting among outputs of different bilingual models. By developing a joint multilingual model we can automatically achieve performance that rivals that of the best bilingual pairings. 2 Related Work Bilingual Part-of-Speech Tagging Early work on multilingual tagging focused on projecting annotations from an annotated source language to a target language (Yarowsky and Ngai, 2001; Feldman et al., 2006). In contrast, we assume no labeled data at all; our unsupervised model instead symmetrically improves performance for all languages by learning cross-lingual patterns in raw parallel data. An additional distinction is that projection-based work utilizes pairs of languages, while our approach allows for continuous improvement as languages are added to the mix. In recent work, Snyder et al. (2008) presented a model for unsupervised part-of-speech tagging trained from a bilingual parallel corpus. This bilingual model and the model presented here share a number of similarities: both are Bayesian graphical models building upon hidden Markov models. However, the bilingual model explicitly joins each aligned word-pair into a single coupled state. Thus, the state-space of these joined nodes grows exponentially in the number of languages. In addition, crossing alignments must be removed so that the resulting graph structure remains acyclic. In contrast, our multilingual model posits latent cross-lingual tags without explicitly joining or directly connecting the part-of-speech tags across languages. Besides permitting crossing alignments, this structure allows the model to scale gracefully with the number of lan84 3 Model We propose a non-parametric directed Bayesian graphical model for multilingual part-of-speech tagging using a parallel corpus. We perform a joint training pass over the corpus, and then apply the parameters learned for each language to a held-out monolingual test set. The core idea of our model is that patterns of ambiguity vary across languages and therefore even unannotated multilingual data can serve as a learning signal. Our model is able to simultaneously harness this signal from all languages present in the corpus. This goal is achieved by designing a single graphical model that connects tagging decisions within a language as well as across languages. The model contains language-specific HMM substructures connected to one another by cross-lingual latent variables spanning two or more languages. These variables, which we refer to as superlingual tags, capture repeated cross-lingual patterns and Figure 1: Model structure for parallel sentences in English, French, Hebrew, and Urdu. In this example, there are three superlingual tags, each connected to the part-of-speech tag of a word in each of the four languages. thus reduce the overall uncertainty in tagging decisions. To encourage the discovery of a compact set of such cross-lingual patterns, we place a Dirichlet process prior on the superlingual tag values. 3.1 Model Structure For each language, our model includes an HMMlike substructure with observed word nodes, hidden part-of-speech nodes, and directed transition and emission edges. For each set of aligned words in parallel sentences, we add a latent superlingual variable to capture the cross-lingual context. A set of directed edges connect this variable to the partof-speech nodes of the aligned words. Our model assumes that the superlingual tags for parallel sentences are unordered and are drawn independently of one another. Edges radiate outward from superlingual tags to language-specific part-of-speech nodes. Thus, our model implicitly assumes that superlingual tags are drawn prior to the part-of-speech tags of all languages and probabilistically influence their selection. See Figure 1 for an example structure. The particular model structure for each set of parallel sentences (i.e. the configuration of superlingual tags and their edges) is determined by bilingual lexical alignments and -- like the text itself -- is considered an observed variable. In practice, these lexical alignments are obtained using standard techniques from machine translation. 85 Our model design has several benefits. Crossing and many-to-many alignments may be used without creating cycles in the graph, as all cross-lingual information emanates from the hidden superlingual tags. Furthermore, the model scales gracefully with the number of languages, as the number of new edges and nodes will be proportional to the number of words for each additional language. 3.2 Superlingual Tags Each superlingual tag value specifies a set of distributions -- one for each language's part-of-speech tagset. In order to learn repeated cross-lingual patterns, we need to constrain the number of superlingual tag values and thus the number of distributions they provide. For example, we might allow the superlingual tags to take on integer values from 1 to K, with each integer value indexing a separate set of distributions. Each set of distributions should correspond to a discovered cross-lingual pattern in the data. For example, one set of distributions might favor nouns in each language and another might favor verbs. Rather than fixing the number of superlingual tag values to an arbitrary and predetermined size 1, . . . , K, we allow them to range over the entire set of integers. In order to encourage the desired multilingual clustering behavior, we use a Dirichlet process prior for the superlingual tags. This prior allows high posterior probability only when a small number n i o a s h s i o d p n a s a s e p l i l h c e h r c o a d a m e ' h j J u M m i h s g i a f d v e e v h o l o i n I a of values are used repeatedly. The actual number of sampled values will be dictated by the data and the number of languages. More formally, suppose we have n languages, 1 , . . . , n . According to our generative model, a countably infinite sequence of sets 11 , . . . , 1n , 21 , . . . , 2n , . . . is drawn from some base distribution. Each i is a distribution over the parts-of-speech in language . In parallel, an infinite sequence of mixing components 1 , 2 , . . . is drawn from a stick-breaking process (Sethuraman, 1994). These components define a distribution over the integers with most probability mass placed on some initial set of values. The two sequences 11 , . . . , 1n , 21 , . . . , 2n . . . and 1 , 2 . . . now define the distribution over superlingual tags and their associated distributions on parts-of-speech. That is, each superlingual tag z N is drawn with probability z , and indexes the set of distributions z1 , . . . , zn . 3.3 Part-of-Speech Tags Finally, we need to define the generative probabilities of the part-of-speech nodes. For each such node there may be multiple incoming edges. There will always be an incoming transition edge from the previous tag (in the same language). In addition, there may be incoming edges from zero or more superlingual tags. Each edge carries with it a distribution over parts-of-speech and these distributions must be combined into the single distribution from which the tag is ultimately drawn. We choose to combine these distributions as a product of experts. More formally: for language and tag position i, the part-of-speech tag yi is drawn according to yi yi-1 (yi ) Z z z (yi ) That is, any expert can "veto" a potential tag by assigning it low probability, generally leading to consensus decisions. We now formalize this description by giving the stochastic process by which the observed data (raw parallel text) is generated, according to our model. 3.4 Generative Process For n languages, we assume the existence of n tagsets T 1 , . . . , T n and vocabularies, W 1 , . . . , W n , one for each language. For clarity, the generative process is described using only bigram transition dependencies, but our experiments use a trigram model. 1. Transition and Emission Parameters: For each language and for each tag t T , draw a transition distribution over tags T and t an emission distribution t over words W , all from symmetric Dirichlet priors of appropriate dimension. 2. Superlingual Tag Parameters: Draw an infinite sequence of sets from 11 , . . . , 1n , 21 , . . . , 2n , . . . base distribution G0 . Each i is a distribution over the tagset T . The base distribution G0 is a product of n symmetric Dirichlets, where the dimension of the ith such Dirichlet is the size of the corresponding tagset T i . At the same time, draw an infinite sequence of mixture weights GEM (), where GEM () indicates the stick-breaking distribution (Sethuraman, 1994), and = 1. These parameters together define a prior distribution over superlingual tags, p(z) = k (1) k k=z , (2) Where yi-1 indicates the transition distribution, and the z's range over the values of the incoming superlingual tags. The normalization term Z is obtained by summing the numerator over all part-ofspeech tags yi in the tagset. This parameterization allows for a relatively simple and small parameter space. It also leads to a desirable property: for a tag to have high probability each of the incoming distributions must allow it. 86 or equivalently over the part-of-speech distributions 1 , . . . , n that they index: k k k1 ,...,kn = 1 ,..., n . (3) In both cases, v=v is defined as one when v = v and zero otherwise. Distribution 3 is said to be drawn from a Dirichlet process, conventionally written as DP (, G0 ). 3. Data: For each multilingual parallel sentence, (a) Draw an alignment a specifying sets of aligned indices across languages. Each such set may consist of indices in any subset of the languages. We leave the distribution over alignments undefined, as we consider alignments observed variables. (b) For each set of indices in a, draw a superlingual tag value z according to Distribution 2. (c) For each language , for i = 1, . . . (until end-tag reached): i. Draw a part-of-speech tag yi T according to Distribution 1 ii. Draw a word wi W according to the emission distribution yi . To perform Bayesian inference under this model we use a combination of sampling techniques, which we describe in detail in the next section. 3.5 Inference Ideally we would like to predict the part-of-speech tags which have highest marginal probability given the observed words x and alignments a. More specifically, since we are evaluating our accuracy per tag-position, we would like to predict, for language index and word index i, the single part-of-speech tag: argmax P yi = t x, a tT which we wish to marginalize but for which we cannot compute closed-form integrals, where each sample samplek is drawn from P (samplek |x, a). We then approximate the tag marginals as: P yi = t x, a k P yi = t samplek , x, a N (4) We employ closed forms for integrating out the emission parameters , transition parameters , and superlingual tag parameters and . We explicitly sample only part-of-speech tags y, superlingual tags z, and the hyperparameters of the transition and emission Dirichlet priors. To do so, we apply standard Markov chain sampling techniques: a Gibbs sampler for the tags and a within-Gibbs MetropolisHastings subroutine for the hyperparameters (Hastings, 1970). Our Gibbs sampler samples each part-of-speech and superlingual tag separately, conditioned on the current value of all other tags. In each case, we use standard closed forms to integrate over all parameter values, using currently sampled counts and hyperparameter pseudo-counts. We note that conjugacy is technically broken by our use of a product form in Distribution 1. Nevertheless, we consider the sampled tags to have been generated separately by each of the factors involved in the numerator. Thus our method of using count-based closed forms should be viewed as an approximation. 3.6 Sampling Part-of-Speech Tags To sample the part-of-speech tag for language at position i we draw from P (yi |y-(,i) , x, a, z) P (yi+1 |yi , y-(,i) , a, z) P (yi |y-(,i) , a, z)· which we can rewrite as the argmaxtT of the integral, P yi = t y-(,i) , , , z, , x, a · dy-(,i) d d dz d d, P (x |x , y ) , i -i P y-(,i) , , , z, , , x, a in which we marginalize over the settings of all tags other than yi (written as y-(,i) ), the transition distributions = , emission distrit butions = t , superlingual tags z, and superlingual tag parameters = {1 , 2 , . . .} and = 11 , . . . , 1n , 21 , . . . , 2n . . .} (where t ranges over all part-of-speech tags). As these integrals are intractable to compute exactly, we resort to the standard Monte Carlo approximation. We collect N samples of the variables over 87 where the first two terms are the generative probabilities of (i) the current tag given the previous tag and superlingual tags, and (ii) the next tag given the current tag and superlingual tags. These two quantities are similar to Distribution 1, except here we integrate over the transition parameter yi-1 and the superlingual tag parameters z . We end up with a product of integrals. Each integral can be computed in closed form using multinomial-Dirichlet conjugacy (and by making the above-mentioned simplifying assumption that all other tags were generated separately by their transition and superlingual parameters), just as in the monolingual Bayesian HMM of (Goldwater and Griffiths, 2007). For example, the closed form for integrating over the parameter of a superlingual tag with value z is given by: z (yi )P (z |0 )dz = assume an improper uniform prior and use a Gaussian proposal distribution with mean set to the previous value, and variance to one-tenth of the mean. 4 Experimental Setup count(z, yi , ) + 0 count(z, ) + T 0 where count(z, yi , ) is the number of times that tag yi is observed together with superlingual tag z in language , count(z, ) is the total number of times that superlingual tag z appears with an edge into language , and 0 is a hyperparameter. The third term in the sampling formula is the emission probability of the current word x given i the current tag and all other words and sampled tags, as well as a hyperparameter which is suppressed for the sake of clarity. This quantity can be computed exactly in closed form in a similar way. 3.7 Sampling Superlingual Tags For each set of aligned words in the observed alignment a we need to sample a superlingual tag z. Recall that z is an index into an infinite sequence 11 , . . . , 1n , 21 , . . . , 2n . . ., where each z is . The generative disa distribution over the tagset T tribution over z is given by equation 2. In our sampling scheme, however, we integrate over all possible settings of the mixing components using the standard Chinese Restaurant Process (CRP) closed form (Antoniak, 1974): P zi z-i , y P yi z, y-(,i) · 1 k+ count(zi ) k+ if zi z-i otherwise The first term is the product of closed form tag probabilities of the aligned words, given z. The final term is the standard CRP closed form for posterior sampling from a Dirichlet process prior. In this term, k is the total number of sampled superlingual tags, count(zi ) is the total number of times the value zi occurs in the sampled tags, and is the Dirichlet process concentration parameter (see Step 2 in Section 3.4). Finally, we perform standard hyperparameter reestimation for the parameters of the Dirichlet distribution priors on and (the transition and emission distributions) using Metropolis-Hastings. We 88 We test our model in an unsupervised framework where only raw parallel text is available for each of the languages. In addition, we assume that for each language a tag dictionary is available that covers some subset of words in the text. The task is to learn an independent tagger for each language that can annotate non-parallel raw text using the learned parameters. All reported results are on non-parallel monolingual test data. Data For our experiments we use the MultextEast parallel corpus (Erjavec, 2004) which has been used before for multilingual learning (Feldman et al., 2006; Snyder et al., 2008). The tagged portion of the corpus includes a 100,000 word English text, Orwell's novel "Nineteen Eighty Four", and its translation into seven languages: Bulgarian, Czech, Estonian, Hungarian, Romanian, Slovene and Serbian. The corpus also includes a tag lexicon for each of these languages. We use the first 3/4 of the text for learning and the last 1/4 as held-out non-parallel test data. The corpus provides sentence level alignments. To obtain word level alignments, we run GIZA ++ (Och and Ney, 2003) on all 28 pairings of the 8 languages. Since we want each latent superlingual variable to span as many languages as possible, we aggregate the pairwise lexical alignments into larger sets of aligned words. These sets of aligned words are generated as a preprocessing step. During sampling they remain fixed and are treated as observed data. We use the set of 14 basic part-of-speech tags provided by the corpus. In our first experiment, we assume that a complete tag lexicon is available, so that for each word, its set of possible parts-of-speech is known ahead of time. In this setting, the average number of possible tags per token is 1.39. We also experimented with incomplete tag dictionaries, where entries are only available for words appearing more than five or ten times in the corpus. For other words, the entire tagset of 14 tags is considered. In these two scenarios, the average per-token tag ambi- Lexicon: Full MONO BG CS EN ET HU RO SL SR BI AVG BEST MULTI Lexicon: Frequency > 5 MONO BI AVG BEST MULTI Lexicon: Frequency > 10 MONO BI AVG BEST MULTI Avg. 88.8 93.7 95.8 92.5 95.3 90.1 87.4 84.5 91.0 91.3 97.0 95.9 93.4 96.8 91.8 89.3 90.2 93.2 94.7 97.7 96.1 94.3 96.9 94.0 94.8 94.5 95.4 92.6 98.2 95.0 94.6 96.7 95.1 95.8 92.3 95.0 73.5 72.2 87.3 72.5 73.5 77.1 75.7 66.3 74.7 80.2 79.0 90.4 76.5 77.3 82.7 78.7 75.9 80.1 82.7 79.7 90.7 77.5 78.0 84.4 80.9 79.4 81.7 81.3 83.0 88.1 80.6 80.8 86.1 83.6 78.8 82.8 71.9 66.7 84.4 68.3 69.0 73.0 70.4 63.7 70.9 77.8 75.3 88.8 72.9 73.8 80.5 76.1 72.4 77.2 80.2 76.7 89.4 74.9 75.2 82.1 77.6 76.1 79.0 78.8 79.4 86.1 77.9 76.4 83.1 80.0 75.9 79.7 Table 1: Tagging accuracy for Bulgarian, Czech, English, Estonian, Hungarian, Romanian, Slovene, and Serbian. In the first scenario, a complete tag lexicon is available for all the words. In the other two scenarios the tag lexicon only includes words that appear more than five or ten times. Results are given for a monolingual Bayesian HMM (Goldwater and Griffiths, 2007), a bilingual model (Snyder et al., 2008), and the multilingual model presented here. In the case of the bilingual model, we present both the average accuracy over all pairings as well as the result from the best performing pairing for each language. The best results for each language in each scenario are given in boldface. guity is 4.65 and 5.58, respectively. Training and testing In the full lexicon experiment, each word is initialized with a random part-of-speech tag from its dictionary entry. In the two reduced lexicon experiments, we initialize the tags with the result of our monolingual baseline (see below) to reduce sampling time. In both cases, we begin with 14 superlingual tag values -- corresponding to the parts-of-speech -- and initially assign them based on the most common initial part-ofspeech of words in each alignment. We run our Gibbs sampler for 1,000 iterations, and store the conditional tag probabilities for the last 100 iterations. We then approximate marginal tag probabilities on the training data using Equation 4 and predict the highest probability tags. Finally, we compute maximum likelihood transition and emission probabilities using these tag counts, and then apply smoothed viterbi decoding to each held-out monolingual test set. All reported results are averaged over five runs of the sampler. Monolingual and bilingual baselines We reimplemented the Bayesian HMM model of Goldwater and Griffiths (2007) (BHMM1) as our monolingual baseline. It has a standard HMM structure with conjugate Bayesian priors over transitions and emissions. We note that our model, in the absence of any superlingual tags, reduces to this Bayesian HMM. As an additional baseline we use a bilingual 89 model (Snyder et al., 2008). It is a directed graphical model that jointly tags two parallel streams of text aligned at the word level. The structure of the model consists of two parallel HMMs, one for each language. The aligned words form joint nodes that are shared by both HMMs. These joint nodes are sampled from a probability distribution that is a product of the transition and emission distributions in the two languages and a coupling distribution. We note that the numbers reported here for the bilingual model differ slightly from those reported by Snyder et al. (2008) for two reasons: we use a slightly larger set of sentences, and an improved sampling scheme. The new sampling scheme marginalizes over the transition and coupling parameters by using the same count-based approximation that we utilize for our multilingual model. This leads to higher performance, and thus a stronger baseline.1 5 Results Table 1 shows the tagging accuracy of our multilingual model on the test data, when training is performed on all eight languages together. Results from both baselines are also reported. In the case of the bilingual baseline, seven pairings are possible for each language, and the results vary by pair. There1 Another difference is that we use the English lexicon provided with the Multext-East corpus, whereas (Snyder et al., 2008) augment this lexicon with tags found in WSJ. fore, for each language, we present the average accuracy over all seven pairings, as well as the accuracy of its highest performing pairing. We provide results for three scenarios. In the first case, a tag dictionary is provided for all words, limiting them to a restricted set of possible tags. In the other two scenarios, dictionary entries are limited to words that appear more than five or ten times in the corpus. All other words can be assigned any tag, increasing the overall difficulty of the task. In the full lexicon scenario, our model achieves an average tagging accuracy of 95%, compared to 91% for the monolingual baseline and 93.2% for the bilingual baseline when averaged over all pairings. This accuracy (95%) comes close to the performance of the bilingual model when the best pairing for each language is chosen by an oracle (95.4%). This demonstrates that our multilingual model is able to effectively learn from all languages. In the two reduced lexicon scenarios, the gains are even more striking. In both cases the average multilingual performance outpaces even the best performing pairs. Looking at individual languages, we see that in all three scenarios, Czech, Estonian, Romanian, and Slovene show their best performance with the multilingual model. Bulgarian and Serbian, on the other hand, give somewhat better performance with their optimal pairings under the bilingual model, but their multilingual performance remains higher than their average bilingual results. The performance of English under the multilingual model is somewhat lower, especially in the full lexicon scenario, where it drops below monolingual performance. One possible explanation for this decrease lies in the fact that English, by far, has the lowest trigram tag entropy of all eight languages (Snyder et al., 2008). It is possible, therefore, that the signal it should be getting from its own transitions is being drowned out by less reliable information from other languages. In order to test the performance of our model as the number of languages increases, we ran the full lexicon experiment with all possible subsets of the eight languages. Figure 2 plots the average accuracy as the number of languages varies. For comparison, the monolingual and average bilingual baseline results are given, along with supervised monolingual performance. Our multilingual model steadily gains in accuracy as the number of available languages in90 Figure 2: Performance of the multilingual model as the number of languages varies. Performance of the monolingual and average bilingual baselines as well as a supervised monolingual performance are given for comparison. creases. Interestingly, it even outperforms the bilingual baseline (by a small margin) when only two languages are available, which may be attributable to the more flexible non-parametric dependencies employed here. Finally, notice that the gap between monolingual supervised and unsupervised performance is cut by nearly two thirds under the unsupervised multilingual model. 6 Conclusion In this paper we've demonstrated that the benefits of unsupervised multilingual learning increase steadily with the number of available languages. Our model scales gracefully as languages are added and effectively incorporates information from them all, leading to substantial performance gains. In one experiment, we cut the gap between unsupervised and supervised performance by nearly two thirds. A future challenge lies in incorporating constraints from additional languages even when parallel text is unavailable. Acknowledgments The authors acknowledge the support of the National Science Foundation (CAREER grant IIS-0448168 and grant IIS0835445). Thanks to Tommi Jaakkola and members of the MIT NLP group for helpful discussions. Any opinions, findings, or recommendations expressed above are those of the authors and do not necessarily reflect the views of the NSF. References C. E. Antoniak. 1974. Mixtures of Dirichlet processes with applications to Bayesian nonparametric problems. The Annals of Statistics, 2:1152­1174, November. Trevor Cohn and Mirella Lapata. 2007. Machine translation by triangulation: Making effective use of multiparallel corpora. In Proceedings of ACL. T. Erjavec. 2004. MULTEXT-East version 3: Multilingual morphosyntactic specifications, lexicons and corpora. In Fourth International Conference on Language Resources and Evaluation, LREC, volume 4, pages 1535­1538. Anna Feldman, Jirka Hana, and Chris Brew. 2006. A cross-language approach to rapid creation of new morpho-syntactically annotated resources. In Proceedings of LREC, pages 549­554. Dmitriy Genzel. 2005. Inducing a multilingual dictionary from a parallel multitext in related languages. In Proceedings of the HLT/EMNLP, pages 875­882. Sharon Goldwater and Thomas L. Griffiths. 2007. A fully Bayesian approach to unsupervised part-ofspeech tagging. In Proceedings of the ACL, pages 744­751. W. K. Hastings. 1970. Monte Carlo sampling methods using Markov chains and their applications. Biometrika, 57:97­109. Franz Josef Och and Hermann Ney. 2001. Statistical multi-source translation. In MT Summit 2001, pages 253­258. Franz Josef Och and Hermann Ney. 2003. A systematic comparison of various statistical alignment models. Computational Linguistics, 29(1):19­51. J. Sethuraman. 1994. A constructive definition of Dirichlet priors. Statistica Sinica, 4:639­650. Benjamin Snyder, Tahira Naseem, Jacob Eisenstein, and Regina Barzilay. 2008. Unsupervised multilingual learning for POS tagging. In Proceedings of the EMNLP, pages 1041­1050. Masao Utiyama and Hitoshi Isahara. 2006. A comparison of pivot methods for phrase-based statistical machine translation. In Proceedings of NAACL/HLT, pages 484­491. David Yarowsky and Grace Ngai. 2001. Inducing multilingual POS taggers and NP bracketers via robust projection across aligned corpora. In Proceedings of the NAACL, pages 1­8. 91 Efficiently Parsable Extensions to Tree-Local Multicomponent TAG Rebecca Nesson School of Engineering and Applied Sciences Harvard University Cambridge, MA nesson@seas.harvard.edu Stuart M. Shieber School of Engineering and Applied Sciences Harvard University Cambridge, MA shieber@seas.harvard.edu Abstract Recent applications of Tree-Adjoining Grammar (TAG) to the domain of semantics as well as new attention to syntactic phenomena have given rise to increased interested in more expressive and complex multicomponent TAG formalisms (MCTAG). Although many constructions can be modeled using tree-local MCTAG (TL-MCTAG), certain applications require even more flexibility. In this paper we suggest a shift in focus from constraining locality and complexity through treeand set-locality to constraining locality and complexity through restrictions on the derivational distance between trees in the same tree set in a valid derivation. We examine three formalisms, restricted NS-MCTAG, restricted Vector-TAG and delayed TL-MCTAG, that use notions of derivational distance to constrain locality and demonstrate how they permit additional expressivity beyond TLMCTAG without increasing complexity to the level of set local MCTAG. have substantial costs in terms of efficient processing. Much work in TAG semantics makes use of tree-local MCTAG (TL-MCTAG) to model phenomena such as quantifier scoping, Wh-question formation, and many other constructions (Kallmeyer and Romero, 2004; Romero et al., 2004). Certain applications, however, appear to require even more flexibility than is provided by TL-MCTAG. Scrambling is one well-known example (Rambow, 1994). In addition, in the semantics domain, the use of a new TAG operation, flexible composition, is used to perform certain semantic operations that seemingly cannot be modeled with TL-MCTAG alone (Chiang and Scheffler, 2008) and in work in synchronous TAG semantics, constructions such as nested quantifiers require a set-local MCTAG (SL-MCTAG) analysis (Nesson and Shieber, 2006). In this paper we suggest a shift in focus from constraining locality and complexity through restrictions that all trees in a tree set must adjoin within a single tree or tree set to constraining locality and complexity through restrictions on the derivational distance between trees in the same tree set in a valid derivation. We examine three formalisms, two of them introduced in this work for the first time, that use derivational distance to constrain locality and demonstrate by construction of parsers their relationship to TL-MCTAG in both expressivity and complexity. In Section 2 we give a very brief introduction to TAG. In Section 3 we elaborate further the distinction between these two types of locality restrictions using TAG derivation trees. Section 4 briefly addresses the simultaneity requirement present in MCTAG formalisms but not in Vector- 1 Introduction Tree-Adjoining Grammar (TAG) has long been popular for natural language applications because of its ability to naturally capture syntactic relationships while also remaining efficient to process. More recent applications of TAG to the domain of semantics as well as new attention to syntactic phenomena such as scrambling have given rise to increased interested in multicomponent TAG formalisms (MCTAG), which extend the flexibility, and in some cases generative capacity of the formalism but also 92 Human Language Technologies: The 2009 Annual Conference of the North American Chapter of the ACL, pages 92­100, Boulder, Colorado, June 2009. c 2009 Association for Computational Linguistics S X X a Y b S S Y b X a Y b Y S X a X Y a Y Z b c Y Z c Figure 1: An example of the TAG operations substitution and adjunction. TAG formalisms and argues for dropping the requirement. In Sections 5 and 6 we introduce two novel formalisms, restricted non-simultaneous MCTAG and restricted Vector-TAG, respectively, and define CKY-style parsers for them. In Section 7 we recall the delayed TL-MCTAG formalism introduced by Chiang and Scheffler (2008) and define a CKY-style parser for it as well. In Section 8 we explore the complexity of all three parsers and the relationship between the formalisms. In Section 9 we discuss the linguistic applications of these formalisms and show that they permit analyses of some of the hard cases that have led researchers to look beyond TL-MCTAG. order to clarify the presentation of our extended TLMCTAG parsers below, we briefly review the algorithm of Shieber et al. (1995) using the inference rule notation from that paper. As shown in Figure 2, items in CKY-style TAG parsing consist of a node in an elementary tree and the indices that mark the edges of the span dominated by that node. Nodes, notated @a , are specified by three pieces of information: the identifier of the elementary tree the node is in, the Gorn address a of the node in that tree1 , and a diacritic, , indicating that an adjunction or substitution is still available at that node or ·, indicating that one has already taken place. Each item has four indices, indicating the left and right edges of the span covered by the node as well as any gap in the node that may be the result of a foot node dominated by the node. Nodes that do not dominate a foot node will have no gap in them, which we indicate by the use of underscores in place of the indices for the gap. To limit the number of inference rules needed, we define the following function i j for combining indices: ij = i j 2 Background j= i= i i=j undefined otherwise A tree-adjoining grammar consists of a set of elementary tree structures of arbitrary depth, which are combined by operations of adjunction and substitution. Auxiliary trees are elementary trees in which the root and a frontier node, called the foot node and distinguished by the diacritic , are labeled with the same nonterminal A. The adjunction operation entails splicing an auxiliary tree in at an internal node in an elementary tree also labeled with nonterminal A. Trees without a foot node, which serve as a base for derivations and may combine with other trees by substitution, are called initial trees. Examples of the adjunction and substitution operations are given in Figure 1. For further background, refer to the survey by (Joshi and Schabes, 1997). Shieber et al. (1995) and Vijay-Shanker (1987) apply the Cocke-Kasami-Younger (CKY) algorithm first introduced for use with context-free grammars in Chomsky normal form (Kasami, 1965; Younger, 1967) to the TAG parsing problem to generate parsers with a time complexity of O(n6 |G|2 ). In 93 The side conditions Init() and Aux() hold if is an initial tree or an auxiliary tree, respectively. Label(@a) specifies the label of the node in tree at address a. Ft() specifies the address of the foot node of tree . Adj(@a, ) holds if tree may adjoin into tree at address a. Subst(@a, ) holds if tree may substitute into tree at address a. These conditions fail if the adjunction or substitution is prevented by constraints such as mismatched node labels. Multi-component TAG (MCTAG) generalizes TAG by allowing the elementary items to be sets of trees rather than single trees (Joshi and Schabes, 1997). The basic operations are the same but all trees in a set must adjoin (or substitute) into another tree or tree set in a single step in the derivation. An MCTAG is tree-local if tree sets are required to adjoin within a single elementary tree (Weir, A Gorn address uniquely identifies a node within a tree. The Gorn address of the root node is . The jth child of the node with address i has address i · j. 1 Goal Item: @· , 0, , , n @a· , i - 1, , , i @a· , i, , , i @Ft() , p, p, q, q @(a · 1)· , i, j, k, l @a , i, j, k, l @(a · 1)· , i, j, k, l , @(a · 2)· , l, j , k , m @a , i, j j , k k , m @· , i, p, q, l , @a , p, j, k, q @a· , i, j, k, l @a , i, j, k, l @a· , i, j, k, l @· , i, , , l @a· , i, , , l Init() Label(@) = S Label(@a) = wi Label(@a) = Aux() Terminal Axiom: Empty Axiom: Foot Axiom: Unary Complete: @(a · 2) undefined Binary Complete: Adjoin: Adj(@a, ) No Adjoin: Substitute: Subst(@a, ) Figure 2: The CKY algorithm for TAG 1988). Although tree-local MCTAG (TL-MCTAG) has the same generative capacity as TAG (Weir, 1988), the conversion to TAG is exponential and the TL-MCTAG formalism is NP-hard to recognize (Sřgaard et al., 2007). An MCTAG is set-local if tree sets required to adjoin within a single elementary tree set (Weir, 1988). Set-local MCTAG (SL-MCTAG) has equivalent expressivity to linear context-free rewriting systems and recognition is provably PSPACE complete (Nesson et al., 2008). relationships allowed between members of the same tree set in the derivation tree. TAG derivation trees provide the information about how the elementary structures of the grammar combine that is necessary to construct the derived tree. Nodes in a TAG derivation tree are labeled with identifiers of elementary structures. One elementary structure is the child of another in the derivation tree if it adjoins or substitutes into it in the derivation. Arcs in the derivation tree are labeled with the address in the target elementary structure at which the operation takes place. In MCTAG the derivation trees are often drawn with identifiers of entire tree sets as the nodes of the tree because the lexical locality constraints require that each elementary tree set be the derivational child of only one other tree set. However, if we elaborate the derivation tree to include a node for each tree in the grammar rather than only for each tree set we can see a stark contrast in the derivational 3 Domains of Locality and Derivation Trees The domains of locality of TL-MCTAG and SLMCTAG (and trivially, TAG) can be thought of as lexically defined. That is, all locations at which the adjunction of one tree set into another may occur must be present within a single lexical item. However, we can also think of locality derivationally. In a derivationally local system the constraint is on the 94 1: A 2: S B A A A A b B a B B b B a 2a 3a ··· 1 2b 3b ··· 3: { { } } 3a 2a 3b 2b Figure 3: An example SL-MCTAG grammar that generates the language ww and associated derivation tree that demonstrating an arbitrarily long derivational distance between the trees of a given tree set and their nearest common ancestor. Note that if this grammar is interpreted as a TL-MCTAG grammar only two derivations are possible (for the strings aa and bb). join simultaneously. In terms of well-formed derivation trees, this amounts to disallowing derivations in which a tree from a given set is the ancestor of a tree from the same tree set. For most linguistic applications of TAG, this requirement seems natural and is strictly obeyed. There are a few applications, including flexible composition and scrambling in free-word order languages that benefit from TAG-based grammars that drop the simultaneity requirement (Chiang and Scheffler, 2008; Rambow, 1994). From a complexity perspective, however, checking the simultaneity requirement is expensive (Kallmeyer, 2007). As a result, it can be advantageous to select a base formalism that does not require simultaneity even if the grammars implemented with it do not make use of that additional freedom. 5 Restricted Non-simultaneous MCTAG The simplest version of a derivationally local TAGbased formalism is most similar to non-local MCTAG. There is no lexical locality requirement at all. In addition, we drop the simultaneity requirement. Thus the only constraint on elementary tree sets is the limit, d, on the derivational distance between the trees in a given set and their nearest common ancestor. We call this formalism restricted nonsimultaneous MCTAG. Note that if we constrain d to be one, this happens to enforce both the derivational delay limit and the lexical locality requirement of TL-MCTAG. A CKY-style parser for restricted NS-MCTAG with a restriction of d is given in Figure 4. The items of this parser contain d lists, 1 , . . . , d , called histories that record the identities of the trees that have already adjoined in the derivation in order to enforce the locality constraints. The identities of the trees in a tree set that have adjoined in a given derivation are maintained in the histories until all the trees from that set have adjoined. Once the locality constraint is checked for a tree set, the Filter side condition expunges those trees from the histories. A tree is recorded in this history list with superscript i, where i is the derivational distance between the location where the recorded tree adjoined and the location of the current item. The locality constraint is enforced at the point of adjunction or substitution where the locality of these two formalisms. In TL-MCTAG all trees in a set must adjoin to the same tree. This means that they must all be siblings in the derivation tree. In SL-MCTAG, on the other hand, it is possible to generate derivations with arbitrarily long distances before the nearest common ancestor of two trees from the same elementary tree set is reached. An example SL-MCTAG grammar that can produce an arbitrarily long derivational distance to the nearest common ancestor of the trees in a given tree set is given in Figure 3. Chiang and Scheffler (2008) recently introduced one variant of MCTAG, delayed Tree-Local MCTAG (delayed TL-MCTAG) that uses a derivational notion of locality. In this paper we introduce two additional derivationally local TAG-based formalisms, restricted non-simultaneous MCTAG (restricted NSMCTAG) and restricted Vector TAG (restricted VTAG) and demonstrate by construction of parsers how each gives rise to a hierarchy of derivationally local formalisms with a well-defined efficiency penalty for each step of derivational distance permitted. 4 The Simultaneity Requirement In addition to lexical locality constraints the definition of MCTAG requires that all trees from a set ad95 Goal Item 0 @· , 0, , , n, , . . . , x @a· , i - 1, , , i, , . . . , x @a· , i, , , i, , . . . , x @Ft(x ) , p, p, q, q, , . . . , x @(a · 1)· , i, j, k, l, 1 , . . . , d x @a , i, j, k, l, 1 , . . . , d Init(1 ) Label(0 @) = S Terminal Axiom Empty Axiom Foot Axiom Unary Complete || = 1 Label(x @a) = wi Label(x @a) = Aux(x ) x @(a · 2) undefined Filter(1 1 , . . . , 1 2 Binary Complete x @(a · 1)· , i, j, k, l, 1 , . . . , d x @(a · 2)· , l, j , k , m, 1 , . . . , d 1 1 2 2 x @a , i, j j , k k , m, 1 , . . . , d Adjoin: y @· , i, p, q, l, 1 , . . . , d-1 , x @a , p, j, k, q, 1 , . . . , d 1 2 2 1 · , i, j, k, l, 1 , . . . , d x @a Substitute: y @· , i, , , l, 1 , . . . , d-1 , 1 1 x @a· , i, , , l, 1 , . . . , d No Adjoin: x @a , i, j, k, l, 1 , . . . , d x @a· , i, j, k, l, 1 , . . . , d d d ) = 1 2 1 , . . . , d Adj(x @a, y ) Filter(1 {y }, 2 1 , 2 2 1 d d-1 ) = . . . , 2 1 1 , . . . , d Subst(x @a, y ) Filter({y }, 1 , . . . , d-1 ) 1 1 = 1 , . . . , d Figure 4: Axioms and inference rules for the CKY algorithm for restricted NS-MCTAG with a restriction of d. history at the limit of the permissible delay must be empty for the operation to succeed. 6 Restricted V-TAG A Vector-TAG (V-TAG) (Rambow, 1994) is similar to an MCTAG in that the elementary structures are sets (or vectors) of TAG trees. A derivation in a VTAG is defined as in TAG. There is no locality requirement or other restriction on adjunction except that if one tree from a vector is used in a derivation, all trees from that vector must be used in the derivation. The trees in a vector may be connected by dominance links between the foot nodes of auxiliary trees and any node in other trees in the vector. All adjunctions must respect the dominance relations in that a node 1 that dominates a node 2 must appear on the path from 2 to the root of the derived tree. The definition of V-TAG is very similar to that of 96 non-local MCTAG as defined by Weir (1988) except that in non-local MCTAG all trees from a tree set are required to adjoin simultaneously. Restricted V-TAG constrains V-TAG in several ways. First, the dominance chain in each elementary tree vector is required to define a total order over the trees in the vector. This means there is a single base tree in each vector. Note also that all trees other than the base tree must be auxiliary trees in order to dominate other trees in the vector. The base tree may be either an initial tree or an auxiliary tree. Second, a restricted V-TAG has a restriction level, d, that determines the largest derivational distance that may exists between the base tree and the highest tree in a tree vector in a derivation. Restricted V-TAG differs from restricted NS-MCTAG in one important respect: the dominance requirements of restricted V-TAG require that trees from the same set must appear along a single path in the derived tree, whereas in restricted NS-MCTAG trees from the same set need not adhere to any dominance relationship in the derived tree. A CKY-style parser for restricted V-TAG with restriction level d is given in Figure 5. Parsing is similar to delayed TL-MCTAG in that we have a set of histories for each restriction level. However, because of the total order over trees in a vector, the parser only needs to maintain the identity of the highest tree from a vector that has been used in the derivation along with its distance from the base tree from that vector. The Filter side condition accordingly expunges trees that are the top tree in the dominance chain of their tree vector. The side conditions for the Adjoin non-base rule enforce that the dominance constraints are satisfied and that the derivational distance from the base of a tree vector to its currently highest adjoined tree is maintained accurately. We note that in order to allow a non-total ordering of the trees in a vector we would simply have to record all trees in a tree vector in the histories as is done in the delayed TL-MCTAG parser. Figure 7: Examples of 1-delay (top) and 2-delay (bottom) taken from Chiang and Scheffler (2008). The delays are marked with dashed boxes on the derivation trees. 7 Delayed TL-MCTAG Chiang and Scheffler (2008) introduce the delayed TL-MCTAG formalism which makes use of a derivational distance restriction in a somewhat different way. Rather than restricting the absolute distance between the trees of a set and their nearest common ancestor, given a node in a derivation tree, delayed TL-MCTAG restricts the number of tree sets that are not fully dominated by . Borrowing directly from Chiang and Scheffler (2008), Figure 7 gives two examples. Parsing for delayed TL-MCTAG is not discussed by Chiang and Scheffler (2008) but can be accomplished using a similar CKY-style strategy as in the two parsers above. We present a parser in Figure 6. Rather than keeping histories that record derivational distance, we keep an active delay list for each item that records the delays that are active (by recording the identities of the trees that have adjoined) for the tree of which the current node is a part. At the root of each tree the active delay list is filtered using the Filter side condition to remove all tree sets that are fully dominated and the resulting 97 list is checked using the Size to ensure that it contains no more than d distinct tree sets where d is the specified delay for the grammar. The active delays for a given tree are passed to its derivational parent when it adjoins or substitutes. Delayed TL-MCTAG differs from both of the previous formalisms in that it places no constraint on the length of a delay. On the other hand while the previous formalisms allow unlimited short delays to be pending at the same time, in delayed TLMCTAG, only a restricted number of delays may be active at once. Similar to restricted V-TAG, there is no simultaneity requirement, so a tree may have another tree from the same set as an ancestor. 8 Complexity The complexity of the restricted NS-MCTAG and restricted V-TAG parsers presented above depends on the number of possible histories that may appear in an item. For each step of derivational distance permitted between trees of the same set, the corresponding history permits many more entries. History 1 may contain trees that have adjoined into the same tree as the node of the current item. The number of entries is therefore limited by the number of adjunction sites in that tree, which is in turn limited by the number of nodes in that tree. We will call the maximum number of nodes in a tree in the grammar t. Theoretically, any tree in the grammar could adjoin at any of these adjunction sites, meaning that the number of possible values for each entry in the history is bounded by the size of the grammar |G|. Thus the size of 1 is O(|G|t ). For 2 the en- Unary Complete Binary Complete x @(a · 1)· , i, j, k, l, 1 , . . . , d x @(a · 2)· , l, j , k , m, 1 , . . . , d 1 1 2 2 x @a , i, j j , k k , m, 1 1 , . . . , d d 2 1 2 1 Adjoin base: 1 @· , i, p, q, l, 1 , . . . , d-1 , x @a , p, j, k, q, 1 , . . . , d 1 2 2 1 x @a· , i, j, k, l, 1 , . . . , d Adjoin non-base: y @· , i, p, q, l, 1 , . . . , d-1 , x @a , p, j, k, q, 1 , . . . , d 1 2 2 1 x @a· , i, j, k, l, 1 , . . . , d for unique i s.t. y-1 i , i = (i i-1 {y }) - {y-1 } 2 2 2 1 2 for i s.t. y-1 i , i = i i-1 / 2 2 2 2 1 Substitute: 1 @· , i, , , l, 1 , . . . , d-1 , 1 1 x @a· , i, , , l, 1 , . . . , d No Adjoin: x @a , i, j, k, l, 1 , . . . , d x @a· , i, j, k, l, 1 , . . . , d x @(a · 1)· , i, j, k, l, 1 , . . . , d x @a , i, j, k, l, 1 , . . . , d x @(a · 2) undefined Adj(x @a, 1 ) Filter(1 {1 }, 2 1 , 2 2 1 . . . , d d-1 ) = 2 1 1 , . . . , d Adj(x @a, y ) Filter(1 , 2 1 , . . . , 1 2 2 d-1 d 1 ) = 2 1 , . . . , d Subst(x @a, 1 ) Filter({1 }, 1 , . . . , d-1 ) 1 1 = 1 , . . . , d Figure 5: Inference rules for the CKY algorithm for restricted V-TAG with a restriction of d. Item form, goal item and axioms are omitted because they are identical to those in restricted NS-MCTAG parser. tries correspond to tree that have adjoined into a tree that has adjoined into the tree of the current item. Thus, for each of the t trees that may have adjoined at a derivational distance of one, there are t more trees that may have adjoined at a derivational dis2 tance of two. The size of 2 is therefore |G|t . The combined size of the histories for a grammar with a delay or restriction of d is therefore O(|G| i=1 t ). Replacing the sum with its closed form solution, we have O(|G| t-1 -1 ) histories. Using the reasoning about the size of the histories given above, the restricted NS-MCTAG parser presented here has a complexity of O(n6 |G|1+ t-1 ), where t is as defined above and d is the limit on delay of adjunction. For a tree-local MCTAG, the complexity reduces to O(n6 |G|2+t ). For the linguistic applications that motivate this chapter no delay greater than two is needed, resulting in a complexity 2 of O(n6 |G|2+t+t ). The same complexity analysis applies for re98 td+1 -1 td+1 -1 d d stricted V-TAG. However, we can provide a somewhat tighter bound by noting that the rank, r, of the grammar--how many tree sets adjoin in a single tree--and the fan out, f of the grammar--how many trees may be in a single tree set--are limited by t. That is, a complete derivation containing |D| tree sets can contain no more than t |D| individual trees and also no more than rf |D| individual trees. In the restricted V-TAG algorithm we maintain only one tree from a tree set in the history at a time, so rather than maintaining O(t) entries in each history, we only need to maintain the smaller O(r) entries. The complexity of the delayed TL-MCTAG parser depends on the number of possible active delay lists. As above, each delay list may have a maximum of t entries for trees that adjoin directly into it. The restriction on the number of active delays means that the active delay lists passed up from these child nodes at the point of adjunction or substitution can have size no more than d. This results in an additional td(f - 1) possible entries in the active de- Goal Item: 0 @· , 0, , , n, , . . . , x @a· , i - 1, , , i, , . . . , {x } x @a· , i, , , i, , . . . , {x } x @Ft(x ) , p, p, q, q, , . . . , {x } x @(a · 1)· , i, j, k, l, x @a , i, j, k, l, Init(1 ) Label(0 @) = S Terminal Axiom Empty Axiom Foot Axiom Unary Complete || = 1 Label(x @a) = wi Label(x @a) = Aux(x ) x @(a · 2) undefined Binary Complete x @(a · 1)· , i, j, k, l, 1 x @(a · 2)· , l, j , k , m, 2 x @a , i, j j , k k , m, 1 2 Adjoin: y @· , i, p, q, l, x @a , p, j, k, q, x @a· , i, j, k, l, Substitute: y @· , i, , , l, x @a· , i, , , l, {x } x @a , i, j, k, l, x @a· , i, j, k, l, Adj(x @a, y ) Filter( , ) Size( ) d Subst(x @a, y ) Filter( , ) Size( ) d No Adjoin: Figure 6: Axioms and inference rules for the CKY algorithm for delayed TL-MCTAG with a delay of d. lay list, giving a total number of active delay lists of O(|G|t(1+d(f -1)) ). Thus the complexity of the parser is O(n6 |G|2+t(1+d(f -1)) ). tive way to add flexibility to MCTAG without losing computational tractability. 9 Conclusion Acknowledgments This material is based upon work supported by the National Science Foundation under Grant No. BCS0827979. Each of the formalisms presented above extends the flexibility of MCTAG beyond that of TL-MCTAG while maintaining, as we have shown herein, complexity much less than that of SL-MCTAG. All three formalisms permit modeling of flexible composition (because they permit one member of a tree set to be a derivational ancestor of another tree in the same set), at least restricted NS-MCTAG and restricted V-TAG permit analyses of scrambling, and all three permit analyses of the various challenging semantic constructions mentioned in the introduction. We conclude that extending locality by constraining derivational distance may be an effec99 References David Chiang and Tatjana Scheffler. 2008. Flexible composition and delayed tree-locality. In The Ninth International Workshop on Tree Adjoining Grammars and Related Formalisms (TAG+9). Aravind K. Joshi and Yves Schabes. 1997. Treeadjoining grammars. In G. Rozenberg and A. Salomaa, editors, Handbook of Formal Languages, pages 69­124. Springer. Laura Kallmeyer and Maribel Romero. 2004. LTAG semantics with semantic unification. In Proceedings of the 7th International Workshop on Tree-Adjoining Grammars and Related Formalisms (TAG+7), pages 155­162, Vancouver, May. Laura Kallmeyer. 2007. A declarative characterization of different types of multicomponent tree adjoining grammars. In Andreas Witt Georg Rehm and Lothar Lemnitzer, editors, Datenstrukturen f¨ r linguistische u Ressourcen und ihre Anwendungen, pages 111­120. T. Kasami. 1965. An efficient recognition and syntax algorithm for context-free languages. Technical Report AF-CRL-65-758, Air Force Cambridge Research Laboratory, Bedford, MA. Rebecca Nesson and Stuart M. Shieber. 2006. Simpler TAG semantics through synchronization. In Proceedings of the 11th Conference on Formal Grammar, Malaga, Spain, 29­30 July. Rebecca Nesson, Giorgio Satta, and Stuart M. Shieber. 2008. Complexity, parsing, and factorization of treelocal multi-component tree-adjoining grammar. Technical report, Harvard University. Owen Rambow. 1994. Formal and computational aspects of natural language syntax. Ph.D. thesis, University of Pennsylvania, Philadelphia, PA. Maribel Romero, Laura Kallmeyer, and Olga BabkoMalaya. 2004. LTAG semantics for questions. In Proceedings of the 7th International Workshop on Tree-Adjoining Grammars and Related Formalisms (TAG+7), pages 186­193, Vancouver, May. Stuart M. Shieber, Yves Schabes, and Fernando C. N. Pereira. 1995. Principles and implementation of deductive parsing. Journal of Logic Programming, 24(1­2):3­36, July­August. Also available as cmplg/9404008. Anders Sřgaard, Timm Lichte, and Wolfgang Maier. 2007. On the complexity of linguistically motivated extensions of tree-adjoining grammar. In Recent Advances in Natural Language Processing 2007. K. Vijay-Shanker. 1987. A study of tree-adjoining grammars. PhD Thesis, Department of Computer and Information Science, University of Pennsylvania. David Weir. 1988. Characterizing mildly contextsensitive grammar formalisms. PhD Thesis, Department of Computer and Information Science, University of Pennsylvania. D.H. Younger. 1967. Recognition and parsing of context-free languages in time n3 . Information and Control, 10(2):189­208. 100 Improving Unsupervised Dependency Parsing with Richer Contexts and Smoothing William P. Headden III, Mark Johnson, David McClosky Brown Laboratory for Linguistic Information Processing (BLLIP) Brown University Providence, RI 02912 {headdenw,mj,dmcc}@cs.brown.edu Abstract Unsupervised grammar induction models tend to employ relatively simple models of syntax when compared to their supervised counterparts. Traditionally, the unsupervised models have been kept simple due to tractability and data sparsity concerns. In this paper, we introduce basic valence frames and lexical information into an unsupervised dependency grammar inducer and show how this additional information can be leveraged via smoothing. Our model produces state-of-theart results on the task of unsupervised grammar induction, improving over the best previous work by almost 10 percentage points. The big dog barks Figure 1: Example dependency parse. 1 Introduction The last decade has seen great strides in statistical natural language parsing. Supervised and semisupervised methods now provide highly accurate parsers for a number of languages, but require training from corpora hand-annotated with parse trees. Unfortunately, manually annotating corpora with parse trees is expensive and time consuming so for languages and domains with minimal resources it is valuable to study methods for parsing without requiring annotated sentences. In this work, we focus on unsupervised dependency parsing. Our goal is to produce a directed graph of dependency relations (e.g. Figure 1) where each edge indicates a head-argument relation. Since the task is unsupervised, we are not given any examples of correct dependency graphs and only take words and their parts of speech as input. Most of the recent work in this area (Smith, 2006; Cohen et al., 2008) has focused on variants of the 101 Dependency Model with Valence (DMV) by Klein and Manning (2004). DMV was the first unsupervised dependency grammar induction system to achieve accuracy above a right-branching baseline. However, DMV is not able to capture some of the more complex aspects of language. Borrowing some ideas from the supervised parsing literature, we present two new models: Extended Valence Grammar (EVG) and its lexicalized extension (L-EVG). The primary difference between EVG and DMV is that DMV uses valence information to determine the number of arguments a head takes but not their categories. In contrast, EVG allows different distributions over arguments for different valence slots. L-EVG extends EVG by conditioning on lexical information as well. This allows L-EVG to potentially capture subcategorizations. The downside of adding additional conditioning events is that we introduce data sparsity problems. Incorporating more valence and lexical information increases the number of parameters to estimate. A common solution to data sparsity in supervised parsing is to add smoothing. We show that smoothing can be employed in an unsupervised fashion as well, and show that mixing DMV, EVG, and L-EVG together produces state-ofthe-art results on this task. To our knowledge, this is the first time that grammars with differing levels of detail have been successfully combined for unsupervised dependency parsing. A brief overview of the paper follows. In Section 2, we discuss the relevant background. Section 3 presents how we will extend DMV with additional Human Language Technologies: The 2009 Annual Conference of the North American Chapter of the ACL, pages 101­109, Boulder, Colorado, June 2009. c 2009 Association for Computational Linguistics features. We describe smoothing in an unsupervised context in Section 4. In Section 5, we discuss search issues. We present our experiments in Section 6 and conclude in Section 7. that satisfies the following properties: 1. Tied rules have the same probability. 2. Rules expanding the same nonterminal are never tied. 3. If N1 1 and N2 2 are tied then the tying relation defines a one-to-one mapping between rules in RN1 and RN2 , and we say that N1 and N2 are tied nonterminals. As we see below, we can estimate tied PCFGs using standard techniques. Clearly, the tying relation also defines an equivalence class over nonterminals. The tying relation allows us to formulate the distributions over trees in terms of rule equivalence classes Ż and nonterminal equivalence classes. Suppose R is Ż is the set the set of rule equivalence classes and N of nonterminal equivalence classes. Since all rules in an equivalence class r have the same probability Ż (condition 1), and since all the nonterminals in an Ż Ż equivalence class N N have the same distribution over rule equivalence classes (condition 1 and 3), we can define the set of rule equivalence classes ŻŻ RN associated with a nonterminal equivalence class Ż Ż N , and a vector of probabilities, indexed by rule Ż ŻŻ equivalence classes r R . N refers to the subŻ Ż associated with nonterminal equivalence vector of Ż ŻŻ class N , indexed by r RN . Since rules in the Ż same equivalence class have the same probability, ŻŻ we have that for each r r, r = r . Ż Let f (t, r) denote the number of times rule r appears in tree t, and let f (t, r ) = rŻ f (t, r). We Ż r see that the complete data likelihood is P (s, t|) = r r R rŻ Ż Ż f r (t,r) = r R Ż Ż f r ŻŻ (t,Ż) r 2 Background In this paper, the observed variables will be a corpus of n sentences of text s = s1 . . . sn , and for each word sij an associated part-of-speech ij . We denote the set of all words as Vw and the set of all parts-ofspeech as V . The hidden variables are parse trees Ż t = t1 . . . tn and parameters which specify a distribution over t. A dependency tree ti is a directed acyclic graph whose nodes are the words in si . The graph has a single incoming edge for each word in each sentence, except one called the root of ti . An edge from word i to word j means that word j is an argument of word i or alternatively, word i is the head of word j. Note that each word token may be the argument of at most one head, but a head may have several arguments. If parse tree ti can be drawn on a plane above the sentence with no crossing edges, it is called projective. Otherwise it is nonprojective. As in previous work, we restrict ourselves to projective dependency trees. The dependency models in this paper will be formulated as a particular kind of Probabilistic Context Free Grammar (PCFG), described below. 2.1 Tied Probabilistic Context Free Grammars In order to perform smoothing, we will find useful a class of PCFGs in which the probabilities of certain rules are required to be the same. This will allow us to make independence assumptions for smoothing purposes without losing information, by giving analogous rules the same probability. Let G = (N , T , S, R, ) be a Probabilistic Context Free Grammar with nonterminal symbols N , terminal symbols T , start symbol S N , set of productions R of the form N , N N , (N T ) . Let RN indicate the subset of R whose left-hand sides are N . is a vector of length |R|, indexed by productions N R. N specifies the probability that N rewrites to . We will let N indicate the subvector of corresponding to RN . A tied PCFG constrains a PCFG G with a tying relation, which is an equivalence relation over rules 102 That is, the likelihood is a product of multinomials, one for each nonterminal equivalence class, and there are no constraints placed on the parameters of these multinomials besides being positive and summing to one. This means that all the standard estimation methods (e.g. Expectation Maximization, Variational Bayes) extend directly to tied PCFGs. Maximum likelihood estimation provides a point Ż estimate of . However, often we want to incorpoŻ rate information about by modeling its prior distriŻ Ż bution. As a prior, for each N N we will specify a ŻŻ Dirichlet distribution over N with hyperparameters N . The Dirichlet has the density function: Ż ŻŻ Ż P (N |N ) = ( r RN Ż ŻŻ r ) Ż r RN Ż ŻŻ and Q(t). Kurihara and Sato (2004) show that each ŻŻ Q(N ) is a Dirichlet distribution with parameters r = r + EQ(t) f (t, r). ^ 2.2 Split-head Bilexical CFGs r RN Ż ŻŻ (r ) Ż ŻŻ Ż r r -1 , Ż Thus the prior over is a product of Dirichlets,which is conjugate to the PCFG likelihood function (JohnŻ son et al., 2007). That is, the posterior P (|s, t, ) is also a product of Dirichlets, also factoring into a Ż Dirichlet for each nonterminal N , where the parameters r are augmented by the number of times rule Ż r is observed in tree t: Ż Ż Ż Ż P (|s, t, ) P (s, t|)P (|) r Ż Żf (t,Ż)+r -1 r Ż r R Ż Ż The most successful recent work on dependency induction has focused on the Dependency Model with Valence (DMV) by Klein and Manning (2004). DMV is a generative model in which the head of Ż P (s, t, |) the sentence is generated and then each head recurŻ log P (s|) Q(t, ) log =F Ż Ż Q(t, ) sively generates its left and right dependents. The t arguments of head H in direction d are generated The negative of the lower bound, -F, is sometimes by repeatedly deciding whether to generate another called the free energy. new argument or to stop and then generating the As is typical in variational approaches, Kuri- argument if required. The probability of deciding hara and Sato (2004) make certain independence as- whether to generate another argument is conditioned sumptions about the hidden variables in the vari- on H, d and whether this would be the first argument ational posterior, which will make estimating it (this is the sense in which it models valence). When Ż Ż simpler. It factors Q(t, ) = Q(t)Q() = DMV generates an argument, the part-of-speech of n Ż Ż ). The goal is to recover that argument A is generated given H and d. Ż Ż i=1 Qi (ti ) N N Q(N Ż Q(), the estimate of the posterior distribution over 1 Efficiently parsable versions of split-head bilexical CFGs parameters and Q(t), the estimate of the posterior for the models described in this paper can be derived using the distribution over trees. Finding a local maximum of fold-unfold grammar transform (Eisner and Blatz, 2007; JohnŻ F is done via an alternating maximization of Q() son, 2007). 103 We can see that r acts as a pseudocount of the numŻ ber of times r is observed prior to t. Ż To make use of this prior, we use the Variational Bayes (VB) technique for PCFGs with Dirichlet Priors presented by Kurihara and Sato (2004). VB esŻ timates a distribution over . In contrast, Expectation Maximization estimates merely a point estiŻ Ż mate of . In VB, one estimates Q(t, ), called the variational distribution, which approximates the Ż posterior distribution P (t, |s, ) by minimizing the KL divergence of P from Q. Minimizing the KL divergence, it turns out, is equivalent to maximizing a lower bound F of the log marginal likelihood log P (s|). In the sections that follow, we frame various dependency models as a particular variety of CFGs known as split-head bilexical CFGs (Eisner and Satta, 1999). These allow us to use the fast Eisner and Satta (1999) parsing algorithm to compute the expectations required by VB in O(m3 ) time (Eisner and Blatz, 2007; Johnson, 2007) where m is the length of the sentence.1 In the split-head bilexical CFG framework, each nonterminal in the grammar is annotated with a terminal symbol. For dependency grammars, these annotations correspond to words and/or parts-ofspeech. Additionally, split-head bilexical CFGs require that each word sij in sentence si is represented in a split form by two terminals called its left part sijL and right part sijR . The set of these parts constitutes the terminal symbols of the grammar. This split-head property relates to a particular type of dependency grammar in which the left and right dependents of a head are generated independently. Note that like CFGs, split-head bilexical CFGs can be made probabilistic. 2.3 Dependency Model with Valence Rule S YH YH LH RH LH HL LH L1 H L HL H L L1 H H L1 YA L H H Description Select H as root Move to split-head representation STOP CONT STOP CONT Lbarks L1 barks S Ybarks Rbarks barksR | dir = L, head = H, val = 0 | dir = L, head = H, val = 0 | dir = L, head = H, val = 1 | dir = L, head = H, val = 1 Ldog L1 dog Ydog L barks barksL Rdog dogR Arg A | dir = L, head = H YT he LT he TheL RT he TheR Ybig Lbig bigL Rbig bigR L dog L1 dog L dog dogL Figure 2: Rule schema for DMV. For brevity, we omit the portion of the grammar that handles the right arguments since they are symmetric to the left (all rules are the same except for the attachment rule where the RHS is reversed). val {0, 1} indicates whether we have made any attachments. The grammar schema for this model is shown in Figure 2. The first rule generates the root of the sentence. Note that these rules are for H, A V so there is an instance of the first schema rule for each part-of-speech. YH splits words into their left and right components. LH encodes the stopping decision given that we have not generated any arguments so far. L encodes the same decision after generatH ing one or more arguments. L1 represents the distriH bution over left attachments. To extract dependency relations from these parse trees, we scan for attachment rules (e.g., L1 YA L ) and record that H H A depends on H. The schema omits the rules for right arguments since they are symmetric. We show a parse of "The big dog barks" in Figure 3.2 Much of the extensions to this work have focused on estimation procedures. Klein and Manning (2004) use Expectation Maximization to estimate the model parameters. Smith and Eisner (2005) and Smith (2006) investigate using Contrastive Estimation to estimate DMV. Contrastive Estimation maximizes the conditional probability of the observed sentences given a neighborhood of similar unseen sequences. The results of this approach vary widely based on regularization and neighborhood, but often outperforms EM. Note that our examples use words as leaf nodes but in our unlexicalized models, the leaf nodes are in fact parts-of-speech. 2 Figure 3: DMV split-head bilexical CFG parse of "The big dog barks." Smith (2006) also investigates two techniques for maximizing likelihood while incorporating the locality bias encoded in the harmonic initializer for DMV. One technique, skewed deterministic annealing, ameliorates the local maximum problem by flattening the likelihood and adding a bias towards the Klein and Manning initializer, which is decreased during learning. The second technique is structural annealing (Smith and Eisner, 2006; Smith, 2006) which penalizes long dependencies initially, gradually weakening the penalty during estimation. If hand-annotated dependencies on a held-out set are available for parameter selection, this performs far better than EM; however, performing parameter selection on a held-out set without the use of gold dependencies does not perform as well. Cohen et al. (2008) investigate using Bayesian Priors with DMV. The two priors they use are the Dirichlet (which we use here) and the Logistic Normal prior, which allows the model to capture correlations between different distributions. They initialize using the harmonic initializer of Klein and Manning (2004). They find that the Logistic Normal distribution performs much better than the Dirichlet with this initialization scheme. Cohen and Smith (2009), investigate (concur- 104 Rule S YH YH LH RH LH HL LH L H L L1 H H L L2 H H L2 YA L H H L1 YA HL H Description Select H as root Move to split-head representation STOP CONT STOP CONT . . . Ldog L1 dog . . . Ldog L dog L2 dog | dir = L, head = H, val = 0 | dir = L, head = H, val = 0 | dir = L, head = H, val = 1 | dir = L, head = H, val = 1 YT he TheL TheR Ybig bigL L dog L1 dog L dog dogL YT he TheL TheR Ybig bigL bigR L dog L1 dog dogL bigR Arg A | dir = L, head = H, val = 1 Arg A | dir = L, head = H, val = 0 Figure 4: Extended Valence Grammar schema. As before, we omit rules involving the right parts of words. In this case, val {0, 1} indicates whether we are generating the nearest argument (0) or not (1). Figure 5: An example of moving from DMV to EVG for a fragment of "The big dog." Boxed nodes indicate changes. The key difference is that EVG distinguishes between the distributions over the argument nearest the head (big) from arguments farther away (The). rently with our work) an extension of this, the Shared Logistic Normal prior, which allows different PCFG rule distributions to share components. They use this machinery to investigate smoothing the attachment distributions for (nouns/verbs), and for learning using multiple languages. ure shows that EVG allows these two distributions to be different (nonterminals L2 and L1 ) whereas dog dog DMV forces them to be equivalent (both use L1 as dog the nonterminal). 3.1 Lexicalization 3 Enriched Contexts DMV models the distribution over arguments identically without regard to their order. Instead, we propose to distinguish the distribution over the argument nearest the head from the distribution of subsequent arguments. 3 Consider the following changes to the DMV grammar (results shown in Figure 4). First, we will introduce the rule L2 YA L to denote the deciH H sion of what argument to generate for positions not nearest to the head. Next, instead of having L exH pand to HL or L1 , we will expand it to L1 (attach H H to nearest argument and stop) or L2 (attach to nonH nearest argument and continue). We call this the Extended Valence Grammar (EVG). As a concrete example, consider the phrase "the big hungry dog" (Figure 5). We would expect that distribution over the nearest left argument for "dog" to be different than farther left arguments. The figMcClosky (2008) explores this idea further in an unsmoothed grammar. 3 All of the probabilistic models discussed thus far have incorporated only part-of-speech information (see Footnote 2). In supervised parsing of both dependencies and constituency, lexical information is critical (Collins, 1999). We incorporate lexical information into EVG (henceforth L-EVG) by extending the distributions over argument parts-of-speech A to condition on the head word h in addition to the head part-of-speech H, direction d and argument position v. The argument word a distribution is merely conditioned on part-of-speech A; we leave refining this model to future work. In order to incorporate lexicalization, we extend the EVG CFG to allow the nonterminals to be annotated with both the word and part-of-speech of the head. We first remove the old rules YH LH RH for each H V . Then we mark each nonterminal which is annotated with a part-of-speech as also annotated with its head, with a single exception: YH . We add a new nonterminal YH,h for each H V , h Vw , and the rules YH YH,h and YH,h LH,h RH,h . The rule YH YH,h corresponds to selecting the word, given its part-ofspeech. 105 4 Smoothing In supervised estimation one common smoothing technique is linear interpolation, (Jelinek, 1997). This section explains how linear interpolation can be represented using a PCFG with tied rule probabilities, and how one might estimate smoothing parameters in an unsupervised framework. In many probabilistic models it is common to estimate the distribution of some event x conditioned on some set of context information P (x|N(1) . . . N(k) ) by smoothing it with less complicated conditional distributions. Using linear interpolation we model P (x|N(1) . . . N(k) ) as a weighted average of two distributions 1 P1 (x|N(1) , . . . , N(k) ) + 2 P2 (x|N(1) , . . . , N(k-1) ), where the distribution P2 makes an independence assumption by dropping the conditioning event N(k) . In a PCFG a nonterminal N can encode a collection of conditioning events N(1) . . . N(k) , and N determines a distribution conditioned on N(1) . . . N(k) over events represented by the rules r RN . For example, in EVG the nonterminal L1 N encodes N three separate pieces of conditioning information: the direction d = left, the head part-of-speech H = NN , and the argument position v = 0; L1 YJ J NN L represents the probability of generNN ating JJ as the first left argument of NN . Suppose in EVG we are interested in smoothing P (A | d, H, v) with a component that excludes the head conditioning event. Using linear interpolation, this would be: We can see that in grammar G each N B eventually ends up rewriting to one of N 's expansions in G. There are two indirect paths, one through N b1 P (A | d, H, v) = 1 P1 (A | d, H, v)+2 P2 (A | d, v) and one through N b2 . Thus this defines the probability of N in G, N , as the probability of We will estimate PCFG rules with linearly interporewriting N as in G via N b1 and N b2 . That is: lated probabilities by creating a tied PCFG which is extended by adding rules that select between the N = N N b1 N b1 + N N b2 N b2 main distribution P1 and the backoff distribution P2 , and also rules that correspond to draws from those The example in Figure 6 shows the probability that distributions. We will make use of tied rule proba- L1 rewrites to Ybig dog L in grammar G. dog bilities to make the independence assumption in the Typically when smoothing we need to incorporate backoff distribution. the prior knowledge that conditioning events that We still use the original grammar to parse the sen- have been seen fewer times should be more strongly tence. However, we estimate the parameters in the smoothed. We accomplish this by setting the Dirichextended grammar and then translate them back into let hyperparameters for each N N b1 , N N b2 the original grammar for parsing. decision to (K, 2K), where K = |RN b1 | is the numMore formally, suppose B N is a set of non- ber of rewrite rules for A. This ensures that the terminals (called the backoff set) with conditioning model will only start to ignore the backoff distribu106 events N(1) . . . N(k-1) in common (differing in a conditioning event N(k) ), and with rule sets of the same cardinality. If G is our model's PCFG, we can define a new tied PCFG G = (N , T , S, R , ), where N = N N b | N B, {1, 2} , meaning for each nonterminal N in the backoff set we add two nonterminals N b1 , N b2 representing each distribution P1 and P2 . The new rule set R = (N N R ) where for all N B N rule set R = N N b | {1, 2} , meanN ing at N in G we decide which distribution P1 , P2 to use; and for N B and {1, 2} , R b = N b | N RN indicating a N draw from distribution P . For nonterminals N B, R = RN . Finally, for each N, M B we N specify a tying relation between the rules in R b2 N and R b2 , grouping together analogous rules. This M has the effect of making an independence assumption about P2 , namely that it ignores the conditioning event N(k) , drawing from a common distribution each time a nonterminal N b2 is rewritten. For example, in EVG to smooth P (A = DT | d = left, H = NN , v = 0) with P2 (A = DT | d = left, v = 0) we define the backoff set to be L1 | H V . In the extended grammar we H define the tying relation to form rule equivalence classes by the argument they generate, i.e. for each argument A V , we have a rule equivalence class L1b2 YA HL | H V . H B PG B @ 0 L1 dog Ybig B B B C C=P B G B A B B dogL @ 1 0 L1 dog 1b1 Ldog 1 Ybig dogL C C C C C + PG C C A 0 B B B B B B B @ L1 dog 1b2 Ldog 1 C C C C C C C A Ybig dogL Figure 6: Using linear interpolation to smooth L1 dog Ybig dog L : The first component represents the distribution fully conditioned on head dog, while the second component represents the distribution ignoring the head conditioning event. This later is accomplished by tying the rule L1b2 Ybig dog L to, for instance, L1b2 cat dog Ybig catL , L1b2 Ybig f ishL etc. f ish tion after having seen a sufficiently large number of training examples. 4 4.1 Smoothed Dependency Models Our first experiments examine smoothing the distributions over an argument in the DMV and EVG models. In DMV we smooth the probability of argument A given head part-of-speech H and direction d with a distribution that ignores H. In EVG, which conditions on H, d and argument position v we back off two ways. The first is to ignore v and use backoff conditioning event H, d. This yields a backoff distribution with the same conditioning information as the argument distribution from DMV. We call this EVG smoothed-skip-val. The second possibility is to have the backoff distribution ignore the head part-of-speech H and use backoff conditioning event v, d. This assumes that arguments share a common distribution across heads. We call this EVG smoothed-skip-head. As we see below, backing off by ignoring the part-ofspeech of the head H worked better than ignoring the argument position v. For L-EVG we smooth the argument part-ofspeech distribution (conditioned on the head word) with the unlexicalized EVG smoothed-skip-head model. initializes the attachment probabilities to favor arguments that appear more closely in the data. This starts EM in a state preferring shorter attachments. Since our goal is to expand the model to incorporate lexical information, we want an initialization scheme which does not depend on the details of DMV. The method we use is to create M sets of B random initial settings and to run VB some small number of iterations (40 in all our experiments) for each initial setting. For each of the M sets, the model with the best free energy of the B runs is then run out until convergence (as measured by likelihood of a held-out data set); the other models are pruned away. In this paper we use B = 20 and M = 50. For the bth setting, we draw a random sample Ż from the prior (b) . We set the initial Q(t) = (b) ) which can be calculated using the Ż P (t|s, Ż Expectation-Maximization E-Step. Q() is then initialized using the standard VB M-step. For the Lexicalized-EVG, we modify this procedure slightly, by first running M B smoothed EVG models for 40 iterations each and selecting the best model in each cohort as before; each L-EVG distribution is initialized from its corresponding EVG distribution. The new P (A|h, H, d, v) distributions are set initially to their corresponding P (A|H, d, v) values. 6 Results We trained on the standard Penn Treebank WSJ corpus (Marcus et al., 1993). Following Klein and Manning (2002), sentences longer than 10 words after removing punctuation are ignored. We refer to this variant as WSJ10. Following Cohen et al. (2008), we train on sections 2-21, used 22 as a held-out development corpus, and present results evaluated on section 23. The models were all trained using Variational Bayes, and initialized as described in Section 5. To evaluate, we follow Cohen et al. (2008) in using the mean of the variational posterior Dirichlets Ż as a point estimate . For the unsmoothed models Ż we decode by selecting the Viterbi parse given , or ). Ż argmaxt P (t|s, For the smoothed models we find the Viterbi parse of the unsmoothed CFG, but use the smoothed probabilities. We evaluate against the gold standard 5 Initialization and Search issues Klein and Manning (2004) strongly emphasize the importance of smart initialization in getting good performance from DMV. The likelihood function is full of local maxima and different initial parameter values yield vastly different quality solutions. They offer what they call a "harmonic initializer" which 4 We set the other Dirichlet hyperparameters to 1. 107 Model DMV DMV DMV DMV DMV EVG EVG EVG L-EVG Variant harmonic init random init log normal-families shared log normal-families smoothed random init smoothed-skip-val smoothed-skip-head smoothed Dir. Acc. 46.9* 55.7 (8.0) 59.4* 62.4 61.2 (1.2) 53.3 (7.1) 62.1 (1.9) 65.0 (5.7) 68.8 (4.5) Dir,Val left, 0 Arg NN NNP DT Prob 0.65 0.18 0.12 Dir,Val right, 0 Arg NN RB NNS IN Prob 0.26 0.23 0.12 0.11 0.78 left, 1 CC RB IN 0.35 0.27 0.18 right, 1 IN Table 1: Directed accuracy (DA) for WSJ10, section 23. *, indicate results reported by Cohen et al. (2008), Cohen and Smith (2009) respectively. Standard deviations over 10 runs are given in parentheses Table 2: Most likely arguments given valence and direction, according to smoothing distribution P (arg|dir, val) in EVG smoothed-skip-head model with lowest free energy. 7 Conclusion We present a smoothing technique for unsupervised PCFG estimation which allows us to explore more sophisticated dependency grammars. Our method combines linear interpolation with a Bayesian prior that ensures the backoff distribution receives probability mass. Estimating the smoothed model requires running the standard Variational Bayes on an extended PCFG. We used this technique to estimate a series of dependency grammars which extend DMV with additional valence and lexical information. We found that both were helpful in learning English dependency grammars. Our L-EVG model gives the best reported accuracy to date on the WSJ10 corpus. Future work includes using lexical information more deeply in the model by conditioning argument words and valence on the lexical head. We suspect that successfully doing so will require using much larger datasets. We would also like to explore using our smoothing technique in other models such as HMMs. For instance, we could do unsupervised HMM part-of-speech induction by smooth a tritag model with a bitag model. Finally, we would like to learn the parts-of-speech in our dependency model from text and not rely on the gold-standard tags. dependencies for section 23, which were extracted from the phrase structure trees using the standard rules by Yamada and Matsumoto (2003). We measure the percent accuracy of the directed dependency edges. For the lexicalized model, we replaced all words that were seen fewer than 100 times with "UNK." We ran each of our systems 10 times, and report the average directed accuracy achieved. The results are shown in Table 1. We compare to work by Cohen et al. (2008) and Cohen and Smith (2009). Looking at Table 1, we can first of all see the benefit of randomized initialization over the harmonic initializer for DMV. We can also see a large gain by adding smoothing to DMV, topping even the logistic normal prior. The unsmoothed EVG actually performs worse than unsmoothed DMV, but both smoothed versions improve even on smoothed DMV. Adding lexical information (L-EVG) yields a moderate further improvement. As the greatest improvement comes from moving to model EVG smoothed-skip-head, we show in Table 2 the most probable arguments for each val, dir, using the mean of the appropriate variational Dirichlet. For d = right, v = 1, P (A|v, d) largely seems to acts as a way of grouping together various verb types, while for d = lef t, v = 0 the model finds that nouns tend to act as the closest left argument. 108 Acknowledgements This research is based upon work supported by National Science Foundation grants 0544127 and 0631667 and DARPA GALE contract HR0011-062-0001. We thank members of BLLIP for their feedback. References Shay B. Cohen and Noah A. Smith. 2009. Shared logistic normal distributions for soft parameter tying in unsupervised grammar induction. In Proceedings of NAACL-HLT 2009. Shay B. Cohen, Kevin Gimpel, and Noah A. Smith. 2008. Logistic normal priors for unsupervised probabilistic grammar induction. In Advances in Neural Information Processing Systems 21. Michael Collins. 1999. Head-driven Statistical Models for Natural Language Parsing. Ph.D. thesis, The University of Pennsylvania. Jason Eisner and John Blatz. 2007. Program transformations for optimization of parsing algorithms and other weighted logic programs. In Proceedings of the 11th Conference on Formal Grammar. Jason Eisner and Giorgio Satta. 1999. Efficient parsing for bilexical context-free grammars and headautomaton grammars. In Proceedings of ACL 1999. Frederick Jelinek. 1997. Statistical Methods for Speech Recognition. The MIT Press, Cambridge, Massachusetts. Mark Johnson, Thomas L. Griffiths, and Sharon Goldwater. 2007. Bayesian inference for PCFGs via Markov chain Monte Carlo. In Proceedings of NAACL 2007. Mark Johnson. 2007. Transforming projective bilexical dependency grammars into efficiently-parsable CFGs with unfold-fold. In Proceedings of ACL 2007. Dan Klein and Christopher Manning. 2002. A generative constituent-context model for improved grammar induction. In Proceedings of ACL 2002. Dan Klein and Christopher Manning. 2004. Corpusbased induction of syntactic structure: Models of dependency and constituency. In Proceedings of ACL 2004, July. Kenichi Kurihara and Taisuke Sato. 2004. An application of the variational bayesian approach to probabilistics context-free grammars. In IJCNLP 2004 Workshop Beyond Shallow Analyses. Mitchell P. Marcus, Beatrice Santorini, and Mary Ann Marcinkiewicz. 1993. Building a large annotated corpus of English: The Penn Treebank. Computational Linguistics, 19(2):313­330. David McClosky. 2008. Modeling valence effects in unsupervised grammar induction. Technical Report CS09-01, Brown University, Providence, RI, USA. Noah A. Smith and Jason Eisner. 2005. Guiding unsupervised grammar induction using contrastive estimation. In International Joint Conference on Artificial Intelligence Workshop on Grammatical Inference Applications. Noah A. Smith and Jason Eisner. 2006. Annealing structural bias in multilingual weighted grammar induction. In Proceedings of COLING-ACL 2006. Noah A. Smith. 2006. Novel Estimation Methods for Unsupervised Discovery of Latent Structure in Natural Language Text. Ph.D. thesis, Department of Computer Science, Johns Hopkins University. Hiroyasu Yamada and Yuji Matsumoto. 2003. Statistical dependency analysis with support vector machines. In In Proceedings of the International Workshop on Parsing Technologies. 109 Context-Dependent Alignment Models for Statistical Machine Translation Jamie Brunning, Adri` de Gispert and William Byrne a Machine Intelligence Laboratory Department of Engineering, Cambridge University Trumpington Street, Cambridge, CB2 1PZ, U.K. {jjjb2,ad465,wjb31}@eng.cam.ac.uk Abstract We introduce alignment models for Machine Translation that take into account the context of a source word when determining its translation. Since the use of these contexts alone causes data sparsity problems, we develop a decision tree algorithm for clustering the contexts based on optimisation of the EM auxiliary function. We show that our contextdependent models lead to an improvement in alignment quality, and an increase in translation quality when the alignments are used in Arabic-English and Chinese-English translation. 1 Introduction Alignment modelling for Statistical Machine Translation (SMT) is the task of determining translational correspondences between the words in pairs of sentences in parallel J text. Given a source language word sequence f1 and a I target language word sequence e1 , we model the translaJ tion probability as P(eI |f1 ) and introduce a hidden vari1 I able a1 representing a mapping from the target word positions to source word positions such that ei is aligned to j j fai . Then P(eI |f1 ) = aI P(eI , aI |f1 ) (Brown et al., 1 1 1 1 1993). Previous work on statistical alignment modelling has not taken into account the source word context when determining translations of that word. It is intuitive that a word in one context, with a particular part-of-speech and particular words surrounding it, may translate differently when in a different context. We aim to take advantage of this information to provide a better estimate of the word's translation. The challenge of incorporating context information is maintaining computational tractability of estimation and alignment, and we develop algorithms to overcome this. The development of efficient estimation procedures for context-dependent acoustic models revolutionised the field of Automatic Speech Recognition (ASR) (Young et al., 1994). Clustering is used extensively for improving parameter estimation of triphone (and higher order) acoustic models, enabling robust estimation of parameters and reducing the computation required for recognition. Kannan et al. (1994) introduce a binary treegrowing procedure for clustering Gaussian models for triphone contexts based on the value of a likelihood ratio. We adopt a similar approach to estimate contextdependent translation probabilities. We focus on alignment with IBM Model 1 and HMMs. HMMs are commonly used to generate alignments from which state of the art SMT systems are built. Model 1 is used as an intermediate step in the creation of more powerful alignment models, such as HMMs and further IBM models. In addition, it is used in SMT as a feature in Minimum Error Training (Och et al., 2004) and for rescoring lattices of translation hypotheses (Blackwood et al., 2008). It is also used for lexically-weighted phrase extraction (Costa-juss` and Fonollosa, 2005) and sentence a segmentation of parallel text (Deng et al., 2007) prior to machine translation. 1.1 Overview We first develop an extension to Model 1 that allows the use of arbitrary context information about a source word to estimate context-dependent word-to-word translation probabilities. Since there is insufficient training data to accurately estimate translation probabilities for less frequently occurring contexts, we develop a decision tree clustering algorithm to form context classes. We go on to develop a context-dependent HMM model for alignment. In Section 3, we evaluate our context-dependent models on Arabic-English parallel text, comparing them to our baseline context-independent models. We perform morphological decomposition of the Arabic text using MADA, and use part-of-speech taggers on both languages. Alignment quality is measured using Alignment Error Rate (AER) measured against a manually-aligned parallel text. Section 4 uses alignments produced by Human Language Technologies: The 2009 Annual Conference of the North American Chapter of the ACL, pages 110­118, Boulder, Colorado, June 2009. c 2009 Association for Computational Linguistics 110 our improved alignment models to initialise a statistical machine translation system and evaluate the quality of translation on several data sets. We also apply part-ofspeech tagging and decision tree clustering of contexts to Chinese-English parallel text; translation results for these languages are presented in Section 4.2. 1.2 Previous and related work Brown et al. (1993) introduce IBM Models 1-5 for alignment modelling; Vogel et al. (1996) propose a Hidden Markov Model (HMM) model for word-to-word alignment, where the words of the source sentence are viewed as states of an HMM and emit target sentence words; Deng and Byrne (2005a) extend this to an HMM word-tophrase model which allows many-to-one alignments and can capture dependencies within target phrases. Habash and Sadat (2006) perform morphological decomposition of Arabic words, such as splitting of prefixes and suffixes. This leads to gains in machine translation quality when systems are trained on parallel text containing the modified Arabic and processing of Arabic text is carried out prior to translation. Nießen and Ney (2001a) perform pre-processing of German and English text before translation; Nießen and Ney (2001b) use morphological information of the current word to estimate hierarchical translation probabilities. Berger et al. (1996) introduce maximum entropy models for machine translation, and use a window either side of the target word as context information. Varea et al. (2002) test for the presence of specific words within a window of the current source word to form features for use inside a maximum entropy model of alignment. Toutanova et al. (2002) use part-of-speech information in both the source and target languages to estimate alignment probabilities, but this information is not incorporated into translation probabilities. Popovi´ and c Ney (2004) use the base form of a word and its part-ofspeech tag during the estimation of word-to-word translation probabilities for IBM models and HMMs, but do not defined context-dependent estimates of translation probabilities. Stroppa et al. (2007) consider context-informed features of phrases as components of the log-linear model during phrase-based translation, but do not address alignment. e.g. part-of-speech, previous and next words, part-ofspeech of previous and next words, or longer range context information. We follow Brown et al. (1993), but extend their modelling framework to include information about the source word from which a target word is emitted. We model the alignment process as: J P(eI , aI , I|f1 , cJ ) = 1 1 1 I J P(I|f1 , cJ ) 1 i=1 J P(ei |ai , ei-1 , f1 , cJ , I) 1 1 1 We introduce word-to-word translation tables that depend on the source language context for each word, i.e. the probability that f translates to e given f has context c is t(e|f, c). We assume that the context sequence is given for a source word sequence. This assumption can be relaxed to allow for multiple tag sequences as hidden processes, but we assume here that a tagger generates J a single context sequence cJ for a word sequence f1 . 1 This corresponds to the assumption that, for a context sequence cJ , P(~J |f1 ) = cJ (~J ); hence ~1 c1 J c1 1 J P(eI , aI |f1 ) = 1 1 cJ ~1 J J P(eI , aI , cJ |f1 ) = P(eI , aI |cJ , f1 ) 1 1 ~1 1 1 1 J × P(ai |ei-1 , ai-1 , f1 , cJ , I) 1 1 1 (1) For Model 1, ignoring the sentence length distribution, J PM1 (eI , aI |f1 , cJ ) = 1 1 1 1 (J + 1)I I i=1 t(ei |fai , cai ). (2) Estimating translation probabilities separately for every possible context of a source word individually leads to problems with data sparsity and rapid growth of the translation table. We therefore wish to cluster source contexts which lead to similar probability distributions. Let Cf denote the set of all observed contexts of source word f . A particular clustering is denoted where Kf is a partition of Cf . We define a class membership function µf such that for any context c, µf (c) is the cluster containing c. We assume that all contexts in a cluster give rise to the same translation probability distribution for that source word, i.e. for a cluster K, t(e|f, c) = t(e|f, c ) for all contexts c, c K and all target words e; we write this shared translation probability as t(e|f, K). The Model 1 sentence translation probability for a given alignment (Equation 2) becomes J PM1 (eI , aI |f1 , cJ ) = 1 1 1 Kf = {Kf,1 , . . . , Kf,Nf }, 2 Use of source language context in alignment modelling Consider the alignment of the target sentence e = eI with 1 J the source sentence f = f1 . Let a = aI be the align1 ments of the target words to the source words. Let cj be the context information of fj for j = 1, . . . , J. This context information can be any information about the word, 1 (J + 1)I I i=1 t(ei |fai , µf (cai )). (3) 111 For HMM alignment, we assume that the transition probabilities a(ai |ai-1 ) are independent of the word contexts and the sentence translation probability is I J PH (eI , aI |f1 , cJ ) 1 1 1 Following the usual derivation, the EM update for the class-specific translation probabilities becomes ^ t(e|f, K) = (e|f, K) . e (e |f, K) (7) = i=1 a(ai |ai-1 , J)t(ei |fai , µf (cai )). (4) Section 2.1.1 describes how the context classes are determined by optimisation of the EM auxiliary function. Although the translation model is significantly more complex than that of context-independent models, once class membership is fixed, alignment and parameter estimation use the standard algorithms. 2.1 EM parameter estimation We train using Expectation Maximisation (EM), optimising the log probability of the training set {e(s) , f (s) }S s=1 (Brown et al., 1993). Given model parameters , we estimate new parameters by maximisation of the EM auxiliary function P (a|f (s) , c(s) , e(s) ) log P (e(s) , a, I (s) |f (s) , c(s) ). Standard EM training can be viewed a special case of this, with every context of a source word grouped into a single cluster. Another way to view these clustered contextdependent models is that contexts belonging to the same cluster are tied and share a common translation probability distribution, which is estimated from all training examples in which any of the contexts occur. 2.2 Decision trees for context clustering The objective for each source word is to split the contexts into classes to maximise the likelihood of the training data. Since it is not feasible to maximise the likelihood of the observations directly, we maximise the expected log likelihood by considering the EM auxiliary function, in a similar manner to that used for modelling contextual variations of phones for ASR (Young et al., 1994; Singer and Ostendorf, 1996). We perform divisive clustering independently for each source word f , by building a binary decision tree which forms classes of contexts which maximise the EM auxiliary function. Questions for the tree are drawn from a set of questions Q = {q1 , . . . , q|Q| } concerning the context information of f . Let K be any set of contexts of f , and define L(K) = e cK s,a We assume the sentence length distribution and alignment probabilities do not depend on the contexts of the source words; hence the relevant part of the auxiliary function is (e|f, c) log t(e|f, c), e f cCf (5) (e|f, c) log t(e|f, c) (e|f, c) log e cK e cK where I (s) J (s) = (s) (s) (s) c (cj )e (ei )f (fj ) s i=1 j=1 (e|f, c) = (e|f, c) . cK (e |f, c) × P (ai = j|e(s) , f (s) , c(s) ) Here can be computed under Model 1 or the HMM, and is calculated using the forward-backward algorithm for the HMM. 2.1.1 Parameter estimation with clustered contexts We can re-write the EM auxiliary function (Equation 5) in terms of the cluster-specific translation probabilities: |Kf | e f l=1 cKf,l This is the contribution to the EM auxiliary function of source word f occurring in the contexts of K. Let q be a binary question about the context of f , and consider the effect on the partial auxiliary function (Equation 6) of splitting K into two clusters using question q. Define Kq be the set of contexts in K which answer `yes' to q and Kq be the contexts which answer `no'. Define the Ż objective function Qf,q (K) = e cKq (e|f, c) log t(e|f, c) + e cKq Ż (e|f, c) log t(e|f, c) (e|f, K) log t(e|f, K) (6) (e|f, c) log t(e|f, c) = L(Kq ) + L(Kq ) Ż When the node is split using question q, the increase in objective function is given by Qf,q (K) - L(K) = L(Kq ) + L(Kq ) - L(K). Ż = e f KKf where (e|f, K) = cK (e|f, c) 112 We choose q to maximise this. In order to build the decision tree for f , we take the set of all contexts Cf as the initial cluster at the root node. We then find the question q such that Qf,q (Cf ) is maxi^ mal, i.e. q = arg max Qf,q (Cf ) ^ qQ This splits Cf , so our decision tree now has two nodes. We iterate this process, at each iteration splitting (into two further nodes) the leaf node that leads to the greatest increase in objective function. This leads to a greedy search to optimise the log likelihood over possible state clusterings. In order to control the growth of the tree, we put in place two thresholds: · Timp is the minimum improvement in objective function required for a node to be split; without it, we would continue splitting nodes until each contained only one context, even though doing so would cause data sparsity problems. · Tocc is the minimum occupancy of a node, based on how often the contexts at that node occur in the training data; we want to ensure that there are enough examples of a context in the training data to estimate accurately the translation probability distribution for that cluster. For each leaf node l and set of contexts Kl at that node, we find the question ql that, when used to split Kl , produces the largest gain in objective function: ql = arg max[L(Kl,q ) + L(Kl,Ż) - L(Kl )] q qQ shares NNS bank NN the DT of IN % PUNC 12 NN selling VBG of IN deal NN the DT · · · · · · · · · · · · · · · · · · · · city NN the DT in IN liquor NN selling VBG were VBD owners NNS whose WP$ houses NNS several JJ and CC · · · · · · · · · · Sfqp NN byE NN 12 NN % PUNC mn IN >shm NN Albnk NN · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · Figure 1: Alignment of the English selling in different contexts. In the first, it is preceded by of and links to the infinitive of the Arabic verb byE; in the second, it is preceded by were and links to an inflected form of the same Arabic verb, ybyEwn. 3 Evaluation of alignment quality Our models were built using the MTTK toolkit (Deng and Byrne, 2005b). Decision tree clustering was implemented and the process parallelised to enable thousands of decision trees to be built. Our context-dependent (CD) Model 1 models trained on context-annotated data were compared to the baseline context-independent (CI) models trained on untagged data. The models were trained using data allowed for the NIST 08 Arabic-English evaluation1, excluding the UN collections, comprising 300k parallel sentence pairs, a total of 8.4M words of Arabic and 9.5M words of English. The Arabic language incorporates into its words several prefixes and suffixes which determine grammatical features such as gender, number, person and voice. The MADA toolkit (Habash and Sadat, 2006) was used to perform Arabic morphological word decomposition and part-of-speech tagging. It determines the best analysis for each word in a sentence and splits word prefixes and suffixes, based on the alternative analyses provided by BAMA (Buckwalter, 2002). We use tokenisation scheme 1 http://nist.gov/speech/tests/mt/2008 = arg max[L(Kl,q ) + L(Kl,Ż)] q qQ We then find the leaf node for which splitting gives the largest improvement: ^ = arg max[L(Kl,q ) + L(Kl,q ) - L(Kl )] l Ż l l l If the following criteria are both satisfied at that node, we split the node into two parts, creating two leaf nodes in its place: · The objective function increases sufficiently L(Kl,ql ) + L(Kl,ql ) - L(Kl ) > Timp Ż · The occupancy threshold is exceeded for both child nodes: (e|f, c) > Tocc for x = q, q Ż e cKl,x We perform such clustering for every source word in the parallel text. 113 w+ CC mnAzl NN Edp JJ >SHAbhA NN ybyEwn VBP Alxmwr NN fy IN Almdynp NN · · · · · · · · · `D2', which splits certain prefixes and has been reported to improve machine translation performance (Habash and Sadat, 2006). The alignment models are trained on this processed data, and the prefixes and suffixes are treated as words in their own right; in particular their contexts are examined and clustered. The TnT tagger (Brants, 2000), used as distributed with its model trained on the Wall Street Journal portion of the Penn treebank, was used to obtain part-of-speech tags for the English side of the parallel text. Marcus et al. (1993) gives a complete list of part-of-speech tags produced. No morphological analysis is performed for English. Automatic word alignments were compared to a manually-aligned corpus made up of the IBM ArabicEnglish Word Alignment Corpus (Ittycheriah et al., 2006) and the word alignment corpora LDC2006E86 and LDC2006E93. This contains 28k parallel text sentences pairs: 724k words of Arabic and 847k words of English. The alignment links were modified to reflect the MADA tokenisation; after modification, there are 946k word-toword alignment links. Alignment quality was evaluated by computing Alignment Error Rate (AER) (Och and Ney, 2000) relative to the manual alignments. Since the links supplied contain only `sure' links and no `possible' links, we use the following formula for computing AER given reference alignment links S and hypothesised alignment links A: 2|SA| AER = 1 - |S|+|A| . 3.1 Questions about contexts The algorithm presented in Section 2 allows for any information about the context of the source word to be considered. We could consider general questions of the form `Is the previous word x?' and `Does word y occur within n words of this one?'. To maintain computational tractability, we restrict the questions to those concerning the partof-speech tag assigned to the current, previous and next words. We do not ask questions about the identities of the words themselves. For each part-of-speech tag T , we ask the question `Does w have tag T?'. In addition, we group part-of-speech tags to ask more general questions: e.g. the set of contexts which satisfies `Is w a noun?' contains those that satisfy `Is w a proper noun?' and `Is w a singular or mass noun?'. We also ask the same questions of the previous and next words in the source sentence. In English, this gives a total of 152 distinct questions, each of which is considered when splitting a leaf node. The MADA part-of-speech tagger uses a reduced tag set, which produces a total of 68 distinct questions. Figure 1 shows the links of the English source word selling in two different contexts where it links to different words in Arabic, which are both forms of the same verb. The part-of-speech of the previous word is useful for dis- -4.04e+07 -4.06e+07 -4.08e+07 Log probability of training data -4.1e+07 -4.12e+07 -4.14e+07 -4.16e+07 -4.18e+07 -4.2e+07 -4.22e+07 CI Model 1 Threshold 10 Threshold 20 Threshold 60 11 12 13 14 15 16 Iteration 17 18 19 20 -2.75e+06 -2.8e+06 Log probability of training data -2.85e+06 CI Model 1 Threshold 10 Threshold 20 Threshold 60 -2.9e+06 -2.95e+06 -3e+06 -3.05e+06 -3.1e+06 11 12 13 14 15 Iteration 16 17 18 19 20 Figure 2: Increase in log probability of training data during training for varying Timp , with Model 1, for Arabic to English (top) and English to Arabic (bottom) criminating between the two cases, whereas a contextindependent model would assign the same probability to both Arabic words. 3.2 Training Model 1 Training is carried out in both translation directions. For Arabic to English, the Arabic side of the parallel text is tagged and the English side remains untagged; we view the English words as being generated from the Arabic words and questions are asked about the context of the Arabic words to determine clusters for the translation table. For English to Arabic, the situation is reversed: we used tagged English text as the source language and untagged Arabic text, with morphological decomposition, as the target language. Standard CI Model 1 training, initialised with a uniform translation table so that t(e|f ) is constant for all source/target word pairs (f, e), was run on untagged data for 10 iterations in each direction (Brown et al., 1993; Deng and Byrne, 2005b). A decision tree was built to cluster the contexts and a further 10 iterations of training were carried out using the tagged words-with-context to produce context-dependent models (CD Model 1). The 114 English question Is Next Preposition Is Prev Determiner Is Prev Preposition Is Prev Adjective Is Next Noun Singular Mass Is Prev Noun Singular Mass Is Next Noun Plural Is Next Noun Arabic question Is Prev Preposition Is Next Preposition Is Prev Noun Is Next Noun Is Prev Coordinating Conjunction Is Prev Noun SingularMass Is Next Punctuation Is Next Adjective Adverb Frequency 1523 1444 1209 864 772 690 597 549 Frequency 1110 993 981 912 627 607 603 559 Timp 10 20 40 100 Arabic-English (%) 30601 (25.33) 11193 (9.27) 1874 (1.55) 307 (0.25) English-Arabic (%) 26011 (39.87) 18365 (28.15) 9104 (13.96) 1128 (1.73) Table 2: Words [number (percentage)] with context-dependent translation for varying Timp 3.2.3 Variation of improvement threshold Timp There is a trade-off between modelling the data accurately, which requires more clusters, and eliminating data sparsity problems, which requires each cluster to contain contexts that occur frequently enough in the training data to estimate the translation probabilities accurately. Use of a smaller threshold Timp leads to more clusters per word and an improved ability to fit to the data, but this can lead to reduced alignment quality if there is insufficient data to estimate the translation probability distribution accurately for each cluster. For lower thresholds, we observe over-fitting and the AER rises after the second iteration of CD training, similar to the behaviour seen in Och (2002). Setting Timp = 0 results in each context of a word having its own cluster, which leads to data sparsity problems. Table 2 shows the percentage of words for which the contexts are split into multiple clusters for CD Model 1 with varying improvement thresholds. This occurs when there are enough training data examples and sufficient variability between the contexts of a word that splitting the contexts into more than one cluster increases the EM auxiliary function. For words where the contexts are not split, all the contexts remain in the same cluster and parameter estimation is exactly the same as for the unclustered context-independent models. 3.3 Training HMMs Adding source word context to translation has so far led to improvements in AER for Model 1, but the performance does not match that of HMMs trained on untagged data; we therefore train HMMs on tagged data. We proceed with Model 1 and Model 2 trained in the usual way, and context-independent (CI) HMMs were trained for 5 iterations on the untagged data. Statistics were then gathered for clustering at various thresholds, after which 5 further EM iterations were performed with tagged data to produce context-dependent (CD) HMMs. The HMMs were trained in both the Arabic to English and the English to Arabic directions. The log likelihood of the training set varies with Timp in much the same way as for Model 1, increasing at each iteration, with greater likelihood at lower thresholds. Figure 4 shows how the AER of the union alignment varies with Timp during training. As with Model 1, the clustered HMM Table 1: Most frequent root node context questions models were then evaluated using AER at each training iteration. A number of improvement thresholds Timp were tested, and performance compared to that of models found after further iterations of CI Model 1 training on the untagged data. In both alignment directions, the log probability of the training data increases during training (see Figure 2). As expected, the training set likelihood increases as the threshold Timp is reduced, allowing more clusters and closer fitting to the data. 3.2.1 Analysis of frequently used questions Table 1 shows the questions used most frequently at the root node of the decision tree when clustering contexts in English and Arabic. Because they are used first, these are the questions that individually give the greatest ability to discriminate between the different contexts of a word. The list shows the importance of the left and right contexts of the word in predicting its translation: of the most common 50 questions, 25 concern the previous word, 19 concern the next, and only 6 concern the partof-speech of the current word. For Arabic, of the most frequent 50 questions, 21 concern the previous word, 20 concern the next and 9 the current word. 3.2.2 Alignment Error Rate Since MT systems are usually built on the union of the two sets of alignments (Koehn et al., 2003), we consider the union of alignments in the two directions as well as those in each direction. Figure 3 shows the change in AER of the alignments in each direction, as well as the alignment formed by taking their union at corresponding thresholds and training iterations. 115 50.8 50.6 80 CI Model 1 Threshold 10 Threshold 20 Threshold 60 p0=0.95 50.4 50.2 AER 78 p0=0.95 50 Precision 76 49.8 49.6 74 49.4 English-Arabic CD HMM 72 English-Arabic CI HMM Arabic-English CD HMM Arabic-English CI HMM 40 45 50 Recall p0=0.00 49.2 10 11 12 13 14 15 Iteration 16 17 18 19 20 p0=0.00 55 60 51.2 51.0 CI Model 1 Threshold 10 Threshold 20 Threshold 60 Threshold 100 50.8 50.6 AER Figure 5: Precision/recall curves for the context-dependent HMM and the baseline context-independent HMM, for Arabic to English and English to Arabic. p0 varies from 0.00 to 0.95 in steps of 0.05. 50.4 50.2 50.0 49.8 models produce alignments with a lower AER than the baseline model, and there is evidence of over-fitting to the training data. 10 11 12 13 14 15 Iteration 16 17 18 19 20 49.6 3.3.1 Alignment precision and recall The HMM models include a null transition probability, p0 , which can be modified to adjust the number of alignments to the null token (Deng and Byrne, 2005a). Where a target word is emitted from null, it is not included in the alignment links, so this target word is viewed as not being aligned to any source word; this affects the precision and recall. The results reported above use p0 = 0.2 for English-Arabic and p0 = 0.4 for Arabic-English; we can tune these values to produce alignments with the lowest AER. Figure 5 shows precision-recall curves for the CD HMMs compared to the CI HMMs for both translation directions. For a given value of precision, the CD HMM has higher recall; for a given value of recall, the CD HMM has higher precision. We do not report F-score (Fraser and Marcu, 2006) since in our experiments we have not found strong correlation with translation performance, but we note that these results for precision and recall should lead to improved F-scores as well. 51.0 50.8 50.6 50.4 50.2 AER 50.0 49.8 49.6 49.4 49.2 49.0 CI Model 1 Threshold 10 Threshold 20 Threshold 60 10 11 12 13 14 15 Iteration 16 17 18 19 20 Figure 3: Variation of AER during Model 1 training for varying Timp , for Arabic to English (top), English to Arabic (middle) and their union (bottom) 35.3 35.2 35.1 35.0 34.9 AER 34.8 34.7 34.6 34.5 34.4 CI HMM Threshold 10 Threshold 20 Threshold 60 4 Evaluation of translation quality We have shown that the context-dependent models produce a decrease in AER measured on manually-aligned data; we wish to show this improved model performance leads to an increase in translation quality, measured by BLEU score (Papineni et al., 2001). In addition to the Arabic systems already evaluated by AER, we also report results for a Chinese-English translation system. Alignment models were evaluated by aligning the training data using the models in each translation direc- 5 6 7 Iteration 8 9 10 Figure 4: AER of the union alignment for varying Timp with the HMM model 116 tion. HiFST, a WFST-based hierarchical translation system described in (Iglesias et al., 2009), was trained on the union of these alignments. MET (Och, 2003) was carried out using a development set, and the BLEU score evaluated on two test sets. Decoding used a 4-gram language model estimated from the English side of the entire MT08 parallel text, and a 965M word subset of monolingual data from the English Gigaword Third Edition. For both Arabic and English, the CD HMM models were evaluated as follows. Iteration 5 of the CI HMM was used to produce alignments for the parallel text training data: these were used to train the baseline system. The same data is aligned using CD HMMs after two further iterations of training and a second WFST-based translation system built from these alignments. The models are evaluated by comparing BLEU scores with those of the baseline model. 4.1 Arabic to English translation Alignment models were trained on the NIST MT08 Arabic-English parallel text, excluding the UN portion. The null alignment probability was chosen based on the AER, resulting in values of p0 = 0.05 for Arabic to English and p0 = 0.10 for English to Arabic. We perform experiments on the NIST Arabic-English translation task. The mt02 05 tune and mt02 05 test data sets are formed from the odd and even numbered sentences of the NIST MT02 to MT05 evaluation sets respectively; each contains 2k sentences and 60k words. We use mt02 05 tune as a development set and evaluate the system on mt02 05 test and the newswire portion of the MT08 set, MT08-nw. Table 3 shows a comparison of the system trained using CD HMMs with the baseline system, which was trained using CI HMM models on untagged data. The context-dependent models result in a gain in BLEU score of 0.3 for mt02 05 test and 0.6 for MT08-nw. 4.2 Chinese to English translation The Chinese training set was 600k random parallel text sentences of the newswire LDC collection allowed for NIST MT08, a total of 15.2M words of Chinese and 16.6M words of English. The Chinese text was tagged using the MXPOST maximum-entropy part of speech tagging tool (Ratnaparkhi, 1996) trained on the Penn Chinese Treebank 5.1; the English text was tagged using the TnT part of speech tagger (Brants, 2000) trained on the Wall Street Journal portion of the English Penn treebank. The development set tune-nw and validation set test-nw contain a mix of the newswire portions of MT02 through MT05 and additional developments sets created by translation within the GALE program. We also report results on the newswire portion of the MT08 set. Again we see an increase in BLEU score for both test sets: 0.5 for test- Alignments CI HMM CD HMM Alignments CI HMM CD HMM Arabic-English tune mt02 05 test 50.0 49.4 50.0 49.7 Chinese-English tune test-nw 28.1 28.5 28.5 29.0 MT08-nw 46.3 46.9 MT08-nw 26.9 27.7 Table 3: Comparison, using BLEU score, of the CD HMM with the baseline CI HMM nw and 0.8 for MT08-nw. 5 Conclusions and future work We have introduced context-dependent Model 1 and HMM alignment models, which use context information in the source language to improve estimates of wordto-word translation probabilities. Estimation of parameters using these contexts without smoothing leads to data sparsity problems; therefore we have developed decision tree clustering algorithms to cluster source word contexts based on optimisation of the EM auxiliary function. Context information is incorporated by the use of part-ofspeech tags in both languages of the parallel text, and the EM algorithm is used for parameter estimation. We have shown that these improvements to the model lead to decreased AER compared to context-independent models. Finally, we compare machine translation systems built using our context-dependent alignments. For both Arabic- and Chinese-to-English translation, we report an increase in translation quality measured by BLEU score compared to a system built using contextindependent alignments. This paper describes an initial investigation into context-sensitive alignment models, and there are many possible directions for future research. Clustering the probability distributions of infrequently occurring may produce improvements in alignment quality, different model training schemes and extensions of the contextdependence to more sophisticated alignment models will be investigated. Further translation experiments will be carried out. Acknowledgements This work was supported in part by the GALE program of the Defense Advanced Research Projects Agency, Contract No. HR0011-06-C-0022. J. Brunning is supported by a Schiff Foundation graduate studentship. Thanks to Yanjun Ma, Dublin City University, for training the Chinese part of speech tagger. 117 References A. L. Berger, S. Della Pietra, and V. J. Della Pietra. 1996. A maximum entropy approach to natural language processing. Computational Linguistics, 22(1):39­71. Graeme Blackwood, Adri` de Gispert, Jamie Brunning, and a William Byrne. 2008. European language translation with weighted finite state transducers: The CUED MT system for the 2008 ACL workshop on SMT. In Proceedings of the Third Workshop on Statistical Machine Translation, pages 131­134, Columbus, Ohio, June. Association for Computational Linguistics. Thorsten Brants. 2000. TnT ­ a statistical part-of-speech tagger. In Proceedings of the 6th Applied Natural Language Processing Conference: ANLP-2000, Seattle, USA. Peter F. Brown, Vincent J. Della Pietra, Stephen A. Della Pietra, and Robert L. Mercer. 1993. The mathematics of statistical machine translation: parameter estimation. Computational Linguistics, 19(2):263­311. T. Buckwalter. 2002. Buckwalter Arabic morphological analyzer. Marta Ruiz Costa-juss` and Jos´e A. R. Fonollosa. 2005. a Improving phrase-based statistical translation by modifying phrase extraction and including several features. In Proceedings of the ACL Workshop on Building and Using Parallel Texts, pages 149­154, June. Yonggang Deng and William Byrne. 2005a. HMM word and phrase alignment for statistical machine translation. In Proc. of HLT-EMNLP. Yonggang Deng and William Byrne. 2005b. JHU-Cambridge statistical machine translation toolkit (MTTK) user manual. Yonggang Deng, Shankhar Kumar, and William Byrne. 2007. Segmentation and alignment of parallel text for statistical machine translation. Journal of Natural Language Engineering, 13:3:235­260. Alexander Fraser and Daniel Marcu. 2006. Measuring word alignment quality for statistical machine translation. Technical Report ISI-TR-616, ISI/University of Southern California, May. Nizar Habash and Fatiha Sadat. 2006. Arabic preprocessing schemes for statistical machine translation. In HLT-NAACL. G. Iglesias, A. de Gispert, E. R. Banga, and W. Byrne. 2009. Hierarchical phrase-based translation with weighted finite state transducers. In Procedings of NAACL-HLT, 2009, Boulder, Colorado. Abraham Ittycheriah, Yaser Al-Onaizan, and Salim Roukos. 2006. The IBM Arabic-English word alignment corpus, August. A. Kannan, M. Ostendorf, and J. R. Rohlicek. 1994. Maximum likelihood clustering of Gaussians for speech recognition. Speech and Audio Processing, IEEE Transactions on, 2(3):453­455, July. Philipp Koehn, Franz Josef Och, and Daniel Marcu. 2003. Statistical phrase-based translation. In NAACL '03: Proceedings of the 2003 Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology, pages 48­54. Mitchell P. Marcus, Mary Ann Marcinkiewicz, and Beatrice Santorini. 1993. Building a large annotated corpus of English: the Penn Treebank. Computational Linguistics, 19(2):313­330. Sonja Nießen and Hermann Ney. 2001a. Morpho-syntactic analysis for reordering in statistical machine translation. In Proceedings of MT Summit VIII, pages 247­252, September. Sonja Nießen and Hermann Ney. 2001b. Toward hierarchical models for statistical machine translation of inflected languages. In Proceedings of the workshop on Data-driven methods in machine translation, pages 1­8, Morristown, NJ, USA. Association for Computational Linguistics. Franz Josef Och and Hermann Ney. 2000. A comparison of alignment models for statistical machine translation. In Proceedings of the 18th conference on Computational Linguistics, pages 1086­1090. F. Och, D. Gildea, S. Khudanpur, A. Sarkar, K. Yamada, A. Fraser, S. Kumar, L. Shen, D. Smith, K. Eng, V. Jain, Z. Jin, and D. Radev. 2004. A smorgasbord of features for statistical machine translation. In Proceedings of NAACL. Franz Josef Och. 2002. Statistical Machine Translation: From Single Word Models to Alignment Templates. Ph.D. thesis, Franz Josef Och. Franz Josef Och. 2003. Minimum error rate training in statistical machine translation. In ACL '03: Proceedings of the 41st Annual Meeting on Association for Computational Linguistics, pages 160­167. Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. 2001. Bleu: a method for automatic evaluation of machine translation. In Proceedings of ACL, pages 311­318. Maja Popovi´ and Hermann Ney. 2004. Improving word alignc ment quality using morpho-syntactic information. In In Proceedings of COLING, page 310. Adwait Ratnaparkhi. 1996. A maximum entropy model for part-of-speech tagging. In In Proceedings of the Conference on Empirical Methods in Natural Language Processing, pages 133­142. H. Singer and M. Ostendorf. 1996. Maximum likelihood successive state splitting. Proceedings of ICASSP, 2:601­604. Nicolas Stroppa, Antal van den Bosch, and Andy Way. 2007. Exploiting source similarity for SMT using context-informed features. In Proceedings of the 11th Conference on Theoretical and Methodological Issues in Machine Translation (TMI 2007), pages 231 ­ 240. Kristina Toutanova, H. Tolga Ilhan, and Christopher D. Manning. 2002. Extensions to HMM-based statistical word alignment models. In Proceedings of EMNLP, pages 87­94. Ismael Garc´a Varea, Franz J. Och, Hermann Ney, and Frani cisco Casacuberta. 2002. Improving alignment quality in statistical machine translation using context-dependent maximum entropy models. In Proceedings of COLING, pages 1­7. Stephan Vogel, Hermann Ney, and Christoph Tillmann. 1996. HMM-based word alignment in statistical translation. In Proceedings of COLING, pages 836­841. S. J. Young, J. J. Odell, and P. C. Woodland. 1994. Tree-based state tying for high accuracy acoustic modelling. In HLT '94: Proceedings of the workshop on Human Language Technology, pages 307­312. 118 Graph-based Learning for Statistical Machine Translation Andrei Alexandrescu Dept. of Comp. Sci. Eng. University of Washington Seattle, WA 98195, USA andrei@cs.washington.edu Katrin Kirchhoff Dept. of Electrical Eng. University of Washington Seattle, WA 98195, USA katrin@ee.washington.edu Abstract Current phrase-based statistical machine translation systems process each test sentence in isolation and do not enforce global consistency constraints, even though the test data is often internally consistent with respect to topic or style. We propose a new consistency model for machine translation in the form of a graph-based semi-supervised learning algorithm that exploits similarities between training and test data and also similarities between different test sentences. The algorithm learns a regression function jointly over training and test data and uses the resulting scores to rerank translation hypotheses. Evaluation on two travel expression translation tasks demonstrates improvements of up to 2.6 BLEU points absolute and 2.8% in PER. Arabic-English translation task): Source 1: Asf lA ymknk *lk hnAk klfp HwAly vmAnyn dwlAr lAlsAEp AlwAHdp Ref: sorry you can't there is a cost the charge is eighty dollars per hour 1-best: i'm sorry you can't there in the cost about eighty dollars for a one o'clock Source 2: E*rA lA ymknk t$gyl AltlfAz HtY tqlE AlTA}rp Ref: sorry you cannot turn the tv on until the plane has taken off 1-best: excuse me i you turn tv until the plane departs 1 Introduction Current phrase-based statistical machine translation (SMT) systems commonly operate at the sentence level--each sentence is translated in isolation, even when the test data consists of internally coherent paragraphs or stories, such as news articles. For each sentence, SMT systems choose the translation hypothesis that maximizes a combined log-linear model score, which is computed independently of all other sentences, using globally optimized combination weights. Thus, similar input strings may be translated in very different ways, depending on which component model happens to dominate the combined score for that sentence. This is illustrated by the following example (from the IWSLT 2007 119 The phrase lA ymknk (you may not/you cannot) is translated differently (and wrongly in the second case) due to different segmentations and phrase translations chosen by the decoder. Though different choices may be sometimes appropriate, the lack of constraints enforcing translation consistency often leads to suboptimal translation performance. It would be desirable to counter this effect by encouraging similar outputs for similar inputs (under a suitably defined notion of similarity, which may include e.g. a context specification for the phrase/sentence). In machine learning, the idea of forcing the outputs of a statistical learner to vary smoothly with the underlying structure of the inputs has been formalized in the graph-based learning (GBL) framework. In GBL, both labeled (train) and unlabeled (test) data samples are jointly represented as vertices in a graph whose edges encode pairwise similarities between samples. Various learning algorithms can be applied to assign labels to the test samples while ensuring that the classification output varies smoothly Human Language Technologies: The 2009 Annual Conference of the North American Chapter of the ACL, pages 119­127, Boulder, Colorado, June 2009. c 2009 Association for Computational Linguistics along the manifold defined by the graph. GBL has been successfully applied to a range of problems in computer vision, computational biology, and natural language processing. However, in most cases, the learning tasks consisted of unstructured classification, where the input was represented by fixedlength feature vectors and the output was one of a finite set of discrete labels. In machine translation, by contrast, both inputs and outputs consist of word strings of variable length, and the number of possible outputs is not fixed and practically unlimited. In this paper we propose a new graph-based learning algorithm with structured inputs and outputs to improve consistency in phrase-based statistical machine translation. We define a joint similarity graph over training and test data and use an iterative label propagation procedure to regress a scoring function over the graph. The resulting scores for unlabeled samples (translation hypotheses) are then combined with standard model scores in a log-linear translation model for the purpose of reranking. Our contributions are twofold. First, from a machine translation perspective, we design and evaluate a global consistency model enforcing that similar inputs receive similar translations. Second, from a machine learning perspective, we apply graph-based learning to a task with structured inputs and outputs, which is a novel contribution in itself since previous applications of GBL have focused on predicting categorical labels. We evaluate our approach on two machine translation tasks, the IWSLT 2007 Italianto-English and Arabic-to-English tasks, and demonstrate significant improvements over the baseline. ity measure used to compute the edge weights determines the graph structure and is the most important factor in successfully applying GBL. In most applications of GBL, data samples are represented by fixed-length feature vectors, and cosine similarity or Euclidean distance-based measures are used for edge weights. Learning algorithms on similarity graphs include e.g. min-cut (Blum and Chawla, 2001), spectral graph transducer (Joachims, 2003), random walkbased approaches (Szummer and Jaakkola, 2001), and label propagation (Zhu and Ghahramani, 2002). The algorithm proposed herein is based on the latter. 2.1 Label Propagation Given a graph defined by a weight matrix W and a label set Y , the basic label propagation algorithm proceeds as follows: ij ii 1. Initialize the matrix P as Pij = P Wij -Wii j 2. Initialize a n × C matrix f with binary vectors encoding the known labels for the first l rows: fi = C (yi ) i {1, 2, . . . , l}, where C (yi ) is the Kronecker vector of length C with 1 in position yi and 0 elsewhere. The remaining rows of f can be zero. 3. f P × f 4. Clamp already-labeled data rows: fi = C (yi ) i {1, 2, . . . , l} 5. If f f , stop. = 6. f f 7. Repeat from step 3. W -W 2 Graph-Based Learning GBL algorithms rely on a similarity graph consisting of a set of nodes representing data samples xi (where i ranges over 1, . . . , l labeled points and l + 1, . . . , n unlabeled points), and a set of weighted edges encoding pairwise similarities between samples. The graph is characterized by a weight matrix W whose elements Wij 0 are the similarity values for edges between vertices xi and xj , and by its label vector Y = (y1 , . . . yl ), yi {1, . . . , C} that defines labels for the first l points. If there is no edge linking nodes xi and xj , then Wij = 0. There is considerable freedom in choosing the weights. The similar120 After convergence, f contains the solution in rows l + 1 to n in the form of unnormalized label probability distributions. Hard labels can be obtained by yi = arg max fij i {l + 1, . . . , n} ^ j{1,...,C} (1) The algorithm minimizes the following cost function (Zhu, 2005): C S= k=1 i>l j>l Wij (fik - fjk )2 (2) S measures the smoothness of the learned function, i.e., the extent to which the labeling allows largeweight edges to link nodes of different labels. By minimizing S, label propagation finds a labeling that, to the extent possible, assigns similar soft labels (identical hard labels) to nodes linked by edges with large weights (i.e., highly similar samples). The labeling decision takes into account not only similarities between labeled and unlabeled nodes (as in nearest-neighbor approaches) but also similarities among unlabeled nodes. Label propagation has been used successfully for various classification tasks, e.g. image classification and handwriting recognition (Zhu, 2005). In natural language processing, label propagation has been used for document classification (Zhu, 2005), word sense disambiguation (Niu et al., 2005; Alexandrescu and Kirchhoff, 2007), and sentiment categorization (Goldberg and Zhu, 2006). hypotheses reasonably well, provided N is large enough. For all hypotheses of all sentences in the test set (set we denote with H), the system learns a ranking function r : H [0, 1]. Larger values of r indicate better hypotheses. The corresponding loss functional is L(r) = Wij [r(xi ) - r(xj )]2 (3) i,j 3 Graph-Based Learning for Machine Translation Our goal is to exploit graph-based learning for improving consistency in statistical phrase-based machine translation. Intuitively, a set of similar source sentences should receive similar target-language translations. This means that similarities between training and test sentences should be taken into account, but also similarities between different test sentences, which is a source of information currently not exploited by standard SMT systems. To this end we define a graph over the training and test sets with edges between test and training sentences as well as between different test sentences. In cases where a test sentence does not have any connections to training sentences but is connected to other test sentences, helpful information about preferred translations can be propagated via these edges. As mentioned above, the problem of machine translation does not neatly fit into the standard GBL framework. Given that our samples consist of variable-length word strings instead of feature vectors, the standard cosine or Euclidean-distance based similarity measures cannot be used meaningfully, and the number of possible "labels"-- correct translations--is unbounded and practically very large. We thus need to modify both the graph construction and the label propagation algorithms. First, we handle the problem of unlimited outputs by applying GBL to rescoring only. In most SMT systems, an N -best list (generated by a first decoding pass) approximates the search space of good 121 L(r) measures the smoothness of r over the graph by penalizing highly similar clusters of nodes that have a high variance of r (in other words, similar input sentences that have very different translations). The smaller L(r), the "smoother" r is over the graph. Thus, instead of directly learning a classification function, we learn a regression function-- similar to (Goldberg and Zhu, 2006)--that is then used for ranking the hypotheses. 3.1 Graph Construction Each graph node represents a sentence pair (consisting of source and target strings), and edge weights represent the combined similarity scores computed from comparing both the source sides and target sides of a pair of nodes. Given a training set with l source and target language sentence pairs (s1 , t1 ), . . . , (sl , tl ) and a test set with l + 1, ..., n source sentences, sl+1 , . . . , sn , the construction of the similarity graph proceeds as follows: 1. For each test sentence si , i = l + 1, . . . , n, find a set Straini of similar training source sentences and a set Stesti of similar test sentences (excluding si and sentences identical to it) by applying a string similarity function to the source sides only and retaining sentences whose similarity exceeds a threshold . Different 's can be used for training vs. test sentences; we use the same for both sets. 2. For each hypothesis hsi generated for si by a baseline system, compute its similarity to the target sides of all sentences in Straini . The overall similarity is then defined by the combined score ij = (si , sj ), (hsi , tj ) (4) where i = l + 1, . . . n, j = 1, . . . , |Straini | and : R+ × R+ R+ is an averaging function. If ij > 0, establish graph nodes for hsi and tj and link them with an edge of weight ij . 3. For each hypothesis hsi and each hypothesis generated for each of the sentences sk testi , compute similarity on the target side and use the combined similarity score as the edge weight between nodes for hsi and hsk . 4. Finally,for each node xt representing a training sentence, assign r(xt ) = 1 and also define its synthetic counterpart: a vertex x with t r(x ) = 0. For each edge incident to xt of t weight Wth , define a corresponding edge of weight 1 - Wt h . The synthetic nodes and edges need to be added to prevent the label propagation algorithm from converging to the trivial solution that assigns r = 1 to all points in the graph. This choice is theoretically motivated--a similarity graph for regression should have not only "sources" (good nodes with high value of r) but also "sinks" (counterparts for the sources). Figure 1 illustrates the connections of a test node. Similarity Measure The similarity measure used for comparing source and target sides is of prime importance, as it determines the structure of the graph. This has consequences for both computational efficiency (denser graphs require more computation and memory) and the accuracy of the outcome. A low similarity threshold results in a rich graph with a large number of edges but possibly introduces noise. A higher threshold leads to a small graph emphasizing highly similar samples but with too many disconnected components. The similarity measure is also the means by which domain knowledge can be incorporated into the graph construction process. Similarity may be defined at the level of surface word strings, but may also include linguistic information such as morphological features, part-of-speech tags, or syntactic structures. Here, we compare two similarity measures: the familiar BLEU score (Papineni et al., 2002) and a score based on string kernels. In using BLEU we treat each sentence as a complete document. BLEU is not symmetric--when comparing two sentences, different results are obtained depending on which one is considered the reference and which one is the hypothesis. For computing similarities between train and test translations, we use the train translation as 122 the reference. For computing similarity between two test hypotheses, we compute BLEU in both directions and take the average. We note that more appropriate distance measures are certainly possible. Many previous studies, such as (Callison-Burch et al., 2006), have pointed out drawbacks of BLEU, and any other similarity measure could be utilized instead. In particular, similarity measures that model aspects of sentences that are ill handled by standard phrase-based decoders (such as syntactic structure or semantic information) could be useful here. A more general way of computing similarity between strings is provided by string kernels (Lodhi et al., 2002; Rousu and Shawe-Taylor, 2005), which have been extensively used in bioinformatics and email spam detection. String kernels map strings into a feature space defined by all possible substrings of the string up a fixed length k, and computing the dot product between the resulting feature vectors. Several variants of basic string kernels exist, notably those allowing gaps or mismatches, and efficient implementations have been devised even for large scale applications. Formally, we define a sentence s as a concatenation of symbols from a finite alphabet (the vocabulary of the language) and an embedding function from strings to feature vectors, : H. A kernel function K(s, t) computes the distance between the resulting vectors for two sentences s and t. In our case, the embedding function is defined as k (s) := u i:u=s(i) |i| u k (5) where k is the maximum length of substrings, |i| is the length of i, and is a penalty parameter for each gap encountered in the substring. K is defined as K(s, t) = u (s), u (t) wu u (6) where w is a weight dependent on the length of the substring u. Finally, the kernel score is normalized by K(s, s) · K(t, t) to discourage long sentences from being favored. Thus, our similarity measure is a gapped, normalized string kernel, which is a more general measure than BLEU in that is considers noncontiguous substrings. We use a dynamic programming implementation of string kernels (Rousu and Shawe-Taylor, 2005). For the combination of source-side and targetside similarity scores (the function we denoted as ) we test two simple schemes, using either the geometric or the arithmetic mean of the individual scores. In the first case, large edge weights only result when both source and target are close to each other; the latter may produce high edge weights when only one of them (typically the source score) is high. More sophisticated combination schemes, using e.g. weighted combination, could be used but were not investigated in this study. Scalability Poor scalability is often mentioned as a drawback of graph-based learning. Straightforward implementations of GBL algorithms often represent the joint training and test data in working memory and therefore do not scale well to large data sets. However, we have developed several techniques to improve scalability without impeding accuracy. First, we construct separate graphs for each test sentence without losing global connectivity information. The graph for a test sentence is computed as the transitive closure of the edge set E over the nodes containing all hypotheses for that test sentence. This smaller graph does not affect the outcome of the learning process for the chosen sentence because in label propagation the learned value r(xi ) can be influenced by that of another node xj if and only if xj is reachable from xi . In the worst theoretical case, the transitive closure could comprehend the entire graph, but in practice the edge set is never that dense and can be easily pruned based on the heuristic that faraway nodes connected through low-weight edges have less influence on the result. We use a simple embodiment of this heuristic in a work-list approach: starting from the nodes of interest (hypotheses for the focal sentence), we expand the closure starting with the direct neighbors, which have the largest influence; then add their neighbors, which have less influence, and so forth. A threshold on the number of added vertices limits undue expansion while capturing either the entire closure or a good approximation of it. Another practical computational advantage of portioning work is that graphs for different hypothesis sets can be trivially created and used in parallel, whereas distributing large matrix-vector multiplication is much more difficult (Choi, 1998). The disadvantage is that overall 123 1 1 ... W1h W 2h 1 - W1h 0 1- W2 h 0 ... Figure 1: Connections for hypothesis node xh . Similarity edges with weights Wth link the node with train sentences xt , for which r(xt ) = 1. For each of these edges we define a dissimilarity edge of weight 1 - Wth , linking the node with node x for which r(x ) = 0. The vertex is t t also connected to other test vertices (the dotted edges). redundant computations are being made: incomplete estimates of r are computed for the ancillary nodes in the transitive closure and then discarded. Second, we obtain a reduction in graph size of orders of magnitude by collapsing all training vertices of the same r that are connected to the same test vertex into one and sum the edge weights. This is equivalent to the full graph for learning purposes. 3.2 Propagation Label propagation proceeds as follows: 1. Compute the transitive closure over the edges starting from all hypothesis nodes of a given sentence. 2. On the resulting graph, collapse all test-train similarities for each test node by summing edge weights. Obtain accumulated similarities in row and column 1 of the similarity matrix W . 3. Normalize test-to-train weights such that j Wj1 = 1. j W1j = P 4. Initialize the matrix P as Pij = 1-Wi1 +ij Wij . j (The quantity 1 - W1i in the denominator is the weight of the dissimilarity edge.) 5. Initialize a column vector f of height n with f1 = 1 (corresponding to node x1 ) and 0 in the remaining positions. 6. f P × f 7. Clamp f1 : f1 = 1 f , continue with step 11. 8. If f = 9. f f 10. Repeat from step 6. 11. The result r is in the slots of f that correspond to the hypotheses of interest. Normalize per sentence if needed, and rank in decreasing order of r. W Convergence Our algorithm's convergence proof is similar to that for standard label propagation (Zhu, 2005, p. 6). We split P as follows: P = 0 PU L PLU PU U (7) where PU L is a column vector holding global similarities of test hypotheses with train sentences, PLU is a horizontal vector holding the same similarities T (though PLU = PU L due to normalization), and PU U holds the normalized similarities between pairs of test hypotheses. We also separate f : f= 1 fU (8) where we distinguish the first entry because it represents the training part of the data. With these notations, the iteration formula becomes: fU = PU U fU + PU L training set consists of read sentences but the development and test data consist of spontaneous dialogues. The second is a standard travel expression translation task consisting entirely of read input. For our experiments we chose the text input (correct transcription) condition only. The data set sizes are shown in Table 1. We split the IE development set into two subsets of 500 and 496 sentences each. The first set (dev-1) is used to train the system parameters of the baseline system and as a training set for GBL. The second is used to tune the GBL parameters. For each language pair, the baseline system was trained with additional out-of-domain text data: the Italian-English Europarl corpus (Koehn, 2005) in the case of the IE system, and 5.5M words of newswire data (LDC Arabic Newswire, MultipleTranslation Corpus and ISI automatically extracted parallel data) in the case of the AE system. Set IE train IE dev-1 IE dev-2 IE eval AE train AE dev4 AE dev5 AE eval # sent pairs 26.5K 500 496 724 23K 489 500 489 # words 160K 4308 4204 6481 160K 5392 5981 2893 # refs 1 1 1 4 1 7 7 6 (9) Unrolling the iteration yields: n fU = lim n 0 (PU U )n fU + i=1 (PU U )i-1 PU L It can be easily shown that the first term converges to zero because of normalization in step 4 (Zhu, 2005). The sum in the second term converges to (I - PU U )-1 , so the unique fixed point is: fU = (I - PU U )-1 PU L (10) Table 1: Data set sizes and reference translations count. Our system uses the iterative form. On the data sets used, convergence took 61.07 steps on average. At the end of the label propagation algorithm, normalized scores are obtained for each N-best list (sentences without any connections whatsoever are assigned zero scores). These are then used together with the other component models in log-linear combination. Combination weights are optimized on a held-out data set. 4 Data and System We evaluate our approach on the IWSLT 2007 Italian-to-English (IE) and Arabic-to-English (AE) travel tasks. The first is a challenge task, where the 124 Our baseline is a standard phrase-based SMT system based on a log-linear model with the following feature functions: two phrase-based translation scores, two lexical translation scores, word count and phrase count penalty, distortion score, and language model score. We use the Moses decoder (Koehn et al., 2007) with a reordering limit of 4 for both languages, which generates N -best lists of up to 2000 hypotheses per sentence in a first pass. The second pass uses a part-of-speech (POS) based trigram model, trained on POS sequences generated by a MaxEnt tagger (Ratnaparkhi, 1996). The language models are trained on the English side using SRILM (Stolcke, 2002) and modified Kneser-Ney discounting for the first-pass models and WittenBell discounting for the POS models. The baseline system yields state-of-the-art performance. Weighting none (baseline) (a) (b) (c) dev-2 22.3/53.3 23.4/51.5 23.5/51.6 23.2/51.8 eval 29.6/45.5 30.7/44.1 30.6/44.3 30.0/44.6 System Baseline GBL dev-2 22.3/53.3 24.3/51.0 eval 29.6/45.5 32.2/42.7 Table 3: GBL results (%BLEU/PER) on IE tasks with string-kernel based similarity measure. Table 2: GBL results (%BLEU/PER) on IE task for different weightings of labeled-labeled vs. labeledunlabeled graph edges (BLEU-based similarity measure). 5 Experiments and Results We started with the IE system and initially investigated the effect of only including edges between labeled and unlabeled samples in the graph. This is equivalent to using a weighted k-nearest neighbor reranker that, for each hypothesis, computes average similarity with its neighborhood of labeled points, and uses the resulting score for reranking. Starting with the IE task and the BLEU-based similarity metric, we ran optimization experiments that varied the similarity threshold and compared sum vs. product combination of source and target similarity scores, settling for = 0.7 and product combination. We experimented with three different ways of weighting the contributions from labeled-unlabeled vs. unlabeled-unlabeled edges: (a) no weighting, (b) labeled-to-unlabeled edges were weighted 4 times stronger than unlabeledunlabeled ones; and (c) labeled-to-unlabeled edges were weighted 2 times stronger. The weighting schemes do not lead to significantly different results. The best result obtained shows a gain of 1.2 BLEU points on the dev set and 1 point on the eval set, reflecting PER gains of 2% and 1.2%, respectively. We next tested the string kernel based similarity measure. The parameter values were 0.5 for the gap penalty, a maximum substring length of k = 4, and weights of 0, 0.1, 0.2, 0.7. These values were chosen heuristically and were not tuned extensively due to time constraints. Results (Table 3) show significant improvements in PER and BLEU. In the context of the BTEC challenge task it is interesting to compare this approach to adding the development set directly to the training set. Part of the improvements may be due to utilizing kNN information from a data set that is matched to the test 125 set in terms of style. If this data were also used for training the initial phrase table, the improvements might disappear. We first optimized the log-linear model combination weights on the entire dev07 set (dev-1 and dev-2 in Table 1) before retraining the phrase table using the combined train and dev07 data. The new baseline performance (shown in Table 4) is much better than before, due to the improved training data. We then added GBL to this system by keeping the model combination weights trained for the previous system, using the N-best lists generated by the new system, and using the combined train+dev07 set as a train set for selecting similar sentences. We used the GBL parameters that yielded the best performance in the experiments described above. As can be seen from Table 4, GBL again yields an improvement of up to 1.2% absolute in both BLEU and PER. System Baseline GBL BLEU (%) 37.9 39.2 PER 38.4 37.2 Table 4: Effect of GBL on IE system trained with matched data (eval set). For the AE task we used = 0.5; however, this threshold was not tuned extensively. Results using BLEU similarity are shown in Table 5. The best result on the eval set yields an improvement of 1.2 BLEU points though only 0.2% reduction in PER. Overall, results seem to vary with parameter settings and nature of the test set (e.g. on dev5, used as a test set, not for optimization, a surprisingly larger improvement in BLEU of 2.7 points is obtained!). Overall, sentence similarities were observed to be lower for this task. One reason may be that the AE system includes statistical tokenization of the source side, which is itself error-prone in that it can split the same word in different ways depending on the con- Method Baseline GBL dev4 30.2/43.5 30.3/42.5 dev5 21.9/48.4 24.6/48.1 eval 37.8/41.8 39.0/41.6 Table 5: AE results (%BLEU/PER, = 0.5) text. Since our similarity measure is word-based, this may cause similar sentences to fall below the threshold. The string kernel does not yield any improvement over the BLEU-based similarity measure on this task. One possible improvement would be to use an extended string kernel that can take morphological similarity into account. Example Below we give an actual example of a translation improvement, showing the source sentence, the 1-best hypotheses of the baseline system and GBL system, respectively, the references, and the translations of similar sentences in the graph neighborhood of the current sentence. Source: Al+ mE*rp Aymknk {ltqAT Swrp lnA Baseline: i'm sorry could picture for us GBL: excuse me could you take a picture of the us Refs: excuse me can you take a picture of us excuse me could you take a photo of us pardon would you mind taking a photo of us pardon me could you take our picture pardon me would you take a picture of us excuse me could you take a picture of u Similar sentences: could you get two tickets for us please take a picture for me could you please take a picture of us as opposed to entire graphs. String kernel representations have been used in MT (Szedmak, 2007) in a kernel regression based framework, which, however, was an entirely supervised framework. Finally, our approach can be likened to a probabilistic implementation of translation memories (Maruyana and Watanabe, 1992; Veale and Way, 1997). Translation memories are (usually commercial) databases of segment translations extracted from a large database of translation examples. They are typically used by human translators to retrieve translation candidates for subsequences of a new input text. Matches can be exact or fuzzy; the latter is similar to the identification of graph neighborhoods in our approach. However, our GBL scheme propagates similarity scores not just from known to unknown sentences but also indirectly, via connections through other unknown sentences. The combination of a translation memory and statistical translation was reported in (Marcu, 2001); however, this is a combination of word-based and phrase-based translation predating the current phrase-based approach to SMT. 7 Conclusion We have presented a graph-based learning scheme to implement a consistency model for SMT that encourages similar inputs to receive similar outputs. Evaluation on two small-scale translation tasks showed significant improvements of up to 2.6 points in BLEU and 2.8% PER. Future work will include testing different graph construction schemes, in particular better parameter optimization approaches and better string similarity measures. More gains can be expected when using better domain knowledge in constructing the string kernels. This may include e.g. similarity measures that accommodate POS tags or morphological features, or comparisons of the syntax trees of parsed sentence. The latter could be quite easily incorporated into a string kernel or the related tree kernel similarity measure. Additionally, we will investigate the effectiveness of this approach on larger translation tasks. Acknowledgments This work was funded by NSF grant IIS-032676 and DARPA under Contract No. HR0011-06-C-0023. Any opinions, findings and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of these agencies. 6 Related Work GBL is an instance of semi-supervised learning, specifically transductive learning. A different form of semi-supervised learning (self-training) has been applied to MT by (Ueffing et al., 2007). Ours is the first study to explore a graph-based learning approach. In the machine learning community, work on applying GBL to structured outputs is beginning to emerge. Transductive graph-based regularization has been applied to large-margin learning on structured data (Altun et al., 2005). However, scalability quickly becomes a problem with these approaches; we solve that issue by working on transitive closures 126 References A. Alexandrescu and K. Kirchhoff. 2007. Data-Driven Graph Construction for Semi-Supervised GraphBased Learning in NLP. In HLT. Y. Altun, D. McAllester, and M. Belkin. 2005. Maximum margin semi-supervised learning for structured variables. In Proceedings of NIPS 18. A. Blum and S. Chawla. 2001. Learning from labeled and unlabeled data using graph mincuts. Proc. 18th International Conf. on Machine Learning, pages 19­ 26. C. Callison-Burch, M. Osborne, and P. Koehn. 2006. Reevaluating the role of BLEU in machine translation research. In Proceedings of EACL. Jaeyoung Choi. 1998. A new parallel matrix multiplication algorithm on distributed-memory concurrent computers. Concurrency: Practice and Experience, 10(8):655­670. A. Goldberg and J. Zhu. 2006. Seeing stars when there aren't many stars: Graph-based semi-supervised learning for sentiment categorization. In HLT-NAACL Workshop on Graph-based Algorithms for Natural Language Processing. T. Joachims. 2003. Transductive learning via spectral graph partitioning. In Proceedings of ICML. Philipp Koehn, Hieu Hoang, Alexandra Birch, Chris Callison-Burch, Marcello Federico, Nicola Bertoldi, Brooke Cowan, Wade Shen, Christine Moran, Richard Zens, Chris Dyer, Ondrej Bojar, Alexandra Constantin, and Evan Herbst. 2007. Moses: Open source toolkit for statistical machine translation. In ACL. P. Koehn. 2005. Europarl: A parallel corpus for statistical machine translation. In Machine Translation Summit X, pages 79­86, Phuket, Thailand. H. Lodhi, J. Shawe-taylor, and N. Cristianini. 2002. Text classification using string kernels. In Proceedings of NIPS. D. Marcu. 2001. Towards a unified approach to memoryand statistical-based machine translation. In Proceedings of ACL. H. Maruyana and H. Watanabe. 1992. Tree cover search algorithm for example-based translation. In Proceedings of TMI, pages 173­184. Zheng-Yu Niu, Dong-Hong Ji, and Chew Lim Tan. 2005. Word sense disambiguation using label propagation based semi-supervised learning method. In Proceedings of ACL, pages 395­402. K. Papineni, S. Roukos, T. Ward, and W. Zhu. 2002. Bleu: a method for automatic evaluation of machine translation. In Proceedings of ACL. A. Ratnaparkhi. 1996. A maximum entropy part-ofspeech tagger. In Proc.of (EMNLP). J. Rousu and J. Shawe-Taylor. 2005. Efficient computation of gap-weighted string kernels on large alphabets. Journal of Machine Learning Research, 6:1323­1344. A. Stolcke. 2002. SRILM--an extensible language modeling toolkit. In ICSLP, pages 901­904. Zhuoran Wang;John Shawe-Taylor;Sandor Szedmak. 2007. Kernel regression based machine translation. In Proceedings of NAACL/HLT, pages 185­188. Association for Computational Linguistics. Martin Szummer and Tommi Jaakkola. 2001. Partially labeled classification with markov random walks. In Advances in Neural Information Processing Systems, volume 14. http://ai.mit.edu/people/ szummer/. N. Ueffing, G. Haffari, and A. Sarkar. 2007. Transductive learning for statistical machine translation. In Proceedings of the ACL Workshop on Statistical Machine Translation. T. Veale and A. Way. 1997. Gaijin: a templatebased bootstrapping approach to example-based machine translation. In Proceedings of News Methods in Natural Language Processing. X. Zhu and Z. Ghahramani. 2002. Learning from labeled and unlabeled data with label propagation. Technical report, CMU-CALD-02. Xiaojin Zhu. 2005. Semi-Supervised Learning with Graphs. Ph.D. thesis, Carnegie Mellon University. CMU-LTI-05-192. 127 Intersecting multilingual data for faster and better statistical translations Yu Chen1,2 , Martin Kay1,3 , Andreas Eisele1,2 1: Universit¨ t des Saarlandes, Saarbr¨ cken, Germany a u 2: Deutsches Forschungszentrum f¨ r K¨ nstliche Intelligenz GmbH, Saarbr¨ cken, Germany u u u 3: Stanford University, CA, USA {yuchen,kay,eisele}@coli.uni-saarland.de Abstract In current phrase-based SMT systems, more training data is generally better than less. However, a larger data set eventually introduces a larger model that enlarges the search space for the translation problem, and consequently requires more time and more resources to translate. We argue redundant information in a SMT system may not only delay the computations but also affect the quality of the outputs. This paper proposes an approach to reduce the model size by filtering out the less probable entries based on compatible data in an intermediate language, a novel use of triangulation, without sacrificing the translation quality. Comprehensive experiments were conducted on standard data sets. We achieved significant quality improvements (up to 2.3 BLEU points) while translating with reduced models. In addition, we demonstrate a straightforward combination method for more progressive filtering. The reduction of the model size can be up to 94% with the translation quality being preserved. 1 Introduction Statistical machine translation (SMT) applies machine learning techniques to a bilingual corpus to produce a translation system entirely automatically. Such a scheme has many potential advantages over earlier systems which relied on carefully crafted rules. The most obvious is that it at dramatically reduces cost in human labor and it is able to reach many critical translation rules that are easily overlooked by human being. 128 SMT systems generally assemble translations by selecting phrases from a large candidate set. Unsupervised learning often introduces a considerable amount of noise into this set as a result of which the selection process becomes more longer and less effective. This paper provides one approach to these problems. Various filtering techniques, such as (Johnson et al., 2007) and (Chen et al., 2008), have been applied to eliminate a large portion of the translation rules that were judged unlikely to be of value for the current translation. However, these approaches were only able to improve the translation quality slightly. In this paper, we describe a triangulation approach (Kay, 1997) that incorporates multilingual data to improve system efficiency and translation quality at the same time. Most of the previous triangulation approaches (Kumar et al., 2007; Cohn and Lapata, 2007; Filali and Bilmes, 2005; Simard, 1999; Och and Ney, 2001) add information obtained from a third language. In other words, they work with the union of the data from the different languages. In contrast, we work with the intersection of information acquired through a third language. The hope is that the intersection will be more precise and more compact than the union, so that a better result will be obtained more efficiently. 2 Noise in a phrase-based SMT system The phrases in a translation model are extracted heuristically from a word alignment between the parallel texts in two languages using machine learning techniques. The translation model feature values are stored in the form of a so-called phrase-table, Human Language Technologies: The 2009 Annual Conference of the North American Chapter of the ACL, pages 128­136, Boulder, Colorado, June 2009. c 2009 Association for Computational Linguistics while the distortion model is in the reordering-table. As we have said models built in this way tend to contain a contains a considerable amount of noise. The phrase-table entries are far less reliable than the lexicons and grammar rules handcrafted for rule-based systems. The main source of noise in the phrase table is errors from the word alignment process. For example, many function words occur so frequently that they are incorrectly mapped to translations of many function words in the other language to which they are, in fact, unrelated. On the other hand, many words remain unaligned on account of their very low frequency. Another source noise comes from the phrase extraction algorithm itself. The unaligned words are usually attached to aligned sequences In order to achieve longer phrase pairs. The final selection of entries from the phrase table is based not only on the values assigned to them there, but also to values coming from the language and reordering models, so that entries that receive an initially high value may end up not being preferred. (1) Sie lieben ihre Kinder nicht. they love their children not They don't love their children. The frequently occurring German negative "nicht" in (1). is sometimes difficult for SMT systems to translate into English because it may appear in many positions of a sentence. For instance, it occurs at the end of the sentence in (1). The phrase pairs "ihre kinder nicht their children are not" and "ihre kinder nicht their children" are both likely also to appear in the phrase table and the former has greater estimated probability. However, the language model would preferred the latter in this example because the sentence "They love their children are not." is unlikely to be attested. Accordingly, SMT system may therefore produce the misleading translation in (2). (2) They love their children. The system would not produce translations with the opposite meanings if the noisy entries like "ihre kinder nicht their children" were excluded from the translation candidates. Eliminating the noise should help to improve the system's performance, for both efficiency and translation quality. 129 3 Triangulated filtering While direct translation and pivot translation through a bridge language presumably introduce noise, in substantially similar amounts, there is no reason to expect the noise in the two systems to correlate strongly. In fact, the noise from such different sources, tends to be quite distinct, whereas the more useful information is often retained. This encourages us to hope that information gathered from various sources will be more reliable overall. Our plan is to ameliorate the noise problem by constructing a smaller phrase-table by taking the intersection of a number of sources. We reason that a target phrase is will appear as a candidate translation of a given source phrase, only if it also appears as a candidate translation for some word or phrase in the bridge language mapping to the source phrase. We refer to this triangulation approach as triangulated phrase-table filtering. Parallel Corpus Alignment, Phrase Extraction Source-Bridge Model Target-Bridge Model Translation Model Filtering Filtered Model SMT Decoder Target Text Source Text Monolingual Corpus Language Model Counting Smoothing Figure 1: Triangulated filtering in SMT systems Figure 1 illustrates our triangulation approach. Two bridge models are first constructed: one from the source language to the bridge language, and another from the target language to the bridge language. Then, we use these two models to filter the original source-target model. For each phrase pair in the original table, we try to find a common link in these bridge models to connect both phrases. If such links do not exist, we remove the entry from the table. The probability values in the table remain unchanged. The reduced table can be used in place of the original one in the SMT system. There are various forms of links that can be used as our evidence for the filtering process. One obvious form is complete phrases in the bridge language, which means, for each phrase pair in the model to be filtered, we should look for a third phrase in the bridge language that can relate the two phrases in the pair. This approach to filtering examines each phrase pair presented in the phrase-table one by one. For each phrase pair, we collect the corresponding translations using the models for translation into a third language. If both phrases can be mapped to some phrases in the bridge language, but to different ones, we should remove it from the model. It is also possible that neither of the phrases appear in corresponding bridge models. In this case, we consider the bridge models insufficient for making the filtering decision and prefer to keep the pair in the table. The way a decoder constructs translation hypotheses is directly related to the weights for different model features in a SMT system, which are usually optimized for a given set of models with minimum error rate training (MERT) (Och, 2003) to achieve better translation performance. In other words, the weights obtained for a model do not necessarily apply to another model. Since the triangulated filtering method removes a part of the model, it is important to readjust the feature weights for the reduced phrase-table. as bridges. For the Spanish-English pair, three translation models were constructed over the same parallel corpora. We acquired comparable data sets by drawing several subsets from the same corpus according to various maximal sentence lengths. The subsets Model EP-20 EP-40 EP-50 Europarl Sentences 410,487 964,687 1,100,813 1,304,116 Tokens Spanish English 5,220,142 5,181,452 20,820,067 20,229,833 26,731,269 25,867,370 37,870,751 36,429,274 Table 1: Europarl subsets for building the SpanishEnglish SMT system 4 Experimental design All the text data used in our experiments are from Release v3 of "European Parliament Proceedings Parallel Corpus 1996-2006" (Europarl) corpus (Koehn, 2005). We mainly investigated translations from Spanish to English. There are enough structural differences in these two language to introduce some noise in the phrase table. French, Portuguese, Danish, German and Finnish were used as bridge languages. Portuguese is very similar to Spanish and French somewhat less so. Finnish is unrelated and fairly different typologically with Danish and German occupying the middle ground. In addition, we also present briefly the results on GermanEnglish translations with Dutch, Spanish and Danish 130 we used in the experiments are presented by "EP20", "EP-40" and "EP-50", in which the numbers indicate the maximal sentence length in respective Europarl subsets. Table 1 lists the characteristics of the Spanish-English subsets. Although the maximal sentence length in these sets is far less than that of the whole corpus (880 tokens), EP-50 already includes nearly 85% of Spanish-English sentence pairs from Europarl. The translations models, both the models to be filtered and the bridge models, were generated from compatible Europarl subsets using the Moses toolkit (Koehn et al., 2007) with the most basic configurations. The feature weights for the SpanishEnglish translation models were optimized over a development set of 500 sentences using MERT to maximize BLEU (Papineni et al., 2001). The triangulated filtering algorithm was applied to each combination of a translation model and a third language. The reordering models were also filtered according to the phrase-table. Only those phrase pairs that appeared in the phrase-table remained in the reordering table. We rerun the MERT process solely based on the remaining entries in the filtered tables. Each table is used to translate a set of 2,000 sentences of test data (from the shared task of the third Workshop on Statistical Machine Translation, 2008 1 ). Both the test set and the development data set have been excluded from the training data. We evaluated the proposed phrase-table filtering For details, see http://www.statmt.org/wmt08/shared-task.html 1 method mainly from two points of view: the efficiency of systems with filtered tables and the quality of output translations produced by the systems. 5 5.1 Results System efficiency Often the question of machine translation is not only how to produce a good translation, but also how to produce it quickly. To evaluate the system efficiency, we measured both storage space and time consumption. For recording the computation time, we run an identical of installation of the decoder with different models and then measure the average execution time for the given translation task. In Table 2, we give the number of entries in each phrase table (N ), and the physical file size of the phrase table (SP T ) and the reordering table (SRT ) (without any compression or binarization), Tl , the time for the program to load phrase tables and Tt the time to translate the complete test set. We also highlighted the largest and the smallest reduction from each group. All filtered models showed significant reductions in size. The greatest reduction of model sizes, taking both phrase-table and reordering table into account, is nearly 11 gigabytes for filtering the largest model (EP-50) with a Finnish bridge, which leads to the maximal time saving of 939 seconds, or almost 16 minutes, for translating two thousand sentences. The reduction rates from two larger models are very close to each other whereas the filtered table scaled down the most significantly on the smallest model (EP-20), which was in fact constructed over a much smaller subset of Europarl corpus, consisting of less than half of the sentences pairs in the other two Europarl subsets. Compared to the larger Europarl subsets, the small data set is expected to produce more errors through training as there is much less relevant data for the machine learning algorithm to correctly extract useful information from. Consequently, there are more noisy entries in this small model, and therefore more entries to be removed. In addition, the filtering is done by exact matching of complete phrases, which presumably happens much less frequently even for correctly paired phrase pairs in the very limited data supplied by the smallest training set. For the same reason, the distinction be131 tween different bridge languages was less clear for this small model. Due to hardware limitation, we are not able to fit the unfiltered phrase tables completely into the memory. Every table was filtered based on the given input so only a small portion of each table was loaded into memory. This may diminish the difference between the original and the filtered table to a certain degree. The relative time consumptionnevertheless agrees with the reduction in size: phrase tables from the smallest model showed the most reduction for both loading the models and processing the translations. For loading time, we count the time it takes to start and to load the bilingual phrase-tables plus reordering tables and the monolingual language model into the memory. The majority of the loading time for the smallest model, even before filtering, has been used for loading language models and other start-up processes, could not be reduced as much as the reduction on table size. 5.2 Translation quality Bridge -- pt fr da de fi EP-20 26.62 28.40 28.28 28.48 28.05 28.02 EP-40 31.43 32.90 32.69 32.47 32.65 31.91 EP-50 31.68 33.93 33.47 33.88 33.13 33.04 Table 3: BLEU scores of translations using filtered phrase tables Efficiency aside, a translation system should be able to produce useful translation. It is important to verify that the filtering approach does not affect the translation quality of the system. Table 3 show the BLEU scores of each translation acquired in the experiments. Between translation models of different sizes, there are obvious performance gaps. Different bridge languages can cause different effects on performance. However, the translation qualities from a single model are fairly close to each other. We therefore take it that the effect of the triangulation approach is rather robust across translation models of different sizes. Model+Bridge EP-20+ -- EP-20+ pt EP-20+ fr EP-20+ da EP-20+ de EP-20+ fi EP-40+ -- EP-40+ pt EP-40+ fr EP-40+ da EP-40+ de EP-40+ fi EP-50+ -- EP-50+ pt EP-50+ fr EP-50+ da EP-50+ de EP-50+ fi Time Tl (s) Tt (s) 55 3529 53 2826 48 2702 52 2786 43 2732 47 2670 65 3673 50 3091 46 3129 42 3050 46 3069 40 2889 140 4130 78 3410 97 3616 81 3418 95 3488 71 3191 Table Size N SP T (byte) 7,599,271 953M 1,712,508 (22.54%) 198M 1,536,056 (20.21%) 172M 1,659,067 (21.83%) 186M 1,260,524 (16.59%) 132M 1,331,323 (17.52%) 147M 19,199,807 2.5G 8,378,517 (43.64%) 1.1G 8,599,708 (44.79%) 1.1G 6,716,304 (34.98%) 842M 6,113,769 (31.84%) 725M 4,473,483 (23.30%) 533M 54,382,715 7.1G 13,225,654 (24.32%) 1.6G 24,057,849 (44.24%) 3.0G 12,547,839 (23.07%) 1.5G 15,938,151 (29.31%) 1.9G 7,691,904 (17.75%) 895M SRT (byte) 717M 149M 131M 141M 101M 111M 1.9G 1.8G 741M 568M 492M 353M 5.4G 1.3G 2.3G 1.2G 1.5G 677M Table 2: System efficiency: time consumption and phrase-table size It is obvious that the best systems are usually NOT from the filtered tables that preserved the most entries from the original. All the filtered models showed some improvement in quality with updated model weights. Mostly around 1.5 BLEU points, the increases ranged from 0.36 to 2.25. Table 4 gives a set of translations from the experiments. The unfiltered baseline system inserted the negative by mistake while all the filtered systems are able to avoid this. It indicates that there are indeed noisy entries affecting translation quality in the original table. We were able to achieve better translations by eliminating noisy entries. The filtering methods indeed tend to remove entries composed of long phrases. Table 5 lists the average length of phrases in several models. Both source phrases and target phrases are taken into account. The best models have shortest phrases on average. Discarding such entries seems to be necessary. This is consistent with the findings in (Koehn, 2003) that phrases longer than three words improve performance little for training corpora of up to 20 million words. Quality gains appeared to converge in the results across different bridge languages while the original models became larger. Translations generated using large models filtered with different bridge lan132 Bridge -- pt fr da de fi EP-20 3.776 3.195 3.003 3.005 2.535 2.893 EP-40 4.242 3.943 3.809 3.74 3.501 3.521 EP-50 4.335 3.740 3.947 3.453 3.617 3.262 Table 5: Average phrase length guages are less diverse. Meanwhile, the degradation is less for a larger model. It is reasonable to expect improvements for extremely large models with arbitrary bridge languages. For relatively small models, the selection of bridge languages would be critical for the effect of our approach. 5.3 Language clustering To further understand how the triangulated filtering approach worked and why it could work as it did, we examined a randomly selected phrase table fragment through the experiments. The segment included 10 potential English translations of the same Spanish word "fabricantes", the plural form of the word "fabricante" (manufacturer). Table 6 shows the filtering results on a randomly selected segment from the original "EP-40" model, including 10 English translations of the same source source ref baseline pt fr da de fi As´, se van modificando poco a poco los principios habituales del Estado de derecho por influencia de una i concepcin extremista de la lucha con tra las discriminaciones.. thus , the usual principles of the rule of law are being gradually altered under the influence of an extremist approach to combating discrimination. we are not changing the usual principles of the rule of law from the influence of an extremist approach in the fight against discrimination. so , are gradually changing normal principles of the rule of law by influence of an extremist conception of the fight against discrimination. so , we are gradually changing the usual principles of the rule of law by influence of an extremist conception of the fight against discrimination. so , are gradually changing the usual principles of the rule of law by influence of an extremist conception of the fight against discrimination. thus , we are gradually altering the usual principles of the rule of law by influence of an extremist conception of the fight against discrimination. so , are gradually changing normal principles of the rule of law by influence of an extremist conception of the fight against discrimination. Table 4: Examples BLEU (%) fabricantes a manufacturer battalions car manufacturers have car manufacturers makers manufacturer manufacturers producers are producers need producers pt fr da de fi 4 3 0 5 3 5 5 3 0 5 changed. None of the languages led to the identical eliminations. None of the cases excludes all errors. Apparently, the selection of bridge languages had immediate effects on the filtering results. 33 32.8 32.6 32.4 32.2 32 31.8 31.6 31.4 Portugese French Danish German Finnish Baseline Table 6: Phrase-table entries before and after filtering a model with different bridges word "fabricantes". indicates that the corresponding English phrase remained in the table after triangulated filtering with the corresponding bridge language. We also counted the number of tables that included each phrase pair. Regardless of the bridge language, the triangulated filtering approach had removed those entries that are clearly noise. Meanwhile, entries which are surely correct were always preserved in the filtered tables. The results of using different bridge languages turned out to be consistent on these extreme cases. The 5 filtering processes agreed on six out of ten pairs. As for the other 4 pairs, the decisions were different using different bridge languages. The remaining entries were always different when the bridge was 133 31.2 31 4 6 8 10 12 14 Phrase-table Entries (Mil.) 16 18 20 Figure 2: Clustering of bridge languages We compared two factors of these filtered tables: their sizes and the corresponding BLEU scores. Figure 2 shows interesting signs of language similarity/dissimilarity. There are apparently two groups of languages having extremely close performance, which happen to fall in two language groups: Germanic (German and Danish) and Romance (French and Portuguese). The Romance group was associated with larger filtered tables that produced slightly better translations. The filtered tables created with Germanic bridge languages contained ap- proximately 2 million entries less than Romance groups. The translation quality difference between these two groups was within 1 point of BLEU. Observed from this figure, it seems that the translation quality was connected to the similarity between the bridge language and the source language. The closer the bridge is to the source language, the better translations it may produce. For instance, Portuguese led to a filtered table that produced the best translations. On the other hand, the more different the bridge languages compared to the source, the larger portion of the phrase-table the filtering algorithm will remove. The table filtered with German was the smallest in the four cases. Finnish, a language that is unrelated to others, was associated with distinctive results. The size of the table filtered with Finnish is only 23% of the original, almost half of the table generated with Portuguese. Finnish has extremely rich morphology, hence a great many word-forms, which would make exact matching in bridge models less likely to happen. Many more phrase pairs in the original table were removed for this reason even though some of these entries were beneficial for translations. Even though the improvement on translation quality due to the Finnish bridge was less significant than the others, it is clear that triangulated filtering retained the useful information from the original model. 5.4 Further filtering 32 30 BLEU (%) 28 26 Baseline Portugese French Danish German Finnish 0 10 20 30 Phrase-table Entries (Mil.) 40 50 24 Figure 3: Combining probability-based filtering The filtering decision with a bridge language on a particular phrase pair is fixed: either to keep the entry or to discard it. It is difficult to adjust the system to work differently. However, as the triangulated filtering procedure does not consider probability distributions in the models, it is possible to further filter the tables according to the probabilities. The phrase pairs are associated with values computed from the given set of feature weights and sorted, so that we can remove any portions of the remain entries based on the values. Each generated table is used to translate the test set again. Figure 3 shows BLEU scores of the translation outputs produced with tables derived from the "EP-50" model with respect to their sizes. We also included the curve of probability-based filtering alone as the baseline. The difference between filtered tables at the same 134 size can be over 6 BLEU points, which is a remarkable advantage for the triangulated filtering approach always producing better translations. The curves of the triangulated filtered models are clearly much steeper than that of the naive pruned ones. Data in these filtered models are more compact than the original model before any filtering. The triangulated filtered phrase-tables contain more useful information than a normal phrase-table of the same size. The curves representing the triangulated filtering performance are always on the left of the original curves. We are able to use less than 6% of the original phrase table (40% of the table filtered with Finnish) to obtain translations with the same quality as the original. The extreme case, using only 1.4% of the original table, leads to a reasonable BLEU score, indicating that most of the output sentences should still be understandable. In this case, the overall size of the phrase table and the reordering table was less than 100 megabytes, potentially feasible for mobile devices, whereas the original models took nearly 12.5 gigabytes of disk space. 5.5 Different source language Bridge -- Dutch Spanish Danish EP-40 5.1G 26.92 562M 27.11 3.0G 27.28 505M 28.04 EP-50 6.5G 27.23 1.3G 28.14 3.6G 28.09 780M 28.21 Table 7: Filtered German-English systems (Size and BLEU) In addition to Spanish-English translation, we also conducted experiments on German-English translation. The results, shown in Table 7, appear consistent with the results of Spanish-English translation. Translations in most cases have performance close to the original unfiltered models, whereas the reduction in phrase-table size ranged from 40% to 85%. Meanwhile, translation speed has been increased up to 17%. Due to German's rich morphology, the unfiltered German-English models contain many more entries than the Spanish-English ones constructed from similar data sets. Unlike the Spanish-English models, the difference between "EP-40" and "EP50" was not significant. Neither was the difference between the impacts of the filtering in terms of translation quality. In addition, German and English are so dissimilar that none of the three bridge languages we chose turned out to be significantly superior. Spanish to English, Finnish, the most distinctive bridge language, appeared to be a more effective intermediate language which could remove more phrase pair entries while still improving the translation quality. Portuguese, the most close to the source language, always leads to a filtered model that produces the best translations. The selection of bridge languages has more obvious impact on the performance of our approach when the size of the model to filter was larger. The work gave one instance of the general approach described in Section 3. There are several potential directions for continuing this work. The most straightforward one is to use our approaches with more different languages, such as Chinese and Arabic, and incompatible corpora, for example, different segments of Europarl. The main focus of such experiments should be verifying the conclusions we had in this paper. 6 Conclusions Acknowledgments This work was supported by European Community through the EuroMatrix project funded under the Sixth Framework Programme and the EuroMatrix Plus project funded under the Seventh Framework Programme for Research and Technological Development. We highlighted one problem of the state-of-the-art SMT systems that was generally neglected: the noise in the translation models. Accordingly, we proposed triangulated filtering methods to deal with this problem. We used data in a third language as evidence to locate the less probable items in the translation models so as to obtain the intersection of information extracted from multilingual data. Only the occurrences of complete phrases were taken into account. The probability distributions of the phrases have not been considered so far. Although the approach was fairly naive, our experiments showed it to be effective. The approaches were applied to SMT systems built with the Moses toolkit. The translation quality was improved at least 1 BLEU for all 15 cases (filtering 3 different models with 5 bridge languages). The improvement can be as much as 2.25 BLEU. It is also clear that the best translations were not linked to the largest translation models. We also sketched a simple extension to the triangulated filtering approach to further reduce the model size, which allows us to generate reasonable results with only 1.4% of the entries from the original table. The results varied for different bridge languages as well as different models. For translation from 135 References Yu Chen, Andreas Eisele, and Martin Kay. 2008. Improving Statistical Machine Translation Efficiency by Triangulation. In the 6th International Conference on Language Resources and Evaluation (LREC '08), May. Trevor Cohn and Mirella Lapata. 2007. Machine Translation by Triangulation: Making Effective Use of Multi-Parallel Corpora. In the 45th Annual Meeting of the Association for Computational Linguistics, Prague, Czech, June. Karim Filali and Jeff Bilmes. 2005. Leveraging Multiple Languages to Improve Statistical MT Word Alignments. In IEEE Automatic Speech Recognition and Understanding (ASRU), Cancun, Mexico, November. J. Howard Johnson, Joel Martin, George Foster, and Roland Kuhn. 2007. Improving Translation Quality by Discarding Most of the Phrasetable. In the 2007 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning (EMNLP-CoNLL), Prague, Czech Republic, June. Martin Kay. 1997. The proper place of men and machines in language translation. Machine Translation, 12(1-2):3­23. Philipp Koehn, Hieu Hoang, Alexandra Birch, Chris Callison-Burch, Marcello Federico, Nicola Bertoldi, Brooke Cowan, Wade Shen, Christine Moran, Richard Zens, Chris Dyer, Ondrej Bojar, Alexandra Constantin, and Evan Herbst. 2007. Moses: Open Source Toolkit for Statistical Machine Translation. In the 45th Annual Meeting of the Association for Computational Linguistics (ACL), Prague, Czech Republic, June. Philipp Koehn. 2003. Noun Phrase Translation. Ph.D. thesis, University of Southern California. Philipp Koehn. 2005. Europarl: A Parallel Corpus for Statistical Machine Translation. In MT Summit 2005. Shankar Kumar, Franz Josef Och, and Wolfgang Macherey. 2007. Improving word alignment with bridge languages. In the 2007 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning (EMNLP-CoNLL), pages 42­50, Prague, Czech. Franz Josef Och and Hermann Ney. 2001. Statistical multi-source translation. In MT Summit VIII, Santiago de Compostela, Spain. Franz Josef Och. 2003. Minimum error rate training in statistical machine translation. In ACL '03: Proceedings of the 41st Annual Meeting on Association for Computational Linguistics, pages 160­167, Morristown, NJ, USA. Association for Computational Linguistics. Kishore Papineni, Salim Roukos, Todd Ward, and WeiJing Zhu. 2001. BLEU: a method for automatic evaluation of machine translation. In the 40th Annual Meeting on Association for Computational Linguistics, pages 311­318, Morristown, NJ, USA. Association for Computational Linguistics. Michel Simard. 1999. Text-translation alignment: Three languages are better than two. In EMNLP/VLC-99, College Park, MD, June. 136 Without a `doubt'? Unsupervised discovery of downward-entailing operators Cristian Danescu-Niculescu-Mizil, Lillian Lee, and Richard Ducott Department of Computer Science Cornell University Ithaca, NY 14853-7501 cristian@cs.cornell.edu, llee@cs.cornell.edu, rad47@cornell.edu Abstract An important part of textual inference is making deductions involving monotonicity, that is, determining whether a given assertion entails restrictions or relaxations of that assertion. For instance, the statement `We know the epidemic spread quickly' does not entail `We know the epidemic spread quickly via fleas', but `We doubt the epidemic spread quickly' entails `We doubt the epidemic spread quickly via fleas'. Here, we present the first algorithm for the challenging lexical-semantics problem of learning linguistic constructions that, like `doubt', are downward entailing (DE). Our algorithm is unsupervised, resource-lean, and effective, accurately recovering many DE operators that are missing from the handconstructed lists that textual-inference systems currently use. The following two examples help illustrate the particular type of inference that is the focus of this paper. 1. `We know the epidemic spread quickly' 2. `We doubt the epidemic spread quickly' A relaxation of `spread quickly' is `spread'; a restriction of it is `spread quickly via fleas'. From statement 1, we can infer the relaxed version, `We know the epidemic spread', whereas the restricted version, `We know the epidemic spread quickly via fleas', does not follow. But the reverse holds for statement 2: it entails the restricted version `We doubt the epidemic spread quickly via fleas', but not the relaxed version. The reason is that `doubt' is a downward-entailing operator;1 in other words, it allows one to, in a sense, "reason from sets to subsets" (van der Wouden, 1997, pg. 90). Downward-entailing operators are not restricted to assertions about belief or to verbs. For example, the preposition `without' is also downward entailing: from `The applicants came without payment or waivers' we can infer that all the applicants came without payment. (Contrast this with `with', which, like `know', is upward entailing.) In fact, there are many downward-entailing operators, encompassing many syntactic types; these include explicit negations like `no' and `never', but also many other terms, such as `refuse (to)', `preventing', `nothing', `rarely', and `too [adjective] to'. Synonyms for "downward entailing" include downwardmonotonic and monotone decreasing. Related concepts include anti-additivity, veridicality, and one-way implicatives. 1 1 Introduction Making inferences based on natural-language statements is a crucial part of true natural-language understanding, and thus has many important applications. As the field of NLP has matured, there has been a resurgence of interest in creating systems capable of making such inferences, as evidenced by the activity surrounding the ongoing sequence of "Recognizing Textual Entailment" (RTE) competitions (Dagan, Glickman, and Magnini, 2006; BarHaim, Dagan, Dolan, Ferro, Giampiccolo, Magnini, and Szpektor, 2006; Giampiccolo, Magnini, Dagan, and Dolan, 2007) and the AQUAINT knowledgebased evaluation project (Crouch, Saur´, and Fowler, i 2005). Human Language Technologies: The 2009 Annual Conference of the North American Chapter of the ACL, pages 137­145, Boulder, Colorado, June 2009. c 2009 Association for Computational Linguistics 137 As the prevalence of these operators indicates and as van der Wouden (1997, pg. 92) states, downward entailment "plays an extremely important role in natural language" (van Benthem, 1986; Hoeksema, 1986; S´ nchez Valencia, 1991; Dowty, 1994; Maca Cartney and Manning, 2007). Yet to date, only a few systems attempt to handle the phenomenon in a general way, i.e., to consider more than simple direct negation (Nairn, Condoravdi, and Karttunen, 2006; MacCartney and Manning, 2008; Christodoulopoulos, 2008; Bar-Haim, Berant, Dagan, Greental, Mirkin, Shnarch, and Szpektor, 2008). These systems rely on lists of items annotated with respect to their behavior in "polar" (positive or negative) environments. The lists contain a relatively small number of downward-entailing operators, at least in part because they were constructed mainly by manual inspection of verb lists (although a few non-verbs are sometimes also included). We therefore propose to automatically learn downward-entailing operators2 -- henceforth DE operators for short -- from data; deriving more comprehensive lists of DE operators in this manner promises to substantially enhance the ability of textual-inference systems to handle monotonicity-related phenomena. Summary of our approach There are a number of significant challenges to applying a learningbased approach. First, to our knowledge there do not exist DE-operator-annotated corpora, and moreover, relevant types of semantic information are "not available in or deducible from any public lexical database" (Nairn et al., 2006). Also, it seems there is no simple test one can apply to all possible candidates; van der Wouden (1997, pg. 110) remarks, "As a rule of thumb, assume that everything that feels negative, and everything that [satisfies a condition described below], is monotone decreasing. This rule of thumb will be shown to be wrong as it stands; but We include superlatives (`tallest'), comparatives (`taller'), and conditionals (`if') in this category because they have nondefault (i.e., non-upward entailing) properties -- for instance, `he is the tallest father' does not entail `he is the tallest man'. Thus, they also require special treatment when considering entailment relations. In fact, there have been some attempts to unify these various types of non-upward entailing operators (von Fintel, 1999). We use the term downward entailing (narrowly-defined) (DE(ND)) when we wish to specifically exclude superlatives, comparatives, and conditionals. 2 it sort of works, like any rule of thumb." Our first insight into how to overcome these challenges is to leverage a finding from the linguistics literature, Ladusaw's (1980) hypothesis, which can be treated as a cue regarding the distribution of DE operators: it asserts that a certain class of lexical constructions known as negative polarity items (NPIs) can only appear in the scope of DE operators. Note that this hypothesis suggests that one can develop an unsupervised algorithm based simply on checking for co-occurrence with known NPIs. But there are significant problems with applying this idea in practice, including: (a) there is no agreed-upon list of NPIs; (b) terms can be ambiguous with respect to NPI-hood; and (c) many non-DE operators tend to co-occur with NPIs as well. To cope with these issues, we develop a novel unsupervised distillation algorithm that helps filter out the noise introduced by these problems. This algorithm is very effective: it is accurate and derives many DE operators that do not appear on pre-existing lists. Contributions Our project draws a connection between the creation of textual entailment systems and linguistic inquiry regarding DE operators and NPIs, and thus relates to both language-engineering and linguistic concerns. To our knowledge, this work represents the first attempt to aid in the process of discovering DE operators, a task whose importance we have highlighted above. At the very least, our method can be used to provide high-quality raw materials to help human annotators create more extensive DE operator lists. In fact, while previous manual-classification efforts have mainly focused on verbs, we retrieve DE operators across multiple parts of speech. Also, although we discover many items (including verbs) that are not on pre-existing manually-constructed lists, the items we find occur frequently -- they are not somehow peculiar or rare. Our algorithm is surprisingly accurate given that it is quite resource- and knowledge-lean. Specifically, it relies only on Ladusaw's hypothesis as initial inspiration, a relatively short and arguably noisy list of NPIs, and a large unannotated corpus. It does not use other linguistic information -- for example, we do not use parse information, even though c-command relations have been asserted to play a 138 key role in the licensing of NPIs (van der Wouden, 1997). 2 Method Now, Ladusaw's hypothesis suggests that we can find DE operators by looking for words that tend to occur more often in NPI contexts than they occur overall. We formulate this as follows: Assumption: For any DE operator d, FbyNPI pdq ˇ F pdq. Here, FbyNPI pdq is the number of occurrences of d in NPI contexts4 divided by the number of words in NPI contexts, and F pxq refers to the number of occurrences of x relative to the number of words in the corpus. An additional consideration is that we would like to focus on the discovery of novel or non-obvious DE operators. Therefore, for a given candidate DE p operator c, we compute FbyNPI pcq: the value of FbyNPI pcq that results if we discard all NPI contexts containing a DE operator on a list of 10 wellknown instances, namely, `not', `n't', `no', `none', `neither', `nor', `few', `each', `every', and `without'. (This list is based on the list of DE operators used by the RTE system presented in MacCartney and Manning (2008).) This yields the following scoring function: p FbyNPI pcq S pcq : . (1) F pcq Distillation There are certain terms that are not DE operators, but nonetheless co-occur with NPIs as a side-effect of co-occurring with true DE operators themselves. For instance, the proper noun `Milken' (referring to Michael Milken, the so-called "junkbond king") occurs relatively frequently with the DE operator `denies', and `vigorously' occurs frequently with DE operators like `deny' and `oppose'. We refer to terms like `milken' and `vigorously' as "piggybackers", and address the piggybackers problem by leveraging the following intuition: in general, we do not expect to have two DE operators in the same NPI context.5 One way to implement this would be to re-score the candidates in a winner-takes-all fashion: for each NPI context, reward only the candidate Even if d occurs multiple times in a single NPI context we only count it once; this way we "dampen the signal" of function words that can potentially occur multiple times in a single sentence. 5 One reason is that if two DE operators are composed, they ordinarily create a positive context, which would not license NPIs (although this is not always the case (Dowty, 1994)). 4 We mentioned in the introduction some significant challenges to developing a machine-learning approach to discovering DE operators. The key insight we apply to surmount these challenges is that in the linguistics literature, it has been hypothesized that there is a strong connection between DE operators and negative polarity items (NPIs), which are terms that tend to occur in "negative environments". An example NPI is `anymore': one can say `We don't have those anymore' but not `¦We have those anymore'. Specifically, we propose to take advantage of the seminal hypothesis of Ladusaw (1980, influenced by Fauconnier (1975), inter alia): (Ladusaw) NPIs only appear within the scope of downward-entailing operators. This hypothesis has been actively discussed, updated, and contested by multiple parties (Linebarger, 1987; von Fintel, 1999; Giannakidou, 2002, inter alia). It is not our intent to comment (directly) on its overall validity. Rather, we simply view it as a very useful starting point for developing computational tools to find DE operators-- indeed, even detractors of the theory have called it "impressively algorithmic" (Linebarger, 1987, pg. 361). First, a word about scope. For Ladusaw's hypothesis, scope should arguably be defined in terms of ccommand, immediate scope, and so on (von Fintel, 1999, pg. 100). But for simplicity and to make our approach as resource-lean as possible, we simply assume that potential DE operators occur to the left of NPIs,3 except that we ignore text to the left of any preceding commas or semi-colons as a way to enforce a degree of locality. For example, in both `By the way, we don't have plants anymoreNPI because they died' and `we don't have plants anymoreNPI ', we look for DE operators within the sequence of words `we don't have plants'. We refer to such sequences in which we seek DE operators as NPI contexts. There are a few exceptions, such as with the NPI "for the life of me" (Hoeksema, 1993). 3 139 with the highest score S. However, such a method is too aggressive because it would force us to pick a single candidate even when there are several with relatively close scores -- and we know our score S is imperfect. Instead, we propose the following "soft" mechanism. Each sentence distributes a "budget" of total score 1 among the candidates it contains according to the relative scores of those candidates; this works out to yield the following new distilled scoring function what it modifies and perhaps on whether there are degree adverbs pre-modifying it (Hoeksema, 1997). There are some ambiguous NPIs that we do retain due to their frequency. For example, `any' occurs both in a non-NPI "free choice" variant, as in `any idiot can do that', and in an NPI version. Although it is ambiguous with respect to NPI-hood, `any' is also a very valuable cue due to its frequency.7 Here is our NPI list: any at all give a damn do a thing bat an eye in weeks/ages/years drink a drop last/be/take long arrive/leave until would care/mind budge red cent but what give a shit eat a bite yet ever bother to lift a finger to speak of ° Sd pcq NPIcontexts p S pcq nppq N pcq , (2) where nppq cP p S pcq is an NPI-context normalizing factor and N pcq is the number of NPI contexts containing the candidate c. This way, plausible candidates that have high S scores relative to the other candidates in the sentence receive enhanced Sd scores. To put it another way: apparently plausible candidates that often appear in sentences with multiple good candidates (i.e., piggybackers) receive a low distilled score, despite a high initial score. Our general claim is that the higher the distilled score of a candidate, the better its chances of being a DE operator. Choice of NPIs Our proposed method requires access to a set of NPIs. However, there does not appear to be universal agreement on such a set. Lichte and Soehn (2007) mention some doubts regarding approximately 200 (!) of the items on a roughly 350item list of German NPIs (K¨ rschner, 1983). For u English, the "moderately complete"6 Lawler (2005) list contains two to three dozen items; however, there is also a list of English NPIs that is several times longer (von Bergen and von Bergen, 1993, written in German), and Hoeksema (1997) asserts that English should have hundreds of NPIs, similarly to French and Dutch. We choose to focus on the items on these lists that seem most likely to be effective cues for our task. Specifically, we select a subset of the Lawler NPIs, focusing mostly on those that do not have a relatively frequent non-NPI sense. An example discard is `much', whose NPI-hood depends on www-personal.umich.edu/jlawler/aue/ npi.html 6 ° 3 Experiments Our main set of evaluations focuses on the precision of our method at discovering new DE operators. We then briefly discuss evaluation of other dimensions. 3.1 Setup We applied our method to the entirety of the BLLIP (Brown Laboratory for Linguistic Information Processing) 1987­89 WSJ Corpus Release 1, available from the LDC (LDC2000T43). The 1,796,379 sentences in the corpus comprise 53,064 NPI contexts; after discarding the ones containing the 10 wellknown DE operators, 30,889 NPI contexts were left. To avoid sparse data problems, we did not consider candidates with very low frequency in the corpus (¤150 occurrences) or in the NPI contexts (¤10 occurrences). Methodology for eliciting judgments The obvious way to evaluate the precision of our algorithm is to have human annotators judge each output item as to whether it is a DE operator or not. However, there are some methodological issues that arise. First, if the judges know that every term they are rating comes from our system and that we are hoping that the algorithm extracts DE operators, they may be biased towards calling every item "DE" regardless of whether it actually is. We deal with this problem by introducing distractors -- items that are not produced by our algorithm, but are similar enough to not be easily identifiable as "fakes". Specifically, It is by far the most frequent NPI, appearing in 36,554 of the sentences in the BLLIP corpus (see Section 3). 7 140 for each possible part of speech of each of our system's outputs c that exists in WordNet, we choose a distractor that is either in a "sibling" synset (a hyponym of c's hypernym) or an antonym. Thus, the distractors are highly related to the candidates. Note that they may in fact also be DE operators. The judges were made aware of the presence of a substantial number of distractors (about 70 for the set of top 150 outputs). This design choice did seem to help ensure that the judges carefully evaluated each item. The second issue is that, as mentioned in the introduction, there does not seem to be a uniform test that judges can apply to all items to ascertain their DE-ness; but we do not want the judges to improvise excessively, since that can introduce undesirable randomness into their decisions. We therefore encouraged the judges to try to construct sentences wherein the arguments for candidate DE operators were drawn from a set of phrases and restricted replacements we specified (example: `singing' vs `singing loudly'). However, improvisation was still required in a number of cases; for example, the candidate `act', as either a noun or a verb, cannot take `singing' as an argument. The labels that the judges could apply were "DE(ND)" (downward entailing (narrowlydefined)), "superlative", "comparative", "conditional", "hard to tell", and "not-DE" (= none of the above). We chose this fine-grained sub-division because the second through fourth categories are all known to co-occur with NPIs. There is some debate in the linguistics literature as to whether they can be considered to be downward entailing, narrowly construed, or not (von Fintel, 1999, inter alia), but nonetheless, such operators call for special reasoning quite distinct from that required when dealing with upward entailing operators -- hence, we consider it a success when our algorithm identifies them. Since monotonicity phenomena can be rather subtle, the judges engaged in a collaborative process. Judge A (the second author) annotated all items, but worked in batches of around 10 items. At the end of each batch, Judge B (the first author) reviewed Judge A's decisions, and the two consulted to resolve disagreements as far as possible. One final remark regarding the annotation: some decisions still seem uncertain, since various factors such as context, Gricean maxims, what should be presupposed8 and so on come into play. However, we take comfort in a comment by Eugene Charniak (personal communication) to the effect that if a word causes a native speaker to pause, that word is interesting enough to be included. And indeed, it seems reasonable that if a native speaker thinks there might be a sense in which a word can be considered downward entailing, then our system should flag it as a word that an RTE system should at least perhaps pass to a different subsystem for further analysis. 3.2 Precision Results We now examine the 150 items that were most highly ranked by our system, which were subsequently annotated as just described. (For full system output that includes the unannotated items, see http://www.cs.cornell.edu/ cristian. We would welcome external annotation help.) As shown in Figure 1a, which depicts precision at k for various values of k, our system performs very well. In fact, 100% of the first 60 outputs are DE, broadly construed. It is also interesting to note the increasing presence of instances that the judges found hard to categorize as we move further down the ranking. Of our 73 distractors, 46% were judged to be members of one of our goal categories. The fact that this percentage is substantially lower than our algorithm's precision at both 73 and 150 (the largest k we considered) confirms that our judges were not making random decisions. (We expect the percentage of DE operators among the distractors to be much higher than 0 because they were chosen to be similar to our system's outputs, and so can be expected to also be DE operators some fraction of the time.) Table 1 shows the lemmas of just the DE(ND) operators that our algorithm placed in its top 150 outputs.9 Most of these lemmas are new discoveries, in the sense of not appearing in Ladusaw's (1980) (implicit) enumeration of DE operators. Moreover, the For example, `X doubts the epidemic spread quickly' might be said to entail `X would doubt the epidemic spreads via fleas, presupposing that X thinks about the flea issue'. 9 By listing lemmas, we omit variants of the same word, such as `doubting' and `doubted', to enhance readability. We omit superlatives, comparatives, and conditionals for brevity. 8 141 100 90 80 70 60 50 40 30 20 10 0 DE(ND) S/C/C Hard 100 90 80 70 60 50 40 30 20 10 0 DE(ND) S/C/C Hard Precision at k 10 20 30 40 50 60 70 80 k 90 100 110 120 130 140 150 Precision at k 10 20 30 40 50 60 70 80 k 90 100 110 120 130 140 150 (a) (b) Figure 1: (a) Precision at k for k divisible by 10 up to k 150. The bar divisions are, from the x-axis up, DE(ND) (blue, the largest); Superlatives/Conditionals/Comparatives (green, 2nd largest); and Hard (red, sometimes non-existent). For example, all of the first 10 outputs were judged to be either downward entailing (narrowly-defined) (8 of 10, or 80%) or in one of the related categories (20%). (b) Precision at k when the distillation step is omitted. not-DE firmly fined liable notify Hard approve cautioned dismissed fend almost ambitious considers detect one-day signal remove vowed Table 3: Examples of words judged to be either not in one of our monotonicity categories of interest (not-DE) or hard to evaluate (Hard). lists of DE(ND) operators that are used by textualentailment systems are significantly smaller than that depicted in Table 1; for example, MacCartney and Manning (2008) use only about a dozen (personal communication). Table 3 shows examples of the words in our system's top 150 outputs that are either clear mistakes or hard to evaluate. Some of these are due to idiosyncrasies of newswire text. For instance, we often see phrases like `biggest one-day drop in ...', where `one-day' piggybacks on superlatives, and `vowed' piggybacks on the DE operator `veto', as in the phrase `vowed to veto'. Effect of distillation In order to evaluate the importance of the distillation process, we study how the results change when distillation is omitted (thus using as score function S from Equation 1 rather than Sd ). When comparing the results (summarized in Figure 1b) with those of the complete system (Figure 1a) we observe that the distillation indeed has the desired effect: the number of highly ranked words that are annotated as not-DE decreases after distillation. This results in an increase of the precision at k ranging from 5% to 10% (depending on k), as can be observed by comparing the height of the composite bars in the two figures.10 Importantly, this improvement does indeed seem to stem at least in part from the distillation process handling the piggybacking problem. To give just a few examples: `vigorously' is pushed down from rank 48 (undistilled scoring) to rank 126 (distilled scoring), `one-day' from 25th to 65th , `vowed' from 45th to 75th , and `Milken' from 121st to 350th . 3.3 Other Results It is natural to ask whether the (expected) decrease in precision at k is due to the algorithm assigning relatively low scores to DE operators, so that they do not appear in the top 150, or due to there being no more more true DE operators to rank. We cannot directly evaluate our method's recall because no comprehensive list of DE operators exists. However, to get a rough impression, we can check how our system ranks the items in the largest list we are aware of, namely, the Ladusaw (implicit) list mentioned above. Of the 31 DE operator lemmas on this list (not including the 10 well-known DE operators), only 7 of those frequent enough to be considered by our algorithm are not in its top 150 outputs, and only The words annotated "hard" do not affect this increase in precision. 10 142 absence of absent from anxious about to avoid (L) to bar barely to block cannot (L) compensate for to decline to defer to deny (L) to deter to discourage to dismiss to doubt (L) to eliminate essential for to exclude to fail (L) hardly (L) to lack innocent of to minimize never (L) nobody nothing to oppose to postpone to preclude premature to to prevent to prohibit rarely (L) to refrain from to refuse (L) regardless to reject reluctant to (L) to resist to rule out skeptical to suspend to thwart unable to unaware of unclear on unlike unlikely (L) unwilling to to veto wary of warned that (L) whenever withstand Table 1: The 55 lemmas for the 90 downward entailing (narrowly-defined) operators among our algorithm's top 150 outputs. (L) marks instances from Ladusaw's list. marks some of the more interesting cases. We have added function words (e.g., `to', `for') to indicate parts of speech or subcategorization; our algorithm does not discover multi-word phrases. Original Dan is unlikely to sing. Olivia compensates for eating by exercising. Ursula refused to sing or dance. Bob would postpone singing. Talent is essential for singing. She will finish regardless of threats. Ń ůń đ { ů ůń đ { ů ůń đ { ů ůń đ { ů ůń đ { ů ůń đ { ů Restriction Dan is unlikely to sing loudly. Olivia compensates for eating late by exercising. Ursula refused to sing. Bob would postpone singing loudly. Talent is essential for singing a ballad. She will finish regardless of threats to my career. Table 2: Example demonstrations that the underlined expressions (selected from Table 1) are DE operators: the sentences on the left entail those on the right. We also have provided đů indicators because the reader might find it { helpful to reason in the opposite direction and see that these expressions are not upward entailing. 5 are not in the top 300. Remember that we only annotated the top 150 outputs; so, there may be many other DE operators between positions 150 and 300. Another way of evaluating our method would be to assess the effect of our newly discovered DE operators on downstream RTE system performance. There are two factors to take into account. First, the DE operators we discovered are quite prevalent in naturally occurring text11 : the 90 DE(ND) operators appearing in our algorithm's top 150 outputs occur in 111,456 sentences in the BLLIP corpus (i.e., in 6% of its sentences). Second, as previously mentioned, systems do already account for monotonicity to some extent -- but they are limited by the fact that their DE operator lexicons are restricted mostly to well-known instances; to take a concrete example with a publicly available RTE system: Nutcracker (Bos and Markert, 2006) correctly infers that `We did not know the disease spread' entails `We did not know the disease spread quickly' but it fails to inHowever, RTE competitions do not happen to currently stress inferences involving monotonicity. The reasons why are beyond the scope of this paper. 11 fer that `We doubt the disease spread' entails `We doubt the disease spread quickly'. So, systems can use monotonicity information but currently do not have enough of it; our method can provide them with this information, enabling them to handle a greater fraction of the large number of naturally occurring instances of this phenomenon than ever before. 4 Related work not already discussed Magnini (2008), in describing modular approaches to textual entailment, hints that NPIs may be used within a negation-detection sub-component. There is a substantial body of work in the linguistics literature regarding the definition and nature of polarity items (Polarity Items Bibliography). However, very little of this work is computational. There has been passing speculation that one might want to learn polarity-inverting verbs (Christodoulopoulos, 2008, pg. 47). There have also been a few projects on the discovery of NPIs, which is the converse of the problem we consider. Hoeksema (1997) discusses some of the difficulties with corpus-based determination of NPIs, including "rampant" poly- 143 semy and the problem of "how to determine independently which predicates should count as negative" -- a problem which our work addresses. Lichte and Soehn (Lichte, 2005; Lichte and Soehn, 2007) consider finding German NPIs using a method conceptually similar in some respects to our own, although again, their objective is the reverse of ours. Their discovery statistic for single-word NPIs is the ratio of within-licenser-clause occurrences to total occurrences, where, to enhance precision, the list of licensers was filtered down to a set of fairly unambiguous, easily-identified items. They do not consider distillation, which we found to be an important component of our DE-operator-detection algorithm. Their evaluation scheme, unlike ours, did not employ a bias-compensation mechanism. They did employ a collocation-detection technique to extend their list to multi-word NPIs, but our independent experiments with a similar technique (not reported here) did not yield good results. In practice, subcategorization is an important feature to capture. In Table 1, we indicate which subcategorizations are DE. An interesting extension of our work would be to try to automatically distinguish particular DE subcategorizations that are lexically apparent, e.g., `innocent' (not DE) vs. `innocent of' (as in `innocent of burglary', DE). Our project provides a connection (among many) between the creation of textual entailment systems (the domain of language engineers) and the characterization of DE operators (the subject of study and debate among linguists). The prospect that our method might potentially eventually be refined in such a way so as to shed at least a little light on linguistic questions is a very appealing one, although we cannot be certain that any progress will be made on that front. Acknowledgments We thank Roy Bar-Haim, Cleo Condoravdi, and Bill MacCartney for sharing their systems' lists and information about their work with us; Mats Rooth for helpful conversations; Alex Niculescu-Mizil for technical assistance; and Eugene Charniak for reassuring remarks. We also thank Marisa Ferrara Boston, Claire Cardie, Zhong Chen, Yejin Choi, Effi Georgala, Myle Ott, Stephen Purpura, and Ainur Yessenalina at Cornell University, the UT-Austin NLP group, Roy Bar-Haim, Bill MacCartney, and the anonymous reviewers for for their comments on this paper. This paper is based upon work supported in part by DHS grant N0014-07-1-0152, National Science Foundation grant No. BCS-0537606, a Yahoo! Research Alliance gift, a CU Provost's Award for Distinguished Scholarship, and a CU Institute for the Social Sciences Faculty Fellowship. Any opinions, findings, and conclusions or recommendations expressed are those of the authors and do not necessarily reflect the views or official policies, either expressed or implied, of any sponsoring institutions, the U.S. government, or any other entity. 5 Conclusions and future work To our knowledge, this work represents the first attempt to discover downward entailing operators. We introduced a unsupervised algorithm that is motivated by research in linguistics but employs simple distributional statistics in a novel fashion. Our algorithm is highly accurate and discovers many reasonable DE operators that are missing from pre-existing manually-built lists. Since the algorithm is resource-lean -- requiring no parser or tagger but only a list of NPIs -- it can be immediately applied to languages where such lists exist, such as German and Romanian (Trawi´ ski and n Soehn, 2008). On the other hand, although the results are already quite good for English, it would be interesting to see what improvements could be gained by using more sophisticated syntactic information. For languages where NPI lists are not extensive, one could envision applying an iterative co-learning approach: use the newly-derived DE operators to infer new NPIs, and then discover even more new DE operators given the new NPI list. (For English, our initial attempts at bootstrapping from our initial NPI list on the BLLIP corpus did not lead to substantially improved results.) References Roy Bar-Haim, Ido Dagan, Bill Dolan, Lisa Ferro, Danilo Giampiccolo, Bernardo Magnini, and Idan Szpektor. The second PASCAL Recognising Textual Entailment challenge. In Proceedings of the Second PASCAL Challenges Workshop on Recognising Textual Entailment, 2006. Roy Bar-Haim, Jonathan Berant, Ido Dagan, Iddo Greental, Shachar Mirkin, Eyal Shnarch, and Idan Szpektor. Efficient semantic deduction and approximate matching over compact parse forests. In Proceedings of TAC, 2008. Johan Bos and Katja Markert. Recognising textual entailment with robust logical inference. In Qui~ onero n Candela, Dagan, Magnini, and d'Alch´ Buc (2006), e pages 404­426. Christos Christodoulopoulos. Creating a natural logic inference system with combinatory categorial grammar. Master's thesis, University of Edinburgh, 2008. 144 Dick Crouch, Roser Saur´, and Abraham Fowler. i AQUAINT pilot knowledge-based evaluation: Annotation guidelines. http://www2. parc.com/istl/groups/nltt/papers/ aquaint kb pilot evaluation guide.pdf, 2005. Ido Dagan, Oren Glickman, and Bernardo Magnini. The PASCAL Recognising Textual Entailment challenge. In Qui~ onero Candela et al. (2006), pages 177­190. n David Dowty. The role of negative polarity and concord marking in natural language reasoning. In Mandy Harvey and Lynn Santelmann, editors, Proceedings of SALT IV, pages 114­144, Ithaca, New York, 1994. Cornell University. Gilles Fauconnier. Polarity and the scale principle. In Proceedings of the Chicago Linguistic Society (CLS), pages 188­199, 1975. Reprinted in Javier GutierrezRexach (ed.), Semantics: Critical Concepts in Linguistics, 2003. Danilo Giampiccolo, Bernardo Magnini, Ido Dagan, and Bill Dolan. The third PASCAL Recognizing Textual Entailment challenge. In Proceedings of the ACLPASCAL Workshop on Textual Entailment and Paraphrasing, pages 1­9, 2007. URL http://www. aclweb.org/anthology/W/W07/W07-1401. Anastasia Giannakidou. Licensing and sensitivity in polarity items: from downward entailment to nonveridicality. In Proceedings of the Chicago Linguistic Society (CLS), 2002. Jack Hoeksema. Monotonicity phenomena in natural language. Linguistic Analysis, 16:25­40, 1986. Jack Hoeksema. As (of) yet. Appears in Language and Cognition 3, the 1992 yearbook of the research group for theoretical and experimental linguistics of the University of Groningen, 1993. http://www.let. rug.nl/hoeksema/asofyet.pdf. Jack Hoeksema. Corpus study of negative polarity items. IV-V Jornades de corpus linguistics 1996-1997, 1997. http://odur.let.rug.nl/ hoeksema/docs/barcelona.html. Wilfried K¨ rschner. Studien zur Negation im Deutschen. u Narr, 1983. William A. Ladusaw. Polarity Sensitivity as Inherent Scope Relations. Garland Press, New York, 1980. Ph.D. thesis date 1979. John Lawler. Negation and NPIs. http://www. umich.edu/jlawler/NPIs.pdf, 2005. Version of 10/29/2005. Timm Lichte. Corpus-based acquisition of complex negative polarity items. In ESSLLI Student Session, 2005. Timm Lichte and Jan-Philipp Soehn. The retrieval and classification of Negative Polarity Items using statistical profiles. In Sam Featherston and Wolfgang Sternefeld, editors, Roots: Linguistics in Search of its Evidential Base, pages 249­266. Mouton de Gruyter, 2007. Marcia Linebarger. Negative polarity and grammatical representation. Linguistics and philosophy, 10:325­ 387, 1987. Bill MacCartney and Christopher D. Manning. Natural logic for textual inference. In Proceedings of the ACLPASCAL Workshop on Textual Entailment and Paraphrasing, pages 193­200, 2007. Bill MacCartney and Christopher D. Manning. Modeling semantic containment and exclusion in natural language inference. In Proceedings of the 22nd International Conference on Computational Linguistics (Coling 2008), pages 521­528, Manchester, UK, August 2008. Coling 2008 Organizing Committee. URL http://www.aclweb.org/anthology/ C08-1066. Bernardo Magnini. Slides for a presentation entitled "Semantic Knowledge for Textual Entailment". Symposium on Semantic Knowledge Discovery, Organization and Use, New York University, November 14 and 15, 2008. Rowan Nairn, Cleo Condoravdi, and Lauri Karttunen. Computing relative polarity for textual inference. In Proceedings of Inference in Computational Semantics (ICoS), 2006. Polarity Items Bibliography. The polarity items bibliography. http://www.sfb441. uni-tuebingen.de/a5/pib/XML2HTML/ list.html, 2008. Maintenance guaranteed only through December 2008. Joaquin Qui~ onero Candela, Ido Dagan, Bernardo n Magnini, and Florence d'Alch´ Buc, editors. Mae chine Learning Challenges, Evaluating Predictive Uncertainty, Visual Object Classification and Recognizing Textual Entailment, First PASCAL Machine Learning Challenges Workshop, MLCW 2005, Southampton, UK, April 11-13, 2005, Revised Selected Papers, volume 3944 of Lecture Notes in Computer Science (LNCS), 2006. Springer. V´ctor S´ nchez Valencia. Studies on natural logic and i a categorial grammar. PhD thesis, University of Amsterdam, 1991. Beata Trawi´ ski and Jan-Philipp Soehn. A Multilingual n Database of Polarity Items. In Proceedings of LREC 2008, May 28­30, Marrakech, Morocco, 2008. Johan van Benthem. Essays in Logical Semantics. Reidel, Dordrecht, 1986. Ton van der Wouden. Negative contexts: Collocation, polarity and multiple negation. Routledge, 1997. Anke von Bergen and Karl von Bergen. Negative Polarit¨ t im Englischen. Gunter Narr, 1993. List a extracted and compiled by Manfred Sailer, 2008, http://www.sfs.uni-tuebingen.de/fr/ esslli/08/byday/english-npi.pdf. Kai von Fintel. NPI licensing, Strawson entailment, and context dependency. Journal of Semantics, 16:97­148, 1999. 145 The Role of Implicit Argumentation in Nominal SRL Matt Gerber Dept. of Computer Science Michigan State University gerberm2@msu.edu Joyce Y. Chai Dept. of Computer Science Michigan State University jchai@cse.msu.edu Adam Meyers Dept. of Computer Science New York University meyers@cs.nyu.edu Abstract Nominals frequently surface without overtly expressed arguments. In order to measure the potential benefit of nominal SRL for downstream processes, such nominals must be accounted for. In this paper, we show that a state-of-the-art nominal SRL system with an overall argument F1 of 0.76 suffers a performance loss of more than 9% when nominals with implicit arguments are included in the evaluation. We then develop a system that takes implicit argumentation into account, improving overall performance by nearly 5%. Our results indicate that the degree of implicit argumentation varies widely across nominals, making automated detection of implicit argumentation an important step for nominal SRL. company] [P redicate distributed] [Arg2 to the partnership's unitholders]. The NomBank corpus contains a similar instance of the deverbal nominalization distribution: (2) Searle will give [Arg0 pharmacists] [Arg1 brochures] [Arg1 on the use of prescription drugs] for [P redicate distribution] [Location in their stores]. This instance demonstrates the annotation of split arguments (Arg1) and modifying adjuncts (Location), which are also annotated in PropBank. In cases where a nominal has a verbal counterpart, the interpretation of argument positions Arg0-Arg5 is consistent between the two corpora. In addition to deverbal (i.e., event-based) nominalizations, NomBank annotates a wide variety of nouns that are not derived from verbs and do not denote events. An example is given below of the partitive noun percent: (3) Hallwood owns about 11 [P redicate %] [Arg1 of Integra]. In this case, the noun phrase headed by the predicate % (i.e., "about 11% of Integra") denotes a fractional part of the argument in position Arg1. Since NomBank's release, a number of studies have applied verbal SRL techniques to the task of nominal SRL. For example, Liu and Ng (2007) reported an argument F1 of 0.7283. Although this result is encouraging, it does not take into account nominals that surface without overt arguments. Consider the following example: (4) The [P redicate distribution] represents [N P available cash flow] [P P from the partnership] [P P between Aug. 1 and Oct. 31]. 1 Introduction In the past few years, a number of studies have focused on verbal semantic role labeling (SRL). Driven by annotation resources such as FrameNet (Baker et al., 1998) and PropBank (Palmer et al., 2005), many systems developed in these studies have achieved argument F1 scores near 80% in large-scale evaluations such as the one reported by Carreras and M` rquez (2005). a More recently, the automatic identification of nominal argument structure has received increased attention due to the release of the NomBank corpus (Meyers, 2007a). NomBank annotates predicating nouns in the same way that PropBank annotates predicating verbs. Consider the following example of the verbal predicate distribute from the PropBank corpus: (1) Freeport-McMoRan Energy Partners will be liquidated and [Arg1 shares of the new 146 Human Language Technologies: The 2009 Annual Conference of the North American Chapter of the ACL, pages 146­154, Boulder, Colorado, June 2009. c 2009 Association for Computational Linguistics As in (2), distribution in (4) has a noun phrase and multiple prepositional phrases in its environment, but not one of these constituents is an argument to distribution in (4); rather, any arguments are implicitly supplied by the surrounding discourse. As described by Meyers (2007a), instances such as (2) are called "markable" because they contain overt arguments, and instances such as (4) are called "unmarkable" because they do not. In the NomBank corpus, only markable instances have been annotated. Previous evaluations (e.g., those by Jiang and Ng (2006) and Liu and Ng (2007)) have been based on markable instances, which constitute 57% of all instances of nominals from the NomBank lexicon. In order to use nominal SRL systems for downstream processing, it is important to develop and evaluate techniques that can handle markable as well as unmarkable nominal instances. To address this issue, we investigate the role of implicit argumentation for nominal SRL. This is, in part, inspired by the recent CoNLL Shared Task (Surdeanu et al., 2008), which was the first evaluation of syntactic and semantic dependency parsing to include unmarkable nominals. In this paper, we extend this task to constituent parsing with techniques and evaluations that focus specifically on implicit argumentation in nominals. We first present our NomBank SRL system, which improves the best reported argument F1 score in the markable-only evaluation from 0.7283 to 0.7630 using a single-stage classification approach. We show that this system, when applied to all nominal instances, achieves an argument F1 score of only 0.6895, a loss of more than 9%. We then present a model of implicit argumentation that reduces this loss by 46%, resulting in an F1 score of 0.7235 on the more complete evaluation task. In our analyses, we find that SRL performance varies widely among specific classes of nominals, suggesting interesting directions for future work. 2 Related work primarily on relations that hold between nominalizations and their arguments, whereas the SemEval task focuses on a range of semantic relations, many of which are not applicable to nominal argument structure. Early work in identifying the argument structure of deverbal nominalizations was primarily rulebased, using rule sets to associate syntactic constituents with semantic roles (Dahl et al., 1987; Hull and Gomez, 1996; Meyers et al., 1998). Lapata (2000) developed a statistical model to classify modifiers of deverbal nouns as underlying subjects or underlying objects, where subject and object denote the grammatical position of the modifier when linked to a verb. FrameNet and NomBank have facilitated machine learning approaches to nominal argument structure. Gildea and Jurafsky (2002) presented an early FrameNet-based SRL system that targeted both verbal and nominal predicates. Jiang and Ng (2006) and Liu and Ng (2007) have tested the hypothesis that methodologies and representations used in PropBank SRL (Pradhan et al., 2005) can be ported to the task of NomBank SRL. These studies report argument F1 scores of 0.6914 and 0.7283, respectively. Both studies also investigated the use of features specific to the task of NomBank SRL, but observed only marginal performance gains. NomBank argument structure has also been used in the recent CoNLL Shared Task on Joint Parsing of Syntactic and Semantic Dependencies (Surdeanu et al., 2008). In this task, systems were required to identify syntactic dependencies, verbal and nominal predicates, and semantic dependencies (i.e., arguments) for the predicates. For nominals, the best semantic F1 score was 0.7664 (Surdeanu et al., 2008); however this score is not directly comparable to the NomBank SRL results of Liu and Ng (2007) or the results in this paper due to a focus on different aspects of the problem (see the end of section 5.2 for details). Nominal SRL is related to nominal relation interpretation as evaluated in SemEval (Girju et al., 2007). Both tasks identify semantic relations between a head noun and other constituents; however, the tasks focus on different relations. Nominal SRL focuses 147 3 NomBank SRL Given a nominal predicate, an SRL system attempts to assign surrounding spans of text to one of 23 classes representing core arguments, adjunct arguments, and the null or non-argument. Similarly to verbal SRL, this task is traditionally formulated as a two-stage classification problem over nodes in the syntactic parse tree of the sentence containing the predicate.1 In the first stage, each parse tree node is assigned a binary label indicating whether or not it is an argument. In the second stage, argument nodes are assigned one of the 22 non-null argument types. Spans of text subsumed by labeled parse tree nodes constitute arguments of the predication. 3.1 An improved NomBank SRL baseline To investigate the effects of implicit argumentation, we first developed a system based on previous markable-only approaches. Our system follows many of the traditions above, but differs in the following ways. First, we replace the standard twostage pipeline with a single-stage logistic regression model2 that predicts arguments directly. Second, we model incorporated arguments (i.e., predicates that are also arguments) with a simple maximum likelihood model that predicts the most likely argument label for a predicate based on counts from the training data. Third, we use the following heuristics to resolve argument conflicts: (1) If two arguments overlap, the one with the higher probability is kept. (2) If two non-overlapping arguments are of the same type, the one with the higher probability is kept unless the two nodes are siblings, in which case both are kept. Heuristic (2) accounts for split argument constructions. Our NomBank SRL system uses features that are selected with a greedy forward search strategy similar to the one used by Jiang and Ng (2006). The top half of Table 2 (next page) lists the selected argument features.3 We extracted training nodes from sections 2-21 of NomBank, used section 24 for development and section 23 for testing. All parse trees were generated by Charniak's re-ranking syntactic parser (Charniak and Johnson, 2005). Following the evaluation methodology used by Jiang and Ng (2006) and Liu and Ng (2007), we obtained sigThe syntactic parse can be based on ground-truth annotation or derived automatically, depending on the evaluation. 2 We use LibLinear (Fan et al., 2008). 3 For features requiring the identification of support verbs, we use the annotations provided in NomBank. Preliminary experiments show a small loss when using automatic support verb identification. 1 Jiang and Ng (2006) Liu and Ng (2007) This paper Dev. F1 0.6677 0.7454 Testing F1 0.6914 0.7283 0.7630 Table 1: Markable-only NomBank SRL results for argument prediction using automatically generated parse trees. The f-measure statistics were calculated by aggregating predictions across all classes. "-" indicates that the result was not reported. Markable-only 0.7955 0.7330 0.7630 All-token 0.6577 0.7247 0.6895 % loss -17.32 -1.13 -9.63 P R F1 Table 3: Comparison of the markable-only and alltoken evaluations of the baseline argument model. nificantly better results, as shown in Table 1 above.4 3.2 The effect of implicit nominal arguments The presence of implicit nominal arguments presents challenges that are not taken into account by the evaluation described above. To assess the impact of implicit arguments, we evaluated our NomBank SRL system over each token in the testing section. The system attempts argument identification for all singular and plural nouns that have at least one annotated instance in the training portion of the NomBank corpus (morphological variations included). Table 3 gives a comparison of the results from the markable-only and all-token evaluations. As can be seen, assuming that all known nouns take overt arguments results in a significant performance loss. This loss is due primarily to a drop in precision caused by false positive argument predictions made for nominals with implicit arguments. 4 Accounting for implicit arguments in nominal SRL A natural solution to the problem described above is to first distinguish nominals that bear overt arguments from those that do not. We treat this As noted by Carreras and M` rquez (2005), the discrepancy a between the development and testing results is likely due to poorer syntactic parsing performance on the development section. 4 148 # 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 ... 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 Description 12 & parse tree path from n to pred Position of n relative to pred & parse tree path from n to pred First word subsumed by n 12 & position of n relative to pred 12 & 14 Head word of n's parent Last word subsumed n n's syntactic category & length of parse tree path from n to pred First word of n's right sibling Production rule that expands the parent of pred Head word of the right-most NP in n if n is a PP Stem of pred Parse tree path from n to the lowest common ancestor of n and pred Head word of n 12 & n's syntactic category Production rule that expands n's parent Parse tree path from n to the nearest support verb Last part of speech (POS) subsumed by n Production rule that expands n's left sibling Head word of n, if the parent of n is a PP The POS of the head word of the right-most NP under n if n is a PP Features 22-31 are available upon request n's ancestor subcategorization frames (ASF) (see section 4) n's word Syntactic category of n's right sibling Parse tree paths from n to each support verb Last word of n's left sibling Parse tree path from n to previous nominal, with lexicalized source (see section 4) Last word of n's right sibling Production rule that expands n's left sibling Syntactic category of n PropBank markability score (see section 4) Parse tree path from n to previous nominal, with lexicalized source and destination Whether or not n is followed by PP Parse tree path from n to previous nominal, with lexicalized destination Head word of n's parent Whether or not n surfaces before a passive verb First word of n's left sibling Parse tree path from n to closest support verb, with lexicalized destination Whether or not n is a head Head word of n's right sibling Production rule that expands n's parent Parse tree paths from n to all support verbs, with lexicalized destinations First word of n's right sibling Head word of n's left sibling If n is followed by a PP, the head of that PP's object Parse tree path from n to previous nominal Token distance from n to previous nominal Production rule that expands n's grandparent N * S * Argument features * * * * * * * * 0 * 3 * * * * * * * * * * * * * * * * * * * * * * * * Nominal features * * * * Table 2: Features, sorted by gain in selection algorithm. & denotes concatenation. The last two columns indicate (N)ew features (not used in Liu and Ng (2007)) and features (S)hared by the argument and nominal models. 149 as a binary classification task over token nodes. Once a nominal has been identified as bearing overt arguments, it is processed with the argument identification model developed in the previous section. To classify nominals, we use the features shown in the bottom half of Table 2, which were selected with the same algorithm used for the argument classification model. As shown by Table 2, the sets of features selected for argument and nominal classification are quite different, and many of the features used for nominal classification have not been previously used. Below, we briefly explain a few of these features. Ancestor subcategorization frames (ASF) As shown in Table 2, the most informative feature is ASF. For a given token t, ASF is actually a set of sub-features, one for each parse tree node above t. Each sub-feature is indexed (i.e., named) by its distance from t. The value of an ASF sub-feature is the production rule that expands the corresponding node in the tree. An ASF feature with two sub-features is depicted below for the token "sale": VP: ASF2 = V P V, N P V (made) NP: ASF1 = N P Det, N Det (a) N (sale) Baseline MLE LibLinear Precision 0.5555 0.6902 0.8989 Recall 0.9784 0.8903 0.8927 F1 0.7086 0.7776 0.8958 Table 4: Evaluation results for identifying nominals with explicit arguments. to their entity type using BBN's IdentiFinder, and adverbs are normalized to their related adjective using the ADJADV dictionary provided by NomBank. The normalization of adverbs is motivated by the fact that adverbial modifiers of verbs typically have a corresponding adjectival modifier for deverbal nominals. 5 Evaluation results Our evaluation methodology reflects a practical scenario in which the nominal SRL system must process each token in a sentence. The system cannot safely assume that each token bears overt arguments; rather, this decision must be made automatically. In section 5.1, we present results for the automatic identification of nominals with overt arguments. Then, in section 5.2, we present results for the combined task in which nominal classification is followed by argument identification. 5.1 Nominal classification Following standard practice, we train the nominal classifier over NomBank sections 2-21 using LibLinear and automatically generated syntactic parse trees. The prediction threshold is set to the value that maximizes the nominal F1 score on development section (24), and the resulting model is tested over section 23. For comparison, we implemented the following simple classifiers. Baseline nominal classifier Classifies a token as overtly bearing arguments if it is a singular or plural noun that is markable in the training data. As shown in Table 4, this classifier achieves nearly perfect recall.5 MLE nominal classifier Operates similarly to Recall is less than 100% due to (1) part-of-speech errors from the syntactic parser and (2) nominals that were not annotated in the training data but exist in the testing data. 5 Parse tree path lexicalization A lexicalized parse tree path is one in which surface tokens from the beginning or end of the path are included in the path. This is a finer-grained version of the traditional parse tree path that captures the joint behavior of the path and the tokens it connects. For example, in the tree above, the path from "sale" to "made" with a lexicalized source and destination would be sale : N N P V P V : made. Lexicalization increases sparsity; however, it is often preferred by the feature selection algorithm, as shown in the bottom half of Table 2. PropBank markability score This feature is the probability that the context (± 5 words) of a deverbal nominal is generated by a unigram language model trained over the PropBank argument words for the corresponding verb. Entities are normalized 150 0.1 0.09 % of nominal instances 0.08 0.07 0.06 0.05 0.04 0.03 0.02 0.01 0 0. 3 .2 5) 0. 35 0. 65 .8 0. 85 0. 05 0. 15 0. 25 0. 45 0. 55 0. 75 0. 95 0. 1 0. 4 0. 2 0. 5 0. 6 0. 7 0. 9 (0 (0 .7 5) 0 .5 ) 1 Observed markable probability (a) Distribution of nominals. Each interval on the x-axis denotes a set of nominals that are markable between (x-5)% and x% of the time in the training data. The y-axis denotes the percentage of all nominal instances in TreeBank that is occupied by nominals in the interval. Quartiles are marked below the intervals. For example, quartile 0.25 indicates that one quarter of all nominal instances are markable 35% of the time or less. 1 0.9 0.8 1 0.9 0.8 0.7 Predicate nominal F1 0.7 (0 Argument F1 0.6 0.5 0.4 0.3 0.2 0.1 0 0. 35 0. 25 0. 45 0. 65 0. 75 0. 55 0. 85 0. 15 0.6 0.5 0.4 0.3 0.2 0.1 0 0. 4 0. 2 0. 1 0. 3 0. 5 0. 6 0. 7 0. 8 0. 9 0. 35 0. 15 0. 25 0. 45 Baseline LibLinear Baseline MLE LibLinear Observed markable probability 0. 95 0. 05 0. 4 0. 7 0. 8 0. 3 0. 5 0. 2 0. 6 0. 9 0. 1 0. 05 Observed markable probability (b) Nominal classification performance with respect to the distribution in Figure 1a. The y-axis denotes the combined F1 for nominals in the interval. (c) All-token argument classification performance with respect to the distribution in Figure 1a. The y-axis denotes the combined F1 for nominals in the interval. Figure 1: Evaluation results with respect to the distribution of nominals in TreeBank. the baseline classifier, but also produces a score for the classification. The value of the score is equal to the probability that the nominal bears overt arguments, as observed in the training data. A prediction threshold is imposed on this score as determined by the development data (t = 0.23). As shown by Table 4, this exchanges recall for precision and leads to a significant increase in the overall F1 score. The last row in Table 4 shows the results for the LibLinear nominal classifier, which significantly outperforms the others, achieving balanced precision and recall scores near 0.9. In addition, it is 151 able to recover from part-of-speech errors because it does not filter out non-noun instances; rather, it combines part-of-speech information with other lexical and syntactic features to classify nominals. Interesting observations can be made by grouping nominals according to the probability with which they are markable in the corpus. Figure 1a gives the overall distribution of markable nominals in the training data. As shown, 50% of nominal instances are markable only 65% of the time or less, making nominal classification an important first step. Using this view of the data, Figure 1b presents the overall F1 scores for the baseline and LibLinear nominal 0. 55 0. 65 0. 75 0. 85 0. 95 1 1 classifiers.6 As expected, gains in nominal classification diminish as nominals become more overtly associated with arguments. Furthermore, nominals that are rarely markable (i.e., those in interval 0.05) remain problematic due to a lack of positive training instances and the unbalanced nature of the classification task. 5.2 Combined nominal-argument classification We now turn to the task of combined nominalargument classification. In this task, systems must first identify nominals that bear overt arguments. We evaluated three configurations based on the nominal classifiers from the previous section. Each configuration uses the argument classification model from section 3. As shown in Table 3, overall argument classification F1 suffers a loss of more than 9% under the assumption that all known nouns bear overt arguments. This corresponds precisely to using the baseline nominal classifier in the combined nominalargument task. The MLE nominal classifier is able to reduce this loss by 25% to an F1 of 0.7080. The LibLinear nominal classifier reduces this loss by 46%, resulting in an overall argument classification F1 of 0.7235. This improvement is the direct result of filtering out nominal instances that do not bear overt arguments. Similarly to the nominal evaluation, we can view argument classification performance with respect to the probability that a nominal bears overt arguments. This is shown in Figure 1c for the three configurations. The configuration using the MLE nominal classifier obtains an argument F1 of zero for nominals below its prediction threshold. Compared to the baseline nominal classifier, the LibLinear classifier achieves argument classification gains as large as 150.94% (interval 0.05), with an average gain of 52.87% for intervals 0.05 to 0.4. As with nominal classification, argument classification gains diminish for nominals that express arguments more overtly - we observe an average gain of only 2.15% for intervals 0.45 to 1.00. One possible explanation for this is that the argument prediction model has substantially more training data for the nominals in intervals 0.45 to 1.00. Thus, even if the nom6 Baseline MLE LibLinear Baseline MLE LibLinear Deverbal 0.7975 0.8298 0.9261 0.7059 0.7206 0.7282 Nominals Deverbal-like 0.6789 0.7332 0.8826 Arguments 0.6738 0.6641 0.7178 Other 0.6757 0.7486 0.8905 0.7454 0.7675 0.7847 Table 5: Nominal and argument F1 scores for deverbal, deverbal-like, and other nominals in the all-token evaluation. inal classifier makes a false positive prediction in the 0.45 to 1.00 interval range, the argument model may correctly avoid labeling any arguments. As noted in section 2, these results are not directly comparable to the results of the recent CoNLL Shared Task (Surdeanu et al., 2008). This is due to the fact that the semantic labeled F1 in the Shared Task combines predicate and argument predictions into a single score. The same combined F1 score for our best two-stage nominal SRL system (logistic regression nominal and argument models) is 0.7806; however, this result is not precisely comparable because we do not identify the predicate role set as required by the CoNLL Shared Task. 5.3 NomLex-based analysis of results As demonstrated in section 1, NomBank annotates many classes of deverbal and non-deverbal nominals, which have been categorized on syntactic and semantic bases in NomLex-PLUS (Meyers, 2007b). To help understand what types of nominals are particularly affected by implicit argumentation, we further analyzed performance with respect to these classes. Figure 2a shows the distribution of nominals across classes defined by the NomLex resource. As shown in Figure 2b, many of the most frequent classes exhibit significant gains. For example, the classification of partitive nominals (13% of all nominal instances) with the LibLinear classifier results in gains of 55.45% and 33.72% over the baseline and MLE classifiers, respectively. For the 5 most common classes, which constitute 82% of all nominals instances, we observe average gains of 27.47% and 19.30% over the baseline and MLE classifiers, Baseline and MLE are identical above the MLE threshold. 152 0.5 0.45 % of nominal instances Predicate nominal F1 0.4 0.35 0.3 0.25 0.2 0.15 0.1 0.05 0 no pa m r ti ti no ve m re like la tio n no al m at ing en trib vi r o u te nm en ab t il no ity w ma or k- d j of -a rt no gro m up ad jlik e jo b ar e ev en t ty p ve e rs ha ion llm ab ar le k -n om fie ld sh 0. 7 0. 8 0. 9 1 0. 6 Baseline MLE LibLinear NomLex class (a) Distribution of nominals across the NomLex classes. The y-axis denotes the percentage of all nominal instances that is occupied by nominals in the class. (b) Nominal classification performance with respect to the NomLex classes in Figure 2a. The y-axis denotes the combined F1 for nominals in the class. Figure 2: Evaluation results with respect to NomLex classes. respectively. Table 5 separates nominal and argument classification results into sets of deverbal (NomLex class nom), deverbal-like (NomLex class nom-like), and all other nominalizations. A deverbal-like nominal is closely related to some verb, although not morphologically. For example, the noun accolade shares argument interpretation with award, but the two are not morphologically related. As shown by Table 5, nominal classification tends to be easier - and argument classification harder - for deverbals when compared to other types of nominals. The difference in argument F1 between deverbal/deverbal-like nominals and the others is due primarily to relational nominals, which are relatively easy to classify (Figure 2b); additionally, relational nominals exhibit a high rate of argument incorporation, which is easily handled by the maximum-likelihood model described in section 3.1. 6 Conclusions and future work The application of nominal SRL to practical NLP problems requires a system that is able to accurately process each token it encounters. Previously, it was unclear whether the models proposed by Jiang and Ng (2006) and Liu and Ng (2007) would operate effectively in such an environment. The systems described by Surdeanu et al. (2008) are designed with this environment in mind, but their evaluation did not focus on the issue of implicit argumentation. These two problems motivate the work presented in 153 this paper. Our contribution is three-fold. First, we improve upon previous nominal SRL results using a singlestage classifier with additional new features. Second, we show that this model suffers a substantial performance degradation when evaluated over nominals with implicit arguments. Finally, we identify a set of features - many of them new - that can be used to reliably detect nominals with explicit arguments, thus significantly increasing the performance of the nominal SRL system. Our results also suggest interesting directions for future work. As described in section 5.2, many nominals do not have enough labeled training data to produce accurate argument models. The generalization procedures developed by Gordon and Swanson (2007) for PropBank SRL and Pad´ et al. (2008) o for NomBank SRL might alleviate this problem. Additionally, instead of ignoring nominals with implicit arguments, we would prefer to identify the implicit arguments using information contained in the surrounding discourse. Such inferences would help connect entities and events across sentences, providing a fuller interpretation of the text. Acknowledgments The authors would like to thank the anonymous reviewers for their helpful suggestions. The first two authors were supported by NSF grants IIS-0535112 and IIS-0347548, and the third author was supported by NSF grant IIS-0534700. n pa om rti no tive m re lik la e tio no nal m a in en ttr g vi ibu ro te nm en ab t i no lity w m or ad kof j -a rt no gro m up ad jlik e jo sh b ar e ev en t ty ve pe rs ha ion ll ab ma le rk -n om fie ld 0 0. 1 0. 2 0. 3 0. 4 0. 5 NomLex class References Collin Baker, Charles Fillmore, and John Lowe. 1998. The Berkeley FrameNet project. In Christian Boitet and Pete Whitelock, editors, Proceedings of the ThirtySixth Annual Meeting of the Association for Computational Linguistics and Seventeenth International Conference on Computational Linguistics, pages 86­90, San Francisco, California. Morgan Kaufmann Publishers. Xavier Carreras and Llu´s M` rquez. 2005. Introduction i a to the conll-2005 shared task: Semantic role labeling. Eugene Charniak and Mark Johnson. 2005. Coarse-tofine n-best parsing and maxent discriminative reranking. In Proceedings of the 43rd Annual Meeting on Association for Computational Linguistics. Rong-En Fan, Kai-Wei Chang, Cho-Jui Hsieh, XiangRui Wang, and Chih-Jen Lin. 2008. Liblinear: A library for large linear classification. Journal of Machine Learning Research, 9:1871­1874. Daniel Gildea and Daniel Jurafsky. 2002. Automatic labeling of semantic roles. Computational Linguistics, 28:245­288. Roxana Girju, Preslav Nakov, Vivi Nastase, Stan Szpakowicz, Peter Turney, and Deniz Yuret. 2007. Semeval-2007 task 04: Classification of semantic relations between nominals. In Proceedings of the 4th International Workshop on Semantic Evaluations. A. Gordon and R. Swanson. 2007. Generalizing semantic role annotations across syntactically similar verbs. In Proceedings of ACL, pages 192­199. Z. Jiang and H. Ng. 2006. Semantic role labeling of nombank: A maximum entropy approach. In Proceedings of the 2006 Conference on Empirical Methods in Natural Language Processing. Maria Lapata. 2000. The automatic interpretation of nominalizations. In Proceedings of the Seventeenth National Conference on Artificial Intelligence and Twelfth Conference on Innovative Applications of Artificial Intelligence, pages 716­721. AAAI Press / The MIT Press. Chang Liu and Hwee Ng. 2007. Learning predictive structures for semantic role labeling of nombank. In Proceedings of the 45th Annual Meeting of the Association of Computational Linguistics, pages 208­215, Prague, Czech Republic, June. Association for Computational Linguistics. Adam Meyers. 2007a. Annotation guidelines for nombank - noun argument structure for propbank. Technical report, New York University. Adam Meyers. 2007b. Those other nombank dictionaries. Technical report, New York University. Sebastian Pad´ , Marco Pennacchiotti, and Caroline o Sporleder. 2008. Semantic role assignment for event nominalisations by leveraging verbal data. In Proceedings of the 22nd International Conference on Computational Linguistics (Coling 2008), pages 665­ 672, Manchester, UK, August. Coling 2008 Organizing Committee. Martha Palmer, Daniel Gildea, and Paul Kingsbury. 2005. The proposition bank: An annotated corpus of semantic roles. Computational Linguistics, 31(1):71­ 106. Sameer Pradhan, Wayne Ward, and James H. Martin. 2005. Towards robust semantic role labeling. In Association for Computational Linguistics. Mihai Surdeanu, Richard Johansson, Adam Meyers, Llu´s M` rquez, and Joakim Nivre. 2008. The CoNLL i a 2008 shared task on joint parsing of syntactic and semantic dependencies. In CoNLL 2008: Proceedings of the Twelfth Conference on Computational Natural Language Learning, pages 159­177, Manchester, England, August. Coling 2008 Organizing Committee. 154 Jointly Identifying Predicates, Arguments and Senses using Markov Logic Ivan Meza-Ruiz Sebastian Riedel School of Informatics, University of Edinburgh, UK Department of Computer Science, University of Tokyo, Japan Database Center for Life Science, Research Organization of Information and System, Japan I.V.Meza-Ruiz@sms.ed.ac.uk sebastian.riedel@gmail.com Abstract In this paper we present a Markov Logic Network for Semantic Role Labelling that jointly performs predicate identification, frame disambiguation, argument identification and argument classification for all predicates in a sentence. Empirically we find that our approach is competitive: our best model would appear on par with the best entry in the CoNLL 2008 shared task open track, and at the 4th place of the closed track--right behind the systems that use significantly better parsers to generate their input features. Moreover, we observe that by fully capturing the complete SRL pipeline in a single probabilistic model we can achieve significant improvements over more isolated systems, in particular for out-of-domain data. Finally, we show that despite the joint approach, our system is still efficient. 1 Introduction Semantic Role Labelling (SRL, M´ rquez et al., a 2008) is generally understood as the task of identifying and classifying the semantic arguments and modifiers of the predicates mentioned in a sentence. For example, in the case of the following sentence: we are to find out that for the predicate token "plays" with sense "play a role" (play.02) the phrase headed by the token "Haag" is referring to the player (A0) of the play event, and the phrase headed by the token 155 "Elianti" is referring to the role (A1) being played. SRL is considered as a key task for applications that require to answer "Who", "What", "Where", etc. questions, such as Information Extraction, Question Answering and Summarization. Any real-world SRL system needs to make several decisions, either explicitly or implicitly: which are the predicate tokens of a sentence (predicate identification), which are the tokens that have semantic roles with respect to these predicates (argument identification), which are the roles these tokens play (argument classification), and which is the sense of the predicate (sense disambiguation). In this paper we use Markov Logic (ML), a Statistical Relational Learning framework that combines First Order Logic and Markov Networks, to develop a joint probabilistic model over all decisions mentioned above. The following paragraphs will motivate this choice. First, it allows us to readily capture global correlations between decisions, such as the constraint that a predicate can only have one agent. This type of correlations has been successfully exploited in several previous SRL approaches (Toutanova et al., 2005; Punyakanok et al., 2005). Second, we can use the joint model to evaluate the benefit of incorporating decisions into the joint model that either have not received much attention within the SRL community (predicate identification and sense disambiguation), or been largely made in isolation (argument identification and classification for all predicates of a sentence). Third, our ML model is essentially a template that describes a class of Markov Networks. Algorithms can perform inference in terms of this template with- Human Language Technologies: The 2009 Annual Conference of the North American Chapter of the ACL, pages 155­163, Boulder, Colorado, June 2009. c 2009 Association for Computational Linguistics out ever having to fully instantiate the complete Markov Network (Riedel, 2008; Singla and Domingos, 2008). This can dramatically improve the efficiency of an SRL system when compared to a propositional approach such as Integer Linear Programming (ILP). Finally, when it comes to actually building an SRL system with ML there are "only" four things to do: preparing input data files, converting output data files, and triggering learning and inference. The remaining work can be done by an off-theshelf Markov Logic interpreter. This is to be contrasted with pipeline systems where several components need to be trained and connected, or Integer Linear Programming approaches for which we need to write additional wrapper code to generate ILPs. Empirically we find that our system is competitive--our best model would appear on par with the best entry in the CoNLL 2008 shared task open track, and at the 4th place of the closed track--right behind systems that use significantly better parsers1 to generate their input features. We also observe that by integrating frame disambiguation into the joint SRL model, and by extracting all arguments for all predicates in a sentence simultaneously, significant improvements compared to more isolated systems can be achieved. These improvements are particularly large in the case of out-of-domain data, suggesting that a joint approach helps to increase the robustness of SRL. Finally, we show that despite the joint approach, our system is still efficient. Our paper is organised as follows: we first introduce ML (section 2), then we present our model in terms of ML (section 3) and illustrate how to perform learning and inference with it (section 4). How this model will be evaluated is explained in section 5 with the corresponding evaluation presented in section 6. We conclude in section 7. some penalty. From an alternative point of view, it is an expressive template language that uses First Order Logic formulae to instantiate Markov Networks of repetitive structure. Let us describe ML by considering the predicate identification task. In ML we can model this task by first introducing a set of logical predicates2 such as isPredicate(Token) or word(Token,Word). Then we specify a set of weighted first order formulae that define a distribution over sets of ground atoms of these predicates (or so-called possible worlds). Ideally, the distribution we define with these weighted formulae assigns high probability to possible worlds where SRL predicates are correctly identified and a low probability to worlds where this is not the case. For example, a suitable set of weighted formulae would assign a high probability to the world3 {word (1, Haag) , word(2, plays), word(3, Elianti), isP redicate(2)} and a low one to {word (1, Haag) , word(2, plays), word(3, Elianti), isP redicate(3)} In Markov Logic a set of weighted formulae is called a Markov Logic Network (MLN). Formally speaking, an MLN M is a set of pairs (, w) where is a first order formula and w a real weight. M assigns the probability 1 p (y) = exp w fc (y) (1) Z (,w)M cC 2 Markov Logic Markov Logic (ML, Richardson and Domingos, 2005) is a Statistical Relational Learning language based on First Order Logic and Markov Networks. It can be seen as a formalism that extends First Order Logic to allow formulae that can be violated with Our unlabelled accuracy for syntactic dependencies is at least 3% points under theirs. 1 to the possible world y. Here C is the set of all possible bindings of the free variables in with the constants of our domain. fc is a feature function that returns 1 if in the possible world y the ground formula we get by replacing the free variables in by the constants in c is true and 0 otherwise. Z is a normalisation constant. Note that this distribution corresponds to a Markov Network (the socalled Ground Markov Network) where nodes represent ground atoms and factors represent ground formulae. 2 In the cases were is not obvious whether we refer to SRL or ML predicates we add the prefix SRL or ML, respectively. 3 "Haag plays Elianti" is a segment of a sentence in the training corpus. 156 For example, if M contains the formula word (x, take) isP redicate (x) then its corresponding log-linear model has, among others, a feature ft1 for which x in has been replaced by the constant t1 and that returns 1 if word (1, take) isP redicate (1) is true in y and 0 otherwise. We will refer predicates such as word as observed because they are known in advance. In contrast, isPredicate is hidden because we need to infer it at test time. Pipeline direction sense Sense Disambiguation role hasRole isArgument Argument Identification & clasification Bottom-up isPredicate Predicate Identification Figure 1: MLN hidden predicates divided in stages 3 Model Conceptually we divide our SRL system into three stages: one stage that identifies the predicates of a sentence, one stage that identifies and classifies the arguments of these predicates, and a final stage that predicts the sense of each predicate. We should stress that this architecture is intended to illustrate a typical SRL system, and to describe the pipelinebased approach we will compare our models to. However, it does not correspond to the way inference is performed in our proposed model--we jointly infer all decisions described above. Note that while the proposed division into conceptual stages seems somewhat intuitive, it is by no means uncontroversial. In fact, for the CoNLL 2008 shared task slightly more than one half of the participants performed sense disambiguation before argument identification and classification; most other participants framed the problem in the reverse order.4 We define five hidden predicates for the three stages of the task. Figure 1 illustrates these predicates and the stage they belong to. For predicate identification, we use the predicate isPredicate. isPredicate(p) indicates that the word in the position p is an SRL predicate. For argument identification and classification, we use the predicates isArgument, hasRole and role. The atom isArgument(a) signals that the word in the position a is a SRL argument of some (unspecified) SRL predicate while hasRole(p,a) indicates that the token at position a is However, for almost all pipeline based systems, predicate identification was the first stage of the role labelling process. 4 an argument of the predicate in position p. The predicate role(p,a,r) corresponds to the decision that the argument at position a has the role r with respect to the predicate in position p. Finally, for sense disambiguation we define the predicate sense(p,e) which signals that the predicate in position p has the sense e. Before we continue to describe the formulae of our Markov Logic Network we would like to highlight the introduction of the isArgument predicate mentioned above. This predicate corresponds to a decision that is usually made implicitly: a token is an argument if there exists a predicate for which it plays a semantic role. Here we model this decision explicitly, assuming that there exist cases where a token clearly has to be an argument of some predicate, regardless of which predicate in the sentence this might be. It is this assumption that requires us to infer the arguments for all predicates of a sentence at once--otherwise we cannot make sure that for a marked argument there exists at least one predicate for which the argument plays a semantic role. In addition to the hidden predicates, we define observable predicates to represent the information available in the corpus. Table 1 presents these predicates. 3.1 Local formulae A formula is local if its groundings relate any number of observed ground atoms to exactly one hidden ground atom. For example, two groundings of the local formula lemma(p, +l1 )lemma(a, +l2 ) hasRole(p, a) can be seen in the Factor Graph of Figure 2. Both connect a single hidden hasRole ground atom with 157 Top-Down word(i,w) lemma(i,l) ppos(i,p) cpos(i,p) voice(i,v) subcat(i,f) dep(i,j,d) palmer(i,j) depPath(i,j,p) depFrame(i,j,f) Token i has word w Token i has lemma l Token i has POS tag p Token i has coarse POS tag p Token i is verb and has voice v (Active/Passive). Token i has subcategorization frame f Token h is head of token m and has dependency label d Token j can be semantic argument for token i according to high recall heuristic Dependency path between tokens i and j is p f is a syntactic (dependency) frame in which tokens i and j are designated as "pivots" Figure 2: Factor graph for the first local formula in section 3.1. Here round nodes represent variables (corresponding to the states of ground atoms) and the rectangular nodes represent the factor and their parameters attached to the ground formulae. Table 1: Observable predicates; predicates marked with are dependency parsing-based versions for features of Xue and Palmer (2004). two observed lemma ground atoms. The + notation indicates that the MLN contains one instance of the rule, with a separate weight, for each assignment of the variables with a plus sign (?). The local formulae for isPredicate, isArgument and sense aim to capture the relation of the tokens with their lexical and syntactic surroundings. This includes formulae such as subcat(p, +f ) isP redicate(p) which implies that a certain token is a predicate with a weight that depends on the subcategorization frame of the token. Further local formulae are constructed using those observed predicates in table 1 that relate single tokens and their properties. The local formulae for role and hasRole focus on properties of the predicate and argument token--the formula illustrated in figure 2 is an example of this-- and on the relation between the two tokens. An example of the latter type is the formula depP ath(p, a, +d) role(p, a, +r) which implies that token a plays the semantic role r with respect to token p, and for which the weight depends on the syntactic (dependency) path d between p and a and on the actual role to assign. Again, further formulae are constructed using the observed 158 predicates in table 1; however, this time we consider both predicates that relate tokens to their individual properties and predicates that describe the relation between tokens. Unfortunately, the complete set of local formulae is too large to be exhaustively described in this paper. Its size results from the fact that we also consider conjunctions of several atoms as conditions, and lexical windows around tokens. Hence, instead of describing all local formulae we refer the reader to our MLN model files.5 They can be used both as a reference and as input to our Markov Logic Engine,6 and thus allow the reader to easily reproduce our results. 3.2 Global formulae Global formulae relate several hidden ground atoms. We use this type of formula for two purposes: to ensure consistency between the predicates of all SRL stages, and to capture some of our background knowledge about SRL. We will refer to formulae that serve the first purpose as structural constraints. For example, a structural constraint is given by the (deterministic) formula role(p, a, r) hasRole(p, a) which ensures that, whenever the argument a is given a label r with respect to the predicate p, this argument must be an argument of a as denoted by hasRole(p,a). Note that this formula by itself models the traditional "bottom-up" argument identification and classification pipeline (Xue and Palmer, 2004): http://code.google.com/p/thebeast/ source/browse/#svn/mlns/naacl-hlt 6 http://code.google.com/p/thebeast 5 it is possible to not assign a role r to an predicateargument pair (p, a) proposed by the identification stage; however, it is impossible to assign a role r to token pairs (p, a) that have not been proposed as potential arguments. An example of another class of structural constraints is hasRole(p, a) r.role(p, a, r) which, by itself, models an inverted or "top-down" pipeline. In this architecture the argument classification stage can assign roles to tokens that have not been proposed by the argument identification stage. However, it must assign a label to any token pair the previous stage proposes. For the SRL predicates that perform a labelling task (role and sense) we also need a structural constraint which ensures that not more than one label is assigned. For instance, (role(p, a, r1 ) r1 = r2 ¬role(p, a, r2 )) forbids two different semantic roles for a pair of words. There are three global formulae that capture our linguistic background knowledge. The first one is a deterministic constraint that had been frequently applied in the SRL literature. It forbids cases where distinct arguments of a predicate have the same role unless the role describes a modifier: role (p, a1 , r) ¬mod (r) a1 = a2 ¬role (p, a2 , r) The second "linguistic" global formula is role(p, a, +r) lemma(p, +l) sense(p, +s) which implies that when a predicate p with lemma l has an argument a with role r it has to have the sense s. Here the weight depends on the combination of role r, lemma l and sense s. The third and final "linguistic" global formula is lemma(p, +l) ppos(a, +p) hasRole(p, a) sense(p, +f ) It implies that if a predicate p has the lemma l and an argument a with POS tag p it has to have the sense 159 s. This time the weight depends on the combination of POS tag p, lemma l and sense s. Note that the final two formulae evaluate the semantic frame of a predicate and become local formulae in a pipeline system that performs sense disambiguation after argument identification and classification. Table 2 summarises the global formulae we use in this work. 4 Inference and Learning Assuming that we have an MLN, a set of weights and a given sentence then we need to predict the choice of predicates, frame types, arguments and role labels with maximal a posteriori probability (MAP). To this end we apply a method that is both exact and efficient: Cutting Plane Inference (CPI, Riedel, 2008) with Integer Linear Programming (ILP) as base solver. Instead of fully instantiating the Markov Network that a Markov Logic Network describes, CPI begins with a subset of factors/edges--in our case we use the factors that correspond to the local formulae of our model--and solves the MAP problem for this subset using the base solver. It then inspects the solution for ground formulae/features that are not yet included but could, if added, lead to a different solution--this process is usually referred to as separation. The ground formulae that we have found are added and the network is solved again. This process is repeated until the network does not change anymore. This type of algorithm could also be realised for an ILP formulation of SRL. However, it would require us to write a dedicated separation routine for each type of constraint we want to add. In Markov Logic, on the other hand, separation can be generically implemented as the search for variable bindings that render a weighted first order formulae true (if its weight is negative) or false (if its weight is positive). In practise this means that we can try new global formulae/constraints without any additional implementation overhead. We learn the weights associated with each MLN using 1-best MIRA (Crammer and Singer, 2003) Online Learning method. As MAP inference method that is applied in the inner loop of the online learner we apply CPI, again with ILP as base Bottom-up Top-Down Unique Labels Linguistic sense(p, s) isP redicate(p) hasRole(p, a) isP redicate(p) hasRole(p, a) isArgument(a) role(p, a, r) hasLabel(p, a) isP redicate(p) s.sense(p, s) isP redicate(p) a.hasRole(p, a) isArgument(a) p.hasRole(p, a) hasLabel(p, a) r.role(p, a, r) role(p, a, r1 ) r1 = r2 ¬role(p, a, r2 ) sense(p, s1 ) s1 = s2 ¬sense(p, r2 ) role (p, a1 , r) ¬mod (r) a1 = a2 ¬role (p, a2 , r) lemma(p, +l) ppos(a, +p) hasRole(p, a) sense(p, +f ) lemma(p, +l) role(p, a, +r) sense(p, +f ) Table 2: Global formulae for ML model solver. 5 Experimental Setup For training and testing our SRL systems we used a version of the CoNLL 2008 shared task (Surdeanu et al., 2008) dataset that only mentions verbal predicates, disregarding the nominal predicates available in the original corpus.7 While the original (open track) corpus came with MALT (Nivre et al., 2007) dependencies, we observed slightly better results when using the dependency parses generated with a Charniak parser (Charniak, 2000). Hence we used the latter for all our experiments. To assess the performance of our model, and it to evaluate the possible gains to be made from considering a joint model of the complete SRL pipeline, we set up several systems. The full system uses a Markov Logic Network with all local and global formulae described in section 3. For the bottom-up system we removed the structural top-down constraints from the complete model--previous work Riedel and Meza-Ruiz (2008) has shown that this can lead to improved performance. The bottom-up (-arg) system is equivalent to the bottom-up system, but it does not include any formulae that mention the hidden isArgument predicate. For the systems presented so far we perform joint inference and learning. The pipeline system differs in this regard. For this system we train a separate model for each stage in the pipeline of figure 1. The predicate identification stage identifies the predicates (using all local isP redicate formulae) of 7 a sentence. The next stage predicts arguments and their roles for the identified predicates. Here we include all local and global formulae that involve only the predicates of this stage. In the last stage we predict the sense of each identified predicate using all formulae that involve the sense, without the structural constraints that connect the sense predicate to the previous stages of the pipeline (these constraints are enforced by architecture). 6 Results Table 3 shows the results of our systems for the CoNLL 2008 development set and the WSJ and brown test sets. The scores are calculated using the semantic evaluation metric of the CoNLL-08 shared task (Surdeanu et al., 2008). This metric measures the precision, recall and F1 score of the recovered semantic dependencies. A semantic dependency is created for each predicate and its arguments, the label of such dependency is the role of the argument. Additionally, there is a semantic dependency for each predicate and a ROOT argument which has the sense of the predicate as label. To put these results into context, let us compare them to those of the participants of the CoNLL 2008 shared task (see the last three rows of table 3).8 Our best model, Bottom-up, would reach the highest F1 WSJ score, and second highest Brown score, for the open track. Here the best-performing participant was Vickrey and Koller (2008). Table 3 also shows the results of the best (Johansson and Nugues, 2008) and fourth best sysResults of other systems were extracted from Table 16 of the shared task overview paper (Surdeanu et al., 2008). 8 The reason for this choice where license problems. 160 tem (Zhao and Kit, 2008) of the closed track. We note that we do significantly worse than Johansson and Nugues (2008), and roughly equivalent to Zhao and Kit (2008); this places us on the fourth rank of 19 participants. However, note that all three systems above us, as well as Zhao and Kit (2008), use parsers with at least about 90% (unlabelled) accuracy on the WSJ test set (Johansson's parser has about 92% unlabelled accuracy).9 By contrast, with about 87% unlabelled accuracy our parses are significantly worse. Finally, akin to Riedel and Meza-Ruiz (2008) we observe that the bottom-up joint model performs better than the full joint model. System Full Bottom-up Bottom-up (-arg) Pipeline Vickrey Johansson Zhao Devel 76.93 77.96 77.57 75.69 N/A N/A N/A WSJ 79.09 80.16 79.37 78.19 79.75 86.37 79.40 Brown 67.64 68.02 66.70 64.66 69.57 71.87 66.38 isPredicate isArgument hasRole role sense WSJ Pipe. Fu. 96.6 96.5 90.3 90.6 88.0 87.9 75.4 75.5 85.5 88.5 Brown Pipe. Fu. 92.2 92.5 85.9 86.9 83.6 83.8 64.2 64.6 67.3 77.1 Table 4: F1 scores for M predicates; Pipe. refers to the Pipeline system, Fu. to the full system. 6.2 Modelling if a Token is an Argument Table 3: Semantic F1 scores for our systems and three CoNLL 2008 shared task participants. The Bottom-up results are statistically significantly different to all others (i.e., 0.05 according to the sign test). 6.1 Joint Model vs. Pipeline Table 3 suggests that by including sense disambiguation into the joint model (as is the case for all systems but the pipeline) significant improvements can be gained. Where do these improvements come from? We tried to answer this question by taking a closer look at how accurately the pipeline predicts the isP redicate, isArgument, hasRole, role and sense relations, and how this compares to the result of the joint full model. Table 4 shows that the joint model mainly does better when it comes to predicting the right predicate senses. This is particularly true for the case of the Brown corpus--here we gain about 10% points. These results suggest that a more joint approach may be particularly useful in order to increase the robustness of an SRL system in out-of-domain scenarios.10 9 In table 3 we also observe that improvements can be made if we explicitly model the decision whether a token is a semantic argument of some predicate or not. As we mentioned in section 3, this aspect of our model requires us to jointly perform inference for all predicates of a sentence, and hence our results justify the per-sentence SRL approach proposed in this paper. In order to analyse where these improvements come from, we again list our results on a per-SRLpredicate basis. Table 5 shows that by including the isArgument predicate and the corresponding formulae we gain around 0.6% and 1.0% points across the board for WSJ and Brown, respectively.11 As shown in table 3, these improvements result in about 1.0% improvements for both WSJ and Brown in terms of the CoNLL 2008 metric. Hence, an explicit model of the "is an argument" decision helps the SRL at all levels. How the isArgument helps to improve the overall role labelling score can be illustrated with the example in figure 3. Here the model without a hidden isArgument predicate fails to attach the preposition "on" to the predicate "start.01" (here 01 refers to the sense of the predicate). Apparently the model has not enough confidence to assign the preposition to either "start.01" or "get.03", so it just drops the argument altogether. However, because the isArgument model knows that most prepositions have to be modifying some predicate, prespare labelled accuracy. 10 The differences between results of the full and joint model are statistically significant with the exception of the results for the isP redicate predicate for the WSJ test set. 11 The differences between results of the w/ and w/o model are statistically significant with the exception of the results for the sense predicate for the Brown test set. Since our parses use a different label set we could not com- 161 Figure 3: Segment of the CoNLL 2008 development set for which the bottom-up model w/o isArgument predicate fails to attach the preposition "on" as an "AM-LOC" for "started". The joint bottom-up model attaches the preposition correctly. likely assignment using an ILP solver. This system (Bottom-up (-CPI)) is four times slower than the equivalent system that uses Cutting Plane Inference (Bottom-up). This suggests that if we were to implement the same joint model using ILP instead of ML, our system would either be significantly slower, or we would need to implement a Cutting Plane algorithm for the corresponding ILP formulation--when we use ML this algorithm comes "for free". System Full Full (-CPI) Bottom-up Bottom-up (-CPI) Pipeline WSJ 9.2m 38.4m 9.5m 38.8m 18.9m Brown 1.5m 7.47m 1.6m 6.9m 2.9m sure is created that forces a decision between the two predicates. And because for the role model "start.01" looks like a better fit than "get.03", the correct attachment is found. WSJ w/o w/ 96.3 96.5 87.1 87.7 76.9 77.5 88.3 89.0 Brown w/o w/ 91.4 92.5 82.5 83.6 65.2 66.2 76.1 77.5 isPredicate hasRole role sense Table 6: Testing times for full model and bottom-up when CPI algorithm is not used. The WSJ test set contains 2414 sentences, the Brown test set 426. Our best systems thus takes on average 230ms per WSJ sentence (on a 2.4Ghz system). Table 5: F1 scores for ML predicates; w/o refers to a Bottom-up system without isArgument predicate, w/ refers to a Bottom-up system with isArgument predicate. 7 Conclusion 6.3 Efficiency In the previous sections we have shown that our joint model indeed does better than an equivalent pipeline system. However, usually most joint approaches come at a price: efficiency. Interestingly, in our case we observe the opposite: our joint model is actually faster than the pipeline. This can be seen in table 6, where we list the time it took for several different system to process the WSJ and Brown test corpus, respectively. When we compare the times for the bottom-up model to those of the pipeline, we note that the joint model is twice as fast. While the individual stages within the pipeline may be faster than the joint system (even when we sum up inference times), extracting results from one system and feeding them into another creates overhead which offsets this potential reduction. Table 6 also lists the run-time of a bottom-up system that solves the inference problem by fully grounding the Markov Network that the Markov Logic (ML) model describes, mapping this network to an Integer Linear Program, and finding the most 162 In this paper we have presented a Markov Logic Network that jointly models all predicate identification, argument identification and classification and sense disambiguation decisions for a sentence. We have shown that this approach is competitive, in particular if we consider that our input parses are significantly worse than those of the top CoNLL 2008 systems. We demonstrated the benefit of jointly predicting senses and semantic arguments when compared to a pipeline system that first picks arguments and then senses. We also showed that by modelling whether a token is an argument of some predicate and jointly picking arguments for all predicates of a sentence, further improvements can be achieved. Finally, we demonstrated that our system is efficient, despite following a global approach. This efficiency was also shown to stem from the first order inference method our Markov Logic engine applies. Acknowledgements The authors are grateful to Mihai Surdeanu for providing the version of the corpus used in this work. References Eugene Charniak. A maximum-entropy-inspired parser. In Proceedings of NAACL-2000, 2000. Koby Crammer and Yoram Singer. Ultraconservative online algorithms for multiclass problems. Journal of Machine Learning Research, 3:951­ 991, 2003. Richard Johansson and Pierre Nugues. Dependencybased semantic role labeling of propbank. In Proceedings of EMNLP-2008., 2008. Llu´s M´ rquez, Xavier Carreras, Ken Litkowski, and i a Suzanne Stevenson. Semantic role labeling. Computational Linguistics, 34(2), 2008. Introduction to the Special Issue on Semantic Role Labeling. J. Nivre, J. Hall, J. Nilsson, A. Chanev, G. Eryigit, S. Kuebler, S. Marinov, and E. Marsi. MaltParser: A language-independent system for datadriven dependency parsing. Natural Language Engineering, 13(02):95­135, 2007. V. Punyakanok, D. Roth, and W. Yih. Generalized inference with multiple semantic role labeling systems. In Ido Dagan and Dan Gildea, editors, CoNLL '05: Proceedings of the Annual Conference on Computational Natural Language Learning, pages 181­184, 2005. Matthew Richardson and Pedro Domingos. Markov logic networks. Technical report, University of Washington, 2005. Sebastian Riedel. Improving the accuracy and efficiency of map inference for markov logic. In UAI '08: Proceedings of the Annual Conference on Uncertainty in AI, 2008. Sebastian Riedel and Ivan Meza-Ruiz. Collective semantic role labelling with markov logic. In Conference on Computational Natural Language Learning, 2008. P. Singla and P. Domingos. Lifted First-Order Belief Propagation. Association for the Advancement of Artificial Intelligence (AAAI), 2008. Mihai Surdeanu, Richard Johansson, Adam Meyers, i a Llu´s M` rquez, and Joakim Nivre. The CoNLL2008 shared task on joint parsing of syntactic and semantic dependencies. In Proceedings of the 12th Conference on Computational Natural Language Learning (CoNLL-2008), 2008. 163 Kristina Toutanova, Aria Haghighi, and Christopher D. Manning. Joint learning improves semantic role labeling. In ACL '05: Proceedings of the 43rd Annual Meeting on Association for Computational Linguistics, Morristown, NJ, USA, 2005. David Vickrey and Daphne Koller. Applying sentence simplification to the conll-2008 shared task. In Proceedings of CoNLL-2008., 2008. Nianwen Xue and Martha Palmer. Calibrating features for semantic role labeling. In EMNLP '04: Proceedings of the Annual Conference on Empirical Methods in Natural Language Processing, 2004. Hai Zhao and Chunyu Kit. Parsing syntactic and semantic dependencies with two single-stage maximum entropy models. In CoNLL 2008: Proceedings of the Twelfth Conference on Computational Natural Language Learning, Manchester, England, 2008. Structured Generative Models for Unsupervised Named-Entity Clustering Micha Elsner, Eugene Charniak and Mark Johnson Brown Laboratory for Linguistic Information Processing (BLLIP) Brown University Providence, RI 02912 {melsner,ec,mj}@cs.brown.edu Abstract We describe a generative model for clustering named entities which also models named entity internal structure, clustering related words by role. The model is entirely unsupervised; it uses features from the named entity itself and its syntactic context, and coreference information from an unsupervised pronoun resolver. The model scores 86% on the MUC-7 named-entity dataset. To our knowledge, this is the best reported score for a fully unsupervised model, and the best score for a generative model. 1 Introduction Named entity clustering is a classic task in NLP, and one for which both supervised and semi-supervised systems have excellent performance (Mikheev et al., 1998; Chinchor, 1998). In this paper, we describe a fully unsupervised system (using no "seed rules" or initial heuristics); to our knowledge this is the best such system reported on the MUC-7 dataset. In addition, the model clusters the words which appear in named entities, discovering groups of words with similar roles such as first names and types of organization. Finally, the model defines a notion of consistency between different references to the same entity; this component of the model yields a significant increase in performance. The main motivation for our system is the recent success of unsupervised generative models for coreference resolution. The model of Haghighi and Klein (2007) incorporated a latent variable for named entity class. They report a named entity score 164 of 61.2 percent, well above the baseline of 46.4, but still far behind existing named-entity systems. We suspect that better models for named entities could aid in the coreference task. The easiest way to incorporate a better model is simply to run a supervised or semi-supervised system as a preprocess. To perform joint inference, however, requires an unsupervised generative model for named entities. As far as we know, this work is the best such model. Named entities also pose another problem with the Haghighi and Klein (2007) coreference model; since it models only the heads of NPs, it will fail to resolve some references to named entities: ("Ford Motor Co.", "Ford"), while erroneously merging others: ("Ford Motor Co.", "Lockheed Martin Co."). Ng (2008) showed that better features for matching named entities­ exact string match and an "alias detector" looking for acronyms, abbreviations and name variants­ improve the model's performance substantially. Yet building an alias detector is nontrivial (Uryupina, 2004). English speakers know that "President Clinton" is the same person as "Bill Clinton" , not "President Bush". But this cannot be implemented by simple substring matching. It requires some concept of the role of each word in the string. Our model attempts to learn this role information by clustering the words within named entities. 2 Related Work Supervised named entity recognition now performs almost as well as human annotation in English (Chinchor, 1998) and has excellent performance on other languages (Tjong Kim Sang and De Meulder, 2003). For a survey of the state of the art, Human Language Technologies: The 2009 Annual Conference of the North American Chapter of the ACL, pages 164­172, Boulder, Colorado, June 2009. c 2009 Association for Computational Linguistics see Nadeau and Sekine (2007). Of the features we explore here, all but the pronoun information were introduced in supervised work. Supervised approaches such as Black et al. (1998) have used clustering to group together different nominals referring to the same entity in ways similar to the "consistency" approach outlined below in section 3.2. Semi-supervised approaches have also achieved notable success on the task. Co-training (Riloff and Jones, 1999; Collins and Singer, 1999) begins with a small set of labeling heuristics and gradually adds examples to the training data. Various co-training approaches presented in Collins and Singer (1999) all score about 91% on a dataset of named entities; the inital labels were assigned using 7 hand-written seed rules. However, Collins and Singer (1999) show that a mixture-of-naive-Bayes generative clustering model (which they call an EM model), initialized with the same seed rules, performs much more poorly at 83%. Much later work (Evans, 2003; Etzioni et al., 2005; Cucerzan, 2007; Pasca, 2004) relies on the use of extremely large corpora which allow very precise, but sparse features. For instance Etzioni et al. (2005) and Pasca (2004) use web queries to count occurrences of "cities such as X" and similar phrases. Although our research makes use of a fairly large amount of data, our method is designed to make better use of relatively common contextual features, rather than searching for high-quality semantic features elsewhere. Models of the internal structure of names have been used for cross-document coreference (Li et al., 2004; Bhattacharya and Getoor, 2006) and a goal in their own right (Charniak, 2001). Li et al. (2004) take named entity classes as a given, and develops both generative and discriminative models to detect coreference between members of each class. Their generative model designates a particular mention of a name as a "representative" and generates all other mentions from it according to an editing process. Bhattacharya and Getoor (2006) operates only on authors of scientific papers. Their model accounts for a wider variety of name variants than ours, including misspellings and initials. In addition, they confirm our intuition that Gibbs sampling for inference has insufficient mobility; rather than using a heuristic algorithm as we do (see section 3.5), they 165 use a data-driven block sampler. Charniak (2001) uses a Markov chain to generate 6 different components of people's names, again assuming that the class of personal names can be pre-distinguished using a name list. He infers coreference relationships between similar names appearing in the same document, using the same notion of consistency between names as our model. As with our model, the clusters found are relatively good, although with some mistakes even on frequent items (for example, "John" is sometimes treated as a descriptor like "Secretary"). 3 System Description Like Collins and Singer (1999), we assume that the named entities have already been correctly extracted from the text, and our task is merely to label them. We assume that all entities fit into one of the three MUC-7 categories, LOC (locations), ORG (organizations), and PER (people). This is an oversimplification; Collins and Singer (1999) show that about 12% of examples do not fit into these categories. However, while using the MUC-7 data, we have no way to evaluate on such examples. As a framework for our models, we adopt adaptor grammars (Johnson et al., 2007), a framework for non-parametric Bayesian inference over contextfree grammars. Although our system does not require the full expressive power of PCFGs, the adaptor grammar framework allows for easy development of structured priors, and supplies a flexible generic inference algorithm. An adaptor grammar is a hierarchical Pitman-Yor process (Pitman and Yor, 1997). The grammar has two parts: a base PCFG and a set of adapted nonterminals. Each adapted nonterminal is a Pitman-Yor process which expands either to a previously used subtree or to a sample from the base PCFG. The end result is a posterior distribution over PCFGs and over parse trees for each example in our dataset. Each of our models is an adaptor grammar based on a particular base PCFG where the top nonterminal of each parse tree represents a named entity class. 3.1 Core NP Model We begin our analysis by reducing each namedentity reference to the contiguous substring of ROOT NE 0 |NE 1 |NE 2 Words Word (Words) Word Bill . . . NE 0 (NE 0 )(NE 1 )(NE 2 )(NE 3 )(NE 4 ) 0 0 0 0 0 0 NE 0 Words ROOT NE 0 |NE 1 |NE 2 NE 0 E00 |E01 . . . E0k 0 1 2 3 4 E00 (E00 )(E00 )(E00 )(E00 )(E00 ) 0 E00 NE 0 0 NE 0 Words . . . 0 Figure 1: Part of the grammar for core phrases. (Parentheses) mark optional nonterminals. *Starred nonterminals are adapted. Figure 2: Part of the consistency-enforcing grammar for core phrases. There are an infinite number of entities Exk , all with their own lexical symbols. Each lexical i symbol Exk expands to a single NE i . x 1 "President", a first name EP ER,Clinton = "Bill" etc. These are generated from the class-specific distribu0 0 tions, for instance EP ER,Clinton EP ER , which we intend to be a distribution over titles in general. The resulting grammar is shown in Figure 2; the i prior parameters for the entity-specific symbols Exk are fixed so that, with overwhelming probability, only one expansion occurs. We can represent any fixed number of entities Ek with a standard adaptor grammar, but since we do not know the correct number, we must extend the adaptor model slightly to allow for an unbounded number. We generate the Ek from a Chinese Restaurant process prior. (General grammars with infinite numbers of nonterminals were studied by (Liang et al., 2007b)). proper nouns which surrounds its head, which we call the core (Figure 1). To analyze the core, we use a grammar with three main symbols (NE x ), one for each named entity class x. Each class has an associated set of lexical symbols, which occur in a strict order (NE i is the ith symbol for class x). We can x think of the NE i as the semantic parts of a proper name; for people, NE 0 ER might generate titles and P NE 1 ER first names. Each NE i is adapted, and can P expand to any string of words; the ability to generate multiple words from a single symbol is useful both because it can learn to group collocations like "New York" and because it allows the system to handle entities longer than four words. However, we set the prior on multi-word expansions very low, to avoid degenerate solutions where most phrases are analyzed with a single symbol. The system learns a separate probability for each ordered subset of the NE i (for instance the rule NE 0 NE 0 NE 2 NE 4 ), 0 0 0 so that it can represent constraints on possible references; for instance, a last name can occur on its own, but not a title. 3.2 Consistency Model 3.3 Modifiers, Prepositions and Pronouns This system captures some of our intuitions about core phrases, but not all: our representation for "Bill Clinton" does not share any information with "President Bill Clinton" except the named-entity class. To remedy this, we introduce a set of "entity" nonterminals Ek , which enforce a weak notion of consistency. We follow Charniak (2001) in assuming that two names are consistent (can be references to the same entity) if they do not have different expansions for any lexical symbol. In other words, a particu0 lar entity EP ER,Clinton has a title EP ER,Clinton = 166 Next, we introduce two types of context information derived from Collins and Singer (1999): nominal modifiers and prepositional information. A nominal modifier is either the head of an appositive phrase ("Maury Cooper, a vice president") or a non-proper prenominal ("spokesman John Smith")1 . If the entity is the complement of a preposition, we extract the preposition and the head of the governing NP ("a federally funded sewage plant in Georgia"). These are added to the grammar at the named-entity class level (separated from the core by a special punctuation symbol). Finally, we add information about pronouns and wh-complementizers (Figure 3). Our pronoun information is derived from an unsupervised coreference algorithm which does not use named entity informa1 We stem modifiers with the Porter stemmer. ROOT Modifiers 0 # NE 0 # ... Prepositions 0 # Pronouns 0 # Pronouns 0 Pronoun 0 Pronouns 0 Pronouns 0 Pronoun 0 pers|loc|org|any pers i |he|she|who|me . . . org which|it|they|we . . . loc where|which|it|its Figure 3: A fragment of the full grammar. The symbol # represents punctuation between different feature types. The prior for class 0 is concentrated around personal pronouns, although other types are possible. attack airlift airlift rescu # wing # of-commander of-command with-run # # # air-india # # # # abels # # it # # gaudreau # # they he # # priddy # # he # spokesman bird bird bird director bird ford clinton director bird # johnson # before-hearing to-happened of-cartoon on-pressure under-medicare to-according to-allied with-stuck of-government ofphotographs of-daughter of-photo for-embarrassing under-instituted about-allegations for-worked before-hearing to-secretary than-proposition oftypical # he he his he my himself his he he he he i he his his i i i he his # Figure 4: Some merged examples from an input file. (# separates different feature types.) tion (Charniak and Elsner, 2009). This algorithm uses EM to learn a generative model with syntactic, number and gender parameters. Like Haghighi and Klein (2007), we give our model information about the basic types of pronouns in English. By setting up the base grammar so that each named-entity class prefers to associate to a single type of pronoun, we can also determine the correspondence between our named-entity symbols and the actual named-entity labels­ for the models without pronoun information, this matching is arbitrary and must be inferred during the evaluation process. 3.4 Data Preparation To prepare data for clustering with our system, we first parse it with the parser of Charniak and Johnson (2005). We then annotate pronouns with Charniak and Elsner (2009). For the evaluation set, we use the named entity data from MUC-7. Here, we extract all strings in tags and determine their cores, plus any relevant modifiers, governing prepositions and pronouns, by examining the parse trees. In addition, we supply the system with additional data from the North American News Corpus (NANC). Here we extract all NPs headed by proper nouns. We then process our data by merging all examples with the same core; some merged examples from our dataset are shown in Figure 4. When two examples are merged, we concatenate their lists of 167 modifiers, prepositions and pronouns (capping the length of each list at 20 to keep inference tractable). For instance, "air-india" has no features outside the core, while "wing" has some nominals ("attack" &c.) and some prepositions ("commander-of" &c.). This merging is useful because it allows us to do inference based on types rather than tokens (Goldwater et al., 2006). It is well known that, to interpolate between types and tokens, Hierarchical Dirichlet Processes (including adaptor grammars) require a deeper hierarchy, which slows down inference and reduces the mobility of sampling schemes. By merging examples, we avoid using this more complicated model. Each merged example also represents many examples from the training data, so we can summarize features (such as modifiers) observed throughout a large input corpus while keeping the size of our input file small. To create an input file, we first add all the MUC7 examples. We then draw additional examples from NANC, ranking them by how many features they have, until we reach a specified number (larger datasets take longer, but without enough data, results tend to be poor). 3.5 Inference Our implementation of adaptor grammars is a modified version of the Pitman-Yor adaptor grammar sampler2 , altered to deal with the infinite number of entities. It carries out inference using a Metropoliswithin-Gibbs algorithm (Johnson et al., 2007), in which it repeatedly parses each input line using the CYK algorithm, samples a parse, and proposes this as the new tree. To do Gibbs sampling for our consistencyenforcing model, we would need to sample a parse for an example from the posterior over every possible entity. However, since there are thousands of entities (the number grows roughly linearly with the number of merged examples in the data file), this is not tractable. Instead, we perform a restricted Gibbs sampling search, where we enumerate the posterior only for entities which share a word in their core with the example in question. In fact, if the shared word is very common (occuring in more than .001 of examples), we compute the posterior for that entity only .05 of the time3 . These restrictions mean that we do not compute the exact posterior. In particular, the actual model allows entities to contain examples with no words in common, but our search procedure does not explore these solutions. For our model, inference with the Gibbs algorithm seems to lack mobility, sometimes falling into very poor local minima from which it does not seem to escape. This is because, if there are several references to the same named entity with slightly different core phrases, once they are all assigned to the wrong class, it requires a low-probability series of individual Gibbs moves to pull them out. Similarly, the consistency-enforcing model generally does not fully cluster references to common entities; there are usually several "Bill Clinton" clusters which it would be best to combine, but the sequence of moves that does so is too improbable. The data-merging process described above is one attempt to improve mobility by reducing the number of duplicate examples. In addition, we found that it was a better use of CPU time to run multiple samplers with different initialization than to perform many iterations. In the experiments below, we use 20 chains, initializing with 50 iterations without using consistency, then 50 more using the consistency model, and evaluate the last sample from each. We discard Available at http://www.cog.brown.edu/ mj/Software.htm We ignore the corresponding Hastings correction, as in practice it leads to too many rejections. 3 2 the 10 samples with worst log-likelihood and report the average score for the other 10. 3.6 Parameters In addition to the base PCFG itself, the system requires a few hyperparameter settings: Dirichlet priors for the rule weights of rules in the base PCFG. Pitman-Yor parameters for the adapted nonterminals are sampled from vague priors using a slice sampler (Neal, 2003). The prior over core words was set to the uniform distribution (Dirichlet 1.0) and the prior for all modifiers, prepositions and pronouns to a sparse value of .01. Beyond setting these parameters to a priori reasonable values, we did not optimize them. To encourage the system to learn that some lexical symbols were more common than others, we set a sparse prior over expansions to symbols4 . There are two really important hyperparameters: an extremely biased prior on class-to-pronountype probabilities (1000 for the desired class, .0001 for everything else), and a prior of .0001 for the Word Word Words rule to discourage symbols expanding to multiword strings. 4 Experiments We performed experiments on the named entity dataset from MUC-7 (Chinchor, 1998), using the training set as development data and the formal test set as test data. The development set has 4936 named entities, of which 1575 (31.9%) are locations, 2096 (42.5%) are organizations and 1265 (25.6%) people. The test set has 4069 named entities, 1321 (32.5%) locations, 1862 (45.8%) organizations and 876 (21.5%) people5 . We use a baseline which gives all named entities the same label; this label is mapped to "organization". In most of our experiments, we use an input file of 40000 lines. For dev experiments, the labeled data contributes 1585 merged examples; for test experiments, only 1320. The remaining lines are derived Expansions that used only the middle three symbols NE 1,2,3 got a prior of .005, expansions whose outermost symx bol was NE 0,4 got .0025, and so forth. This is not so imporx tant for our final system, which has only 5 symbols, but was designed during development to handle systems with up to 10 symbols. 5 10 entities are labeled location|organization; since this fraction of the dataset is insignificant we score them as wrong. 4 168 Model Baseline (All Org) Core NPs (no consistency) Core NPs (consistency) Context Features Pronouns Accuracy 42.5 45.5 48.5 83.3 87.1 Table 1: Accuracy of various models on development data. Model Baseline (All Org) Pronouns Accuracy 45.8 86.0 Table 2: Accuracy of the final model on test data. using the process described in section 3.4 from 5 million words of NANC. To evaluate our results, we map our three induced labels to their corresponding gold label, then count the overlap; as stated, this mapping is predictably encoded in the prior when we use the pronoun features. Our experimental results are shown in Table 1. All models perform above baseline, and all features contribute significantly to the final result. Test results for our final model are shown in Table 2. A confusion matrix for our highest-likelihood test solution is shown as Figure 5. The highest confusion class is "organization", which is confused most often with "location" but also with "person". "location" is likewise confused with "organization". "person" is the easiest class to identify­ we believe this explains the slight decline in performance from dev to test, since dev has proportionally more people. Our mapping from grammar symbols to words appears in Table 3; the learned prepositional and modifier information is in Table 4. Overall the results are good, but not perfect; for instance, the P ers states are mostly interpretable as a sequence of title - first name - middle name or initial - last name loc 1187 223 36 org 97 1517 20 per 37 122 820 last name or post-title (similar to (Charniak, 2001)). The organization symbols tend to put nationalities and other modifiers first, and end with institutional types like "inc." or "center", although there is a similar (but smaller) cluster of types at Org 2 , suggesting the system has incorrectly found two analyses for these names. Location symbols seem to put entities with a single, non-analyzable name into Loc2 , and use symbols 0, 1 and 3 for compound names. Loc4 has been recruited for time expressions, since our NANC dataset includes many of these, but we failed to account for them in the model. Since they appear in a single class here, we are optimistic that they could be clustered separately if another class and some appropriate features were added to the prior. Some errors do appear ("supreme court" and "house" as locations, "minister" and "chairman" as middle names, "newt gingrich" as a multiword phrase). The table also reveals an unforeseen issue with the parser: it tends to analyze the dateline beginning a news story along with the following NP ("WASHINGTON Bill Clinton said..."). Thus common datelines ("washington", "new york" and "los angeles") appear in state 0 for each class. 5 Discussion LOC ORG PER Figure 5: Confusion matrix for highest-likelihood test run. Gold labels in CAPS, induced labels italicized. Organizations are most frequently confused. As stated above, we aim to build an unsupervised generative model for named entity clustering, since such a model could be integrated with unsupervised coreference models like Haghighi and Klein (2007) for joint inference. To our knowledge, the closest existing system to such a model is the EM mixture model used as a baseline in Collins and Singer (1999). Our system improves on this EM system in several ways. While they initialize with minimal supervision in the form of 7 seed heuristics, ours is fully unsupervised. Their results cover only examples which have a prepositional or modifier feature; we adopt these features from their work, but label all entities in the predefined test set, including those that appear without these features. Finally, as discussed, we find the "person" category to be the easiest to label. 33% of the test items in Collins and Singer (1999) were people, as opposed to 21% of ours. However, even without the pronoun features, that is, using the same feature set, our system scores equivalently to the EM model, at 83% (this score is 169 P ers0 rep. sen. (256) washington dr. los angeles senate house new york president republican Org 0 american (137) washington washington the national first los angeles new royal british california Loc0 washington (92) los angeles south north old grand black west (22) east (21) haiti P ers1 john (767) robert (495) david michael james president richard william (317) sen. (236) george Org 1 national american (182) new york international (136) public united house federal home world Loc1 the st. new national (69) east (65) mount fort west (56) lake great P ers2 minister j. john (242) l. chairman e. m. william (173) robert (155) r. Org 2 university inc. (166) corp. (156) college institute (87) group hospital museum press international (61) Loc2 texas new york washington (22) united states baltimore california capitol christmas bosnia san juan P ers3 brown smith (97) b johnson newt gingrich king miller kennedy martin davis Org 3 research medical news health services communications development policy affairs defense Loc3 county city beach valley island river (71) park bay house supreme court P ers4 jr. a smith (111) iii williams wilson brown clinton simpson b Org 4 association center inc. (257) corp. (252) co. committee institute council fund act Loc4 monday thursday river (57) tuesday wednesday hotel friday hall center building Table 3: 10 most common words for each grammar symbol. Words which appear in multiple places have observed counts indicated in parentheses. 170 Pers-gov according-to (1044) played-by directed-by led-by meeting-with from-to met-with letter-to secretary-of known-as Pers-mod director spokesman leader presid[ent] attorney candid[ate] lawyer chairman counsel actor Org-gov president-of chairman-of director-of according-to (786) professor-at head-of department-of member-of members-of spokesman-for Org-mod $ giant opposit[e] group pp compan[y] journal firm state agenc[y] Loc-gov university-of city-of from-to town-of state-of center-in out-of is-in house-of known-as Loc-mod calif. newspap[er] state downtown n.y. warrant va. fla. p.m. itself Table 4: 10 most common prepositional and modifier features for each named entity class. Modifiers were Porterstemmed; for clarity a reconstructed stem is shown in brackets. on dev, 25% people). When the pronoun features are added, our system's performance increases to 86%, significantly better than the EM system. One motivation for our use of a structured model which defined a notion of consistency between entities was that it might allow the construction of an unsupervised alias detector. According to the model, two entities are consistent if they are in the same class, and do not have conflicting assignments of words to lexical symbols. Results here are at best equivocal. The model is reasonable at passing basic tests­ "Dr. Seuss" is not consistent with "Dr. Strangelove", "Dr. Quinn" etc, despite their shared title, because the model identifies the second element of each as a last name. Also correctly, "Dr. William F. Gibson" is judged consistent with "Dr. Gibson" and "Gibson" despite the missing elements. But mistakes are commonplace. In the "Gibson" case, the string "William F." is misanalyzed as a multiword string, making the name inconsistent with "William Gibson"; this is probably the result of a search error, which, as we explained, Gibbs sampling is unlikely to correct. In other cases, the system clusters a family group together under a single "entity" nonterminal by forcing their first names into inappropriate states, for instance assigning P ers1 Bruce, P ers2 Ellen, P ers3 Jarvis, where P ers2 (usually a middle name) actually contains the first name of a different individual. To improve this aspect of our system, we might incorporate namespecific features into the prior, such as abbreviations and the concept of a family name. The most critical improvement, however, would be integration with a 171 generative coreference system, since the document context probably provides hints about which entities are and are not coreferent. The other key issue with our system is inference. Currently we are extremely vulnerable to falling into local minima, since the complex structure of the model can easily lock a small group of examples into a poor configuration. (The "William F. Gibson" case above seems to be one of these.) In addition to the block sampler used by Bhattacharya and Getoor (2006), we are investigating general-purpose splitmerge samplers (Jain and Neal, 2000) and the permutation sampler (Liang et al., 2007a). One interesting question is how well these samplers perform when faced with thousands of clusters (entities). Despite these issues, we clearly show that it is possible to build a good model of named entity class while retaining compatibility with generative systems and without supervision. In addition, we do a reasonable job learning the latent structure of names in each named entity class. Our system improves over the latent named-entity tagging in Haghighi and Klein (2007), from 61% to 87%. This suggests that it should indeed be possible to improve on their coreference results without using a supervised named-entity model. How much improvement is possible in practice, and whether joint inference can also improve named-entity performance, remain interesting questions for future work. Acknowledgements We thank three reviewers for their comments, and NSF for support via grants 0544127 and 0631667. References Indrajit Bhattacharya and Lise Getoor. 2006. A latent dirichlet model for unsupervised entity resolution. In The SIAM International Conference on Data Mining (SIAM-SDM), Bethesda, MD, USA. William J. Black, Fabio Rinaldi, and David Mowatt. 1998. Facile: Description of the ne system used for muc-7. In In Proceedings of the 7th Message Understanding Conference. Eugene Charniak and Micha Elsner. 2009. EM works for pronoun anaphora resolution. In Proceedings of the Conference of the European Chapter of the Association for Computational Linguistics (EACL-09), Athens, Greece. Eugene Charniak and Mark Johnson. 2005. Coarse-tofine n-best parsing and MaxEnt discriminative reranking. In Proc. of the 2005 Meeting of the Assoc. for Computational Linguistics (ACL), pages 173­180. Eugene Charniak. 2001. Unsupervised learning of name structure from coreference data. In NAACL-01. Nancy A. Chinchor. 1998. Proceedings of the Seventh Message Understanding Conference (MUC-7) named entity task definition. In Proceedings of the Seventh Message Understanding Conference (MUC7), page 21 pages, Fairfax, VA, April. version 3.5, http://www.itl.nist.gov/iaui/894.02/related projects/muc/. Michael Collins and Yorav Singer. 1999. Unsupervised models for named entity classification. In Proceedings of EMNLP 99. Silviu Cucerzan. 2007. Large-scale named entity disambiguation based on Wikipedia data. In Proceedings of EMNLP-CoNLL, pages 708­716, Prague, Czech Republic, June. Association for Computational Linguistics. Oren Etzioni, Michael Cafarella, Doug Downey, Ana maria Popescu, Tal Shaked, Stephen Soderl, Daniel S. Weld, and Er Yates. 2005. Unsupervised namedentity extraction from the web: An experimental study. Artificial Intelligence, 165:91­134. Richard Evans. 2003. A framework for named entity recognition in the open domain. In Proceedings of Recent Advances in Natural Language Processing (RANLP-2003), pages 137 ­ 144, Borovetz, Bulgaria, September. Sharon Goldwater, Tom Griffiths, and Mark Johnson. 2006. Interpolating between types and tokens by estimating power-law generators. In Advances in Neural Information Processing Systems (NIPS) 18. Aria Haghighi and Dan Klein. 2007. Unsupervised coreference resolution in a nonparametric Bayesian model. In Proceedings of the 45th Annual Meeting of the Association for Computational Linguistics, pages 848­855. Association for Computational Linguistics. Sonia Jain and Radford M. Neal. 2000. A split-merge markov chain monte carlo procedure for the dirichlet process mixture model. Journal of Computational and Graphical Statistics, 13:158­182. Mark Johnson, Tom L. Griffiths, and Sharon Goldwater. 2007. Bayesian inference for PCFGs via Markov chain Monte Carlo. In Proceedings of NAACL 2007. Xin Li, Paul Morie, and Dan Roth. 2004. Identification and tracing of ambiguous names: Discriminative and generative approaches. In AAAI, pages 419­424. Percy Liang, Michael I. Jordan, and Ben Taskar. 2007a. A permutation-augmented sampler for DP mixture models. In Proceedings of ICML, pages 545­552, New York, NY, USA. ACM. Percy Liang, Slav Petrov, Michael Jordan, and Dan Klein. 2007b. The infinite PCFG using hierarchical Dirichlet processes. In Proceedings of EMNLP-CoNLL, pages 688­697, Prague, Czech Republic, June. Association for Computational Linguistics. A. Mikheev, C. Grover, and M. Moens. 1998. Description of the LTG System Used for MUC-7. In Proceedings of the 7th Message Understanding Conference (MUC-7), Fairfax, Virginia. David Nadeau and Satoshi Sekine. 2007. A survey of named entity recognition and classification. Journal of Linguisticae Investigationes, 30(1). Radford M. Neal. 2003. Slice sampling. Annals of Statistics, 31:705­767. Vincent Ng. 2008. Unsupervised models for coreference resolution. In Proceedings of the 2008 Conference on Empirical Methods in Natural Language Processing, pages 640­649, Honolulu, Hawaii, October. Association for Computational Linguistics. Marius Pasca. 2004. Acquisition of categorized named entities for web search. In CIKM '04: Proceedings of the thirteenth ACM international conference on Information and knowledge management, pages 137­145, New York, NY, USA. ACM. Jim Pitman and Marc Yor. 1997. The two-parameter Poisson-Dirichlet distribution derived from a stable subordinator. Ann. Probab., 25:855­900. Ellen Riloff and Rosie Jones. 1999. Learning dictionaries for information extraction by multi-level bootstrapping. In Proceedings of the Sixteenth National Conference on Artificial Intelligence, pages 472­479. AAAI. Erik F. Tjong Kim Sang and Fien De Meulder. 2003. Introduction to the CoNLL-2003 shared task: Languageindependent named entity recognition. In Walter Daelemans and Miles Osborne, editors, Proceedings of CoNLL-2003, pages 142­147. Edmonton, Canada. Olga Uryupina. 2004. Evaluating name-matching for coreference resolution. In Proceedings of LREC 04, Lisbon. 172 Hierarchical Dirichlet Trees for Information Retrieval Gholamreza Haffari School of Computing Sciences Simon Fraser University ghaffar1@cs.sfu.ca Yee Whye Teh Gatsby Computational Neuroscience University College London ywteh@gatsby.ucl.ac.uk Abstract We propose a principled probabilisitc framework which uses trees over the vocabulary to capture similarities among terms in an information retrieval setting. This allows the retrieval of documents based not just on occurrences of specific query terms, but also on similarities between terms (an effect similar to query expansion). Additionally our principled generative model exhibits an effect similar to inverse document frequency. We give encouraging experimental evidence of the superiority of the hierarchical Dirichlet tree compared to standard baselines. 1 Introduction Information retrieval (IR) is the task of retrieving, given a query, the documents relevant to the user from a large quantity of documents (Salton and McGill, 1983). IR has become very important in recent years, with the proliferation of large quantities of documents on the world wide web. Many IR systems are based on some relevance score function R(j, q) which returns the relevance of document j to query q. Examples of such relevance score functions include term frequency-inverse document frequency (tf-idf) and Okapi BM25 (Robertson et al., 1992). Besides the effect that documents containing more query terms should be more relevant (term frequency), the main effect that many relevance scores try to capture is that of inverse document frequency: the importance of a term is inversely related to the number of documents that it appears in, i.e. the popularity of the term. This is because popular 173 terms, e.g. common and stop words, are often uninformative, while rare terms are often very informative. Another important effect is that related or co-occurring terms are often useful in determining the relevance of documents. Because most relevance scores do not capture this effect, IR systems resort to techniques like query expansion which includes synonyms and other morphological forms of the original query terms in order to improve retrieval results; e.g. (Riezler et al., 2007; Metzler and Croft, 2007). In this paper we explore a probabilistic model for IR that simultaneously handles both effects in a principled manner. It builds upon the work of (Cowans, 2004) who proposed a hierarchical Dirichlet document model. In this model, each document is modeled using a multinomial distribution (making the bag-of-words assumption) whose parameters are given Dirichlet priors. The common mean of the Dirichlet priors is itself assumed random and given a Dirichlet hyperprior. (Cowans, 2004) showed that the shared mean parameter induces sharing of information across documents in the corpus, and leads to an inverse document frequency effect. We generalize the model of (Cowans, 2004) by replacing the Dirichlet distributions with Dirichlet tree distributions (Minka, 2003), thus we call our model the hierarchical Dirichlet tree. Related terms are placed close by in the vocabulary tree, allowing the model to take this knowledge into account when determining document relevance. This makes it unnecessary to use ad-hoc query expansion methods, as related words such as synonyms will be taken into account by the retrieval rule. The structure of the tree is learned from data in an unsupervised fashion, us- Human Language Technologies: The 2009 Annual Conference of the North American Chapter of the ACL, pages 173­181, Boulder, Colorado, June 2009. c 2009 Association for Computational Linguistics ing a variety of agglomerative clustering techniques. We review the hierarchical Dirichlet document (HDD) model in section 2, and present our proposed hierarchical Dirichlet tree (HDT) document model in section 3. We describe three algorithms for constructing the vocabulary tree in section 4, and give encouraging experimental evidence of the superiority of the hierarchical Dirichlet tree compared to standard baselines in section 5. We conclude the paper in section 6. u flat 0 uk k 0 k flat j b flat k njw nj k j k 2 Hierarchical Dirichlet Document Model The probabilistic approach to IR assumes that each document in a collection can be modeled probabilistically. Given a query q, it is further assumed that relevant documents j are those with highest generative probability p(q|j) for the query. Thus given q the relevance score is R(j, q) = p(q|j) and the documents with highest relevance are returned. Assume that each document is a bag of words, with document j modeled as a multinomial distribution over the words in j. Let V be the terms in the vocabulary, njw be the number of occurrences flat of term w V in document j, and jw be the probability of w occurring in document j (the superscript "flat" denotes a flat Dirichlet as opposed to our proposed Dirichlet tree). (Cowans, 2004) assumes the following hierarchical Bayesian model for the document collection: flat flat = (0w )wV Dirichlet(u) 0 flat flat J J (a) (b) Figure 1: (a) The graphical model representation of the hierarchical Dirichlet document model. (b) The global tree and local trees in hierarchical Dirichlet tree document model. Triangles stand for trees with the same structure, but different parameters at each node. The generation of words in each document is not shown. the collection, and this in turn is the mechanism for information sharing among documents. (Cowans, 2004) proposed a good estimate of flat : 0 flat 0w = /|V | + n0w + wV n0w (2) (1) j = (jw )wV Dirichlet( 0 ) flat where n0w is simply the number of documents containing term w, i.e. the document frequency. Integrating out the document parameters flat , we see that j the probability of query q being generated from document j is: p(q|j) = xq flat 0x + njx + wV njw nj = (njw )wV Multinomial( flat ) j (3) njx /|V |+n0x wV In the above, bold face a = (aw )wV means that a is a vector with |V | entries indexed by w V , and u is a uniform distribution over V . The generative process is as follows (Figure 1(a)). First a vector flat is drawn from a symmetric Dirichlet distribution 0 with concentration parameter . Then we draw the parameters flat for each document j from a common j Dirichlet distribution with mean flat and concentra0 tion parameter . Finally, the term frequencies of the document are drawn from a multinomial distribution with parameters flat . j The insight of (Cowans, 2004) is that because the common mean parameter flat is random, it in0 duces dependencies across the document models in 174 = Const · Const + xq + njw Where Const are terms not depending on j. We see that njx is term frequency, its denominator /|V | + n0x is an inverse document frequency factor, and + wV njw normalizes for document length. The inverse document frequency factor is directly related to the shared mean parameter, in that popular flat terms x will have high 0x value, causing all documents to assign higher probability to x, and down weighting the term frequency. This effect will be inherited by our model in the next section. 3 Hierarchical Dirichlet Trees Apart from the constraint that the parameters should sum to one, the Dirichlet priors in the HDD model do not impose any dependency among the parameters of the resulting multinomial. In other words, the document models cannot capture the notion that related terms tend to co-occur together. For example, this model cannot incorporate the knowledge that if the word `computer' is seen in a document, it is likely to observe the word `software' in the same document. We relax the independence assumption of the Dirichlet distribution by using Dirichlet tree distributions (Minka, 2003), which can capture some dependencies among the resulting parameters. This allows relationships among terms to be modeled, and we will see that it improves retrieval performance. 3.1 Model recover the "flat" HDD model in the previous section. We call our model the hierarchical Dirichlet tree (HDT). 3.2 Inference and Learning Given a term, the path from the root to the corresponding leaf is unique. Thus given the term frequencies nj of document j as defined in (1), the number of times njl child l C(k) was picked at node k is known and fixed. The probability of all words in document j, given the parameters, is then a product of multinomials probabilities over internal nodes k: p(nj |{ jk }) = Q njk ! k lC(k) njl ! lC(k) jljl n (5) The probability of the documents, integrating out the jk 's, is: p({nj }|{ 0k }) = Q njk ! lC(k) (6) (k 0l +njl ) (k 0l ) lC(k) Let us assume that we have a tree over the vocabulary whose leaves correspond to vocabulary terms. Each internal node k of the tree has a multinomial distribution over its children C(k). Words are drawn by starting at the root of the tree, recursively picking a child l C(k) whenever we are in an internal node k, until we reach a leaf of the tree which corresponds to a vocabulary term (see Figure 2(b)). The Dirichlet tree distribution is the product of Dirichlet distributions placed over the child probabilities of each internal node, and serves as a (dependent) prior over the parameters of multinomial distributions over the vocabulary (the leaves). Our model generalizes the HDD model by replacing the Dirichlet distributions in (1) by Dirichlet tree distributions. At each internal node k, define a hierarchical Dirichlet prior over the choice of the children: 0k = (0l )lC(k) Dirichlet(k uk ) (4) (k ) njl ! (k +njk ) j k The probability of a query q under document j, i.e. the relevance score, follows from (3): p(q|j) = xq (kl) k 0l +njl k +njk (7) jk = (jl )lC(k) Dirichlet(k 0k ) where uk is a uniform distribution over the children of node k, and each internal node has its own hyperparameters k and k . jl is the probability of choosing child l if we are at internal node k. If the tree is degenerate with just one internal node (the root) and all leaves are direct children of the root we 175 where the second product is over pairs (kl) where k is a parent of l on the path from the root to x. The hierarchical Dirichlet tree model we proposed has a large number of parameters and hyperparameters (even after integrating out the jk 's), since the vocabulary trees we will consider later typically have large numbers of internal nodes. This over flexibility might lead to overfitting or to parameter regimes that do not aid in the actual task of IR. To avoid both issues, we constrain the hierarchical Dirichlet tree to be centered over the flat hierarchical Dirichlet document model, and allow it to learn only the k hyperparameters, integrating out the jk parameters. We set { 0k }, the hyperparameters of the global tree, so that it induces the same distribution over vocabulary terms as flat : 0 flat 0l = 0l 0k = lC(k) 0l (8) The hyperparameters of the local trees k 's are estimated using maximum a posteriori learning with likelihood given by (6), and a gamma prior with informative parameters. The density function of a Gamma(a, b) distribution is g(x; a, b) = xa-1 ba e-bx (a) Algorithm 1 Greedy Agglomerative Clustering 1: Place m words into m singleton clusters 2: repeat 3: Merge the two clusters with highest similarity, re- where the mode happens at x = a-1 . We set the b mode of the prior such that the hierarchical Dirichlet tree reduces to the hierarchical Dirichlet document model at these values: flat flat = 0l l sulting in one less cluster If there still are unincluded words, pick one and place it in a singleton cluster, resulting in one more cluster 5: until all words have been included and there is only one cluster left 4: flat = k lC(k) flat l (9) k Gamma(bflat + 1, b) k and b > 0 is an inverse scale hyperparameter to be tuned, with large values giving a sharp peak around flat . We tried a few values1 of b and have found that k the results we report in the next section are not sensitive to b. This prior is constructed such that if there is insufficient information in (6) the MAP value will simply default back to the hierarchical Dirichlet document model. We used LBFGS2 which is a gradient based optimization method to find the MAP values, where the gradient of the likelihood part of the objective function (6) is: log p({nj }|{ 0j }) = k + lC(k) j (k ) - (k + njk ) 0l (k 0l + njl ) - (k 0l ) where (x) := log (x)/x is the digamma function. Because each k can be optimized separately, the optimization is very fast (approximately 15-30 minutes in the experiments to follow on a Linux machine with 1.8 GH CPU speed). 4 Vocabulary Tree Structure Learning The structure of the vocabulary tree plays an important role in the quality of the HDT document model, Of the form 10i for i {-2, -1, 0, 1}. We used a C++ re-implementation of Jorge Nocedal's LBFGS library (Nocedal, 1980) from the ALGLIB website: http://www.alglib.net. 1 2 since it encapsulates the similarities among words captured by the model. In this paper we explored using trees learned in an unsupervised fashion from the training corpus. The three methods are all agglomerative clustering algorithms (Duda et al., 2000) with different similarity functions. Initially each vocabulary word is placed in its own cluster; each iteration of the algorithm finds the pair of clusters with highest similarity and merges them, continuing until only one cluster is left. The sequence of merges determines a binary tree with vocabulary words as its leaves. Using a heap data structure, this basic agglomerative clustering algorithm requires O(n2 log(n) + sn2 ) computations where n is the size of the vocabulary and s is the amount of computation needed to compute the similarity between two clusters. Typically the vocabulary size n is large; to speed up the algorithm, we use a greedy version described in Algorithm 1 which restricts the number of cluster candidates to at most m n. This greedy version is faster with complexity O(nm(log m + s)). In the experiments we used m = 500. Distributional clustering (Dcluster) (Pereira et al., 1993) measures similarity among words in terms of the similarity among their local contexts. Each word is represented by the frequencies of various words in a window around each occurrence of the word. The similarity between two words is computed to be a symmetrized KL divergence between the distributions over neighboring words associated with the two words. For a cluster of words the neighboring words are the union of those associated with each word in the cluster. Dcluster has been used extensively in text classification (Baker and McCallum, 1998). Probabilistic hierarchical clustering (Pcluster) 176 (Friedman, 2003). Dcluster associates each word with its local context, as a result it captures both semantic and syntactic relationships among words. Pcluster captures more relevant semantic relationships by instead associating each word with the documents in which it appears. Specifically, each word is associated with a binary vector indexed by documents in the corpus, where a 1 means the word appears in the corresponding document. Pcluster models a cluster of words probabilistically, with the binary vectors being iid draws from a product of Bernoulli distributions. The similarity of two clusters c1 and c2 of words is P (c1 c2 )/P (c1 )P (c2 ), i.e. two clusters of words are similar if their union can be effectively modeled using one cluster, relative to modeling each separately. Conjugate beta priors are placed over the parameters of the Bernoulli distributions and integrated out so that the similarity scores are comparable. Brown's algorithm (Bcluster) (Brown et al., 1990) was originally proposed to build class-based language models. In the 2-gram case, words are clustered such that the class of the previous word is most predictive of the class of the current word. Thus the similarity between two clusters of words is defined to be the resulting mutual information between adjacent classes corrresponding to a sequence of words. 4.1 Operations to Simplify Trees a b Figure 2: (root) = 2, while (v) = 1 for shaded vertices v. Contracting a and b results in both child of b being direct children of a while b is removed. Trees constructed using the agglomerative hierarchical clustering algorithms described in this section suffer from a few drawbacks. Firstly, because they are binary trees they have large numbers of internal nodes. Secondly, many internal nodes are simply not informative in that the two clusters of words below a node are indistinguishable. Thirdly, Pcluster and Dcluster tend to produce long chain-like branches which significantly slows down the computation of the relevance score. To address these issues, we considered operations to simplify trees by contracting internal edges of the tree while preserving as much of the word relationship information as possible. Let L be the set of tree leaves and (a) be the distance from node or edge a to the leaves: (a) := min #{edges between a and l} lL In the experiments we considered either contracting edges3 close to the leaves (a) = 1 (thus removing many of the long branches described above), or edges further up the tree (a) 2 (preserving the informative subtrees closer to the leaves while removing many internal nodes). See Figure 2. (Miller et al., 2004) cut the BCluster tree at a certain depth k to simplify the tree, meaning every leaf descending from a particular internal node at level k is made an immediate child of that node. They use the tree to get extra features for a discriminative model to tackle the problem of sparsity--the features obtained from the new tree do not suffer from sparsity since each node has several words as its leaves. This technique did not work well for our application so we will not report results using it in our experiments. 5 Experiments In this section we present experimental results on two IR datasets: Cranfield and Medline4 . The Cranfield dataset consists of 1,400 documents and 225 queries; its vocabulary size after stemming and removing stop words is 4,227. The Medline dataset contains 1,033 documents and 30 queries with the vocabulary size of 8,800 after stemming and removing stop words. We compare HDT with the flat HDD model and Okapi BM25 (Robertson et al., 1992). Since one of our motivations has been to Contracting an edge means removing the edge and the adjacent child node and connecting the grandchildren to the parent. 4 Both datasets can be downloaded from http://www.dcs.gla.ac.uk/idom/ir resources/test collections. 3 (10) 177 Tree BCluster BC contract 2 BC contract = 1 DCluster DC contract 2 DC contract = 1 PCluster PC contract 2 PC contract = 1 flat model BM25 BM25QueryExp Depth Statistics Cranfield Medline avg / max total avg / max total 16.7 / 24 4226 16.4 / 22 8799 6.2 / 16 3711 5.3 / 14 7473 16.1 / 23 3702 15.8 / 22 7672 41.2 / 194 4226 38.1 / 176 8799 2.3 / 8 2469 3.3 / 9 5091 40.9 / 194 3648 38.1 / 176 8799 50.2 / 345 4226 37.1 / 561 8799 35.2 / 318 3741 20.4 / 514 7280 33.6 / 345 2246 34.1 / 561 4209 1/1 1 1/1 1 ­ ­ ­ ­ ­ ­ ­ ­ Performance Cranfield Medline avg-pr top10-pr avg-pr top10-pr 0.2675 0.3218 0.2131 0.6433 0.2685 0.3147 0.2079 0.6533 0.2685 0.3204 0.1975 0.6400 0.2552 0.3120 0.1906 0.6300 0.2555 0.3156 0.1906 0.6167 0.2597 0.3129 0.1848 0.6300 0.2613 0.3231 0.1681 0.6633 0.2624 0.3213 0.1792 0.6767 0.2588 0.3240 0.1880 0.6633 0.2506 0.3089 0.1381 0.6133 0.2566 0.3124 0.1804 0.6567 0.2097 0.3191 0.2121 0.7366 Table 1: Average precision and Top-10 precision scores of HDT with different trees versus flat model and BM25. The statistics for each tree shows its average/maximum depth of its leaf nodes as well as the number of its total internal nodes. The bold numbers highlight the best results in the corresponding columns. get away from query expansion, we also compare against Okapi BM25 with query expansion. The new terms to expand each query are chosen based on Robertson-Sparck Jones weights (Robertson and Sparck Jones, 1976) from the pseudo relevant documents. The comparison criteria are (i) top-10 precision, and (ii) average precision. 5.1 HDT vs Baselines All the hierarchical clustering algorithms mentioned in section 4 are used to generate trees, each of which is further post-processed by tree simplification operators described in section 4.1. We consider (i) contracting nodes at higher levels of the hierarchy ( 2), and (ii) contracting nodes right above the leaves ( = 1). The statistics of the trees before and after postprocessing are shown in Table 1. Roughly, the Dcluster and BCluster trees do not have long chains with leaves hanging directly off them, which is why their average depths are reduced significantly by the 2 simplification, but not by the = 1 simplification. The converse is true for Pcluster: the trees have many chains with leaves hanging directly off them, which is why average depth is not reduced as much as the previous trees based on the 2 simplification. However the average depth is still reduced significantly compared to the original trees. Table 1 presents the performance of HDT with 178 different trees against the baselines in terms of the top-10 and average precision (we have bold faced the performance values which are the maximum of each column). HDT with every tree outperforms significantly the flat model in both datasets. More specifically, HDT with (original) BCluster and PCluster trees significantly outperforms the three baselines in terms of both performance measure for the Cranfield. Similar trends are observed on the Medline except here the baseline Okapi BM25 with query expansion is pretty strong5 , which is still outperformed by HDT with BCluster tree. To further highlight the differences among the methods, we have shown the precision at particular recall points on Medline dataset in Figure 4 for HDT with PCluster tree vs the baselines. As the recall increases, the precision of the PCluster tree significantly outperforms the flat model and BM25. We attribute this to the ability of PCluster tree to give high scores to documents which have words relevant to a query word (an effect similar to query expansion). 5.2 Analysis It is interesting to contrast the learned k 's for each of the clustering methods. These k 's impose corNote that we tuned the parameters of the baselines BM25 with/without query expansion with respect to their performance on the actual retrieval task, which in a sense makes them appear better than they should. 5 10 3 BCluster 10 3 DCluster 10 3 PCluster 10 2 10 2 10 2 k k 10 10 0 0 10 -1 k 10 0 10 1 10 1 10 1 10 -1 10 -1 10 -1 10 0 10 0k parent(k) 1 10 2 10 3 10 -2 10 -2 10 -1 10 10 0k parent(k) 0 1 10 2 10 3 10 -2 10 -2 10 -1 10 10 0k parent(k) 0 1 10 2 10 3 Figure 3: The plots showing the contribution of internal nodes in trees constructed by the three clustering algorithms for the Cranfield dataset. In each plot, a point represent an internal node showing a positive exponent in the node's contribution (i.e. positive correlation among its children) if the point is below x = y line. From left to the right plots, the fraction of nodes below the line is 0.9044, 0.7977, and 0.3344 for a total of 4,226 internal nodes. Precision at particular recall points 1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 PCluster Flat model BM25 BM25 Query Expansion 0 .1 .2 .3 .4 .5 .6 .7 .8 .9 Figure 4: The precision of all methods at particular recall points for the Medline dataset. relations on the probabilities of the children under k in an interesting fashion. In particular, if we compare k to 0k parent(k) , then a larger value of k implies that the probabilities of picking one of the children of k (from among all nodes) are positively correlated, while a smaller value of k implies negative correlation. Roughly speaking, this is because drawn values of jl for l C(k) are more likely to be closer to uniform (relative to the flat Dirichlet) thus if we had picked one child of k we will likely pick another child of k. Figure 3 shows scatter plots of k values versus 0k parent(k) for the internal nodes of the trees. Firstly, smaller values for both tend to be associated with lower levels of the trees, while large values are with higher levels of the trees. Thus we see that PCluster tend to have subtrees of vocabulary terms that are positively correlated with each other--i.e. they tend to co-occur in the same docu179 ments. The converse is true of DCluster and BCluster because they tend to put words with the same meaning together, thus to express a particular concept it is enough to select one of the words and not to choose the rest. Figure 5 show some fragments of the actual trees including the words they placed together and k parameters learned by HDT model for their internal nodes. Moreover, visual inspection of the trees shows that DCluster can easily misplace words in the tree, which explains its lower performance compared to the other tree construction methods. Secondly, we observed that for higher nodes of the tree (corresponding generally to larger values of k and 0k parent(k) ) PCluster k 's are smaller, thus higher levels of the tree exhibit negative correlation. This is reasonable, since if the subtrees capture positively correlated words, then higher up the tree the different subtrees correspond to clusters of words that do not co-occur together, i.e. negatively correlated. 6 Conclusion and Future Work We presented a hierarchical Dirichlet tree model for information retrieval which can inject (semantical or syntactical) word relationships as the domain knowledge into a probabilistic model for information retrieval. Using trees to capture word relationships, the model is highly efficient while making use of both prior information about words and their occurrence statistics in the corpus. Furthermore, we investigated the effect of different tree construction algorithms on the model performance. On the Cranfield dataset, HDT achieves 26.85% for average-precision and 32.40% for top-10 preci- Figure 5: Small parts of the trees learned by clustering algorithms for the Cranfield dataset where the learned k for each internal node is written close to it. sion, and outperforms all baselines including BM25 which gets 25.66% and 31.24% for these two measures. On the Medline dataset, HDT is competitive with BM25 with Query Expansion and outperforms all other baselines. These encouraging results show the benefits of HDT as a principled probabilistic model for information retrieval. An interesting avenue of research is to construct the vocabulary tree based on WordNet, as a way to inject independent prior knowledge into the model. However WordNet has a low coverage problem, i.e. there are some words in the data which do not exist in it. One solution to this low coverage problem is to combine trees generated by the clustering algorithms mentioned in this paper and WordNet, which we leave as a future work. Research and development in information retrieval, pages 96­103. P. F. Brown, V. J. Della Pietra, P. V. deSouza, J. C. Lai, and R. L. Mercer. 1990. Class-based n-gram models of natural language. Computational Linguistics. P. J. Cowans. 2004. Information retrieval using hierarchical dirichlet processes. In Proceedings of the 27th Annual International Conference on Research and Development in Information Retrieval (SIGIR). R. O. Duda, P. E. Hart, and D. G. Stork. 2000. Pattern Classification. Wiley-Interscience Publication. N. Friedman. 2003. Pcluster: Probabilistic agglomerative clustering of gene expression profiles. Available from http://citeseer.ist.psu.edu/668029.html. Donald Metzler and W. Bruce Croft. 2007. Latent concept expansion using markov random fields. In Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval. S. Miller, J. Guinness, and A. Zamanian. 2004. Name tagging with word clusters and discriminative training. In Proceedings of North American Chapter of the Association for Computational Linguistics - Human Language Technologies conference (NAACL HLT). References L. Douglas Baker and Andrew Kachites McCallum. 1998. Distributional clustering of words for text classification. In SIGIR '98: Proceedings of the 21st annual international ACM SIGIR conference on 180 T. Minka. 2003. The dirichlet-tree distribution. Available from http://research.microsoft.com/ minka/papers/dirichlet/minka-dirtree.pdf. J. Nocedal. 1980. Updating quasi-newton matrices with limited storage. Mathematics of Computation, 35. Fernando Pereira, Naftali Tishby, and Lillian Lee. 1993. Distributional clustering of english words. In 31st Annual Meeting of the Association for Computational Linguistics, pages 183­190. Stefan Riezler, Alexander Vasserman, Ioannis Tsochantaridis, Vibhu Mittal, and Yi Liu. 2007. Statistical machine translation for query expansion in answer retrieval. In Proceedings of the 45th Annual Meeting of the Association of Computational Linguistics. S. E. Robertson and K. Sparck Jones. 1976. Relevance weighting of search terms. Journal of the American Society for Information Science, 27(3):129­146. S. E. Robertson, S. Walker, M. Hancock-Beaulieu, A. Gull, and M. Lau. 1992. Okapi at trec. In Text REtrieval Conference, pages 21­30. G. Salton and M.J. McGill. 1983. An Introduction to Modern Information Retrieval. McGraw-Hill, New York. 181 Phrase-Based Query Degradation Modeling for Vocabulary-Independent Ranked Utterance Retrieval J. Scott Olsson HLT Center of Excellence Johns Hopkins University Baltimore, MD 21211, USA solsson@jhu.edu Douglas W. Oard College of Information Studies University of Maryland College Park, MD 15213, USA oard@umd.edu Abstract This paper introduces a new approach to ranking speech utterances by a system's confidence that they contain a spoken word. Multiple alternate pronunciations, or degradations, of a query word's phoneme sequence are hypothesized and incorporated into the ranking function. We consider two methods for hypothesizing these degradations, the best of which is constructed using factored phrasebased statistical machine translation. We show that this approach is able to significantly improve upon a state-of-the-art baseline technique in an evaluation on held-out speech. We evaluate our systems using three different methods for indexing the speech utterances (using phoneme, phoneme multigram, and word recognition), and find that degradation modeling shows particular promise for locating out-of-vocabulary words when the underlying indexing system is constructed with standard word-based speech recognition. 1 Introduction Our goal is to find short speech utterances which contain a query word. We accomplish this goal by ranking the set of utterances by our confidence that they contain the query word, a task known as Ranked Utterance Retrieval (RUR). In particular, we are interested in the case when the user's query word can not be anticipated by a Large Vocabulary Continuous Speech Recognizer's (LVCSR) decoding dictionary, so that the word is said to be Out-OfVocabulary (OOV). Rare words tend to be the most informative, but are also most likely to be OOV. When words are 182 OOV, we must use vocabulary-independent techniques to locate them. One popular approach is to search for the words in output from a phoneme recognizer (Ng and Zue, 2000), although this suffers from the low accuracy typical of phoneme recognition. We consider two methods for handling this inaccuracy. First, we compare an RUR indexing system using phonemes with two systems using longer recognition units: words or phoneme multigrams. Second, we consider several methods for handling the recognition inaccuracy in the utterance ranking function itself. Our baseline generative model handles errorful recognition by estimating term frequencies from smoothed language models trained on phoneme lattices. Our new approach, which we call query degradation, hypothesizes many alternate "pronunciations" for the query word and incorporates them into the ranking function. These degradations are translations of the lexical phoneme sequence into the errorful recognition language, which we hypothesize using a factored phrase-based statistical machine translation system. Our speech collection is a set of oral history interviews from the MALACH collection (Byrne et al., 2004), which has previously been used for ad hoc speech retrieval evaluations using one-best word level transcripts (Pecina et al., 2007; Olsson, 2008a) and for vocabulary-independent RUR (Olsson, 2008b). The interviews were conducted with survivors and witnesses of the Holocaust, who discuss their experiences before, during, and after the Second World War. Their speech is predominately spontaneous and conversational. It is often also emotional and heavily accented. Because the speech contains many words unlikely to occur within a general purpose speech recognition lexicon, it repre- Human Language Technologies: The 2009 Annual Conference of the North American Chapter of the ACL, pages 182­190, Boulder, Colorado, June 2009. c 2009 Association for Computational Linguistics sents an excellent collection for RUR evaluation. We were graciously permitted to use BBN Technology's speech recognition system Byblos (Prasad et al., 2005; Matsoukas et al., 2005) for our speech recognition experiments. We train on approximately 200 hours of transcribed audio excerpted from about 800 unique speakers in the MALACH collection. To provide a realistic set of OOV query words, we use an LVCSR dictionary previously constructed for a different topic domain (broadcast news and conversational telephone speech) and discard all utterances in our acoustic training data which are not covered by this dictionary. New acoustic and language models are trained for each of the phoneme, multigram and word recognition systems. The output of LVCSR is a lattice of recognition hypotheses for each test speech utterance. A lattice is a directed acyclic graph that is used to compactly represent the search space for a speech recognition system. Each node represents a point in time and arcs between nodes indicates a word occurs between the connected nodes' times. Arcs are weighted by the probability of the word occurring, so that the so-called "one-best" path through the lattice (what a system might return as a transcription) is the path through the lattice having highest probability under the acoustic and language models. Each RUR model we consider is constructed using the expected counts of a query word's phoneme sequences in these recognition lattices. We consider three approaches to producing these phoneme lattices, using standard word-based LVCSR, phoneme recognition, and LVCSR using phoneme multigrams. Our word system's dictionary contains about 50,000 entries, while the phoneme system contains 39 phonemes from the ARPABET set. Originally proposed by Deligne and Bimbot (1997) to model variable length regularities in streams of symbols (e.g., words, graphemes, or phonemes), phoneme multigrams are short sequences of one or more phonemes. We produce a set of "phoneme transcripts" by replacing transcript words with their lexical pronunciation. The set of multigrams is learned by then choosing a maximumlikelihood segmentation of these training phoneme transcripts, where the segmentation is viewed as hidden data in an Expectation-Maximization algorithm. The set of all continuous phonemes occurring be183 tween segment boundaries is then chosen as our multigram dictionary. This multigram recognition dictionary contains 16,409 entries. After we have obtained each recognition lattice, our indexing approach follows that of Olsson (2008b). Namely, for the word and multigram systems, we first expand lattice arcs containing multiple phones to produce a lattice having only single phonemes on its arcs. Then, we compute the expected count of all phoneme n-grams n 5 in the lattice. These n-grams and their counts are inserted in our inverted index for retrieval. This paper is organized as follows. In Section 2 we introduce our baseline RUR methods. In Section 3 we introduce our query degradation approach. We introduce our experimental validation in Section 4 and our results in Section 5. We find that using phrase-based query degradations can significantly improve upon a strong RUR baseline. Finally, in Section 6 we conclude and outline several directions for future work. 2 Generative Baseline Each method we present in this paper ranks the utterances by the term's estimated frequency within the corresponding phoneme lattice. This general approach has previously been considered (Yu and Seide, 2005; Saraclar and Sproat, 2004), on the basis that it provides a minimum Bayes-risk ranking criterion (Yu et al., Sept 2005; Robertson, 1977) for the utterances. What differs for each method is the particular estimator of term frequency which is used. We first outline our baseline approach, a generative model for term frequency estimation. Recall that our vocabulary-independent indices contain the expected counts of phoneme sequences from our recognition lattices. Yu and Seide (2005) used these expected phoneme sequence counts to estimate term frequency in the following way. For a query term Q and lattice L, term frequency tf G is estimated as tf G (Q, L) = P (Q|L) · NL , where NL is an estimate for the number of words in the utterance. The conditional P (Q|L) is modeled as an order M phoneme level language model, ^ P (Q|L) = l i=1 ~ P (qi |qi-M +1 , . . . , qi-1 , L), (1) ^ so that tf G (Q, L) P (Q|L) · NL . The probability of a query phoneme qj being generated, given that the phoneme sequence qj-M +1 , . . . , qj-1 = j-1 qj-M +1 was observed, is estimated as j-1 ~ P (qj |qj-M +1 , L) = j EPL [C(qj-M +1 )] j-1 EPL [C(qj-M +1 )] . j-1 in lattice L of the phoneme sequence qj-M +1 . We compute these counts using a variant of the forwardbackward algorithm, which is implemented by the SRI language modeling toolkit (Stolcke, 2002). In practice, because of data sparsity, the language model in Equation 1 must be modified to include smoothing for unseen phoneme sequences. We use a backoff M -gram model with Witten-Bell discounting (Witten and Bell, 1991). We set the phoneme language model's order to M = 5, which gave good results in previous work (Yu and Seide, 2005). j-1 Here, EPL [C(qj-M +1 )] denotes the expected count easily computed from a phoneme confusion matrix after aligning the reference and one-best hypothesis transcript under a minimum edit distance criterion. Similar methods have also been used in other language processing applications. For example, in (Kolak, 2005), one-for-one character substitutions, insertions and deletions were considered in a generative model of errors in OCR. In this work, because we are focused on constructing inverted indices of audio files (for speed and to conserve space), we must generalize our method of incorporating query degradations in the ranking function. Given a degradation model P (H|Q), we take as our ranking function the expectation of the ^ generative baseline estimate NL · P (H|L) with respect to P (H|Q), tf G (Q, L) = ^ P (H|L) · NL · P (H|Q), (2) HH 3 Incorporating Query Degradations One problem with the generative approach is that recognition error is not modeled (apart from the uncertainty captured in the phoneme lattice). The essential problem is that while the method hopes to model P (Q|L), it is in fact only able to model the probability of one degradation H in the lattice, that is P (H|L). We define a query degradation as any phoneme sequence (including the lexical sequence) which may, with some estimated probability, occur in an errorful phonemic representation of the audio (either a one-best or lattice hypothesis). Because of speaker variation and because recognition is errorful, we ought to also consider non-lexical degradations of the query phoneme sequence. That is, we should incorporate P (H|Q) in our ranking function. It has previously been demonstrated that allowing for phoneme confusability can significantly increase spoken term detection performance on onebest phoneme transcripts (Chaudhari and Picheny, 2007; Schone et al., 2005) and in phonemic lattices (Foote et al., 1997). These methods work by allowing weighted substitution costs in minimumedit-distance matching. Previously, these substitution costs have been maximum-likelihood estimates of P (H|Q) for each phoneme, where P (H|Q) is 184 where H is the set of degradations. Note that, while we consider the expected value of our baseline term frequency estimator with respect to P (H|Q), this general approach could be used with any other term frequency estimator. Our formulation is similar to approaches taken in OCR document retrieval, using degradations of character sequences (Darwish and Magdy, 2007; Darwish, 2003). For vocabulary-independent spoken term detection, perhaps the most closely related formulation is provided by (Mamou and Ramabhadran, 2008). In that work, they ranked utterances by the weighted average of their matching score, where the weights were confidences from a grapheme to phoneme system's first several hypotheses for a word's pronunciation. The matching scores were edit distances, where substitution costs were weighted using phoneme confusability. Accordingly, their formulation was not aimed at accounting for errors in recognition per se, but rather for errors in hypothesizing pronunciations. We expect this accounts for their lack of significant improvement using the method. Since we don't want to sum over all possible recognition hypotheses H, we might instead sum over the smallest set H such that HH P (H|Q) . That is, we could take the most probable degradations until their cumulative probability exceeds some threshold . In practice, however, because degradation probabilities can be poorly scaled, we instead take a fixed number of degradations and normalize their scores. When a query is issued, we apply a degradation model to learn the top few phoneme sequences H that are most likely to have been recognized, under the model. In the machine translation literature, this process is commonly referred to as decoding. We now turn to the modeling of query degradations H given a phoneme sequence Q, P (H|Q). First, we consider a simple baseline approach in Section 3.1. Then, in Section 3.2, we propose a more powerful technique, using state-of-the-art machine translation methods to hypothesize our degradations. 3.1 Baseline Query Degradations AY Vowel Dipthong K Consonant Voiceless plosive M Semi-vowel Nasal AA Vowel Back vowel N Semi-vowel Nasal Figure 1: Three levels of annotation used by the factored phrase-based query degradation model. phonemes is then P (H|Q) = m P (hi |ri ). We i=1 efficiently compute the most probable degradations for a query Q using a lattice of possible degradations and the forward backward algorithm. We call this baseline degradation approach CMQD (Confusion Matrix based Query Degradation). 3.2 Phrase-Based Query Degradation Schone et al. (2005) used phoneme confusion matrices created by aligning hypothesized and reference phoneme transcripts to weight edit costs for a minimum-edit distance based search in a one-best phoneme transcript. Foote et al. (1997) had previously used phoneme lattices, although with ad hoc edit costs and without efficient indexing. In this work, we do not want to linearly scan each phoneme lattice for our query's phoneme sequence, preferring instead to look up sequences in the inverted indices containing phoneme sequences. Our baseline degradation approach is related to the edit-cost approach taken by (Schone et al., 2005), although we generalize it so that it may be applied within Equation 2 and we consider speech recognition hypotheses beyond the one-best hypothesis. First, we randomly generate N traversals of each phonemic recognition lattice. These traversals are random paths through the lattice (i.e., we start at the beginning of the lattice and move to the next node, where our choice is weighted by the outgoing arcs' probabilities). Then, we align each of these traversals with its reference transcript using a minimum-edit distance criterion. Phone confusion matrices are then tabulated from the aggregated insertion, substitution, and deletion counts across all traversals of all lattices. From these confusion matrices, we compute unsmoothed estimates of P (h|r), the probability of a phoneme h being hypothesized given a reference phoneme r. Making an independence assumption, our baseline degradation model for a query with m 185 One problem with CMQD is that we only allow insertions, deletions, and one-for-one substitutions. It may be, however, that certain pairs of phonemes are commonly hypothesized for a particular reference phoneme (in the language of statistical machine translation, we might say that we should allow some non-zero fertility). Second, there is nothing to discourage query degradations which are unlikely under an (errorful) language model--that is, degradations that are not observed in the speech hypotheses. Finally, CMQD doesn't account for similarities between phoneme classes. While some of these deficiencies could be addressed with an extension to CMQD (e.g., by expanding the degradation lattices to include language model scores), we can do better using a more powerful modeling framework. In particular, we adopt the approach of phrase-based statistical machine translation (Koehn et al., 2003; Koehn and Hoang, 2007). This approach allows for multiple-phoneme to multiple-phoneme substitutions, as well as the soft incorporation of additional linguistic knowledge (e.g., phoneme classes). This is related to previous work allowing higher order phoneme confusions in bigram or trigram contexts (Chaudhari and Picheny, 2007), although they used a fuzzy edit distance measure and did not incorporate other evidence in their model (e.g., the phoneme language model score). The reader is referred to (Koehn and Hoang, 2007; Koehn et al., 2007) for detailed information about phrase-based statistical machine translation. We give a brief outline here, sufficient only to provide background for our query degradation application. Statistical machine translation systems work by converting a source-language sentence into the most probable target-language sentence, under a model whose parameters are estimated using example sentence pairs. Phrase-based machine translation is one variant of this statistical approach, wherein multipleword phrases rather than isolated words are the basic translation unit. These phrases are generally not linguistically motivated, but rather learned from co-occurrences in the paired example translation sentences. We apply the same machinery to hypothesize our pronunciation degradations, where we now translate from the "source-language" reference phoneme sequence Q to the hypothesized "targetlanguage" phoneme sequence H. Phrase-based translation is based on the noisy channel model, where Bayes rule is used to reformulate the translation probability for translating a reference query Q into a hypothesized phoneme sequence H as arg max P (H|Q) = arg max P (Q|H)P (H). H H Here, for example, P (H) is the language model probability of a degradation H and P (Q|H) is the conditional probability of the reference sequence Q given H. More generally however, we can incorporate other feature functions of H and Q, hi (H, Q), and with varying weights. This is implemented using a log-linear model for P (H|Q), where the model covariates are the functions hi (H, Q), so that P (H|Q) = 1 exp Z n i hi (H, Q) i=1 The parameters i are estimated by MLE and the normalizing Z need not be computed (because we will take the argmax). Example feature functions include the language model probability of the hypothesis and a hypothesis length penalty. In addition to feature functions being defined on the surface level of the phonemes, they may also be defined on non-surface annotation levels, called factors. In a word translation setting, the intuition is that statistics from morphological variants of a lexical form ought to contribute to statistics for other variants. For example, if we have never seen the word houses in language model training, but have examples of house, we still can expect houses are to 186 be more probable than houses fly. In other words, factors allow us to collect improved statistics on sparse data. While sparsity might appear to be less of a problem for phoneme degradation modeling (because the token inventory is comparatively very small), we nevertheless may benefit from this approach, particularly because we expect to rely on higher order language models and because we have rather little training data: only 22,810 transcribed utterances (about 600k reference phonemes). In our case, we use two additional annotation layers, based on a simple grouping of phonemes into broad classes. We consider the phoneme itself, the broad distinction of vowel and consonant, and a finer grained set of classes (e.g., front vowels, central vowels, voiceless and voiced fricatives). Figure 1 shows the three annotation layers we consider for an example reference phoneme sequence. After mapping the reference and hypothesized phonemes to each of these additional factor levels, we train language models on each of the three factor levels of the hypothesized phonemes. The language models for each of these factor levels are then incorporated as features in the translation model. We use the open source toolkit Moses (Koehn et al., 2007) as our phrase-based machine translation system. We used the SRI language modeling toolkit to estimate interpolated 5-gram language models (for each factor level), and smoothed our estimates with Witten-Bell discounting (Witten and Bell, 1991). We used the default parameter settings for Moses's training, with the exception of modifying GIZA++'s default maximum fertility from 10 to 4 (since we don't expect one reference phoneme to align to 10 degraded phonemes). We used default decoding settings, apart from setting the distortion penalty to prevent any reorderings (since alignments are logically constrained to never cross). For the rest of this chapter, we refer to our phrase-based query degradation model as PBQD. We denote the phrasebased model using factors as PBQD-Fac. Figure 2 shows an example alignment learned for a reference and one-best phonemic transcript. The reference utterance "snow white and the seven dwarves" is recognized (approximately) as "no white a the second walks". Note that the phrasebased system is learning not only acoustically plausible confusions, but critically, also confusions aris- N OW W AY T AX DH AX S EH K AX N D W AO K S S N OW W AY T AE N D DH AX S EH V AX N D W OW R F S snow white and the seven dwarves Figure 2: An alignment of hypothesized and reference phoneme transcripts from the multigram phoneme recognizer, for the phrase-based query degradation model. ing from the phonemic recognition system's peculiar construction. For example, while V and K may not be acoustically similar, they are still confusable--within the context of S EH--because multigram language model data has many examples of the word second. Moreover, while the word dwarves (D-W-OW-R-F-S) is not present in the dictionary, the words dwarf (D-W-AO-R-F) and dwarfed (D-W-AO-R-F-T) are present (N.B., the change of vowel from AO to OW between the OOV and in vocabulary pronunciations). While CMQD would have to allow a deletion and two substitutions (without any context) to obtain the correct degradation, the phrase-based system can align the complete phrase pair from training and exploit context. Here, for example, it is highly probable that the errorfully hypothesized phonemes W AO will be followed by K, because of the prevalence of walk in language model data. 4 Experiments An appropriate and commonly used measure for RUR is Mean Average Precision (MAP). Given a ranked list of utterances being searched through, we define the precision at position i in the list as the proportion of the top i utterances which actually contain the corresponding query word. Average Precision (AP) is the average of the precision values computed for each position containing a relevant utterance. To assess the effectiveness of a system across multiple queries, Mean Average Precision is defined as the arithmetic mean of per-query average precision, 1 MAP = n n APn . Throughout this paper, when we report statistically significant improvements in MAP, we are comparing AP for paired queries using a Wilcoxon signed rank test at = 0.05. Note, RUR is different than spoken term detection in two ways, and thus warrants an evaluation measure (e.g., MAP) different than standard spoken 187 term detection measures (such as NIST's actual term weighted value (Fiscus et al., 2006)). First, STD measures require locating a term with granularity finer than that of an utterance. Second, STD measures are computed using a fixed detection threshold. This latter requirement will be unnecessary in many applications (e.g., where a user might prefer to decide themselves when to stop reading down the ranked list of retrieved utterances) and unlikely to be helpful for downstream evidence combination (where we may prefer to keep all putative hits and weight them by some measure of confidence). For our evaluation, we consider retrieving short utterances from seventeen fully transcribed MALACH interviews. Our query set contains all single words occurring in these interviews that are OOV with respect to the word dictionary. This gives us a total of 261 query terms for evaluation. Note, query words are also not present in the multigram training transcripts, in any language model training data, or in any transcripts used for degradation modeling. Some example query words include BUCHENWALD, KINDERTRANSPORT, and SONDERKOMMANDO. To train our degradation models, we used a held out set of 22,810 manually transcribed utterances. We run each recognition system (phoneme, multigram, and word) on these utterances and, for each, train separate degradation models using the aligned reference and hypothesis transcripts. For CMQD, we computed 100 random traversals on each lattice, giving us a total of 2,281,000 hypothesis and reference pairs to align for our confusion matrices. 5 Results We first consider an intrinsic measure of the three speech recognition systems we consider, namely Phoneme Error Rate (PER). Phoneme Error Rate is calculated by first producing an alignment of the hypothesis and reference phoneme transcripts. The counts of each error type are used to compute P ER = 100 · S+D+I , where S, D, I are the numN ber of substitutions, insertions, and deletions respectively, while N is the phoneme length of the reference. Results are shown in Table 1. First, we see that the PER for the multigram system is roughly half that of the phoneme-only system. Second, we find that the word system achieves a considerably lower PER than the multigram system. We note, however, that since these are not true phonemes (but rather phonemes copied over from pronunciation dictionaries and word transcripts), we must cautiously interpret these results. In particular, it seems reasonable that this framework will overestimate the strength of the word based system. For comparison, on the same train/test partition, our word-level system had a word error rate of 31.63. Note, however, that automatic word transcripts can not contain our OOV query words, so word error rate is reported only to give a sense of the difficulty of the recognition task. Table 1 shows our baseline RUR evaluation results. First, we find that the generative model yields statistically significantly higher MAP using words or multigrams than phonemes. This is almost certainly due to the considerably improved phoneme recognition afforded by longer recognition units. Second, many more unique phoneme sequences typically occur in phoneme lattices than in their word or multigram counterparts. We expect this will increase the false alarm rate for the phoneme system, thus decreasing MAP. Surprisingly, while the word-based recognition system achieved considerably lower phoneme error rates than the multigram system (see Table 1), the word-based generative model was in fact indistinguishable from the same model using multigrams. We speculate that this is because the method, as it is essentially a language modeling approach, is sensitive to data sparsity and requires appropriate smoothing. Because multigram lattices incorporate smaller recognition units, which are not constrained to be English words, they naturally produce smoother phoneme language models than a wordbased system. On the other hand, the multigram system is also not statistically significantly better than the word-based generative model, suggesting this may be a promising area for future work. 188 Table 1 shows results using our degradation models. Query degradation appears to help all systems with respect to the generative baseline. This agrees with our intuition that, for RUR, low MAP on OOV terms is predominately driven by low recall.1 Note that, at one degradation, CMQD has the same MAP as the generative model, since the most probable degradation under CMQD is almost always the reference phoneme sequence. Because the CMQD model can easily hypothesize implausible degradations, we see the MAP increases modestly with a few degradations, but then MAP decreases. In contrast, the MAP of the phrase-based system (PBQDFac) increases through to 500 query degradations using multigrams. The phonemic system appears to achieve its peak MAP with fewer degradations, but also has a considerably lower best value. The non-factored phrase-based system PBQD achieves a peak MAP considerably larger than the peak CMQD approach. And, likewise, using additional factor levels (PBQD-Fac) also considerably improves performance. Note especially that, using multiple factor levels, we not only achieve a higher MAP, but also a higher MAP when only a few degradations are possible. To account for errors in phonemic recognition, we have taken two steps. First, we used longer recognition units which we found significantly improved MAP while using our baseline RUR technique. As a second method for handling recognition errors, we also considered variants of our ranking function. In particular, we incorporated query degradations hypothesized using factored phrase-based machine translation. Comparing the MAP for PBQDFac with MAP using the generative baseline for the most improved indexing system (the word system), we find that this degradation approach again statistically significantly improved MAP. That is, these two strategies for handling recognition errors in RUR appear to work well in combination. Although we focused on vocabulary-independent RUR, downstream tasks such as ad hoc speech retrieval will also want to incorporate evidence from in-vocabulary query words. This makes We note however that the preferred operating point in the tradeoff between precision and recall will be task specific. For example, it is known that precision errors become increasingly important as collection size grows (Shao et al., 2008). 1 Query Degradations Method Degraded Model Phone Source Phonemes Multigrams Multigrams Multigrams Words PER 64.4 32.1 32.1 32.1 20.5 QD Model PBQD-Fac CMQD PBQD PBQD-Fac PBQD-Fac Baseline 0.0387 0.1258 0.1258 0.1258 0.1255 1 0.0479 0.1258 0.1160 0.1238 0.1162 5 0.0581 0.1272 0.1283 0.1399 0.1509 50 0.0614 0.1158 0.1347 0.1510 0.1787 500 0.0612 0.0991 0.1317 0.1527 0.1753 Table 1: PER and MAP results for baseline and degradation models. The best result for each indexing approach is shown in bold. our query degradation approach which indexed phonemes from word-based LVCSR particularly attractive. Not only did it achieve the best MAP in our evaluation, but this approach also allows us to construct recognition lattices for both in and out-ofvocabulary query words without running a second, costly, recognition step. CMQD M-EH-NX-EY-L-EH M-EH-NX-EY-L M-NX-EY-L-EH M-EH-NX-EY-EH M-EH-NX-L-EH Phrase-based M-EH-N-T-AX-L M-EH-N-T-AX-L-AA-T AH-AH-AH-AH-M-EH-N-T-AX-L M-EH-N-DH-EY-L-EH M-EH-N-T-AX-L-IY 6 Conclusion Our goal in this work was to rank utterances by our confidence that they contained a previously unseen query word. We proposed a new approach to this task using hypothesized degradations of the query word's phoneme sequence, which we produced using a factored phrase-based machine translation model. This approach was principally motivated by the mismatch between the query's phonemes and the recognition phoneme sequences due to errorful speech indexing. Our approach was constructed and evaluated using phoneme-, multigram-, and wordbased indexing, and significant improvements in MAP using each indexing system were achieved. Critically, these significant improvements were in addition to the significant gains we achieved by constructing our index with longer recognition units. While PBQD-Fac outperformed CMQD averaging over all queries in our evaluation, as expected, there may be particular query words for which this is not the case. Table 2 shows example degradations using both the CMQD and PBQD-Fac degradation models for multigrams. The query word is Mengele. We see that CMQD degradations are near (in an edit distance sense) to the reference pronunciation (M-EH-NX-EY-L-EH), while the phrasebased degradations tend to sound like commonly oc189 Table 2: The top five degradations and associated probabilities using the CMQD and PBQD-Fac models, for the term Mengele using multigram indexing. curring words (mental, meant a lot, men they. . . , mentally). In this case, the lexical phoneme sequence does not occur in the PBQD-Fac degradations until degradation nineteen. Because deleting EH has the same cost irrespective of context for CMQD, both CMQD degradations 2 and 3 are given the same pronunciation weight. Here, CMQD performs considerably better, achieving an average precision of 0.1707, while PBQD-Fac obtains only 0.0300. This suggests that occasionally the phrasebased language model may exert too much influence on the degradations, which is likely to increase the incidence of false alarms. One solution, for future work, might be to incorporate a false alarm model (e.g., down-weighting putative occurrences which look suspiciously like non-query words). Second, we might consider training the degradation model in a discriminative framework (e.g., training to optimize a measure that will penalize degradations which cause false alarms, even if they are good candidates from the perspective of MLE). We hope that the ideas presented in this paper will provide a solid foundation for this future work. References W. Byrne et al. 2004. Automatic Recognition of Spontaneous Speech for Access to Multilingual Oral History Archives. IEEE Transactions on Speech and Audio Processing, Special Issue on Spontaneous Speech Processing, 12(4):420­435, July. U.V. Chaudhari and M. Picheny. 2007. Improvements in phone based audio search via constrained match with high order confusion estimates. Automatic Speech Recognition & Understanding, 2007. ASRU. IEEE Workshop on, pages 665­670, Dec. Kareem Darwish and Walid Magdy. 2007. Error correction vs. query garbling for Arabic OCR document retrieval. ACM Trans. Inf. Syst., 26(1):5. Kareem M. Darwish. 2003. Probabilistic Methods for Searching OCR-Degraded Arabic Text. Ph.D. thesis, University of Maryland, College Park, MD, USA. Directed by Bruce Jacob and Douglas W. Oard. S. Deligne and F. Bimbot. 1997. Inference of Variablelength Acoustic Units for Continuous Speech Recognition. In ICASSP '97: Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing, pages 1731­1734, Munich, Germany. Jonathan Fiscus et al. 2006. English Spoken Term Detection 2006 Results. In Presentation at NIST's 2006 STD Eval Workshop. J.T. Foote et al. 1997. Unconstrained keyword spotting using phone lattices with application to spoken document retrieval. Computer Speech and Language, 11:207--224. Philipp Koehn and Hieu Hoang. 2007. Factored Translation Models. In EMNLP '07: Conference on Empirical Methods in Natural Language Processing, June. Philipp Koehn et al. 2003. Statistical phrase-based translation. In NAACL '03: Proceedings of the 2003 Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology, pages 48­54, Morristown, NJ, USA. Association for Computational Linguistics. Philipp Koehn et al. 2007. Moses: Open Source Toolkit for Statistical Machine Translation. In ACL '07: Proceedings of the 2007 Conference of the Association for Computational Linguistics, demonstration session, June. Okan Kolak. 2005. Rapid Resource Transfer for Multilingual Natural Language Processing. Ph.D. thesis, University of Maryland, College Park, MD, USA. Directed by Philip Resnik. Jonathan Mamou and Bhuvana Ramabhadran. 2008. Phonetic Query Expansion for Spoken Document Retrieval. In Interspeech '08: Conference of the International Speech Communication Association. Spyros Matsoukas et al. 2005. The 2004 BBN 1xRT Recognition Systems for English Broadcast News and Conversational Telephone Speech. In Interspeech '05: Conference of the International Speech Communication Association, pages 1641­1644. K. Ng and V.W. Zue. 2000. Subword-based approaches for spoken document retrieval. Speech Commun., 32(3):157­186. J. Scott Olsson. 2008a. Combining Speech Retrieval Results with Generalized Additive Models. In ACL '08: Proceedings of the 2008 Conference of the Association for Computational Linguistics. J. Scott Olsson. 2008b. Vocabulary Independent Discriminative Term Frequency Estimation. In Interspeech '08: Conference of the International Speech Communication Association. Pavel Pecina, Petra Hoffmannova, Gareth J.F. Jones, Jianqiang Wang, and Douglas W. Oard. 2007. Overview of the CLEF-2007 Cross-Language Speech Retrieval Track. In Proceedings of the CLEF 2007 Workshop on Cross-Language Information Retrieval and Evaluation, September. R. Prasad et al. 2005. The 2004 BBN/LIMSI 20xRT English Conversational Telephone Speech Recognition System. In Interspeech '05: Conference of the International Speech Communication Association. S.E. Robertson. 1977. The Probability Ranking Principle in IR. Journal of Documentation, pages 281­286. M. Saraclar and R. Sproat. 2004. Lattice-Based Search for Spoken Utterance Retrieval. In NAACL '04: Proceedings of the 2004 Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology. P. Schone et al. 2005. Searching Conversational Telephone Speech in Any of the World's Languages. Jian Shao et al. 2008. Towards Vocabulary-Independent Speech Indexing for Large-Scale Repositories. In Interspeech '08: Conference of the International Speech Communication Association. A. Stolcke. 2002. SRILM ­ an extensible language modeling toolkit. In ICSLP '02: Proceedings of 2002 International Conference on Spoken Language Processing. I. H. Witten and T. C. Bell. 1991. The Zero-Frequency Problem: Estimating the Probabilities of Novel Events in Adaptive Text Compression. IEEE Trans. Information Theory, 37(4):1085­1094. Peng Yu and Frank Seide. 2005. Fast TwoStage Vocabulary-Independent Search In Spontaneous Speech. In ICASSP '05: Proceedings of the 2005 IEEE International Conference on Acoustics, Speech, and Signal Processing. P. Yu et al. Sept. 2005. Vocabulary-Independent Indexing of Spontaneous Speech. IEEE Transactions on Speech and Audio Processing, 13(5):635­643. 190 Japanese Query Alteration Based on Semantic Similarity Masato Hagiwara Nagoya University Furo-cho, Chikusa-ku Nagoya 464-8603, Japan hagiwara@kl.i.is.nagoya-u.ac.jp Abstract We propose a unified approach to web search query alterations in Japanese that is not limited to particular character types or orthographic similarity between a query and its alteration candidate. Our model is based on previous work on English query correction, but makes some crucial improvements: (1) we augment the query-candidate list to include orthographically dissimilar but semantically similar pairs; and (2) we use kernel-based lexical semantic similarity to avoid the problem of data sparseness in computing querycandidate similarity. We also propose an efficient method for generating query-candidate pairs for model training and testing. We show that the proposed method achieves about 80% accuracy on the query alteration task, improving over previously proposed methods that use semantic similarity. :pS :pS Hisami Suzuki Microsoft Research One Microsoft Way Redmond, WA 98052, USA hisamis@microsoft.com anagariH Figure 1: Japanese character types and spelling variants 1 Introduction Web search query correction is an important problem to solve for robust information retrieval given how pervasive errors are in search queries: it is said that more than 10% of web search queries contain errors (Cucerzan and Brill, 2004). English query correction has been an area of active research in recent years, building on previous work on generalpurpose spelling correction. However, there has been little investigation of query correction in languages other than English. In this paper, we address the issue of query correction, and more generally, query alteration in Japanese. Japanese poses particular challenges to the query correction task due to its complex writing system, summarized in Fig. 11 . There are four The figure is somewhat over-simplified as it does not include any word consisting of multiple character types. It also does not include examples of spelling mistakes and variants in word segmentation. 1 main character types, including two types of kana (phonetic alphabet - hiragana and katakana), kanji (ideographic - characters represent meaning) and Roman alphabet; a word can be legitimately spelled in multiple ways, combining any of these character sets. For example, the word for `protein' can be spelled as (all in hiragana), (katakana+kanji), (all in kanji) or (hiragana+kanji), all pronounced in the same way (tanpakushitsu). Some examples of these spelling variants are shown in Fig. 1 with the prefix Sp: as is observed from the figure, spelling variation occurs within and across different character types. Resolving these variants will be essential not only for information retrieval but practically for all NLP tasks. A particularly prolific source of spelling variations in Japanese is katakana. Katakana characters are used to transliterate words from English and other foreign languages, and as such, the variations in the source language pronunciation as well as the ambiguity in sound adaptation are reflected in the katakana spelling. For example, Masuyama et al. (2004) report that at least six distinct transliterations of the word `spaghetti' ( , , etc.) are attested in the newspaper corpus they studied. Normalizing katakana spelling variations has been the subject of research by itself (Aramaki et al., 2008; Masuyama et al., 2004). Similarly, English-to-katakana transliteration (e.g., `fedex' as fedekkusu in Fig. 1) and katakana-to- Human Language Technologies: The 2009 Annual Conference of the North American Chapter of the ACL, pages 191­199, Boulder, Colorado, June 2009. c 2009 Association for Computational Linguistics 191 SM :rbbA xedeF :pS ertnec~retnec :pS onoO~onhO :pS tebahplA namoR :rbbA :rbbA :pS ANA :nyS :nyS :pS :pS anakataK :rbbA :rbbA :pS :nyS :pS ijnaK English back-transliteration (e.g., back into `fedex') have also been studied extensively (Bilac and Tanaka, 2004; Brill et al., 2001; Knight and Graehl, 1998), as it is an essential component for machine translation. To our knowledge, however, there has been no work that addresses spelling variation in Japanese generally. In this paper, we propose a general approach to query correction/alteration in Japanese. Our goal is to find precise re-write candidates for a query, be it a correction of a spelling error, normalization of a spelling variant, or finding a strict synonym including abbreviations (e.g., MS `Microsoft', prefixed by Abbr in Fig. 1) and true synonyms (e.g., (translation of `seat') (transliteration of `seat', indicated by Syn in Fig. 1)2 . Our method is based on previous work on English query correction in that we use both spelling and semantic similarity between a query and its alteration candidate, but is more general in that we include alteration candidates that are not similar to the original query in spelling. In computing semantic similarity, we adopt a kernel-based method (Kandola et al., 2002), which improves the accuracy of the query alteration results over previously proposed methods. We also introduce a novel approach to creating a dataset of query and alteration candidate pairs efficiently and reliably from query session logs. 2 Related Work The key difference between traditional generalpurpose spelling correction and search query correction lies in the fact that the latter cannot rely on a lexicon: web queries are replete with valid outof-dictionary words which are not mis-spellings of in-vocabulary words. Cucerzan and Brill (2004) pioneered the research of query spelling correction, with an excellent description of how a traditional dictionary-based speller had to be adapted to solve the realistic query correction problem. The model they proposed is a source-channel model, where the source model is a word bigram model trained on query logs, and the channel model is based on a weighted Damerau-Levenshtein edit distance. Brill Our goal is to harvest alternation candidates; therefore, exactly how they are used in the search task (whether it is used to substitute the original query, to expand it, or simply to suggest an alternative) is not a concern to us here. 2 and Moore (2000) proposed a general, improved source model for general spelling correction, while Ahmad and Kondrak (2005) learned a spelling error model from search query logs using the Expectation Maximization algorithm, without relying on a training set of misspelled words and their corrections. Extending the work of Cucerzan and Brill (2004), Li et al. (2006) proposed to include semantic similarity between the query and its correction candidate. They point out that adventura is a common misspelling of aventura, not adventure, and this cannot be captured by a simple string edit distance, but requires some knowledge of distributional similarity. Distributional similarity is measured by the similarity of the context shared by two terms, and has been successfully applied to many natural language processing tasks, including semantic knowledge acquisition (Lin, 1998). Though the use of distributional similarity improved the query correction results in Li et al.'s work, one problem is that it is sparse and is not available for many rarer query strings. Chen et al. (2007) addressed this problem by using external information (i.e., web search results); we take a different approach to solve the sparseness problem, namely by using semantic kernels. Jones et al. (2006a) generated Japanese query alteration pairs from by mining query logs and built a regression model which predicts the quality of query rewriting pairs. Their model includes a wide variety of orthographical features, but not semantic similarity features. 3 Query Alteration Model 3.1 Problem Formulation We employ a formulation of query alteration model that is similar to conventional query correction models. Given a query string q as input, a query correction model finds a correct alteration c within the confusion set of q, so that it maximizes the posterior probability: c = arg cCF(q)C max P (c|q) (1) where C is the set of all white-space separated words and their bigrams in query logs in our case3 , and In regular text, Japanese uses no white spaces to separate words; however, white spaces are often (but not consistently) 3 192 CF(q) C is the confusion set of q, consisting of the candidates within a certain edit distance from q, i.e., CF(q) = {c C|ED(q, c) < }. We set = 24 using an unnormalized edit distance. The detail of the edit distance ED(q, c) is described in Section 3.2. The query string q itself is contained in CF(q), and if the model output is different from q, it means the model suggests a query alteration. Formulated in this way, both query error detection and alteration are performed in a unified way. After computing the posterior probability of each candidate in CF(q) by the source channel model (Section 3.2), an N-best list is obtained as the initial candidate set C0 , which is then augmented by the bootstrapping method Tchai (Section 3.4) to create the final candidate list C(q). The candidates in C(q) are re-ranked by a maximum entropy model (Section 3.5) and the candidate with the highest posterior probability is selected as the output. 3.2 Source Channel Model Source channel models are widely used for spelling and query correction (Brill and Moore, 2000; Cucerzan and Brill, 2004). Instead of directly computing Eq. (1), we can decompose the posterior probability using Bayes' rule as: c = arg cCF(q)C which we call the alpha-beta model in this paper. The model is a weighted extension of the normal Damerau-Levenshtein edit distance which equally penalizes single character insertion, substitution, or deletion operations (Damerau, 1964; Levenshtein, 1966), and considers generic edit operations of the form , where and are any (possibly null) strings. From misspelled/correct word pairs, alpha-beta trains the probability P ( |PSN), conditioned by the position PSN of in a word, where PSN {start of word, middle of word, end of word}. Under this model, the probability of rewriting a string w to a string s is calculated as: P (s|w) = RPart(w),T Part(s) max max P (c)P (q|c), (2) where the source model P (c) measures how probable the candidate c is, while the error model P (q|c) measures how similar q and c are. For the source model, an n-gram based statistical language model is the standard in previous work (Ahmad and Kondrak, 2005; Li et al., 2006). Word n-gram models are simple to create for English, which is easy to tokenize and to obtain word-based statistics, but this is not the case with Japanese. Therefore, we simply considered the whole input string as a candidate to be altered, and used the relative frequency of candidates in the query logs to build the language model: P (c) = Freq(c) . c C Freq(c ) (3) For the error model, we used an improved channel model described in (Brill and Moore, 2000), used to separate words in Japanese search queries, due to their keyword-based nature. which corresponds to finding best partitions R and T in all possible partitions Part(w) and Part(s). Brill and Moore (2000) reported that this model gave a significant improvement over conventional edit distance methods. Brill et al. (2001) applied this model for extracting katakana-English transliteration pairs from query logs. They trained the edit distance between character chunks of katakana and Roman alphabets, after converting katakana strings to Roman script. We also trained this model using 59,754 katakanaEnglish pairs extracted from aligned Japanese and English Wikipedia article titles. In this paper we allowed ||, || 3. The resulting edit distance is obtained as the negative logarithm of the alpha-beta probability, i.e., ED (q|c) = - log P (q|c). Since the edit operations are directional and c and q can be any string consisting of katakana and English, distance in both directions were considered. We also included a modified edit distance EDhd for simple kana-kana variations after converting them into Roman script. The distance EDhd is essentially the same as the normal Damerau-Levenshtein edit distance, with the modification that it does not penalize character halving (aa a) and doubling (a aa), because a large part of katakana variants only differ in halving/doubling (e.g. (supageti) vs (supagetii)4 . The final error probability is obtained from the minimum of these three distances: However, character length can be distinctive in katakana, as in biru `building' vs. biiru `beer'. 4 |R| i=1 P (Ri Ti |PSN(Ri )), 193 ED(q, c) P (q|c) = = min[ED (q|c), ED (c|q), EDhd (q, c)], (4) exp[-ED(q, c)] (5) where every edit distance is normalized to [0, 1] by multiplying by a factor of 2/(|q||c|) so that it does not depend on the length of the input strings5 . 3.3 Kernel-based Lexical Semantic Similarity 3.3.1 Distributional Similarity The source channel model described in Section 3.2 only considers language and error models and cannot capture semantic similarity between the query and its correction candidate. To address this issue, we use distributional similarity (Lin, 1998) estimated from query logs as additional evidence for query alteration, following Li et al. (2006). For English, it is relatively easy to define the context of a word based on the bag-of-words model. As this is not expected to work on Japanese, we define context as everything but the query string in a query log, as Pasca et al. (2006) and Komachi and ¸ Suzuki (2008) did for their information extraction tasks. This formulation does not involve any segmentation or boundary detection, which makes this method fast and robust. On the other hand, this may cause additional sparseness in the vector representation; we address this issue in the next two sections. Once the context of a candidate ci is defined as the patterns that the candidate co-occurs with, it can be represented as a vector ci = [pmi(ci , p1 ), . . . , pmi(ci , pM )] , where M denotes the number of context patterns and x is the transposition of a vector (or possibly a matrix) x. The elements of the vector are given by pointwise mutual information between the candidate ci and the pattern pj , computed as: pmi(ci , pj ) = log |ci , pj | , |ci , ||, pj | (6) wildcard, i.e., |ci , | = p |ci , p| and |, pj | = c |c, pj |. With these defined, the distributional similarity can be calculated as cosine similarity. Let ^i be the L2-normalized pattern vector of the candic date ci , and X = {^i } be the candidate-pattern coc occurrence matrix. The candidate similarity matrix K can then be obtained as K = X X. In the following, the (i, j)-element of the matrix K is denoted as Kij , which corresponds to the cosine similarity between candidates ci and cj . 3.3.2 Semantic Kernels Although distributional similarity serves as strong evidence for semantically relevant candidates, directly applying the technique to query logs faces the sparseness problem. Because context patterns are drawn from query logs and can also contain spelling errors, alterations, and word permutations as much as queries do, context differs so greatly in representations that even related candidates might not have sufficient contextual overlap between them. For example, a candidate "YouTube" matched against the patterns "YouTube+movie", "movie+YouTube" and "You-Tube+movii" (with a minor spelling error) will yield three distinct patterns "#+movie", "movie+#" and "#+movii"6 , which will be treated as three separate dimensions in the vector space model. This sparseness problem can be partially addressed by considering the correlation between patterns. Kandola et al. (2002) proposed new kernelbased similarity methods which incorporate indirect similarity between terms for a text retrieval task. Although their kernels are built on a document-term co-occurrence model, they can also be applied to our candidate-pattern co-occurrence model. The proposed kernel is recursively defined as: ^ ^ ^ ^ K = X GX + K, G = X KX + G, (7) where G = XX is the correlation matrix between patterns and is the factor to ensure that longer range effects decay exponentially. This can be interpreted as augmenting the similarity matrix K ^ through indirect similarities of patterns G and vice versa. Semantically related pairs of patterns are ex^ pected to be given high correlation in the matrix G and this will alleviate the sparseness problem. By `+' denotes a white space, and `#' indicates where the word of interest is found in a context pattern. 6 where |ci , pj | is the frequency of the pattern pj instantiated with the candidate ci , and `*' denotes a We did not include kanji variants here, because disambiguating kanji readings is a very difficult task, and the majority of the variations in queries are in katakana and Roman alphabet. The framework proposed in this paper, however, can incorporate kanji variants straightforwardly into ED(q, c) once we have reasonable edit distance functions for kanji variations. 5 194 Figure 3: Bootstrapping Additional Candidates Figure 2: Orthographically Augmented Graph solving the above recursive definition, one obtains the von Neumann kernel: ^ K() = K(I - K)-1 = t=1 t-1 K t . (8) This can also be interpreted in terms of a random walk in a graph where the nodes correspond to all the candidates and the weight of an edge (i, j) is given by Kij . A simple calculation shows that Kij equals the sum of the products of the edge weights over all possible paths between the nodes corresponding ci t and cj in the graph. Also, Kij corresponds to the probability that a random walk beginning at node ci ends up at node cj after t steps, assuming that the entries are all positive and the sum of the connections is 1 at each node. Following this notion, Kandola et al. (2002) proposed another kernel called exponential kernel, with alternative faster decay factors: ~ K() = K tK t t=1 In order to address this issue, we propose to augment the graph by weakly connecting the candidate and pattern nodes as shown in the graph (b) of Fig. 2 based on prior knowledge of orthographic similarity about candidates and patterns. This can be achieved using the following candidate similarity matrix K + instead of K: K + = SC + (1 - )X [SP + (1 - )I] X (10) where SC = {sc (i, j)} is the orthographical similarity matrix of candidates in which the (i, j)-element is given by the edit distance based similarity, i.e., sc (i, j) = exp [-ED(ci , cj )]. The orthographical similarity matrix of patterns SP = {sP (i, j)} is defined similarly, i.e., sP (i, j) = exp[-ED(pi , pj )]. Note that using this similarity matrix K + can be interpreted as a random walk process on a bipartite graph as follows. Let C and P as the sets of candidates and patterns. K + corresponds to a single walking step from C to C, by either remaining within C with a probability of or moving to "the other side" P of the graph with a probability of 1-. When the walking remains in C, it is allowed to move to another candidate node following the candidate orthographic similarity SC . Otherwise it moves to P by the matrix X, chooses either to move within P with a probability SP or to stay with a probability 1 - , and finally comes back to C by the matrix X . Multiplication (K + )t corresponds to repeating this process t times. Using this similarity, we can define two orthographically augmented semantic kernels which differ in the decaying factors, augmented von Neumann kernel and exponential kernel: ^ K + () = K + (I - K + )-1 (11) ~ K + () = K + exp(K + ). (12) t! = K exp(K). (9) They showed that this alternative kernel achieved a better performance for their text retrieval task. We employed these two kernels to compute distributional similarity for our query correction task. 3.3.3 Orthographically Augmented Kernels Although semantic relatedness can be partially captured by the semantic kernels introduced in the previous section, they may still have difficulties computing correlations between candidates and patterns especially for only sparsely connected graphs. Take the graph (a) in Fig. 2 for example, which is a simplified yet representative graph topology for candidate-pattern co-occurrence we often encounter. In this case K = X X equals I, meaning that the connections between candidates and patterns are too sparse to obtain sufficient correlation even when semantic kernels are used. 3.4 Bootstrapping Additional Candidates Now that we have a semantic model, our query correction model can cover query-candidate pairs 195 1 C 0C = ) q ( C la n oitu birtsiD ytiralimis 1 P 1 1 P C n oitc ud ni nrettaP n oitcu d ni ec natsnI 0 0 P C le n nahc ecr u oS n oitc ud ni le dom nrettaP q yreuq t up nI "ebuT+uoY" "6egats" 3p 3c 2p 2c )b( "emina+#" "#+eivom" "eivom+#" "ebuTuoY" "ebuT+uoY" "6egats" "ebuTuoY" 1p 1c )a( "emina+#" "#+eivom" "eivom+#" 3p 3c 2p 2c 1p 1c which are only semantically related. However, previous work on query correction all used a string distance function and a threshold to restrict the space of potential candidates, allowing only the orthographically similar candidates. To collect additional candidates, the use of context-based semantic extraction methods would be effective because semantically related candidates are likely to share context with the initial query q, or at least with the initial candidate set C0 . Here we used the Tchai algorithm (Komachi and Suzuki, 2008), a modified version of Espresso (Pantel and Pennacchiotti, 2006) to collect such candidates. This algorithm starts with initial seed instances, then induces reliable context patterns cooccurring with the seeds, induces instances from the patterns, and iterates this process to obtain categories of semantically related words. Using the candidates in C0 as the seed instances, one bootstrapping iteration of the Tchai algorithm is executed to obtain the semantically related set of instances C1 . The seed instance reliabilities are given by the source channel probabilities P (c)P (q|c). Finally we take the union C0 C1 to obtain the candidate set C(q). This process is outlined in Fig. 3. 3.5 Maximum Entropy Model In order to build a unified probabilistic query alteration model, we used the maximum entropy approach of (Beger et al., 1996), which Li et al. (2006) also employed for their query correction task and showed its effectiveness. It defines a conditional probabilistic distribution P (c|q) based on a set of feature functions f1 , . . . , fK : exp K i fi (c, q) i=1 , (13) P (c|q) = exp K i fi (c, q) c i=1 where 1 , . . . , K are the feature weights. The optimal set of feature weights can be computed by maximizing the log-likelihood of the training set. We used the Generalized Iterative Scaling (GIS) algorithm (Darroch and Ratcliff, 1972) to optimize the feature weights. GIS trains conditional probability in Eq. (13), which requires the normalization over all possible candidates. However, the number of all possible candidates C obtained from a query log can be very large, so we only calculated the sum over the candidates in C(q). This is the same approach that Och and Ney (2002) took for statistical machine translation, and Li et al. (2006) for query spelling correction. We used the following four categories of functions as the features: 1. Language model feature, given by the logarithm of the source model probability: log P (c). 2. Error model features, which are composed of three edit distance functions: -ED (q|c), -ED (c|q), and -EDhd (q, c). 3. Similarity based feature, computed as the logarithm of distributional similarity between q and c: log sim(q, c), which is calcualted using one of the ^ ~ ^ following kernels (Section 3.3): K, K, K, K + , ~ + . The similarity values were normalized and K to [0, 1] after adding a small discounting factor = 1.0 × 10-5 . 4. Similarity based correction candidate features, which are binary features with a value of 1 if and only if the frequency of c is higher than that of q, and distributional similarity between them is higher than a certain threshold. Li et al. (2006) used this set of features, and suggested that these features give the evidence that q may be a common misspelling of c. The thresholds on the normalized distributional similarity are enumerated from 0.5 to 0.9 with the interval 0.1. 4 Experiment 4.1 Dataset Creation For all the experiments conducted in this paper, we used a subset of the Japanese search query logs submitted to Live Search (www.live.com) in November and December of 2007. Queries submitted less than eight times were deleted. The query log we used contained 83,080,257 tokens and 1,038,499 unique queries. Models of query correction in previous work were trained and evaluated using manually created querycandidate pairs. That is, human annotators were given a set of queries and were asked to provide a correction for each query when it needed to be rewritten. As Cucerzan and Brill (2004) point out, however, this method is seriously flawed in that the intention of the original query is completely lost to the annotator, without which the correction is often impossible: it is not clear if gogle should be corrected to google or goggle, or neither -- gogle may be a brand new product name. Cucerzan and Brill 196 therefore performed a second evaluation, where the test data was drawn by sampling the query logs for successive queries (q1 , q2 ) by the same user where the edit distance between q1 and q2 are within a certain threshold, which are then submitted to annotators for generating the correction. While this method makes the annotation more reliable by relying on user (rather than annotator) reformulation, the task is still overly difficult: going back to the example in Section 1, it is unclear which spelling of `protein' produces the best search results -- it can only be empirically determined. Their method also eliminates all pairs of candidates that are not orthographically similar. We have therefore improved their method in the following manner, making the process more automated and thus more reliable. We first collected a subset of the query log that contains only those pairs (q1 , q2 ) that are issued successively by the same user, q2 is issued within 3 minutes of q1 , and q2 resulted in a click of the resulting page while q1 did not. The last condition adds the evidence that q2 was a better formulation than q1 . We then ranked the collected query pairs using loglikelihood ratio (LLR) (Dunning, 1993), which measures the dependence between q1 and q2 within the context of web queries (Jones et al., 2006b). We randomly sampled 10,000 query pairs with LLR 200, and submitted them to annotators, who only confirm or reject a query pair as being synonymous. For example, q1 = nikon and q2 = canon are related but not synonymous, while we are reasonably sure q1 = ipot and q2 = ipod are synonymous, given that this pair has a high LLR value. This verification process is extremely fast and consistent across annotators: it takes less than 1 hour to go through 1,000 query pairs, and the inter-annotator agreement rate of two annotators on 2,000 query pairs was 95.7%. We annotated 10,000 query pairs consisting of alphanumerical and kana characters in this manner. After rejecting non-synonymous pairs and those which do not co-occur with any context patterns, 6,489 pairs remained, and we used 1,243 pairs for testing, 628 as a development set, and 4,618 for training the maximum entropy model. 4.2 Experimental Settings The performance of query alteration was evaluated based on the following measures (Li et al., 2006). Table 1: Performance results (%) Model Accuracy Recall Precision SC 71.12 39.29 45.09 ME-NoSim 74.58 44.58 52.52 74.18 45.84 50.70 ME-Cos ME-vN 74.34 45.59 52.16 73.61 44.84 50.57 ME-Exp ME-vN+ 75.06 44.33 53.01 75.14 44.08 53.52 ME-Exp+ The input queries, correct suggestions, and outputs were matched in a case-insensitive manner. · Accuracy: The number of correct outputs generated by the system divided by the total number of queries in the test set; · Recall: The number of correct suggestions for altered queries divided by the total number of altered queries in the test set; · Precision: The number of correct suggestions for altered queries divided by the total number of alterations made by the system. The parameters for the kernels, namely, , , and , are tuned using the development set. The finally ^ ~ ^ employed values are: = 0.3 for K, K, and K + , + , = 0.2 and = 0.4 for K + , and ~ ^ = 0.2 for K ~ + . In the source channel = 0.35 and = 0.7 for K model, we manually scaled the language probability by a factor of 0.1 to alleviate the bias toward highly frequent candidates. As the initial candidate set C0 , top-50 instances were selected by the source channel model, and 100 patterns were extracted as P0 by the Tchai iteration after removing generic patterns, which we detected simply by rejecting those which induced more than 200 unique instances. Finally top-30 instances were induced using P0 to create C1 . Generic instances were not removed in this process because they may still be alterations of input query q. The maximum size of P1 was set to 2,000, after removing unreliable patterns with reliability smaller than 0.0001. 4.3 Results Table 1 shows the evaluation results. SC is the source channel model, while the others are maximum entropy (ME) models with different features. ME-NoSim uses the same features as SC, but considerably outperforms SC in all three measures, confirming the superiority of the ME approach. Decomposing the three edit distance functions into three 197 separate features in the ME model may also explain the better result. All the ME approaches outperformed SC in accuracy with a statistically significant difference (p < 0.0001 on McNemar's test). The model with the cosine similarity (ME-Cos) in addition to the basic set of features yielded higher recall compared to ME-NoSim, but decreased accuracy and precision, which are more important than recall for our purposes because a false alteration does more damage than no alteration. This is also the case when the kernel-based methods, ME-vN (the von Neumann kernel) and ME-Exp (the exponential kernel), are used in place of the cosine similarity. This shows that using semantic similarity does not always help, which we believe is due to the sparseness of the contextual information used in computing semantic similarity. On the other hand, ME-vN+ (with augmented von Neumann kernel) and ME-Exp+ (with augmented exponential kernel) increased both accuracy and precision with a slight decrease of recall, compared to the distributional similarity baseline and the nonaugmented kernel-based methods. ME-Exp+ was significantly better than ME-Exp (p < 0.01). Note that the accuracy values appear lower than some of the previous results on English (e.g., more than 80% in (Li et al., 2006)), but this is because the dataset creation method we employed tends to over-represent the pairs that lead to alteration: the simplest baseline (= always propose no alteration) performs 67.3% accuracy on our data, in contrast to 83.4% on the data used in (Li et al., 2006). Manually examining the suggestions made by the system also confirms the effectiveness of our model. For example, the similarity-based models altered the query ipot to ipod, while the simple ME-NoSim model failed, because it depends too much on the edit distance-based features. We also observed that many of the suggestions made by the system were actually reasonable, even though they were different from the annotated gold standard. For example, ME-vN+ suggests a re-write of the query 2tyann as 2 (`2-channel'), while the gold standard was an abbreviated form 2 (`2-chan'). To incorporate such possibly correct candidates into account, we conducted a follow-up experiment where we considered multiple reference alterations, created automatically from our data set in the fol- Table 2: Performance with the multiple reference model Model Accuracy Recall Precision SC 75.30 48.61 55.78 79.49 56.17 66.17 ME-NoSim ME-Cos 79.32 58.19 64.35 ME-vN 79.24 57.18 65.42 ME-Exp 78.52 56.42 63.64 ME-vN+ 79.89 55.67 66.57 ME-Exp+ 79.81 54.91 66.67 lowing manner. Suppose that a query q1 is corrected as q2 , and q2 is corrected as q3 in our annotated data. If this is the case, we considered q1 q3 as a valid alteration as well. By applying this chaining operation up to 5 degrees of separation, we re-created a set of valid alterations for each input query. Note that directionality is important -- in the above example, q1 q3 is valid, while q3 q1 is not. Table 2 shows the results of evaluation with multiple references. The numbers substantially improved over the single reference cases, as expected, but did not affect the relative performance of each model. Again, the differences in accuracy between the SC and ME models, and ME-Exp and ME-Exp+ were statistically significant (p < 0.01). 5 Conclusion and future work In this paper we have presented a unified approach to Japanese query alteration. Our approach draws on previous research in English spelling and query correction, Japanese katakana variation and transliteration, and semantic similarity, and builds a model that makes improvements over previously proposed query correction methods. In particular, the use of orthographically augmented semantic kernels proposed in this paper is general and applicable to other languages, including English, for query alteration, especially when the data sparseness is an issue. In the future, we also plan to investigate other methods, such as PLSI (Hofmann, 1999), to deal with data sparseness in computing semantic similarity. Acknowledgments This research was conducted during the first author's internship at Micorosoft Research. We thank the colleagues, especially Dmitriy Belenko, Chris Brockett, Jianfeng Gao, Christian K¨ nig, and Chris o Quirk for their help in conducting this research. 198 References Farooq Ahmad and Grzegorz Kondrak. 2005. Learning a spelling error model from search query logs. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP-2005), pages 955­962. Eiji Aramaki, Takeshi Imai, Kengo Miyo, and Kazuhiko Ohe. 2008. Orthographic disambiguation incorporating transliterated probability. In Proceedings in the third International Joint Conference on Natural Language Processing (IJCNLP-2008), pages 48­55. Adam L. Beger, Stephen A. Della Pietra, and Vincent J. Della Pietra. 1996. A maximum entropy approach to natural language processing. Computational Linguistics, 22(1):39­72. Slaven Bilac and Hozumi Tanaka. 2004. A hybrid backtransliteration system for japanese. In Proceedings of the 20th international conference on Computational Linguistics (COLING-2004), pages 597­603. Eric Brill and Robert C. Moore. 2000. An improved error model for noisy channel spelling. In Proceedings of the 38th Annual Meeting on Association for Computational Linguistics (ACL-2000), pages 286­293. Eric Brill, Gary Kacmarcik, and Chris Brockett. 2001. Automatically harvesting katakana-english term pairs from search engine query logs. In Proceedings of the Sixth Natural Language Processing Pacific Rim Symposium (NLPRS-2001), pages 393­399. Qing Chen, Mu Li, , and Ming Zhou. 2007. Improving query spelling correction using web search results. In Proceedings of the 2007 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning (EMNLPCoNLL), pages 181­189. Silviu Cucerzan and Eric Brill. 2004. Spelling correction as an iterative process that exploits the collective knowledge of web users. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP-2004), pages 293­300. Fred Damerau. 1964. A technique for computer detection and correction of spelling errors. Communication of the ACM, 7(3):659­664. J.N. Darroch and D. Ratcliff. 1972. Generalized iterative scaling for log-linear models. Annuals of Mathematical Statistics, 43:1470­1480. Ted Dunning. 1993. Accurate methods for the statistics of surprise and coincidence. Computational Linguistics, 19(1):61­74. Thomas Hofmann. 1999. Probabilistic latent semantic indexing. In Research and Development in Information Retrieval, pages 50­57. Rosie Jones, Kevin Bartz, Pero Subasic, and Benjamin Rey. 2006a. Automatically generating related queries in japanese. Language Resources and Evaluation (LRE), 40(3-4):219­232. Rosie Jones, Benjamin Rey, Omid Madani, and Wiley Greiner. 2006b. Generating query substitutions. In Proceedings of the 15th international World Wide Web conference (WWW-06), pages 387­396. Jaz Kandola, John Shawe-Taylor, and Nello Cristianini. 2002. Learning semantic similarity. In Neural Information Processing Systems (NIPS 15), pages 657­664. Kevin Knight and Jonathan Graehl. 1998. Machine transliteration. Computational Linguistics, 24(4):599­ 612. Mamoru Komachi and Hisami Suzuki. 2008. Minimally supervised learning of semantic knowledge from query logs. In Proceedings of the 3rd International Joint Conference on Natural Language Processing (IJCNLP-2008), pages 358­365. Vladimir I. Levenshtein. 1966. Binary codes capable of correcting deletions, insertions and reversals. Soviet Physice - Doklady, 10:707­710. Mu Li, Muhua Zhu, Yang Zhang, and Ming Zhou. 2006. Exploring distributional similarity based models for query spelling correction. In Proceedings of COLING/ACL-2006, pages 1025­1032. Dekang Lin. 1998. Automatic retrieval and clustering of similar words. In Proceedings of COLING/ACL-1998, pages 786­774. Takeshi Masuyama, Satoshi Sekine, and Hiroshi Nakagawa. 2004. Automatic construction of japanese katakana variant list from large corpus. In Proceedings of Proceedings of the 20th international conference on Computational Linguistics (COLING-2004), pages 1214­1219. Franz Och and Hermann Ney. 2002. Discriminative training and maximum entropy models for statistical machine translation. In Proceedings of the 40th annual meeting of ACL, pages 295­302. Marius Pasca, Dekang Lin, Jeffrey Bigham, Andrei Lif¸ chits, and Alpa Jain. 2006. Organizing and searching the world wide web of facts - step one: the one-million fact extraction challenge. In Proceedings of the 21st National Conference on Artificial Intelligence (AAAI06), pages 1400­1405. Patrick Pantel and Marco Pennacchiotti. 2006. Espresso: Leveraging generic patterns for automatically harvesting semantic relations. In Proceedings of COLING/ACL-2006, pages 113­120. 199 Context-based Message Expansion for Disentanglement of Interleaved Text Conversations Lidan Wang Computer Science Dept./UMIACS University of Maryland, College Park College Park, MD 20742 lidan@cs.umd.edu Douglas W. Oard College of Information Studies/UMIACS and HLT Center of Excellence University of Maryland, College Park College Park, MD 20742 oard@umd.edu Abstract Computational processing of text exchanged in interactive venues in which participants engage in simultaneous conversations can benefit from techniques for automatically grouping overlapping sequences of messages into separate conversations, a problem known as "disentanglement." While previous methods exploit both lexical and non-lexical information that exists in conversations for this task, the inter-dependency between the meaning of a message and its temporal and social contexts is largely ignored. Our approach exploits contextual properties (both explicit and hidden) to probabilistically expand each message to provide a more accurate message representation. Extensive experimental evaluations show our approach outperforms the best previously known technique. "threading") (Yeh et al., 2006). The extensive metadata associated with email and the relatively rich content of some email messages makes email somewhat of a special case in the broad set of conversation recovery tasks, however. At the opposite extreme, conversation "threading" in multi-party spoken interactions (e.g., meetings) would be a compelling application, but the word error rate of current automated transcription techniques somewhat limits access to the lexical evidence that we know is useful for this task. The recent interest in identifying individual conversations from online-discussions, a task that some refer to as "disentanglement," therefore seems to be something of a middle ground in the research space: computationally tractable, representative to some degree of a broader class of problems, and directly useful as a pre-processing step for a range of important applications. One way to think of this task is as a clustering problem--we seek to partition the messages into a set of disjoint clusters, where each cluster represents a conversation among a set of participants on a topic. This formulation raises the natural question of how we should design a similarity measure. Since the messages are often too short to be meaningful by themselves, techniques based solely on lexical overlap (e.g., inner products of term vectors weighted by some function of term frequency, document frequency and message length) are unlikely to be successful. For instance, consider the multi-party exchange in Figure 1, in which a single message may not convey much about the topic without considering what has been said before, and who said it. Fortunately for us, additional sources of evidence 1 Introduction Conversational media such as the text messages found in Internet Relay Chat presents both new opportunities and new challenges. Among the challenges are that individual messages are often quite short, for the reason that conversational participants are able to assemble the required context over the course of a conversation. A natural consequence of this is that many tasks that we would like to perform on conversational media (e.g., search, summarization, or automated response) would benefit from reassembly of individual messages into complete conversations. This task has been studied extensively in the context of email (where it is often referred to as Human Language Technologies: The 2009 Annual Conference of the North American Chapter of the ACL, pages 200­208, Boulder, Colorado, June 2009. c 2009 Association for Computational Linguistics 200 (18323 Ricardo) is there a way to emulate input for a program listening on a COM port? (18911 Azzie) Ricardo: Hello there, how is it going? (18939 Ricardo) pretty good, just at the office, about to leave. How are you? (18970 Azzie) well, end of semester work, what could be better? (18980 Josephina) if it's just reading from /dev/ttyS0 or something you could somehow get it to just read from a named pipe instead (19034 Ricardo) Josephina: I might just have to end up modifying the entire program... (19045 Ricardo) so it can read from a different input stream Figure 1: An example of the text message stream. The number before each author's name denotes the timestamp of the message. are available. As we describe below, messages are strongly correlated both temporally (i.e., across time) and socially (i.e,, across participants). For example, in our running example in Figure 1, Ricardo's message (19045 Ricardo) "so it can read from a different input stream" elaborates on his previous message (19034 Ricardo) to Josephina. Messages that are close in time and from the same speaker can share related meanings. Similarly, we see that Ricardo's messages to Josephina (19034 Ricardo and 19045 Ricardo) are responses to earlier comments made by Josephina (18980 Josephina), and that fact is signaled by Ricardo invoking Josephena's name. This is an example of social correlation: lexicalized references to identity can also provide useful evidence. If we take social and temporal context into account, we should be able to do better at recognizing conversations than we could using lexical overlap alone. In recent years, several approaches have been developed for detecting conversational threads in dynamic text streams (Elsner et al., 2008; Shen et al., 2006; Wang et al., 2008). Although they use both lexical and non-lexical information (e.g., time, name mentions in message) for this task, they have ignored the temporal and social contexts a message appears in, which provide valuable cues for interpreting the message. Correlation clustering used in a two-step approach (Elsner et al., 2008) exploits message contexts to some degree, but its performance is largely limited by the classifier used in the first-step which computes message similarity without considering the temporal and social contexts of each message. Our approach exploits contextual properties (both explicit and hidden) to probabilistically expand each message to provide a more accurate message representation. The new representation leads to a much improved performance for conversation disentanglement. We note that this is a general approach and can be applied to the representation of non-chat data that exhibits temporal and social correlations as well. The results that we obtain with this technique are close to the limit of what we can measure using present test collections and evaluation measures. To the best of our knowledge, our work is the first to apply document expansion to the conversation disentanglement problem. 2 Related Work Previous work in conversation disentanglement (i.e. thread detection) has shown the conventional lexical-based clustering is not suitable for text streams because messages are often too short and incomplete. They focus on using discourse/chatspecific features to bias the lexical-based message similarity (Elsner et al., 2008; Shen et al., 2006; Wang et al., 2008). These features provide the means to link messages that may not have sufficient lexical overlap but are nevertheless likely to be topically related. However, our work is different from them in several aspects: (1) They treat individual messages as the basic elements for clustering, and ignore the social and temporal contexts of the messages. In our work, each message is probabilistically expanded using reliable information from its contexts and the expanded messages are the basic elements for clustering. (2) Messages have different amount of explicit information. For example, messages that initiate conversations may have more name mentions than subsequent messages (i.e. for establishing conversations). Previous work only uses what are explicitly present in each message, and clusters may be erroneously assigned for messages that lack enough explicit in- 201 formation. Our work exploits both explicit and implicit context for each message due to how we define contexts (Section 3.2.1). (3) Most work imposes a fixed window size for clustering and it may break up long conversations or may not be fine-grained enough for short conversations. Given each message, we use an exponential decay model to naturally encode time effect and assign differential weights to messages in its contexts. Another thread of related work is document expansion. It was previously studied in (Singhal et al., 1999) in the context of the speech retrieval, helping to overcome limitations in the transcription accuracy by selecting additional terms from lexically similar (text) documents. Document expansion has also been applied to cross-language retrieval in (Levow et al., 2005), in that case to overcome limitations in translation resources. The technique has recently been re-visited (Tao et al., 2006; Kurland et al., 2004; Liu et al., 2004) in the language modeling framework, where lexically related documents are used to enlarge the sample space for a document to improve the accuracy of the estimated document language model. However, these lexical-based approaches are less well suited to conversational interaction, because conversational messages are often short, they therefore may not overlap sufficiently in words with other messages to provide a useful basis for expansion. Our technique can be viewed as an extension of these previous methods to text streams. Our work is also related to text segmentation (Ji et al., 2003) and meeting segmentation (Malioutov et al., 2006; Malioutov et al., 2007; Galley et al., 2003; Eisenstein et al., 2008). Text segmentation identifies boundaries of topic changes in long text documents, but we form threads of messages from streams consisting of short messages. Meeting conversations are not as highly interleaving as chat conversations, where participants can create a new conversation at any time. 3.1 Context-Free Message Model To represent the semantic information of messages and threads (clusters of messages), most of the prior approaches build a document representation on each message alone (using word features and time-stamp and/or discourse features found in the message). We call such a model a context-free message model. Most commonly, a message is represented as a vector (Salton, 1989). Each dimension corresponds to a separate term. If a term occurs in the message, its value in the vector is non-zero. Several different ways of computing these values, known as term weights, have been developed. One of the best known schemes is tf-idf weighting. However, in conversational text, a context-free model cannot fully capture the semantics of messages. The meaning of a message is highly dependent on other messages in its context. For example, in our running example in Figure 1, to fully interpret the message 19045 Ricardo, we need to first read his previous message (19034 Ricardo) to Josephina. Further, messages on the same topic may have little or no overlap in words (Figure 1), and the messages between participants are highly interactive and are often too short and incomplete to fully capture a topic on their own. 3.2 Context-Sensitive Message Model 3 Method Our main idea is to exploit the temporal and social aspects of the conversations to build a contextsensitive document model for each message. We do this by first identifying the temporal and social contexts for each message, then probabilistically expanding the content of each message with selected messages in each context. As we have seen, a message's contexts provide valuable cues for interpreting the message. Finally, we cluster the messages into distinct conversations based on their new representation models. We present the formal definitions of each context and discuss how to model them in Section 3.2.1. In Section 3.2.2, we show how to efficiently identify the related messages in each context, and how to use them to expand our representation of the message. 3.2.1 Social and Temporal Contexts Social contexts: we define two kinds of social contexts: author context and conversational context. We This section describes our technique for clustering messages into threads based on the lexical similarity of documents that have been expanded based on social and temporal evidence. 202 1 Probability in same thread Probability in same thread 1 0.8 0.6 0.4 0.2 0 -400 -200 0 200 400 Time difference between name mention Probability in same thread 1 0.8 0.6 0.4 0.2 0 -400 -200 0 200 400 Time difference between message pairs 0.8 0.6 0.4 0.2 0 -400 -200 0 200 400 Time diff. between messages from same author (i) (ii) (iii) Figure 2: (i) Relationship between messages from the same author (ii) Relationship between messages that mention each other's authors, and (iii) All pairs of messages as a function of time. Estimation is based on training data used in experiments. explain them in detail below. Author context: the author context of a message m, denoted by CA (m), is the set of other messages written by m's author am : CA (m) = {mi |ami = am , m = mi } Further, because of the nature of human conversations, we would be less surprised to find messages from the same person belonging to the same conversation if they are close in time rather than far apart. This is illustrated in Figure 2(i) 1 , which shows the probability that a pair of messages written by the same person belong to the same conversation as a function of the time difference between them. Not surprisingly, messages in m's author context have probabilities which are influenced by their temporal proximity to m. We use a normal distribution (Figure 2(i)) to encode the notion of author context. Given two messages mi and mj written by the same author, each with time-stamp ti and tj , respectively, the probability that mj is topically related to mi given their time difference d = tj - ti is: 2 Pa (d) = N (µa , a ) = a 2 1 e - (d-µa )2 2 2a tj - ti is small. The mean µa is chosen to be zero so that the curve is centered at each message. The variance can be readily estimated from training data. Conversational context: the second kind of social context is the conversational context, which is constructed from name mentions. As pointed out by previous linguistic studies of discourse, especially analysis of multi-party conversation (ONeill et al., 2003), one key difference between multi-party conversation and typical two-party conversation is the frequency with which participants mention each others' names. Name mentioning is hypothesized as a strategy for participants to compensate for the lack of cues normally present in face-to-face dialogue (ONeill et al., 2003; Elsner et al., 2008). Although infrequent, name mentions (such as Azzie's comments to Ricardo in Figure 1) provide a means for linking two speakers and their messages. The conversational context of m, CC (m), is defined to be the set of all messages written by people whose names are mentioned in any of am 's messages (where am is the author of m), or who mention am in their messages. Let Ma denote all messages written by author a. The conversational context of m is: CC (m) = {a Ma |mention(am , a)} {a Ma |mention(a, am )} The exponential decay helps to limit the influence from temporally remote messages. For message mi , this distribution models the uncertainty that messages in its author context (i.e. other messages mj from the same author) belong to the same conversation by assigning assigning a high value to mj if Gaussian kernels shown for illustration purpose in Figure 2 are un-normalized. 1 where mention(am , a) = true if author am mentions a in any of am 's messages. M ention(a, am ) is similarly defined. Discussion: From the definition, mj is included in mi 's conversational context if the author of mi men- 203 tions the author of mj in any of mi 's messages, or vice versa. For instance, the conversational context for Ricardo's message (19034 Ricardo) in Figure 1 includes the messages from Josephina (18980 Josephina) due to the mentioning of Josephina in his message. However, it may well be the case that mi does not contain any name mentions, e.g. Ricardo's message to Azzie (18939 Ricardo). In this case, if Ricardo is being mentioned by another author (here Azzie asks Ricardo a question by starting with his name in 18939 Azzie), message (18939 Ricardo)'s conversational context will contain all of Azzie's messages (18911 and 18970 Azzie) according to the above definition. This intuitively captures the implicit question-answer patterns in conversational speech: Ricardo's subsequent answer is a response to Azzie's comments, hence they are in each other's conversational context. Our definition also accounts for another source of implicit context. In interactive conversations name mention is a tool for getting people's attention and starting a conversation. Once a participant ai establishes a conversation with aj (such that ai may mention aj 's name in an initial message mp to aj ), ai may stop mentioning aj 's name in subsequent messages (mq ) to aj . This is illustrated in Ricardo's last message to Josephina in Figure 1. Our definition accounts for the conversation continuity between aj and ai by including messages from aj in the conversational context of subsequent messages mq from ai (note mq may or may not mention aj ). For instance, message 19045 Ricardo continues the conversation with Josephina from 19034 Ricardo, message 19045 Ricardo thus has Josephina's messages as part of its conversational context. In general, a person can participate in multiple conversations over time, but as time goes on the topic of interest may shift and the person may start talking to other people. So the messages in the conversational context of mi due to earlier discussions with other people should be assigned a lower confidence value for mi . For example, five hours later Ricardo may still be active, but it is unlikely he still chats with Josephina on the same topic, so the earlier messages by Josephina should receive a small confidence value in the conversational context of Ricardo's later messages. We illustrate this idea in Figure 2(ii). It shows the probability that message mj , where mj CC (mi ), belongs to the same thread as mi , given their time difference tj - ti . This is encoded with a normal probability distribution, N (µc , c ) where µc = 0 and variance is estimated from training data. Let d = tj - ti , the probability they are topically related given mj CC (mi ) is: Pc (d) = c 2 1 e - d2 2 2c Temporal context: temporal context for message m, CT (m), refers to all other messages: CT (m) = M \ m where M denotes the entire set of messages. The intuition is that nearby messages to m can provide further evidence to the semantics of m. This is illustrated in Figure 2(iii). From the viewpoint of document smoothing, this can also be regarded as using temporally nearby messages to smooth the representation of m. So given mi , we again model its temporal context by fitting a normal probability distribution N (µt , t ), so that if mj CT (mi ) and d = tj - ti , the probability that mj is topically related to mi is: Pt (d) = 3.2.2 1 e - d2 2 2 t t 2 Constructing Expanded Messages We have shown how to use the social and temporal aspects of conversational text to identify and model the contexts of each message, and how to assign confidence values to messages in its contexts. We now show how to use a message's contexts and their associated messages to probabilistically expand the given message. We hypothesize that the expanded message provides a more accurate message representation and that this improved representation can lead to improved accuracy for conversation disentanglement. We will test this hypothesis in the experiment section. Each message m is represented as a vector of estimated term counts. We expand m using the normalized messages in its contexts. For the expanded message m of m we estimate the term counts as a linear mixture of term counts from each message in 204 each context: c(w, m ) = c(w, m) + (1 - ){ C + A + T mj CC (m) mj CA (m) mj CT (m) Pc (dji ) × c(w, mj ) Number of Conversations Avg. Conv. Length Avg. Conv. Density Min 50.00 6.20 2.53 Mean 81.33 10.60 2.75 Max 128.00 16.00 2.92 Pa (dji ) × c(w, mj ) Pt (dji ) × c(w, mj )} Table 1: Statistics on the IRC chat transcript data (Elsner et al., 2008). The reported values are based on annotations from six different annotations for the 800 lines of chat transcript. These parameter values are tuned on training data: controls how much relative weight we give to lexical content of m (0.45 in our experiments), and C , A and T are the relative weights assigned to the conversational, author and temporal contexts (0.6, 0.3, and 0.1 in our experiments, respectively). A context with large variance in its normal density graph should receive a small value. This is because a large variance in context k implies more uncertainty on a message mj being topically related to m while mj is in the context k of m. In Figure 2, the conversational context (Figure 2(ii)) has the minimum variance among all contexts, hence, it is more accurate for linking messages related in topic and it is assigned a higher value (0.6), while the temporal context has the lowest value (0.1). Finally, for a message mj in context k of mi , Pk (dji ) indicates how strongly we believe mj is topically related to mi , given their time difference dji . Because of the exponential decays of the normal densities that model contexts k, messages in a context will contribute differentially to mi . Temporally distant messages will have a very low density. 3.3 Single-Pass Clustering our experiments) empirically estimated from training data, add m to T ; else, start a new cluster containing only m. The time complexity of this algorithm is O(n2 ), which is tractable for problems of moderate size. 4 Experiments The expanded messages are the basic elements for clustering. The cosine is used to measure similarity: sim(mi , mj ) = w c(w, mi )c(w, mj ) mi mj The collection used in the experiments consists of real text streams produced in Internet Relay Chat, created by (Elsner et al., 2008) and annotated independently by six annotators. As an upper (human) baseline for each of the three measures reported below, we report the average agreement between all pairs of annotators (i.e., treating one annotator as truth and another as a "system"). For our experiment results, we report the average across all annotators of the agreement between our system and each annotator. The test collection also contains both a development set and an evaluation set. We used the development set to approximate the normal densities used in our context models and the evaluation set to obtain the results reported below. Some statistics for the 800 annotated messages in the chat transcript of the evaluation collection are shown in Table 1. As that table shows, the average number of active conversation at a given time is 2.75, which makes thread detection a non-trivial task. 4.1 Evaluation Measures Single-pass clustering is then performed: treat the first message as a single-message cluster T ; for each remaining message m compute T : sim(m, T ) = maxmi T sim(mi , m) For the thread T that maximizes sim(m, T ), if sim(m, T ) > tsim , where tsim is a threshold (0.7 in We conduct comparisons using three commonly used evaluation measures for the thread detection task. As a measure of the systems ability to group related messages we report the F -measure (Shen et al., 2006): F = i ni maxj (F (i, j)) n 205 where i is a ground-truth conversation with length ni , and n is the length of entire transcript. F (i, j) is the harmonic mean of recall (fraction of the messages in the i also present in j) and precision (fraction of messages in j also present in i), and F is a weighted sum over all ground-truth conversations (i.e., F is microaveraged). Two other evaluation measures are "one-to-one accuracy" and "local agreement" (Elsner et al., 2008). "One-to-one accuracy" measures how well we extract whole conversations intact (e.g., as might be required for summarization). It is computed by finding the max-weight bipartite matching between the set of detected threads and the set of real threads, where weight is defined in terms of percentage overlaps for each ground truth and detected thread pair. Some applications (e.g., real-time monitoring) may not require that we look at entire conversations ar once; in this case a "local agreement" measure might make more sense. "loc3 " between system and human annotations as the average (over all possible sets of three consecutive messages) of whether those 3 consecutive messages are assigned consistently by the ground truth and the system. For example, if both the ground truth and the system cluster the first and third messages together and place the second message in a different cluster, then agreement would be recorded. 4.2 Methods Used in Comparison Figure 3: F measure. The dotted line represents interannotator agreement. We compare with the following methods: Elsner et al. 2008 (best previously known technique): Message similarity is computed with lexical and discourse features, but without document expansion. Blocks of k: Every consecutive group of k messages is a conversation. Pause of k: Every pause of k seconds or more separate two conversations. Speaker: Each speaker's messages are treated as a single conversation. All different: Each utterance is a separate thread. All same: The entire transcript is one conversation. 4.3 Results Figure 3 compares the effectiveness of different schemes in terms of the F measure. We show results from the best baseline, Elsner and our technique (which we call the Context model). The average F between human annotators is shown with the dotted line at 0.55; we would expect this to be an upper bound for any model. Our method substantially outperforms the other methods, with a 24% improvement over Elsner and 48% improvement over the best baseline (speaker). Viewed another way, our system achieves 98% of human performance, while Elsner and the best baseline achieve 79% and 66% of that bound, respectively. From this, we can conclude that our Context model is quite effective at clustering messages from same conversation together. To illustrate the impact of conversation length, we binned the lengths of ground-truth conversations from a single assessor into bins of size 5 (i.e., 3­7 messages, 8­12 messages, . . .; there were no ground truth bins of size 1 or 2). Figure 4 plots the approximated microaveraged F at the center value of each bin (i.e., the F for each ground truth cluster, scaled by the number of messages in the cluster). These fine-grained values provide insight into the contribution of conversations of different sizes to the overall microaveraged F . The Context model performs well for every conversation length, but particularly so for conversations containing 35 or more messages as shown by the widened gap in that region. Long conversations usually have richer social and temporal contexts for each message. The context model can benefit more from drawing evidences from these sources and using them to expand the message, thus makes it possible to group messages of the same 206 Figure 4: Dependence of F on ground-truth conversation size, in number of messages. Figure 6: Local-3 measure. The dotted line represents inter-annotator agreement. tions. The difference between the best baseline and maximum upper bound is small, implying limited room for potential improvement by any non-baseline techniques. Our result again compares favorably with the previously reported result and the best baseline, although with a smaller margin of 20% over the best baseline and 3% over Elsner as a result of the relatively high baseline for this measure. Figure 5: One-to-one measure. The dotted line represents inter-annotator agreement. 5 Conclusion and Future Work conversation together. The other two methods that ignore contextual properties do not do well in comparison. To measure how well we extract whole conversations intact, Figure 5 shows the results in terms of the one-to-one measure, where each real conversation is matched up with a distinct detected conversation thread. It is computed by max-weight bipartite matching such that the total message overlap is maximized between the sets of detected threads and real threads. The average by this measure between human annotators is 0.53. In this case, the proposed context model achieves an 14% increase over Elsner and 32% increase over the best baseline, and it is within 88% of human performance. This fairly clearly indicates that our Context model can disentangle interleaved conversations relatively well. Finally, Figure 6 presents the results for "local-3" to evaluate the system's ability to do local annota- We have presented an approach that exploits contextual properties to probabilistically expand each message to provide a more accurate message representation for dynamic conversations. It is a general approach and can be applied to the representation of non-chat data that exhibits temporal and social correlations as well. For conversation disentanglement, it outperforms the best previously known technique. Our work raises three important questions: (1) to what extent is the single test collection that we have used representative of the broad range of "text chat" applications?, (2) to what extent do the measures we have reported correlate to effective performance of downstream tasks such as summarization or automated response?, and (3) can we re-conceptualize the formalized problem in a way that would result in greater inter-annotator agreement, and hence provide scope for further refinements in our technique. These problems will be the focus of our future work. 207 References Micha Elsner and Eugene Charniak. 2008. You talking to me? A Corpus and Algorithm for Conversation Disentanglement. In ACL 2008: Proceedings of the 46th Annual Meeting on Association for Computational Linguistics, pages 834-842, Columbus, OH, USA. Association for Computational Linguistics. Dou Shen, Qiang Yang, Jian-Tao Sun, and Zheng Chen. 2006. Thread Detection in Dynamic Text Message Streams. In SIGIR 2006: Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval, pages 35-42, Seattle, WA, USA. Association for Computing Machinery. Yi-Chia Wang, Mahesh Joshi, William Cohen, and Carolyn Rose. 2008. Recovering Implicit Thread Structure in Newsgroup Style Conversations. In ICWSM 2008: Proceedings of the 2nd International Conference on Weblogs and Social Media, pages 152-160, Seattle, WA, USA. Association for the Advancement of Artificial Intelligence. Tao Tao, Xuanhui Wang, Qiaozhu Mei, and ChengXiang Zhai. 2006. Language Model Information Retrieval with Document Expansion. In HLT-NAACL 2006: Proceedings of the Human Language Technology Conference of the North American Chapter of the ACL, pages 407-414, New York, NY, USA. Association for Computational Linguistics. Oren Kurland and Lillian Lee. 2004. Corpus Structure, Language Models, and AdHoc Information Retrieval. In SIGIR 2004: Proceedings of the 27th annual international ACM SIGIR conference on Research and development in information retrieval, pages 194-201, Sheffield, UK. Association for Computing Machinery. Xiaoyong Liu and W Croft. 2004. Cluster-based Retrieval Using Language Models. In SIGIR 2004: Proceedings of the 27th annual international ACM SIGIR conference on Research and development in information retrieval, pages 186-193, Sheffield, UK. Association for Computing Machinery. Amit Singhal and Fernando Pereira. 1999. Document Expansion for Speech Retrieval. In SIGIR 1999: Proceedings of the 22nd annual international ACM SIGIR conference on Research and development in information retrieval, pages 34-41, Berkeley, CA, USA. Association for Computing Machinery. Xiang Ji and Hongyuan Zha 2003. Domain-Independent Text Segmentation using Anisotropic Diffusion and Dynamic Programming. In SIGIR 2003: Proceedings of the 26th annual international ACM SIGIR conference on Research and development in information retrieval, pages 322-329, Toronto, Canada. Association for Computing Machinery. Michel Galley, Kathleen McKeown, Eric Lussier, and Hongyan Jing. 2003. Discourse Segmentation of Multi-Party Conversation. In ACL 2003: Proceedings of the 41st Annual Meeting of the Association for Computational Linguistics, pages 562-569, Sapporo, Japan. Association for Computational Linguistics. Jacob Eisenstein and Regina Barzilay. 2008. Bayesian Unsupervised Topic Segmentation. In EMNLP 2008: Proceedings of the 2008 Conference on Empirical Methods in Natural Language Processing, pages 334343, Honolulu, Hawaii, USA. Association for Computational Linguistics. Igor Malioutov and Regina Barzilay 2006. MinimumCut Model for Spoken Lecture Segmentation. In ACL 2006: Proceedings of the 44rd Annual Meeting of the Association for Computational Linguistics, pages 2532, Sydney, Australia. Association for Computational Linguistics. Igor Malioutov, Alex Park, Regina Barzilay, and James Glass. 2007. Making Sense of Sound: Unsupervised Topic Segmentation over Acoustic Input. In ACL 2007: Proceedings of the 45th Annual Meeting of the Association of Computational Linguistics, pages 504511, Prague, Czech Republic. Association for Computational Linguistics. Jen-Yuan Yeh and Aaron Harnly. 2006. Email Thread Reassembly Using Similarity Matching. In CEAS 2006: The 3rd Conference on Email and Anti-Spam, pages 64-71, Mountain View, CA, USA. Jacki ONeill and David Martin. 2003. Text Chat in Action. In ACM SIGGROUP 2003: Proceedings of the 2003 International ACM SIGGROUP Conference on Supporting Group Work, pages 40-49, New York, NY, USA. ACM Press. Gerard Salton. 1989. Automatic Text Processing: the Transformation, Analysis and Retrieval of Information by Computer. Addison-Wesley Longman Publishing Co., Inc., Boston, MA, USA, 1989. Gina-Anne Levow, Douglas Oard, and Philip Resnik. 2005. Dictionary-based techniques for cross-language information retrieval. In Information Processing and Management Special Issue: Cross-Language Information Retrieval, 41(3): 523-547. 208 Unsupervised Morphological Segmentation with Log-Linear Models Hoifung Poon Dept. of Computer Sci. & Eng. University of Washington Seattle, WA 98195 hoifung@cs.washington.edu Colin Cherry Microsoft Research Redmond, WA 98052 colinc@microsoft.com Kristina Toutanova Microsoft Research Redmond, WA 98052 kristout@microsoft.com Abstract Morphological segmentation breaks words into morphemes (the basic semantic units). It is a key component for natural language processing systems. Unsupervised morphological segmentation is attractive, because in every language there are virtually unlimited supplies of text, but very few labeled resources. However, most existing model-based systems for unsupervised morphological segmentation use directed generative models, making it difficult to leverage arbitrary overlapping features that are potentially helpful to learning. In this paper, we present the first log-linear model for unsupervised morphological segmentation. Our model uses overlapping features such as morphemes and their contexts, and incorporates exponential priors inspired by the minimum description length (MDL) principle. We present efficient algorithms for learning and inference by combining contrastive estimation with sampling. Our system, based on monolingual features only, outperforms a state-of-the-art system by a large margin, even when the latter uses bilingual information such as phrasal alignment and phonetic correspondence. On the Arabic Penn Treebank, our system reduces F1 error by 11% compared to Morfessor. 1 Introduction The goal of morphological segmentation is to segment words into morphemes, the basic syntactic/semantic units. This is a key subtask in many This research was conducted during the author's internship at Microsoft Research. NLP applications, including machine translation, speech recognition and question answering. Past approaches include rule-based morphological analyzers (Buckwalter, 2004) and supervised learning (Habash and Rambow, 2005). While successful, these require deep language expertise and a long and laborious process in system building or labeling. Unsupervised approaches are attractive due to the the availability of large quantities of unlabeled text, and unsupervised morphological segmentation has been extensively studied for a number of languages (Brent et al., 1995; Goldsmith, 2001; Dasgupta and Ng, 2007; Creutz and Lagus, 2007). The lack of supervised labels makes it even more important to leverage rich features and global dependencies. However, existing systems use directed generative models (Creutz and Lagus, 2007; Snyder and Barzilay, 2008b), making it difficult to extend them with arbitrary overlapping dependencies that are potentially helpful to segmentation. In this paper, we present the first log-linear model for unsupervised morphological segmentation. Our model incorporates simple priors inspired by the minimum description length (MDL) principle, as well as overlapping features such as morphemes and their contexts (e.g., in Arabic, the string Al is likely a morpheme, as is any string between Al and a word boundary). We develop efficient learning and inference algorithms using a novel combination of two ideas from previous work on unsupervised learning with log-linear models: contrastive estimation (Smith and Eisner, 2005) and sampling (Poon and Domingos, 2008). We focus on inflectional morphology and test our Human Language Technologies: The 2009 Annual Conference of the North American Chapter of the ACL, pages 209­217, Boulder, Colorado, June 2009. c 2009 Association for Computational Linguistics 209 approach on datasets in Arabic and Hebrew. Our system, using monolingual features only, outperforms Snyder & Barzilay (2008b) by a large margin, even when their system uses bilingual information such as phrasal alignment and phonetic correspondence. On the Arabic Penn Treebank, our system reduces F1 error by 11% compared to Morfessor Categories-MAP (Creutz and Lagus, 2007). Our system can be readily applied to supervised and semi-supervised learning. Using a fraction of the labeled data, it already outperforms Snyder & Barzilay's supervised results (2008a), which further demonstrates the benefit of using a log-linear model. 2 Related Work There is a large body of work on the unsupervised learning of morphology. In addition to morphological segmentation, there has been work on unsupervised morpheme analysis, where one needs to determine features of word forms (Kurimo et al., 2007) or identify words with the same lemma by modeling stem changes (Schone and Jurafsky, 2001; Goldsmith, 2001). However, we focus our review specifically on morphological segmentation. In the absence of labels, unsupervised learning must incorporate a strong learning bias that reflects prior knowledge about the task. In morphological segmentation, an often-used bias is the minimum description length (MDL) principle, which favors compact representations of the lexicon and corpus (Brent et al., 1995; Goldsmith, 2001; Creutz and Lagus, 2007). Other approaches use statistics on morpheme context, such as conditional entropy between adjacent n-grams, to identify morpheme candidates (Harris, 1955; Keshava and Pitler, 2006). In this paper, we incorporate both intuitions into a simple yet powerful model, and show that each contributes significantly to performance. Unsupervised morphological segmentation systems also differ from the engineering perspective. Some adopt a pipeline approach (Schone and Jurafsky, 2001; Dasgupta and Ng, 2007; Demberg, 2007), which works by first extracting candidate affixes and stems, and then segmenting the words based on the candidates. Others model segmentation using a joint probabilistic distribution (Goldwater et al., 2006; Creutz and Lagus, 2007; Snyder and 210 Barzilay, 2008b); they learn the model parameters from unlabeled data and produce the most probable segmentation as the final output. The latter approach is arguably more appealing from the modeling standpoint and avoids error propagation along the pipeline. However, most existing systems use directed generative models; Creutz & Lagus (2007) used an HMM, while Goldwater et al. (2006) and Snyder & Barzilay (2008b) used Bayesian models based on Pitman-Yor or Dirichlet processes. These models are difficult to extend with arbitrary overlapping features that can help improve accuracy. In this work we incorporate novel overlapping contextual features and show that they greatly improve performance. Non-overlapping contextual features previously have been used in directed generative models (in the form of Markov models) for unsupervised morphological segmentation (Creutz and Lagus, 2007) or word segmentation (Goldwater et al., 2007). In terms of feature sets, our model is most closely related to the constituent-context model proposed by Klein and Manning (2001) for grammar induction. If we exclude the priors, our model can also be seen as a semi-Markov conditional random field (CRF) model (Sarawagi and Cohen, 2004). Semi-Markov CRFs previously have been used for supervised word segmentation (Andrew, 2006), but not for unsupervised morphological segmentation. Unsupervised learning with log-linear models has received little attention in the past. Two notable exceptions are Smith & Eisner (2005) for POS tagging, and Poon & Domingos (2008) for coreference resolution. Learning with log-linear models requires computing the normalization constant (a.k.a. the partition function) Z. This is already challenging in supervised learning. In unsupervised learning, the difficulty is further compounded by the absence of supervised labels. Smith & Eisner (2005) proposed contrastive estimation, which uses a small neighborhood to compute Z. The neighborhood is carefully designed so that it not only makes computation easier but also offers sufficient contrastive information to aid unsupervised learning. Poon & Domingos (2008), on the other hand, used sampling to approximate Z.1 In this work, we benefit from both techniques: contrastive estimation creates a manageable, 1 Rosenfeld (1997) also did this for language modeling. wvlAvwn (##__##) w (##__vl) vlAv (#w__wn) wn (Av__##) Figure 1: The morpheme and context (in parentheses) features for the segmented word w-vlAv-wn. informative Z, while sampling enables the use of powerful global features. 3 Log-Linear Model for Unsupervised Morphological Segmentation Central to our approach is a log-linear model that defines the joint probability distribution for a corpus (i.e., the words) and a segmentation on the corpus. The core of this model is a morpheme-context model, with one feature for each morpheme,2 and one feature for each morpheme context. We represent contexts using the n-grams before and after the morpheme, for some constant n. To illustrate this, a segmented Arabic corpus is shown below along with its features, assuming we are tracking bigram contexts. The segmentation is indicated with hyphens, while the hash symbol (#) represents the word boundary. Segmented Corpus hnAk w-vlAv-wn bn-w Al-ywm Al-jmAEp Morpheme Feature:Value hnAk:1 w:2 vlAv:1 wn:1 bn:1 Al:2 ywm:1 jmAEp:1 hnAk:1 wvlAvwn:1 bnw:1 Alywm:1 AljmAEp:1 Bigram Context Feature:Value ## vl:1 #w wn:1 Av ##:1 ## w#:1 bn ##:1 ## yw:1 Al ##:2 ## jm:1 ## ##:5 Furthermore, the corresponding features for the segmented word w-vlAv-wn are shown in Figure 1. Each feature is associated with a weight, which correlates with the likelihood that the corresponding morpheme or context marks a valid morphological segment. Such overlapping features allow us to capture rich segmentation regularities. For example, given the Arabic word Alywm, to derive its correct segmentation Al-ywm, it helps to know that Al and ywm are likely morphemes whereas Aly or lyw are 2 The word as a whole is also treated as a morpheme in itself. not; it also helps to know that Al ## or ## yw are likely morpheme contexts whereas ly ## or ## wm are not. Ablation tests verify the importance of these overlapping features (see Section 7.2). Our morpheme-context model is inspired by the constituent-context model (CCM) proposed by Klein and Manning (2001) for grammar induction. The morphological segmentation of a word can be viewed as a flat tree, where the root node corresponds to the word and the leaves correspond to morphemes (see Figure 1). The CCM uses unigrams for context features. For this task, however, we found that bigrams and trigrams lead to much better accuracy. We use trigrams in our full model. For learning, one can either view the corpus as a collection of word types (unique words) or tokens (word occurrences). Some systems (e.g., Morfessor) use token frequency for parameter estimation. Our system, however, performs much better using word types. This has also been observed for other morphological learners (Goldwater et al., 2006). Thus we use types in learning and inference, and effectively enforce the constraint that words can have only one segmentation per type. Evaluation is still based on tokens to reflect the performance in real applications. In addition to the features of the morphemecontext model, we incorporate two priors which capture additional intuitions about morphological segmentations. First, we observe that the number of distinct morphemes used to segment a corpus should be small. This is achieved when the same morphemes are re-used across many different words. Our model incorporates this intuition by imposing a lexicon prior: an exponential prior with negative weight on the length of the morpheme lexicon. We define the lexicon to be the set of unique morphemes identified by a complete segmentation of the corpus, and the lexicon length to be the total number of characters in the lexicon. In this way, we can simultaneously emphasize that a lexicon should contain few unique morphemes, and that those morphemes should be short. However, the lexicon prior alone incorrectly favors the trivial segmentation that shatters each word into characters, which results in the smallest lexicon possible (single characters). Therefore, we also impose a corpus prior: an exponential prior on the number of mor- 211 phemes used to segment each word in the corpus, which penalizes over-segmentation. We notice that longer words tend to have more morphemes. Therefore, each word's contribution to this prior is normalized by the word's length in characters (e.g., the segmented word w-vlAv-wn contributes 3/7 to the total corpus size). Notice that it is straightforward to incorporate such a prior in a log-linear model, but much more challenging to do so in a directed generative model. These two priors are inspired by the minimum description length (MDL) length principle; the lexicon prior favors fewer morpheme types, whereas the corpus prior favors fewer morpheme tokens. They are vital to the success of our model, providing it with the initial inductive bias. We also notice that often a word is decomposed into a stem and some prefixes and suffixes. This is particularly true for languages with predominantly inflectional morphology, such as Arabic, Hebrew, and English. Thus our model uses separate lexicons for prefixes, stems, and suffixes. This results in a small but non-negligible accuracy gain in our experiments. We require that a stem contain at least two characters and no fewer characters than any affixes in the same word.3 In a given word, when a morpheme is identified as the stem, any preceding morpheme is identified as a prefix, whereas any following morpheme as a suffix. The sample segmented corpus mentioned earlier induces the following lexicons: Prefix w Al Stem hnAk vlAv bn ywm jmAEp Suffix wn w Before presenting our formal model, we first introduce some notation. Let W be a corpus (i.e., a set of words), and S be a segmentation that breaks each word in W into prefixes, a stem, and suffixes. Let be a string (character sequence). Each occurrence of will be in the form of 1 2 , where 1 , 2 are the adjacent character n-grams, and c = (1 , 2 ) is the context of in this occurrence. Thus a segmentation can be viewed as a set of morpheme strings and their contexts. For a string x, L(x) denotes the number of characters in x; for a word w, MS (w) denotes the In a segmentation where several morphemes have the maximum length, any of them can be identified as the stem, each resulting in a distinct segmentation. 3 number of morphemes in w given the segmentation S; P ref (W, S), Stem(W, S), Suf f (W, S) denote the lexicons of prefixes, stems, and suffixes induced by S for W . Then, our model defines a joint probability distribution over a restricted set of W and S: P (W, S) = where u (W, S) = exp( 1 · u (W, S) Z f (S) + c c fc (S) +· +· +· +· L() P ref (W,S) L() Stem(W,S) L() Suf f (W,S) MS (w)/L(w) ) wW Here, f (S) and fc (S) are respectively the occurrence counts of morphemes and contexts under S, and = ( , c : , c) are their feature weights. , are the weights for the priors. Z is the normalization constant, which sums over a set of corpora and segmentations. In the next section, we will define this set for our model and show how to efficiently perform learning and inference. 4 Unsupervised Learning As mentioned in Smith & Eisner (2005), learning with probabilistic models can be viewed as moving probability mass to the observed data. The question is from where to take this mass. For log-linear models, the answer amounts to defining the set that Z sums over. We use contrastive estimation and define the set to be a neighborhood of the observed data. The instances in the neighborhood can be viewed as pseudo-negative examples, and learning seeks to discriminate them from the observed instances. Formally, let W be the observed corpus, and let N (·) be a function that maps a string to a set of strings; let N (W ) denote the set of all corpora that can be derived from W by replacing every word w W with one in N (w). Then, Z= u(W, S). W N (W ) S 212 Unsupervised learning maximizes the log-likelihood of observing W L (W ) = log S ing now maximizes L (W , S ); the partial derivatives become L (W , S ) = ES|W ,S [fi ] - ES,W [fi ] i The only difference in comparison with unsupervised learning is that we fix the known segmentation when computing the first expected counts. In Section 7.3, we show that when labels are available, our model also learns much more effectively than a directed graphical model. P (W , S) We use gradient descent for this optimization; the partial derivatives for feature weights are L (W ) = ES|W [fi ] - ES,W [fi ] i where i is either a string or a context c. The first expected count ranges over all possible segmentations while the words are fixed to those observed in W . For the second expected count, the words also range over the neighborhood. Smith & Eisner (2005) considered various neighborhoods for unsupervised POS tagging, and showed that the best neighborhoods are TRANS1 (transposing any pair of adjacent words) and DELORTRANS1 (deleting any word or transposing any pair of adjacent words). We can obtain their counterparts for morphological segmentation by simply replacing "words" with "characters". As mentioned earlier, the instances in the neighborhood serve as pseudo-negative examples from which probability mass can be taken away. In this regard, DELORTRANS1 is suitable for POS tagging since deleting a word often results in an ungrammatical sentence. However, in morphology, a word less a character is often a legitimate word too. For example, deleting l from the Hebrew word lyhwh (to the lord) results in yhwh (the lord). Thus DELORTRANS1 forces legal words to compete against each other for probability mass, which seems like a misguided objective. Therefore, in our model we use TRANS1. It is suited for our task because transposing a pair of adjacent characters usually results in a non-word. To combat overfitting in learning, we impose a Gaussian prior (L2 regularization) on all weights. 6 Inference In Smith & Eisner (2005), the objects (sentences) are independent from each other, and exact inference is tractable. In our model, however, the lexicon prior renders all objects (words) interdependent in terms of segmentation decisions. Consider the simple corpus with just two words: Alrb, lAlrb. If lAlrb is segmented into l-Al-rb, Alrb can be segmented into Alrb without paying the penalty imposed by the lexicon prior. If, however, lAlrb remains a single morpheme, and we still segment Alrb into Al-rb, then we introduce two new morphemes into the lexicons, and we will be penalized by the lexicon prior accordingly. As a result, we must segment the whole corpus jointly, making exact inference intractable. Therefore, we resort to approximate inference. To compute ES|W [fi ], we use Gibbs sampling. To derive a sample, the procedure goes through each word and samples the next segmentation conditioned on the segmentation of all other words. With m samples S1 , · · · , Sm , the expected count can be approximated as ES|W [fi ] 1 m fi (Sj ) j 5 Supervised Learning Our learning algorithm can be readily applied to supervised or semi-supervised learning. Suppose that gold segmentation is available for some words, denoted as S . If S contains gold segmentations for all words in W , we are doing supervised learning; otherwise, learning is semi-supervised. Train213 There are 2n-1 ways to segment a word of n characters. To sample a new segmentation for a particular word, we need to compute conditional probability for each of these segmentations. We currently do this by explicit enumeration.4 When n is large, 4 These segmentations could be enumerated implicitly using the dynamic programming framework employed by semiMarkov CRFs (Sarawagi and Cohen, 2004). However, in such a setting, our lexicon prior would likely need to be approximated. We intend to investigate this in future work. this is very expensive. However, we observe that the maximum number of morphemes that a word contains is usually a small constant for many languages; in the Arabic Penn Treebank, the longest word contains 14 characters, but the maximum number of morphemes in a word is only 5. Therefore, we impose the constraint that a word can be segmented into no more than k morphemes, where k is a language-specific constant. We can determine k from prior knowledge or use a development set. This constraint substantially reduces the number of segmentation candidates to consider; with k = 5, it reduces the number of segmentations to consider by almost 90% for a word of 14 characters. ES,W [fi ] can be computed by Gibbs sampling in the same way, except that in each step we also sample the next word from the neighborhood, in addition to the next segmentation. To compute the most probable segmentation, we use deterministic annealing. It works just like a sampling algorithm except that the weights are divided by a temperature, which starts with a large value and gradually drops to a value close to zero. To make burn-in faster, when computing the expected counts, we initialize the sampler with the most probable segmentation output by annealing. Arabic corpus with about 120,000 Arabic words. As in previous work, we report recall, precision, and F1 over segmentation points. We used 500 phrases from the S&B dataset for feature development, and also tuned our model hyperparameters there. The weights for the lexicon and corpus priors were set to = -1, = -20. The feature weights were initialized to zero and were penalized by a Gaussian prior with 2 = 100. The learning rate was set to 0.02 for all experiments, except the full Arabic Penn Treebank, for which it was set to 0.005.5 We used 30 iterations for learning. In each iteration, 200 samples were collected to compute each of the two expected counts. The sampler was initialized by running annealing for 2000 samples, with the temperature dropping from 10 to 0.1 at 0.1 decrements. The most probable segmentation was obtained by running annealing for 10000 samples, using the same temperature schedule. We restricted the segmentation candidates to those with no greater than five segments in all experiments. 7.1 Unsupervised Segmentation on S&B 7 Experiments We evaluated our system on two datasets. Our main evaluation is on a multi-lingual dataset constructed by Snyder & Barzilay (2008a; 2008b). It consists of 6192 short parallel phrases in Hebrew, Arabic, Aramaic (a dialect of Arabic), and English. The parallel phrases were extracted from the Hebrew Bible and its translations via word alignment and postprocessing. For Arabic, the gold segmentation was obtained using a highly accurate Arabic morphological analyzer (Habash and Rambow, 2005); for Hebrew, from a Bible edition distributed by Westminster Hebrew Institute (Groves and Lowery, 2006). There is no gold segmentation for English and Aramaic. Like Snyder & Barzilay, we evaluate on the Arabic and Hebrew portions only; unlike their approach, our system does not use any bilingual information. We refer to this dataset as S&B . We also report our results on the Arabic Penn Treebank (ATB), which provides gold segmentations for an 214 We followed the experimental set-up of Snyder & Barzilay (2008b) to enable a direct comparison. The dataset is split into a training set with 4/5 of the phrases, and a test set with the remaining 1/5. First, we carried out unsupervised learning on the training data, and computed the most probable segmentation for it. Then we fixed the learned weights and the segmentation for training, and computed the most probable segmentation for the test set, on which we evaluated.6 Snyder & Barzilay (2008b) compared several versions of their systems, differing in how much bilingual information was used. Using monolingual information only, their system (S&B-MONO) trails the state-of-the-art system Morfessor; however, their best system (S&B-BEST), which uses bilingual information that includes phrasal alignment and phonetic correspondence between Arabic and Hebrew, outperforms Morfessor and achieves the state-ofthe-art results on this dataset. The ATB set is more than an order of magnitude larger and requires a smaller rate. 6 With unsupervised learning, we can use the entire dataset for training since no labels are provided. However, this setup is necessary for S&B's system because they used bilingual information in training, which is not available at test time. 5 ARABIC S&B-MONO S&B-BEST FULL HEBREW S&B-MONO S&B-BEST FULL Prec. 53.0 67.8 76.0 Prec. 55.8 64.9 67.6 Rec. 78.5 77.3 80.2 Rec. 64.4 62.9 66.1 F1 63.2 72.2 78.1 F1 59.8 63.9 66.9 Table 1: Comparison of segmentation results on the S&B dataset. Table 1 compares our system with theirs. Our system outperforms both S&B-MONO and S&B-BEST by a large margin. For example, on Arabic, our system reduces F1 error by 21% compared to S&BBEST, and by 40% compared to S&B-MONO. This suggests that the use of monolingual morpheme context, enabled by our log-linear model, is more helpful than their bilingual cues. 7.2 Ablation Tests ARABIC FULL NO-PRIOR NO-COR-PR NO-LEX-PR NO-CONTEXT UNIGRAM BIGRAM SG-LEXICON HEBREW FULL NO-PRIOR NO-COR-PR NO-LEX-PR NO-CONTEXT UNIGRAM BIGRAM SG-LEXICON Prec. 76.0 24.6 23.7 79.1 71.2 71.3 73.1 72.8 Prec. 67.6 34.0 35.6 65.9 63.0 63.0 69.5 67.4 Rec. 80.2 89.3 87.4 51.3 62.1 76.5 78.4 82.0 Rec. 66.1 89.9 90.6 49.2 47.6 63.7 66.1 65.7 F1 78.1 38.6 37.2 62.3 66.3 73.8 75.7 77.1 F1 66.9 49.4 51.1 56.4 54.3 63.3 67.8 66.6 Table 2: Ablation test results on the S&B dataset. To evaluate the contributions of the major components in our model, we conducted seven ablation tests on the S&B dataset, each using a model that differed from our full model in one aspect. The first three tests evaluate the effect of priors, whereas the next three test the effect of context features. The last evaluates the impact of using separate lexicons for affixes and stems. NO-PRIOR The priors are not used. NO-COR-PR The corpus prior is not used. NO-LEX-PR The lexicon prior is not used. NO-CONTEXT Context features are not used. UNIGRAM Unigrams are used in context. BIGRAM Bigrams are used in context. SG-LEXICON A single lexicon is used, rather than three distinct ones for the affixes and stems. Table 2 presents the ablation results in comparison with the results of the full model. When some or all priors are excluded, the F1 score drops substantially (over 10 points in all cases, and over 40 points in some). In particular, excluding the corpus prior, as in NO-PRIOR and NO-COR-PR, results in oversegmentation, as is evident from the high recalls and low precisions. When the corpus prior is enacted but not the lexicon priors (NO-LEX-PR), precision 215 is much higher, but recall is low; the system now errs on under-segmentation because recurring strings are often not identified as morphemes. A large accuracy drop (over 10 points in F1 score) also occurs when the context features are excluded (NO-CONTEXT), which underscores the importance of these overlapping features. We also notice that the NO-CONTEXT model is comparable to the S&B-MONO model; they use the same feature types, but different priors. The accuracies of the two systems are comparable, which suggests that we did not sacrifice accuracy by trading the more complex and restrictive Dirichlet process prior for exponential priors. A priori, it is unclear whether using contexts larger than unigrams would help. While potentially beneficial, they also risk aggravating the data sparsity and making our model more prone to overfitting. For this problem, however, enlarging the context (using higher n-grams up to trigrams) helps substantially. For Arabic, the highest accuracy is attained by using trigrams, which reduces F1 error by 16% compared to unigrams; for Hebrew, by using bigrams, which reduces F1 error by 17%. Finally, it helps to use separate lexicons for affixes and stems, although the difference is small. ARABIC S&B-MONO-S S&B-BEST-S FULL-S HEBREW S&B-MONO-S S&B-BEST-S FULL-S %Lbl. 100 200 25 50 75 100 %Lbl. 100 200 25 50 75 100 Prec. 73.2 77.8 84.9 88.2 89.6 91.7 Prec. 71.4 76.8 78.7 82.8 83.1 83.0 Rec. 92.4 92.3 85.5 86.8 86.4 88.5 Rec. 79.1 79.2 73.3 74.6 77.3 78.9 F1 81.7 84.4 85.2 87.5 87.9 90.0 F1 75.1 78.0 75.9 78.4 80.1 80.9 ATB-7000 MORFESSOR-1.0 MORFESSOR-MAP FULL ATB MORFESSOR-1.0 MORFESSOR-MAP FULL Prec. 70.6 86.9 83.4 Prec. 80.7 77.4 88.5 Rec. 34.3 46.4 77.3 Rec. 20.4 72.6 69.2 F1 46.1 60.5 80.2 F1 32.6 74.9 77.7 Table 4: Comparison of segmentation results on the Arabic Penn Treebank. Table 3: Comparison of segmentation results with supervised and semi-supervised learning on the S&B dataset. 7.3 Supervised and Semi-Supervised Learning To evaluate our system in the supervised and semisupervised learning settings, we report the performance when various amounts of labeled data are made available during learning, and compare them to the results of Snyder & Barzilay (2008a). They reported results for supervised learning using monolingual features only (S&B-MONO-S), and for supervised bilingual learning with labels for both languages (S&B-BEST-S). On both languages, our system substantially outperforms both S&B-MONO-S and S&B-BEST-S. E.g., on Arabic, our system reduces F1 errors by 46% compared to S&B-MONOS, and by 36% compared to S&B-BEST-S. Moreover, with only one-fourth of the labeled data, our system already outperforms S&B-MONO-S. This demonstrates that our log-linear model is better suited to take advantage of supervised labels. 7.4 Arabic Penn Treebank forms much better than Morfessor on Arabic but worse on Hebrew. To test each system in a lowdata setting, we also ran experiments on the set containing the first 7,000 words in ATB with at least two characters (ATB-7000). Table 4 shows the results. Morfessor performs rather poorly on ATB7000. Morfessor Categories-MAP does much better, but its performance is dwarfed by our system, which further cuts F1 error by half. On the full ATB dataset, Morfessor performs even worse, whereas Morfessor Categories-MAP benefits from the larger dataset and achieves an F1 of 74.9. Still, our system substantially outperforms it, further reducing F1 error by 11%.8 8 Conclusion This paper introduces the first log-linear model for unsupervised morphological segmentation. It leverages overlapping features such as morphemes and their contexts, and enables easy extension to incorporate additional features and linguistic knowledge. For Arabic and Hebrew, it outperforms the stateof-the-art systems by a large margin. It can also be readily applied to supervised or semi-supervised learning when labeled data is available. Future directions include applying our model to other inflectional and agglutinative languages, modeling internal variations of morphemes, leveraging parallel data in multiple languages, and combining morphological segmentation with other NLP tasks, such as machine translation. Note that the ATB and ATB-7000 experiments each measure accuracy on their entire training set. This difference in testing conditions explains why some full ATB results are lower than ATB-7000. 8 We also evaluated our system on the Arabic Penn Treebank (ATB). As is common in unsupervised learning, we trained and evaluated on the entire set. We compare our system with Morfessor (Creutz and Lagus, 2007).7 In addition, we compare with Morfessor Categories-MAP, which builds on Morfessor and conducts an additional greedy search specifically tailored to segmentation. We found that it perWe cannot compare with Snyder & Barzilay's system as its strongest results require bilingual data, which is not available. 7 216 References Galen Andrew. 2006. A hybrid markov/semi-markov conditional random field for sequence segmentation. In Conference on Empirical Methods in Natural Language Processing (EMNLP). Michael R. Brent, Sreerama K. Murthy, and Andrew Lundberg. 1995. Discovering morphemic suffixes: A case study in minimum description length induction. In Proceedings of the 15th Annual Conference of the Cognitive Science Society. Tim Buckwalter. 2004. Buckwalter Arabic morphological analyzer version 2.0. Mathias Creutz and Krista Lagus. 2007. Unsupervised models for morpheme segmentation and morphology learning. ACM Transactions on Speech and Language Processing, 4(1). Sajib Dasgupta and Vincent Ng. 2007. Highperformance, language-independent morphological segmentation. In Proceedings of Human Language Technology (NAACL). Vera Demberg. 2007. A language-independent unsupervised model for morphological segmentation. In Proceedings of the 45th Annual Meeting of the Association for Computational Linguistics, Prague, Czech Republic. John Goldsmith. 2001. Unsupervised learning of the morphology of a natural language. Computational Linguistics, 27(2):153­198. Sharon Goldwater, Thomas L. Griffiths, and Mark Johnson. 2006. Interpolating between types and tokens by estimating power-law generators. In Advances in Neural Information Processing Systems 18. Sharon Goldwater, Thomas L. Griffiths, and Mark Johnson. 2007. Distributional cues to word segmentation: Context is important. In Proceedings of the 31st Boston University Conference on Language Development. Alan Groves and Kirk Lowery, editors. 2006. The Westminster Hebrew Bible Morphology Database. Westminster Hebrew Institute, Philadelphia, PA, USA. Nizar Habash and Owen Rambow. 2005. Arabic tokenization, part-of-speech tagging and morphological disambiguation in one fell swoop. In Proceedings of the 43rd Annual Meeting of the Association for Computational Linguistics. Zellig S. Harris. 1955. From phoneme to morpheme. Language, 31(2):190­222. Samarth Keshava and Emily Pitler. 2006. A simple, intuitive approach to morpheme induction. In Proceedings of 2nd Pascal Challenges Workshop, Venice, Italy. Dan Klein and Christopher D. Manning. 2001. Natural language grammar induction using a constituentcontext model. In Advances in Neural Information Processing Systems 14. Mikko Kurimo, Mathias Creutz, and Ville Turunen. 2007. Overview of Morpho Challenge in CLEF 2007. In Working Notes of the CLEF 2007 Workshop. Hoifung Poon and Pedro Domingos. 2008. Joint unsupervised coreference resolution with markov logic. In Proceedings of the 2008 Conference on Empirical Methods in Natural Language Processing, pages 649­ 658, Honolulu, HI. ACL. Ronald Rosenfeld. 1997. A whole sentence maximum entropy language model. In IEEE workshop on Automatic Speech Recognition and Understanding. Sunita Sarawagi and William Cohen. 2004. Semimarkov conditional random fields for information extraction. In Proceedings of the Twenty First International Conference on Machine Learning. Patrick Schone and Daniel Jurafsky. 2001. Knowlegefree induction of inflectional morphologies. In Proceedings of Human Language Technology (NAACL). Noah A. Smith and Jason Eisner. 2005. Contrastive estimation: Training log-linear models on unlabeled data. In Proceedings of the 43rd Annual Meeting of the Association for Computational Linguistics. Benjamin Snyder and Regina Barzilay. 2008a. Crosslingual propagation for morphological analysis. In Proceedings of the Twenty Third National Conference on Artificial Intelligence. Benjamin Snyder and Regina Barzilay. 2008b. Unsupervised multilingual learning for morphological segmentation. In Proceedings of the 46th Annual Meeting of the Association for Computational Linguistics. 217 11,001 New Features for Statistical Machine Translation David Chiang and Kevin Knight USC Information Sciences Institute 4676 Admiralty Way, Suite 1001 Marina del Rey, CA 90292 USA Wei Wang Language Weaver, Inc. 4640 Admiralty Way, Suite 1210 Marina del Rey, CA 90292 USA Abstract We use the Margin Infused Relaxed Algorithm of Crammer et al. to add a large number of new features to two machine translation systems: the Hiero hierarchical phrasebased translation system and our syntax-based translation system. On a large-scale ChineseEnglish translation task, we obtain statistically significant improvements of +1.5 BLEU and +1.1 BLEU, respectively. We analyze the impact of the new features and the performance of the learning algorithm. 1 Introduction Many of the new features use syntactic information, and in particular depend on information that is available only inside a syntax-based translation model. Thus they widen the advantage that syntaxbased models have over other types of models. The models are trained using the Margin Infused Relaxed Algorithm or MIRA (Crammer et al., 2006) instead of the standard minimum-error-rate training or MERT algorithm (Och, 2003). Our results add to a growing body of evidence (Watanabe et al., 2007; Chiang et al., 2008) that MIRA is preferable to MERT across languages and systems, even for very large-scale tasks. What linguistic features can improve statistical machine translation (MT)? This is a fundamental question for the discipline, particularly as it pertains to improving the best systems we have. Further: · Do syntax-based translation systems have unique and effective levers to pull when designing new features? · Can large numbers of feature weights be learned efficiently and stably on modest amounts of data? In this paper, we address these questions by experimenting with a large number of new features. We add more than 250 features to improve a syntaxbased MT system--already the highest-scoring single system in the NIST 2008 Chinese-English common-data track--by +1.1 BLEU. We also add more than 10,000 features to Hiero (Chiang, 2005) and obtain a +1.5 BLEU improvement. This research was supported in part by DARPA contract HR0011-06-C-0022 under subcontract to BBN Technologies. 2 Related Work The work of Och et al (2004) is perhaps the bestknown study of new features and their impact on translation quality. However, it had a few shortcomings. First, it used the features for reranking n-best lists of translations, rather than for decoding or forest reranking (Huang, 2008). Second, it attempted to incorporate syntax by applying off-the-shelf part-ofspeech taggers and parsers to MT output, a task these tools were never designed for. By contrast, we incorporate features directly into hierarchical and syntaxbased decoders. A third difficulty with Och et al.'s study was that it used MERT, which is not an ideal vehicle for feature exploration because it is observed not to perform well with large feature sets. Others have introduced alternative discriminative training methods (Tillmann and Zhang, 2006; Liang et al., 2006; Turian et al., 2007; Blunsom et al., 2008; Macherey et al., 2008), in which a recurring challenge is scalability: to train many features, we need many train- Human Language Technologies: The 2009 Annual Conference of the North American Chapter of the ACL, pages 218­226, Boulder, Colorado, June 2009. c 2009 Association for Computational Linguistics 218 ing examples, and to train discriminatively, we need to search through all possible translations of each training example. Another line of research (Watanabe et al., 2007; Chiang et al., 2008) tries to squeeze as many features as possible from a relatively small dataset. We follow this approach here. minimal rules. These larger rules have been shown to substantially improve translation accuracy (Galley et al., 2006; DeNeefe et al., 2007). We apply Good-Turing discounting to the transducer rule counts and obtain probability estimates: P(rule) = count(rule) count(LHS-root(rule)) 3 3.1 Systems Used Hiero Hiero (Chiang, 2005) is a hierarchical, string-tostring translation system. Its rules, which are extracted from unparsed, word-aligned parallel text, are synchronous CFG productions, for example: X X1 de X2 , X2 of X1 As the number of nonterminals is limited to two, the grammar is equivalent to an inversion transduction grammar (Wu, 1997). The baseline model includes 12 features whose weights are optimized using MERT. Two of the features are n-gram language models, which require intersecting the synchronous CFG with finite-state automata representing the language models. This grammar can be parsed efficiently using cube pruning (Chiang, 2007). 3.2 Syntax-based system Our syntax-based system transforms source Chinese strings into target English syntax trees. Following previous work in statistical MT (Brown et al., 1993), we envision a noisy-channel model in which a language model generates English, and then a translation model transforms English trees into Chinese. We represent the translation model as a tree transducer (Knight and Graehl, 2005). It is obtained from bilingual text that has been word-aligned and whose English side has been syntactically parsed. From this data, we use the the GHKM minimal-rule extraction algorithm of (Galley et al., 2004) to yield rules like: NP-C(x0 :NPB PP(IN(of x1 :NPB)) x1 de x0 Though this rule can be used in either direction, here we use it right-to-left (Chinese to English). We follow Galley et al. (2006) in allowing unaligned Chinese words to participate in multiple translation rules, and in collecting larger rules composed of 219 When we apply these probabilities to derive an English sentence e and a corresponding Chinese sentence c, we wind up with the joint probability P(e, c). The baseline model includes log P(e, c), the two n-gram language models log P(e), and other features for a total of 25. For example, there is a pair of features to punish rules that drop Chinese content words or introduce spurious English content words. All features are linearly combined and their weights are optimized using MERT. For efficient decoding with integrated n-gram language models, all transducer rules must be binarized into rules that contain at most two variables and can be incrementally scored by the language model (Zhang et al., 2006). Then we use a CKY-style parser (Yamada and Knight, 2002; Galley et al., 2006) with cube pruning to decode new sentences. We include two other techniques in our baseline. To get more general translation rules, we restructure our English training trees using expectationmaximization (Wang et al., 2007), and to get more specific translation rules, we relabel the trees with up to 4 specialized versions of each nonterminal symbol, again using expectation-maximization and the split/merge technique of Petrov et al. (2006). 3.3 MIRA training We incorporate all our new features into a linear model (Och and Ney, 2002) and train them using MIRA (Crammer et al., 2006), following previous work (Watanabe et al., 2007; Chiang et al., 2008). Let e stand for output strings or their derivations, and let h(e) stand for the feature vector for e. Initialize the feature weights w. Then, repeatedly: · Select a batch of input sentences f1 , . . . , fm and decode each fi to obtain a forest of translations. · For each i, select from the forest a set of hypothesis translations ei1 , . . . , ein , which are the 10-best translations according to each of: h(e) · w BLEU(e) + h(e) · w -BLEU(e) + h(e) · w · For each i, select an oracle translation: e = arg max (BLEU(e) + h(e) · w) Let hi j = e ) h(ei 4.1 (1) Target-side features String-to-tree MT offers some unique levers to pull, in terms of target-side features. Because the system outputs English trees, we can analyze output trees on the tuning set and design new features to encourage the decoder to produce more grammatical trees. Rule overlap features While individual rules observed in decoder output are often quite reasonable, two adjacent rules can create problems. For example, a rule that has a variable of type IN (preposition) needs another rule rooted with IN to fill the position. If the second rule supplies the wrong preposition, a bad translation results. The IN node here is an overlap point between rules. Considering that certain nonterminal symbols may be more reliable overlap points than others, we create a binary feature for each nonterminal. A rule like: IN(at) zai will have feature rule-root-IN set to 1 and all other rule-root features set to 0. Our rule root features range over the original (non-split) nonterminal set; we have 105 in total. Even though the rule root features are locally attached to individual rules--and therefore cause no additional problems for the decoder search--they are aimed at problematic rule/rule interactions. Bad single-level rewrites Sometimes the decoder uses questionable rules, for example: PP(x0 :VBN x1 :NP-C) x0 x1 This rule is learned from 62 cases in our training data, where the VBN is almost always the word given. However, the decoder misuses this rule with other VBNs. So we can add a feature that penalizes any rule in which a PP dominates a VBN and NP-C. The feature class bad-rewrite comprises penalties for the following configurations based on our analysis of the tuning set: PP VBN NP-C PP-BAR NP-C IN VP NP-C PP CONJP RB IN (2) - h(ei j ). (3) · For each ei j , compute the loss ij = BLEU(e ) - BLEU(ei j ) i m · Update w to the value of w that minimizes: 1 w -w 2 2 +C i=1 1 jn max ( ij - hi j · w ) (4) where C = 0.01. This minimization is performed by a variant of sequential minimal optimization (Platt, 1998). Following Chiang et al. (2008), we calculate the sentence BLEU scores in (1), (2), and (3) in the context of some previous 1-best translations. We run 20 of these learners in parallel, and when training is finished, the weight vectors from all iterations of all learners are averaged together. Since the interface between the trainer and the decoder is fairly simple--for each sentence, the decoder sends the trainer a forest, and the trainer returns a weight update--it is easy to use this algorithm with a variety of CKY-based decoders: here, we are using it in conjunction with both the Hiero decoder and our syntax-based decoder. 4 Features In this section, we describe the new features introduced on top of our baseline systems. Discount features Both of our systems calculate several features based on observed counts of rules in the training data. Though the syntax-based system uses Good-Turing discounting when computing the P(e, c) feature, we find, as noted above, that it uses quite a few one-count rules, suggesting that their probabilities have been overestimated. We can directly attack this problem by adding features counti that reward or punish rules seen i times, or features count[i, j] for rules seen between i and j times. 220 Node count features It is possible that the decoder creates English trees with too many or too few nodes of a particular syntactic category. For example, there may be an tendency to generate too many determiners or past-tense verbs. We therefore add a count feature for each of the 109 (non-split) English nonterminal symbols. For a rule like NPB(NNP(us) NNP(president) x0 :NNP) meiguo zongtong x0 the feature node-count-NPB gets value 1, nodecount-NNP gets value 2, and all others get 0. Insertion features Among the rules we extract from bilingual corpora are target-language insertion rules, which have a word on the English side, but no words on the source Chinese side. Sample syntaxbased insertion rules are: NPB(DT(the) x0 :NN) x0 S(x0 :NP-C VP(VBZ(is) x1 :VP-C)) x0 x1 We notice that our decoder, however, frequently fails to insert words like is and are, which often have no equivalent in the Chinese source. We also notice that the-insertion rules sometimes have a good effect, as in the translation "in the bloom of youth," but other times have a bad effect, as in "people seek areas of the conspiracy." Each time the decoder uses (or fails to use) an insertion rule, it incurs some risk. There is no guarantee that the interaction of the rule probabilities and the language model provides the best way to manage this risk. We therefore provide MIRA with a feature for each of the most common English words appearing in insertion rules, e.g., insert-the and insert-is. There are 35 such features. 4.2 Source-side features Soft syntactic constraints Neither of our systems uses source-side syntactic information; hence, both could potentially benefit from soft syntactic constraints as described by Marton and Resnik (2008). In brief, these features use the output of an independent syntactic parser on the source sentence, rewarding decoder constituents that match syntactic constituents and punishing decoder constituents that cross syntactic constituents. We use separatelytunable features for each syntactic category. Structural distortion features Both of our systems have rules with variables that generalize over possible fillers, but neither system's basic model conditions a rule application on the size of a filler, making it difficult to distinguish long-distance reorderings from short-distance reorderings. To remedy this problem, Chiang et al. (2008) introduce a structural distortion model, which we include in our experiment. Our syntax-based baseline includes the generative version of this model already. Word context During rule extraction, we retain word alignments from the training data in the extracted rules. (If a rule is observed with more than one set of word alignments, we keep only the most frequent one.) We then define, for each triple ( f, e, f+1 ), a feature that counts the number of times that f is aligned to e and f+1 occurs to the right of f ; and similarly for triples ( f, e, f-1 ) with f-1 occurring to the left of f . In order to limit the size of the model, we restrict words to be among the 100 most frequently occurring words from the training data; all other words are replaced with a token . These features are somewhat similar to features used by Watanabe et al. (2007), but more in the spirit of features used in the word sense disambiguation model introduced by Lee and Ng (2002) and incorporated as a submodel of a translation system by Chan et al. (2007); here, we are incorporating some of its features directly into the translation model. We now turn to features that make use of source-side context. Although these features capture dependencies that cross boundaries between rules, they are still local in the sense that no new states need to be added to the decoder. This is because the entire source sentence, being fixed, is always available to every feature. 221 5 Experiments For our experiments, we used a 260 million word Chinese/English bitext. We ran GIZA++ on the entire bitext to produce IBM Model 4 word alignments, and then the link deletion algorithm (Fossum et al., 2008) to yield better-quality alignments. For System Hiero Training MERT MIRA Syntax MERT MIRA Features baseline syntax, distortion syntax, distortion, discount all source-side, discount baseline baseline overlap node count all target-side, discount # 11 56 61 10990 25 25 132 136 283 Tune 35.4 35.9 36.6 38.4 38.6 38.5 38.7 38.7 39.6 Test 36.1 36.9 37.3 37.6 39.5 39.8 39.9 40.0 40.6 Table 1: Adding new features with MIRA significantly improves translation accuracy. Scores are case-insensitive IBM BLEU scores. or = significantly better than MERT baseline (p < 0.05 or 0.01, respectively). the syntax-based system, we ran a reimplementation of the Collins parser (Collins, 1997) on the English half of the bitext to produce parse trees, then restructured and relabeled them as described in Section 3.2. Syntax-based rule extraction was performed on a 65 million word subset of the training data. For Hiero, rules with up to two nonterminals were extracted from a 38 million word subset and phrasal rules were extracted from the remainder of the training data. We trained three 5-gram language models: one on the English half of the bitext, used by both systems, one on one billion words of English, used by the syntax-based system, and one on two billion words of English, used by Hiero. Modified Kneser-Ney smoothing (Chen and Goodman, 1998) was applied to all language models. The language models are represented using randomized data structures similar to those of Talbot et al. (2007). Our tuning set (2010 sentences) and test set (1994 sentences) were drawn from newswire data from the NIST 2004 and 2005 evaluations and the GALE program (with no overlap at either the segment or document level). For the source-side syntax features, we used the Berkeley parser (Petrov et al., 2006) to parse the Chinese side of both sets. We implemented the source-side context features for Hiero and the target-side syntax features for the syntax-based system, and the discount features for both. We then ran MIRA on the tuning set with 20 parallel learners for Hiero and 73 parallel learners for the syntax-based system. We chose a stopping iteration based on the BLEU score on the tuning set, and used the averaged feature weights from all iter222 Syntax-based count weight 1 +1.28 2 +0.35 3­5 -0.73 6­10 -0.64 Hiero count weight 1 +2.23 2 +0.77 3 +0.54 4 +0.29 5+ -0.02 Table 2: Weights learned for discount features. Negative weights indicate bonuses; positive weights indicate penalties. ations of all learners to decode the test set. The results (Table 1) show significant improvements in both systems (p < 0.01) over already very strong MERT baselines. Adding the source-side and discount features to Hiero yields a +1.5 BLEU improvement, and adding the target-side syntax and discount features to the syntax-based system yields a +1.1 BLEU improvement. The results also show that for Hiero, the various classes of features contributed roughly equally; for the syntax-based system, we see that two of the feature classes make small contributions but time constraints unfortunately did not permit isolated testing of all feature classes. 6 Analysis How did the various new features improve the translation quality of our two systems? We begin by examining the discount features. For these features, we used slightly different schemes for the two systems, shown in Table 2 with their learned feature weights. We see in both cases that one-count rules are strongly penalized, as expected. Reward -0.42 a -0.13 are -0.09 at -0.09 on -0.05 was -0.05 from -0.04 's -0.04 by -0.04 is -0.03 it -0.03 its . . . Penalty +0.67 of +0.56 the +0.47 comma +0.13 period +0.11 in +0.08 for +0.06 to +0.05 will +0.04 and +0.02 as +0.02 have . . . Table 3: Weights learned for inserting target English words with rules that lack Chinese words. 6.1 Syntax features -0.50 -0.39 -0.36 -0.31 -0.30 -0.26 -0.25 -0.22 -0.21 -0.20 -0.16 -0.16 -0.15 -0.13 -0.12 -0.12 -0.11 Table 3 shows word-insertion feature weights. The system rewards insertion of forms of be; examples 1­3 in Figure 1 show typical improved translations that result. Among determiners, inserting a is rewarded, while inserting the is punished. This seems to be because the is often part of a fixed phrase, such as the White House, and therefore comes naturally as part of larger phrasal rules. Inserting the outside these fixed phrases is a risk that the generative model is too inclined to take. We also note that the system learns to punish unmotivated insertions of commas and periods, which get into our grammar via quirks in the MT training data. Table 4 shows weights for rule-overlap features. MIRA punishes the case where rules overlap with an IN (preposition) node. This makes sense: if a rule has a variable that can be filled by any English preposition, there is a risk that an incorrect preposition will fill it. On the other hand, splitting at a period is a safe bet, and frees the model to use rules that dig deeper into NP and VP trees when constructing a top-level S. Table 5 shows weights for generated English nonterminals: SBAR-C nodes are rewarded and commas are punished. The combined effect of all weights is subtle. To interpret them further, it helps to look at gross changes in the system's behavior. For example, a major error in the baseline system is to move "X said" or "X asked" from the beginning of the Chinese input to the middle or end of the English trans223 Bonus period VP-C VB SG-C MD VBG ADJP -LRBVP-BAR NPB-BAR FRAG PRN NPB RB SBAR-C VP-C-BAR -RRB. . . Penalty +0.93 IN +0.57 NNP +0.44 NN +0.41 DT +0.34 JJ +0.24 right double quote +0.20 VBZ +0.19 NP +0.16 TO +0.15 ADJP-BAR +0.14 PRN-BAR +0.14 NML +0.13 comma +0.12 VBD +0.12 NNPS +0.12 PRP +0.11 SG . . . Table 4: Weights learned for employing rules whose English sides are rooted at particular syntactic categories. -0.73 -0.54 -0.54 -0.52 -0.51 -0.47 -0.39 -0.34 -0.31 -0.30 -0.29 -0.27 -0.22 -0.21 -0.21 -0.20 -0.20 Bonus SBAR-C VBZ IN NN PP-C right double quote ADJP POS ADVP RP PRT SG-C S-C NNPS VP-BAR PRP NPB-BAR . . . Penalty +1.30 comma +0.80 DT +0.58 PP +0.44 TO +0.33 NNP +0.30 NNS +0.30 NML +0.22 CD +0.18 PRN +0.16 SYM +0.15 ADJP-BAR +0.15 NP +0.15 MD +0.15 HYPH +0.14 PRN-BAR +0.14 NP-C +0.11 ADJP-C . . . Table 5: Weights learned for generating syntactic nodes of various types anywhere in the English translation. lation. The error occurs with many speaking verbs, and each time, we trace it to a different rule. The problematic rules can even be non-lexical, e.g.: BLEU 38.5 38 37.5 37 36.5 36 35.5 35 0 5 10 15 Epoch 20 25 Tune Test S(x0 :NP-C x1 :VP x2 :, x3 :NP-C x4 :VP x5 :.) x3 x4 x2 x0 x1 x5 It is therefore difficult to come up with a straightforward feature to address the problem. However, when we apply MIRA with the features already listed, these translation errors all disappear, as demonstrated by examples 4­5 in Figure 1. Why does this happen? It turns out that in translation hypotheses that move "X said" or "X asked" away from the beginning of the sentence, more commas appear, and fewer S-C and SBAR-C nodes appear. Therefore, the new features work to discourage these hypotheses. Example 6 shows additionally that commas next to speaking verbs are now correctly deleted. Examples 7­8 in Figure 1 show other kinds of unanticipated improvements. We do not have space for a fuller analysis, but we note that the specific effects we describe above account for only part of the overall BLEU improvement. 6.2 Word context features In Table 6 are shown feature weights learned for the word-context features. A surprising number of the highest-weighted features have to do with translations of dates and bylines. Many of the penalties seem to discourage spurious insertion or deletion of frequent words (for, 's, said, parentheses, and quotes). Finally, we note that several of the features (the third- and eighth-ranked reward and twelfthranked penalty) shape the translation of shuo `said', preferring translations with an overt complementizer that and without a comma. Thus these features work together to attack a frequent problem that our targetsyntax features also addressed. Figure 2 shows the performance of Hiero with all of its features on the tuning and test sets over time. The scores on the tuning set rise rapidly, and the scores on the test set also rise, but much more slowly, and there appears to be slight degradation after the 18th pass through the tuning data. This seems in line with the finding of Watanabe et al. (2007) that with on the order of 10,000 features, overfitting is possible, but we can still improve accuracy on new data. 224 Figure 2: Using over 10,000 word-context features leads to overfitting, but its detrimental effects are modest. Scores on the tuning set were obtained from the 1-best output of the online learning algorithm, whereas scores on the test set were obtained using averaged weights. Early stopping would have given +0.2 BLEU over the results reported in Table 1.1 7 Conclusion We have described a variety of features for statistical machine translation and applied them to syntaxbased and hierarchical systems. We saw that these features, discriminatively trained using MIRA, led to significant improvements, and took a closer look at the results to see how the new features qualitatively improved translation quality. We draw three conclusions from this study. First, we have shown that these new features can improve the performance even of top-scoring MT systems. Second, these results add to a growing body of evidence that MIRA is preferable to MERT for discriminative training. When training over 10,000 features on a modest amount of data, we, like Watanabe et al. (2007), did observe overfitting, yet saw improvements on new data. Third, we have shown that syntax-based machine translation offers possibilities for features not available in other models, making syntax-based MT and MIRA an especially strong combination for future work. It was this iteration, in fact, which was used to derive the combined feature count used in the title of this paper. 1 1 MERT: the united states pending israeli clarification on golan settlement plan MIRA: the united states is waiting for israeli clarification on golan settlement plan 2 MERT: . . . the average life expectancy of only 18 months , canada 's minority goverment will . . . MIRA: . . . the average life expectancy of canada's previous minority government is only 18 months . . . 3 MERT: . . . since un inspectors expelled by north korea . . . MIRA: . . . since un inspectors were expelled by north korea . . . 4 MERT: another thing is . . . , " he said , " obviously , the first thing we need to do . . . . MIRA: he said : " obviously , the first thing we need to do . . . , and another thing is . . . . " 5 MERT: the actual timing . . . reopened in january , yoon said . MIRA: yoon said the issue of the timing . . . 6 MERT: . . . us - led coalition forces , said today that the crash . . . MIRA: . . . us - led coalition forces said today that a us military . . . 7 MERT: . . . and others will feel the danger . MIRA: . . . and others will not feel the danger . 8 MERT: in residential or public activities within 200 meters of the region , . . . MIRA: within 200 m of residential or public activities area , . . . Figure 1: Improved syntax-based translations due to MIRA-trained weights. -1.19 -1.01 -0.84 -0.82 -0.78 -0.76 -0.66 -0.65 f , yue `month' " " , Bonus e that " " that . . . context f-1 = ri `day' f-1 = ( f-1 = shuo `say' f+1 = f-1 = f+1 = f+1 = nian `year' f+1 = +1.12 +0.83 +0.83 +0.73 +0.73 +0.72 +0.70 +0.69 +0.66 +0.66 +0.65 +0.60 Penalty f jiang `shall' zhengfu `government' , . . . e ) be the ) ( ) ( ( for 's said , context f+1 = f+1 = f-1 = f-1 = f+1 = f-1 = ri `day' f-1 = ri `day' f-1 = f-1 = f-1 = , f-1 = f-1 = shuo `say' Table 6: Weights learned for word-context features, which fire when English word e is generated aligned to Chinese word f , with Chinese word f-1 to the left or f+1 to the right. Glosses for Chinese words are not part of features. 225 References Phil Blunsom, Trevor Cohn, and Miles Osborne. 2008. A discriminative latent variable model for statistical machine translation. In Proc. ACL-08: HLT. Peter F. Brown, Stephen A. Della Pietra, Vincent Della J. Pietra, and Robert L. Mercer. 1993. The mathematics of statistical machine translation: Parameter estimation. Computational Linguistics, 19(2):263­312. Yee Seng Chan, Hwee Tou Ng, and David Chiang. 2007. Word sense disambiguation improves statistical machine translation. In Proc. ACL 2007. Stanley F. Chen and Joshua T. Goodman. 1998. An empirical study of smoothing techniques for language modeling. Technical Report TR-10-98, Computer Science Group, Harvard University. David Chiang, Yuval Marton, and Philip Resnik. 2008. Online large-margin training of syntactic and structural translation features. In Proc. EMNLP 2008. David Chiang. 2005. A hierarchical phrase-based model for statistical machine translation. In Proc. ACL 2005. David Chiang. 2007. Hierarchical phrase-based translation. Computational Linguistics, 33(2). Michael Collins. 1997. Three generative, lexicalized models for statistical parsing. In Proc. ACL 1997. Koby Crammer, Ofer Dekel, Joseph Keshet, Shai ShalevShwartz, and Yoram Singer. 2006. Online passiveaggressive algorithms. Journal of Machine Learning Research, 7:551­585. Steve DeNeefe, Kevin Knight, Wei Wang, and Daniel Marcu. 2007. What can syntax-based MT learn from phrase-based MT? In Proc. EMNLP-CoNLL-2007. Victoria Fossum, Kevin Knight, and Steven Abney. 2008. Using syntax to improve word alignment for syntaxbased statistical machine translation. In Proc. Third Workshop on Statistical Machine Translation. Michel Galley, Mark Hopkins, Kevin Knight, and Daniel Marcu. 2004. What's in a translation rule? In Proc. HLT-NAACL 2004, Boston, Massachusetts. Michel Galley, Jonathan Graehl, Kevin Knight, Daniel Marcu, Steve DeNeefe, Wei Wang, and Ignacio Thayer. 2006. Scalable inference and training of context-rich syntactic models. In Proc. ACL 2006. Liang Huang. 2008. Forest reranking: Discriminative parsing with non-local features. In Proc. ACL 2008. Kevin Knight and Jonathan Graehl. 2005. An overview of probabilistic tree transducers for natural language processing. In Proceedings of the Sixth International Conference on Intelligent Text Processing and Computational Linguistics (CICLing). Yoong Keok Lee and Hwee Tou Ng. 2002. An empirical evaluation of knowledge sources and learning algorithms for word sense disambiguation. In Proc. EMNLP 2002, pages 41­48. Percy Liang, Alexandre Bouchard-C^ t´ , Dan Klein, and oe Ben Taskar. 2006. An end-to-end discriminative approach to machine translation. In Proc. COLING-ACL 2006. Wolfgang Macherey, Franz Josef Och, Ignacio Thayer, and Jakob Uskoreit. 2008. Lattice-based minimum error rate training for statistical machine translation. In Proc. EMNLP 2008. Yuval Marton and Philip Resnik. 2008. Soft syntactic constraints for hierarchical phrased-based translation. In Proc. ACL-08: HLT. Franz Josef Och and Hermann Ney. 2002. Discriminative training and maximum entropy models for statistical machine translation. In Proc. ACL 2002. Franz Josef Och, Daniel Gildea, Sanjeev Khudanpur, Anoop Sarkar, Kenji Yamada, Alex Fraser, Shankar Kumar, Libin Shen, David Smith, Katherine Eng, Viren Jain, Zhen Jin, and Dragomir Radev. 2004. A smorgasbord of features for statistical machine translation. In Proc. HLT-NAACL 2004, pages 161­168. Franz Josef Och. 2003. Minimum error rate training in statistical machine translation. In Proc. ACL 2003. Slav Petrov, Leon Barrett, Romain Thibaux, and Dan Klein. 2006. Learning accurate, compact, and interpretable tree annotation. In Proc. ACL 2006. John C. Platt. 1998. Fast training of support vector machines using sequential minimal optimization. In B. Sch¨ lkopf, C. J. C. Burges, and A. J. Smola, editors, o Advances in Kernel Methods: Support Vector Learning, pages 195­208. MIT Press. David Talbot and Miles Osborne. 2007. Randomised language modelling for statistical machine translation. In Proc. ACL 2007, pages 512­519. Christoph Tillmann and Tong Zhang. 2006. A discriminative global training algorithm for statistical MT. In Proc. COLING-ACL 2006. Joseph Turian, Benjamin Wellington, and I. Dan Melamed. 2007. Scalable discriminative learning for natural language parsing and translation. In Proc. NIPS 2006. Wei Wang, Kevin Knight, and Daniel Marcu. 2007. Binarizing syntax trees to improve syntax-based machine translation accuracy. In Proc. EMNLP-CoNLL 2007. Taro Watanabe, Jun Suzuki, Hajime Tsukuda, and Hideki Isozaki. 2007. Online large-margin training for statistical machine translation. In Proc. EMNLP 2007. Dekai Wu. 1997. Stochastic inversion transduction grammars and bilingual parsing of parallel corpora. Computational Linguistics, 23:377­404. Kenji Yamada and Kevin Knight. 2002. A decoder for syntax-based statistical MT. In Proc. ACL 2002. Hao Zhang, Liang Huang, Daniel Gildea, and Kevin Knight. 2006. Synchronous binarization for machine translation. In Proc. HLT-NAACL 2006. 226 Efficient Parsing for Transducer Grammars John DeNero, Mohit Bansal, Adam Pauls, and Dan Klein Computer Science Division University of California, Berkeley {denero, mbansal, adpauls, klein}@cs.berkeley.edu Abstract The tree-transducer grammars that arise in current syntactic machine translation systems are large, flat, and highly lexicalized. We address the problem of parsing efficiently with such grammars in three ways. First, we present a pair of grammar transformations that admit an efficient cubic-time CKY-style parsing algorithm despite leaving most of the grammar in n-ary form. Second, we show how the number of intermediate symbols generated by this transformation can be substantially reduced through binarization choices. Finally, we describe a two-pass coarse-to-fine parsing approach that prunes the search space using predictions from a subset of the original grammar. In all, parsing time reduces by 81%. We also describe a coarse-to-fine pruning scheme for forest-based language model reranking that allows a 100-fold increase in beam size while reducing decoding time. The resulting translations improve by 1.3 BLEU. 2004) and have increased in size by including more synchronous tree fragments (Galley et al., 2006; Marcu et al., 2006; DeNeefe et al., 2007). As a result of these trends, the syntactic component of machine translation decoding can now account for a substantial portion of total decoding time. In this paper, we focus on efficient methods for parsing with very large tree-to-string grammars, which have flat n-ary rules with many adjacent non-terminals, as in Figure 1. These grammars are sufficiently complex that the purely syntactic pass of our multi-pass decoder is the compute-time bottleneck under some conditions. Given that parsing is well-studied in the monolingual case, it is worth asking why MT grammars are not simply like those used for syntactic analysis. There are several good reasons. The most important is that MT grammars must do both analysis and generation. To generate, it is natural to memorize larger lexical chunks, and so rules are highly lexicalized. Second, syntax diverges between languages, and each divergence expands the minimal domain of translation rules, so rules are large and flat. Finally, we see most rules very few times, so it is challenging to subcategorize non-terminals to the degree done in analytic parsing. This paper develops encodings, algorithms, and pruning strategies for such grammars. We first investigate the qualitative properties of MT grammars, then present a sequence of parsing methods adapted to their broad characteristics. We give normal forms which are more appropriate than Chomsky normal form, leaving the rules mostly flat. We then describe a CKY-like algorithm which applies such rules efficiently, working directly over the n-ary forms in cubic time. We show how thoughtful 1 Introduction Current approaches to syntactic machine translation typically include two statistical models: a syntactic transfer model and an n-gram language model. Recent innovations have greatly improved the efficiency of language model integration through multipass techniques, such as forest reranking (Huang and Chiang, 2007), local search (Venugopal et al., 2007), and coarse-to-fine pruning (Petrov et al., 2008; Zhang and Gildea, 2008). Meanwhile, translation grammars have grown in complexity from simple inversion transduction grammars (Wu, 1997) to general tree-to-string transducers (Galley et al., 227 Human Language Technologies: The 2009 Annual Conference of the North American Chapter of the ACL, pages 227­235, Boulder, Colorado, June 2009. c 2009 Association for Computational Linguistics Maria P daba (a) S! NNP1 did not slap DT2 green NN3 NNP1 no daba una bofetada a DT2 NN3 verde 90,000 Original grammar rules S ! NNP no daba una bofetada a DT NN verde Right-branch o daba (b) S ! NNP no daba una bofetada a DT NN verde S 60,000 DT NN NNS NP ! Lexical normal form (LNF) transformation 30,000 S ! NNP no daba una bofetada a DT+NN verde Left-branch Gre Optimal (I a bruja (c) NNP DT NN 0 NP ! DT+NN NNS 1 2 or 3 NP ! DT NN+NNS 4 5 6 7+ 70,000 52,500 35,000 17,500 0 Mary did not slap the green witch Maria no daba una bofetada a la bruja verde Type-minimizing binarization Figure 1: (a) A synchronous transducer rule has coindexed non-terminals on the source and target side. Internal grammatical structure of the target side has been DT+NN ! DT NN NP ! DT+NN NNS omitted. (b) The source-side projection of the rule is a 90,000 monolingual source-language rule with target-side gram- Anchored LNF transformation stituent alignments (Galley et al., 2004). Given this mar symbols. (c) A training sentence pair is annotated S\NNP ! no daba una bofetada a DT+NN verde 67,500 correspondence, an array of extraction procedures with a target-side parse tree and a word alignment, which DT+NN ! DT NN areNP ! DT+NN NNS machine transyields rules that well-suited to license this rule to be extracted. 45,000 Figure 2: Transducer grammars are composed of very flat Required symbols Sequences to build rules. Above, the histogram shows rule counts for each ruleDT+NN DT the 332,000 DT,NNthat apply to an indisize among NN rules NNS NNP sentence. The size of a rule is the total DT,NN,NNS vidual 30-word NP S number of non-terminals and lexical items in its sourceMinimal binary rules for LNF side yield. binarization can further increase parsing speed, and we present a new coarse-to-fine scheme that uses 70,000 rule subsets rather than symbol clustering to build a coarse 52,500 grammar projection. These techniques reduce parsing time by 81% in aggregate. Finally, 35,000 we demonstrate that we can accelerate forest-based reranking with a language model by pruning with 17,500 information from the parsing pass. This approach 0 enables a 100-fold increase in maximum beam size, 1 2 3 4 5 6+ improving translation quality by 1.3 BLEU while decreasing total decoding time. 2 6 7 8 9 10+ Tree Transducer Grammars 5 Tree-to-string transducer grammars consist of weighted rules like the one depicted in Figure 1. Each n-ary rule consists of a root symbol, a sequence of lexical items and non-terminals on the source-side, and a fragment of a syntax tree on the target side. Each non-terminal on the source side corresponds to a unique one on the target side. Aligned non-terminals share a grammar symbol derived from a target-side monolingual grammar. These grammars are learned from word-aligned sentence pairs annotated with target-side phrase structure trees. Extraction proceeds by using word alignments to find correspondences between targetside constituents and source-side word spans, then discovering transducer rules that match these con228 lation NNP S\NNPet al., 2006; DeNeefe et al., 2007; S ! (Galley Marcu et al., 22,500 Rule weights are estimated 2006). by discriminatively combining relative frequency 0 counts and other rule features. 1 2 3 4 5 6 7 A transducer grammar G can be projected 8 9 10+ onto its source language, inducing a monolingual grammar. If we weight each rule by the maximum weight of its projecting synchronous rules, then parsing with this projected grammar maximizes the translation model score for a source sentence. We need not even consider the target side of transducer rules until integrating an n-gram language model or other non-local features of the target language. We conduct experiments with a grammar extracted from 220 million words of Arabic-English bitext, extracting rules with up to 6 non-terminals. A histogram of the size of rules applicable to a typical 30-word sentence appears in Figure 2. The grammar includes 149 grammatical symbols, an augmentation of the Penn Treebank symbol set. To evaluate, we decoded 300 sentences of up to 40 words in length from the NIST05 Arabic-English test set. 3 Efficient Grammar Encodings Monolingual parsing with a source-projected transducer grammar is a natural first pass in multi-pass decoding. These grammars are qualitatively different from syntactic analysis grammars, such as the lexicalized grammars of Charniak (1997) or the heavily state-split grammars of Petrov et al. (2006). NNP1 did not slap DT2 green NN3 (a) InSthis section, we develop an appropriate grammar ! encoding that enables efficient parsing. 3 verde NNP1 no daba una bofetada a DT2 NN Original grammar rules are flat and lexical S ! NNP no daba una bofetada a DT NN verde NP ! DT NN NNS LNF replaces non-terminal sequences in lexical rules S ! NNP no daba una bofetada a DT+NN verde DT+NN ! DT NN Non-lexical rules are binarized using few symbols Non-lexical rules before binarization: NP ! DT NN NNS DT+NN ! DT NN Equivalent binary rules, minimizing symbol count: NP ! DT+NN NNS DT+NN ! DT NN Anchored LNF rules are bounded by lexical items S\NNP ! no daba una bofetada a DT+NN verde NP ! DT+NN NNS S ! NNP S\NNP DT+NN ! DT NN 7+ It is problematic to convert these grammars into (b) Chomsky normal form, which aCKY requires. BeS ! NNP no daba una bofetada DT NN verde cause transducer rules are very flat and contain speS cific lexical items, binarization introduces a large number of intermediate grammar symbols. Rule size parsing complexity whether (c) and lexicalization affectDT NNP NN the grammar is binarized explicitly (Zhang et al., Mary did not slap the green witch 2006) or implicitly binarized using Early-style intermediate symbols (Zollmann et al., 2006). Moreover, theMaria no daba una bofetada a la bruja Markovized to resulting binary rules cannot be verde merge symbols, as in Klein and Manning (2003), because each rule is associated with a target-side tree that cannot be abstracted. We also do not restrict the form of rules in the Right-branching 8,095 grammar, a common technique in syntactic machine Left-branching For instance, Zollmann 5,871 (2006) translation. et al. follow Chiang (2005) in disallowing adjacent nonGreedy 1,101 terminals. Watanabe et al. (2006) limit grammars Optimal (ILP) 443 to Griebach-Normal form. However, general tree 0 3,000 6,000 9,000 transducer grammars provide excellent translation performance (Galley et al., 2006), and so we focus on parsing with all available rules. 70,000 52,500 3.1 Lexical Normal Form Sequences of consecutive non-terminals complicate parsing because they require a search over non35,000 terminal boundaries when applied to a sentence 45,000 span. We transform the grammar to ensure that all lexical rule we replace each sequence of consecutive 17,500 rules containing lexical items (lexical rules) do not non-terminals X1 . . . Xn with the intermediate sym22,500 contain sequences of non-terminals. We allow both bol X1+. . .+Xn (abbreviated X1:n ) and introduce a 0 1 2 3 6+ unary and binary non-lexical 4 rules. 5 non-lexical rule X1 +. . .+Xn X1 . . . Xn . In the 0 Let L be the set of lexical items and V the set binarization1step, we 3 introduce 5 further intermediate 2 4 6 7+ of non-terminal symbols in the original grammar. symbols and rules to binarize all non-lexical rules Then, lexical normal form (LNF) limits productions 90,000 grammar, including those added by sequence in the to two forms: elimination. Non-lexical: X X1 (X2 ) 9 10+ + Figure 3: We transform the original grammar by first eliminating non-terminal sequences in lexical rules. Next, we binarize, adding a minimal number of intermediate grammar symbols and binary non-lexical rules. 90,000 Finally, anchored LNF further transforms lexical rules to begin and end with lexical items by introducing additional symbols. 67,500 67,500 3.2 Non-terminal Binarization Lexical: X (X1 )(X2 ) = w (Xi w ) + 45,000 Exactly how we binarize non-lexical rules affects the 22,500 the LNF transformation. total number of intermediate symbols introduced by Above, all Xi V and w+ L+ . Symbols in parentheses are optional. The nucleus of lexical rules is a mixed sequence that has lexical items on each end and no adjacent non-terminals. Converting a grammar into LNF requires two steps. In the sequence elimination step, for every 229 Binarization involves selecting a set of symbols 0 that will1 allow3us 4to 5 6 7 8 9right-hand side assemble the 10+ 2 X1 . . . Xn of every non-lexical rule using binary productions. This symbol set must at least include the left-hand side of every rule in the grammar (lexical and non-lexical), including the intermediate DT+NN DT NN Maria no daba una bofetada a la bruja verde NNS NNP NP S DT NN DT NN NNS Binary rules for LNF that minimize symbol count DT+NN ! DT NN Right-branching Left-branching Greedy Optimal (ILP) 0 7+ 1,101 443 3,000 6,000 9,000 5,871 8,095 NP ! DT+NN NNS mar. Let R be the set of symbol sequences on the Anchored LNF rules are bounded by lexical items right-hand side of all non-lexical rules. Then, the S\NNP ILP takes! noform:una bofetada a DT+NN verde the daba DT+NN ! DT NN NP ! DT+NN NNS S ! NNP S\NNP min TY (1) Y L (2) 6 Figure 4: The number of non-terminal symbols intro70,000 duced to the grammar through LNF binarization depends upon the policy for binarizing type sequences. This ex52,500 periment shows results from transforming a grammar that has already been filtered for a particular short sentence. Both 35,000 the greedy and optimal binarizations use far fewer symbols than naive binarizations. 17,500 0 symbols X1:n introduced by sequence elimination. 1 2 3 4 5 6+ To ensure that a symbol sequence X1 . . . Xn can be constructed, we select a split point k and add intermediate types X1:k and Xk+1:n to the grammar. We must also ensure that the sequences X1 . . . Xk and Xk+1 . . . Xn can be constructed. As baselines, we used left-branching (where k = 1 always) and right-branching (where k = n - 1) binarizations. We also tested a greedy binarization approach, choosing k to minimize the number of grammar symbols introduced. We first try to select k such that both X1:k and Xk+1:n are already in the grammar. If no such k exists, we select k such that one of the intermediate types generated is already used. If no 1 such k exists again, we choose k = 2 n . This policy only creates new intermediate types when necessary. Song et al. (2008) propose a similar greedy approach to binarization that uses corpus statistics to select common types rather than explicitly reusing types that have already been introduced. Finally, we computed an optimal binarization that explicitly minimizes the number of symbols in the resulting grammar. We cast the minimization as an integer linear program (ILP). Let V be the set of all base non-terminal symbols in the grammar. We introduce an indicator variable TY for each symbol Y V + to indicate that Y is used in the grammar. Y can be either a base non-terminal symbol Xi or an intermediate symbol X1:n . We also introduce indicators AY,Z for each pairs of symbols, indicating that both Y and Z are used in the grammar. Let L V + be the set of left-hand side symbols for all lexical and non-lexical rules already in the gram- s.t. TY = 1 90,000 1 67,500 45,000 22,500 k Y V + AX1:k ,Xk+1:n X1 . . . Xn R (3) AX1:k ,Xk+1:n k TX1:n X1:n Y, Z (4) (5) AY,Z TY , AY,Z TZ 7 8 9 10+ The solution to this ILP indicates which symbols 0 appear in a1minimal3binarization. 6 7+ Equation 1 explic2 4 5 itly minimizes the number of symbols. Equation 2 ensures that all symbols already in the grammar re90,000 main in the grammar. 67,500 Equation 3 does not require that a symbol represent the entire right-hand side of each non-lexical 45,000 rule, but does ensure that each right-hand side sequence can be built from two subsequence symbols. 22,500 Equation 4 ensures that any included intermediate type can also be built from two subsequence types. 0 2 3 5 6 7 8 9 pair Finally,1Equation45 ensures that if a10+ is used, each member of the pair is included. This program can be optimized with an off-the-shelf ILP solver.1 Figure 4 shows the number of intermediate grammar symbols needed for the four binarization policies described above for a short sentence. Our ILP solver could only find optimal solutions for very short sentences (which have small grammars after relativization). Because greedy requires very little time to compute and generates symbol counts that are close to optimal when both can be computed, we use it for our remaining experiments. 3.3 Anchored Lexical Normal Form We also consider a further grammar transformation, anchored lexical normal form (ALNF), in which the yield of lexical rules must begin and end with a lexical item. As shown in the following section, ALNF improves parsing performance over LNF by shifting work from lexical rule applications to non-lexical 1 We used lp solve: http://sourceforge.net/projects/lpsolve. 230 rule applications. ALNF consists of rules with the following two forms: Non-lexical: X X1 (X2 ) Lexical: X w+ (Xi w+ ) Again, w(k, , X) will have been computed by the dynamic program. Assuming only a constant number of mappings per rule per span, the work in this phase is quadratic. We can then merge wl and wb : w(i, j, X) = max(wl (i, j, X), wb (i, j, X)). To efficiently compute mappings, we store lexical rules in a trie (or suffix array) ­ a searchable graph that indexes rules according to their sequence of lexical items and non-terminals. This data structure has been used similarly to index whole training sentences for efficient retrieval (Lopez, 2007). To find all rules that map onto a span, we traverse the trie using depth-first search. 4.3 Applying Unary Rules Unary non-lexical rules are applied after lexical rules and non-lexical binary rules. w(i, j, X) = r:r=XX1 To convert a grammar into ALNF, we first transform it into LNF, then introduce additional binary rules that split off non-terminal symbols from the ends of lexical rules, as shown in Figure 3. 4 Efficient CKY Parsing We now describe a CKY-style parsing algorithm for grammars in LNF. The dynamic program is organized into spans Sij and computes the Viterbi score w(i, j, X) for each edge Sij [X], the weight of the maximum parse over words i+1 to j, rooted at symbol X. For each Sij , computation proceeds in three phases: binary, lexical, and unary. 4.1 Applying Non-lexical Binary Rules For a span Sij , we first apply the binary non-lexical rules just as in standard CKY, computing an intermediate Viterbi score wb (i, j, X). Let r be the weight of rule r. Then, wb (i, j, X) = r=XX1 X2 max r w(i, j, X1 ). max r max w(i, k, X1 ) · w(k, j, X2 ). k=i+1 j-1 While this definition is recursive, we allow only one unary rule application per symbol X at each span to prevent infinite derivations. This choice does not limit the generality of our algorithm: chains of unaries can always be collapsed via a unary closure. 4.4 Bounding Split Points for Binary Rules Non-lexical binary rules can in principle apply to any span Sij where j - i 2, using any split point k such that i < k < j. In practice, however, many rules cannot apply to many (i, k, j) triples because the symbols for their children have not been constructed successfully over the subspans Sik and Skj . Therefore, the precise looping order over rules and split points can influence computation time. We found the following nested looping order for the binary phase of processing an edge Sij [X] gave the fastest parsing times for these grammars: 1. Loop over symbols X1 for the left child 2. Loop over all rules X X1 X2 containing X1 3. Loop over split points k : i < k < j 4. Update wb (i, j, X) as necessary This looping order allows for early stopping via additional bookkeeping in the algorithm. We track the following statistics as we parse: The quantities w(i, k, X1 ) and w(k, j, X2 ) will have already been computed by the dynamic program. The work in this phase is cubic in sentence length. 4.2 Applying Lexical Rules On the other hand, lexical rules in LNF can be applied without binarization, because they only apply to particular spans that contain the appropriate lexical items. For a given Sij , we first compute all the legal mappings of each rule onto the span. A mapping consists of a correspondence between non-terminals in the rule and subspans of Sij . In practice, there is typically only one way that a lexical rule in LNF can map onto a span, because most lexical items will appear only once in the span. Let m be a legal mapping and r its corresponding (i) rule. Let Sk [X] be the edge mapped to the ith nonterminal of r under m, and r the weight of r. Then, wl (i, j, X) = max r m Sk [X] (i) w(k, , X). 231 Grammar LNF LNF ALNF Bound checks no yes yes Parsing time 264 181 104 Table 1: Adding bound checks to CKY and transforming the grammar from LNF to anchored LNF reduce parsing time by 61% for 300 sentences of length 40 or less. No approximations have been applied, so all three scenarios produce no search errors. Parsing time is in minutes. spent searching the trie for mappings, because the first transition into the trie must use an edge with a lexical item. Finally, ALNF improves the frequency that, when a lexical rule matches a span, we have successfully built every edge Sk [X] in the mapping for that rule. This frequency increases from 45% to 96% with ALNF. 5 Coarse-to-Fine Search minEND (i, X), maxEND (i, X): The minimum and maximum position k for which symbol X was successfully built over Sik . minSTART (j, X), maxSTART (j, X): The minimum and maximum position k for which symbol X was successfully built over Skj . We then bound k by mink and maxk in the inner loop using these statistics. If ever mink > maxk , then the loop is terminated early. 1. set mink = i + 1, maxk = j - 1 2. loop over symbols X1 for the left child mink = max(mink , minEND (i, X1 )) maxk = min(maxk , maxEND (i, X1 )) 3. loop over rules X X1 X2 mink = max(mink , minSTART (j, X2 )) maxk = min(maxk , maxSTART (j, X2 )) 5. update wb (i, j, X) as necessary 4. loop over split points k : mink k maxk We now consider two coarse-to-fine approximate search procedures for parsing with these grammars. Our first approach clusters grammar symbols together during the coarse parsing pass, following work in analytic parsing (Charniak and Caraballo, 1998; Petrov and Klein, 2007). We collapse all intermediate non-terminal grammar symbols (e.g., NP) to a single coarse symbol X, while pre-terminal symbols (e.g., NN) are hand-clustered into 7 classes (nouns, verbals, adjectives, punctuation, etc.). We then project the rules of the original grammar into this simplified symbol set, weighting each rule of the coarse grammar by the maximum weight of any rule that mapped onto it. In our second and more successful approach, we select a subset of grammar symbols. We then include only and all rules that can be built using those symbols. Because the grammar includes many rules that are compositions of smaller rules, parsing with a subset of the grammar still provides meaningful scores that can be used to prune base grammar symbols while parsing under the full grammar. 5.1 Symbol Selection In this way, we eliminate unnecessary work by avoiding split points that we know beforehand cannot contribute to wb (i, j, X). 4.5 Parsing Time Results Table 1 shows the decrease in parsing time from including these bound checks, as well as switching from lexical normal form to anchored LNF. Using ALNF rather than LNF increases the number of grammar symbols and non-lexical binary rules, but makes parsing more efficient in three ways. First, it decreases the number of spans for which a lexical rule has a legal mapping. In this way, ALNF effectively shifts work from the lexical phase to the binary phase. Second, ALNF reduces the time 232 To compress the grammar, we select a small subset of symbols that allow us to retain as much of the original grammar as possible. We use a voting scheme to select the symbol subset. After conversion to LNF (or ALNF), each lexical rule in the original grammar votes for the symbols that are required to build it. A rule votes as many times as it was observed in the training data to promote frequent rules. We then select the top nl symbols by vote count and include them in the coarse grammar C. We would also like to retain as many non-lexical rules from the original grammar as possible, but the right-hand side of each rule can be binarized in many ways. We again use voting, but this time each non- Pruning No pruning Clustering Subsets Minutes 104 79 50 Model score 60,179 60,179 60,163 BLEU 44.84 44.84 44.82 6 Language Model Integration Table 2: Coarse-to-fine pruning speeds up parsing time with minimal effect on either model score or translation quality. The coarse grammar built using symbol subsets outperforms clustering grammar symbols, reducing parsing time by 52%. These experiments do not include a language model. lexical rule votes for its yield, a sequence of symbols. We select the top nu symbol sequences as the set R of right-hand sides. Finally, we augment the symbol set of C with intermediate symbols that can construct all sequences in R, using only binary rules. This step again requires choosing a binarization for each sequence, such that a minimal number of additional symbols is introduced. We use the greedy approach from Section 3.2. We then include in C all rules from the original grammar that can be built from the symbols we have chosen. Surprisingly, we are able to retain 76% of the grammar rules while excluding 92% of the grammar symbols2 , which speeds up parsing substantially. 5.2 Max Marginal Thresholding Large n-gram language models (LMs) are critical to the performance of machine translation systems. Recent innovations have managed the complexity of LM integration using multi-pass architectures. Zhang and Gildea (2008) describes a coarse-to-fine approach that iteratively increases the order of the LM. Petrov et al. (2008) describes an additional coarse-to-fine hierarchy over language projections. Both of these approaches integrate LMs via bottomup dynamic programs that employ beam search. As an alternative, Huang and Chiang (2007) describes a forest-based reranking algorithm called cube growing, which also employs beam search, but focuses computation only where necessary in a top-down pass through a parse forest. In this section, we show that the coarse-to-fine idea of constraining each pass using marginal predictions of the previous pass also applies effectively to cube growing. Max marginal predictions from the parse can substantially reduce LM integration time. 6.1 Language Model Forest Reranking We parse first with the coarse grammar to find the Viterbi derivation score for each edge Sij [X]. We then perform a Viterbi outside pass over the chart, like a standard outside pass but replacing with max (Goodman, 1999). The product of an edge's Viterbi score and its Viterbi outside score gives a max marginal, the score of the maximal parse that uses the edge. We then prune away regions of the chart that deviate in their coarse max marginal from the global Viterbi score by a fixed margin tuned on a development set. Table 2 shows that both methods of constructing a coarse grammar are effective in pruning, but selecting symbol subsets outperformed the more typical clustering approach, reducing parsing time by an additional factor of 2. We used nl of 500 and nu of 4000 for experiments. These parameters were tuned on a development set. 2 Parsing produces a forest of derivations, where each edge in the forest holds its Viterbi (or one-best) derivation under the transducer grammar. In forest reranking via cube growing, edges in the forest produce k-best lists of derivations that are scored by both the grammar and an n-gram language model. Using ALNF, each edge must first generate a k-best list of derivations that are not scored by the language model. These derivations are then flattened to remove the binarization introduced by ALNF, so that the resulting derivations are each rooted by an nary rule r from the original grammar. The leaves of r correspond to sub-edges in the chart, which are recursively queried for their best language-modelscored derivations. These sub-derivations are combined by r, and new n-grams at the edges of these derivations are scored by the language model. The language-model-scored derivations for the edge are placed on a priority queue. The top of the priority queue is repeatedly removed, and its successors added back on to the queue, until k language-model-scored derivations have been discovered. These k derivations are then sorted and 233 Pruning strategy No pruning CTF parsing CTF reranking CTF parse + rerank Max beam 20 200 200 2000 BLEU 57.67 58.43 58.63 58.90 TM score 58,570 58,495 58,582 58,602 LM score -17,202 -16,929 -16,998 -16,980 Total score 41,368 41,556 41,584 41,622 Inside time 99 53 98 53 Outside time 0 0 64 52 LM time 247 186 79 148 Total time 346 239 241 253 Table 3: Time in minutes and performance for 300 sentences. We used a trigram language model trained on 220 million words of English text. The no pruning baseline used a fix beam size for forest-based language model reranking. Coarse-to-fine parsing included a coarse pruning pass using a symbol subset grammar. Coarse-to-fine reranking used max marginals to constrain the reranking pass. Coarse-to-fine parse + rerank employed both of these approximations. supplied to parent edges upon request.3 6.2 Coarse-to-Fine Parsing Even with this efficient reranking algorithm, integrating a language model substantially increased decoding time and memory use. As a baseline, we reranked using a small fixed-size beam of 20 derivations at each edge. Larger beams exceeded the memory of our hardware. Results appear in Table 3. Coarse-to-fine parsing before LM integration substantially improved language model reranking time. By pruning the chart with max marginals from the coarse symbol subset grammar from Section 5, we were able to rerank with beams of length 200, leading to a 0.8 BLEU increase and a 31% reduction in total decoding time. 6.3 Coarse-to-Fine Forest Reranking We realized similar performance and speed benefits by instead pruning with max marginals from the full grammar. We found that LM reranking explored many edges with low max marginals, but used few of them in the final decoder output. Following the coarse-to-fine paradigm, we restricted the reranker to edges with a max marginal above a fixed threshold. Furthermore, we varied the beam size of each edge based on the parse. Let m be the ratio of the max marginal for edge m to the global Viterbi derivation for the sentence. We used a beam of size k · 2ln m for each edge. Computing max marginals under the full grammar required an additional outside pass over the full parse forest, adding substantially to parsing time. Huang and Chiang (2007) describes the cube growing algorithm in further detail, including the precise form of the successor function for derivations. 3 However, soft coarse-to-fine pruning based on these max marginals also allowed for beams up to length 200, yielding a 1.0 BLEU increase over the baseline and a 30% reduction in total decoding time. We also combined the coarse-to-fine parsing approach with this soft coarse-to-fine reranker. Tiling these approximate search methods allowed another 10-fold increase in beam size, further improving BLEU while only slightly increasing decoding time. 7 Conclusion As translation grammars increase in complexity while innovations drive down the computational cost of language model integration, the efficiency of the parsing phase of machine translation decoding is becoming increasingly important. Our grammar normal form, CKY improvements, and symbol subset coarse-to-fine procedure reduced parsing time for large transducer grammars by 81%. These techniques also improved forest-based language model reranking. A full decoding pass without any of our innovations required 511 minutes using only small beams. Coarse-to-fine pruning in both the parsing and language model passes allowed a 100-fold increase in beam size, giving a performance improvement of 1.3 BLEU while decreasing total decoding time by 50%. Acknowledgements This work was enabled by the Information Sciences Institute Natural Language Group, primarily through the invaluable assistance of Jens Voeckler, and was supported by the National Science Foundation (NSF) under grant IIS-0643742. 234 References Eugene Charniak and Sharon Caraballo. 1998. New figures of merit for best-first probabilistic chart parsing. In Computational Linguistics. Eugene Charniak. 1997. Statistical techniques for natural language parsing. In National Conference on Artificial Intelligence. David Chiang. 2005. A hierarchical phrase-based model for statistical machine translation. In The Annual Conference of the Association for Computational Linguistics. Steve DeNeefe, Kevin Knight, Wei Wang, and Daniel Marcu. 2007. What can syntax-based MT learn from phrase-based MT? In Proceedings of the Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning. Michel Galley, Mark Hopkins, Kevin Knight, and Daniel Marcu. 2004. What's in a translation rule? In Human Language Technologies: The Annual Conference of the North American Chapter of the Association for Computational Linguistics. Michel Galley, Jonathan Graehl, Kevin Knight, Daniel Marcu, Steve DeNeefe, Wei Wang, and Ignacio Thayer. 2006. Scalable inference and training of context-rich syntactic translation models. In The Annual Conference of the Association for Computational Linguistics. Joshua Goodman. 1999. Semiring parsing. Computational Linguistics. Liang Huang and David Chiang. 2007. Forest rescoring: Faster decoding with integrated language models. In The Annual Conference of the Association for Computational Linguistics. Dan Klein and Chris Manning. 2003. Accurate unlexicalized parsing. In Proceedings of the Association for Computational Linguistics. Adam Lopez. 2007. Hierarchical phrase-based translation with suffix arrays. In The Conference on Empirical Methods in Natural Language Processing. Daniel Marcu, Wei Wang, Abdessamad Echihabi, and Kevin Knight. 2006. SPMT: Statistical machine translation with syntactified target language phrases. In Proceedings of the Conference on Empirical Methods in Natural Language Processing. Slav Petrov and Dan Klein. 2007. Improved inference for unlexicalized parsing. In The Annual Conference of the North American Chapter of the Association for Computational Linguistics. Slav Petrov, Leon Barrett, Romain Thibaux, and Dan Klein. 2006. Learning accurate, compact, and interpretable tree annotation. In The Annual Conference of the Association for Computational Linguistics. Slav Petrov, Aria Haghighi, and Dan Klein. 2008. Coarse-to-fine syntactic machine translation using language projections. In The Conference on Empirical Methods in Natural Language Processing. Xinying Song, Shilin Ding, and Chin-Yew Lin. 2008. Better binarization for the CKY parsing. In The Conference on Empirical Methods in Natural Language Processing. Ashish Venugopal, Andreas Zollmann, and Stephan Vogel. 2007. An efficient two-pass approach to synchronous-CFG driven statistical MT. In In Proceedings of the Human Language Technology and North American Association for Computational Linguistics Conference. Taro Watanabe, Hajime Tsukada, and Hideki Isozaki. 2006. Left-to-right target generation for hierarchical phrase-based translation. In The Annual Conference of the Association for Computational Linguistics. Dekai Wu. 1997. Stochastic inversion transduction grammars and bilingual parsing of parallel corpora. Computational Linguistics, 23:377­404. Hao Zhang and Daniel Gildea. 2008. Efficient multipass decoding for synchronous context free grammars. In The Annual Conference of the Association for Computational Linguistics. Hao Zhang, Liang Huang, Daniel Gildea, and Kevin Knight. 2006. Synchronous binarization for machine translation. In North American Chapter of the Association for Computational Linguistics. Andreas Zollmann, Ashish Venugopal, and Stephan Vogel. 2006. Syntax augmented machine translation via chart parsing. In The Statistical Machine Translation Workshop at the North American Association for Computational Linguistics Conference. 235 Preference Grammars: Softening Syntactic Constraints to Improve Statistical Machine Translation Ashish Venugopal Andreas Zollmann Noah A. Smith Stephan Vogel Language Technologies Institute School of Computer Science Carnegie Mellon University Pittsburgh, PA 15213, USA {ashishv,zollmann,nasmith,vogel}@cs.cmu.edu Abstract We propose a novel probabilistic synchoronous context-free grammar formalism for statistical machine translation, in which syntactic nonterminal labels are represented as "soft" preferences rather than as "hard" matching constraints. This formalism allows us to efficiently score unlabeled synchronous derivations without forgoing traditional syntactic constraints. Using this score as a feature in a log-linear model, we are able to approximate the selection of the most likely unlabeled derivation. This helps reduce fragmentation of probability across differently labeled derivations of the same translation. It also allows the importance of syntactic preferences to be learned alongside other features (e.g., the language model) and for particular labeling procedures. We show improvements in translation quality on small and medium sized Chinese-to-English translation tasks. 1 Introduction of finding the maximum-weighted derivation consistent with the source sentence, where the scores are defined (at least in part) by R-valued weights associated with the rules. A PSCFG derivation is a synchronous parse tree. Defining the translation function as finding the best derivation has the unfortunate side effect of forcing differently-derived versions of the same target sentence to compete with each other. In other words, the true score of each translation is "fragmented" across many derivations, so that each translation's most probable derivation is the only one that matters. The more Bayesian approach of finding the most probable translation (integrating out the derivations) instantiates an NP-hard inference problem even for simple word-based models (Knight, 1999); for grammar-based translation it is known as the consensus problem (Casacuberta and de la Higuera, 2000; Sima'an, 2002). With weights interpreted as probabilities, the maximum-weighted derivation is the maximum a posteriori (MAP) derivation: e argmax max p(e, d | f ) ^ e d Probabilistic synchronous context-free grammars (PSCFGs) define weighted production rules that are automatically learned from parallel training data. As in classical CFGs, these rules make use of nonterminal symbols to generalize beyond lexical modeling of sentences. In MT, this permits translation and reordering to be conditioned on more abstract notions of context. For example, VP ne VB1 pas # do not VB1 represents the discontiguous translation of the French words "ne" and "pas" to "do not", in the context of the labeled nonterminal symbol "VB" (representing syntactic category "verb"). Translation with PSCFGs is typically expressed as the problem 236 where f is the source sentence, e ranges over target sentences, and d ranges over PSCFG derivations (synchronous trees). This is often described as an approximation to the most probable translation, argmaxe d p(e, d | f ). In this paper, we will describe a technique that aims to find the most probable equivalence class of unlabeled derivations, rather than a single labeled derivation, reducing the fragmentation problem. Solving this problem exactly is still an NP-hard consensus problem, but we provide approximations that build on well-known PSCFG decoding methods. Our model falls somewhere between PSCFGs that extract nonterminal symbols from parse trees and treat them as part of Human Language Technologies: The 2009 Annual Conference of the North American Chapter of the ACL, pages 236­244, Boulder, Colorado, June 2009. c 2009 Association for Computational Linguistics the derivation (Zollmann and Venugopal, 2006) and unlabeled hierarchical structures (Chiang, 2005); we treat nonterminal labels as random variables chosen at each node, with each (unlabeled) rule expressing "preferences" for particular nonterminal labels, learned from data. The paper is organized as follows. In Section 2, we summarize the use of PSCFG grammars for translation. We describe our model (Section 3). Section 4 explains the preference-related calculations, and Section 5 addresses decoding. Experimental results using preference grammars in a loglinear translation model are presented for two standard Chinese-to-English tasks in Section 6. We review related work (Section 7) and conclude. MAP approximation can be defined as: e = tgt ^ argmax dD(G):src(d)=f p(d) (1) where tgt(d) is the target-side yield of a derivation d, and D(G) is the set of G's derivations. Using an n-gram language model to score derivations and rule labels to constraint the rules that form derivations, we define p(d) as log-linear model in terms of the rules r R used in d as: m p(d) = pLM (tgt(d)) ×psyn (d) rR 0 × pi (d)i i=1 m+1 /Z() (2) 2 PSCFGs for Machine Translation pi (d) = psyn (d) = hi (r) freq(r;d) Probabilistic synchronous context-free grammars (PSCFGs) are defined by a source terminal set (source vocabulary) TS , a target terminal set (target vocabulary) TT , a shared nonterminal set N and a set R of rules of the form: X , , w where · X N is a labeled nonterminal referred to as the left-hand-side of the rule. · (N TS ) is the source side of the rule. · (N TT ) is the target side of the rule. · w [0, ) is a nonnegative real-valued weight assigned to the rule. For visual clarity, we will use the # character to separate the source side of the rule from the target side . PSCFG rules also have an implied one-toone mapping between nonterminal symbols in and nonterminals symbols in . Chiang (2005), Zollmann and Venugopal (2006) and Galley et al. (2006) all use parameterizations of this PSCFG formalism1 . Given a source sentence f and a PSCFG G, the translation task can be expressed similarly to monolingual parsing with a PCFG. We aim to find the most likely derivation d of the input source sentence and read off the English translation, identified by composing from each rule used in the derivation. This search for the most likely translation under the Galley et al. (2006) rules are formally defined as tree transducers but have equivalent PSCFG forms. 1 1 if d respects label constraints (3) 0 otherwise where = 0 · · · m+1 are weights that reflect the relative importance of features in the model. The features include the n-gram language model (LM) score of the target yield sequence, a collection of m rule feature functions hi : R R0 , and a "syntax" feature that (redundantly) requires every nonterminal token to be expanded by a rule with that nonterminal on its left-hand side. freq(r; d) denotes the frequency of the rule r in the derivation d. Note that m+1 can be effectively ignored when psyn is defined as in Equation 3. Z() is a normalization constant that does not need to be computed during search under the argmax search criterion in Equation 1. Feature weights are trained discriminatively in concert with the language model weight to maximize the BLEU (Papineni et al., 2002) automatic evaluation metric via Minimum Error Rate Training (MERT) (Och, 2003). We use the open-source PSCFG rule extraction framework and decoder from Zollmann et al. (2008) as the framework for our experiments. The asymptotic runtime of this decoder is: O |f |3 |N ||TT |2(n-1) K (4) where K is the maximum number of nonterminal symbols per rule, |f | the source sentence length, and 237 n is the order of the n-gram LM that is used to compute pLM . This constant factor in Equation 4 arises from the dynamic programming item structure used to perform search under this model. Using notation from Chiang (2007), the corresponding item structure is: [X, i, j, q()] : w (5) (4) (3) (2) (1) where X is the nonterminal label of a derivation, i, j define a span in the source sentence, and q() maintains state required to compute pLM (). Under the MAP criterion we can discard derivations of lower weight that share this item structure, but in practice we often require additional lossy pruning to limit the number of items produced. The Syntax-Augmented MT model of Zollmann and Venugopal (2006), for instance, produces a very large nonterminal set using "slash" (NP/NN the great) and "plus" labels (NP+VB she went) to assign syntactically motivated labels for rules whose target words do not correspond to constituents in phrase structure parse trees. These labels lead to fragmentation of probability across many derivations for the same target sentence, worsening the impact of the MAP approximation. In this work we address the increased fragmentation resulting from rules with labeled nonterminals compared to unlabeled rules (Chiang, 2005). ( ę ý VB1 # a place where I can VB1 S ( ę ý VP1 # a place where I can VP1 SBAR ( ę ý VP1 # a place where I can VP1 FRAG ( ę ý AUX1 # a place where I can AUX1 S VB VP NP NN m # eat m # eat m # eat m # dish (8) (1) (1) (10) where the numbers are frequencies of the rule from the training corpus. In classical PSCFG we can think of the nonterminals mentioned in the rules as hard constraints on which rules can be used to expand a particular node; e.g., a VP can only be expanded by a VP rule. In Equation 2, psyn (d) explicitly enforces this hard constraint. Instead, we propose softening these constraints. In the rules below, labels are represented as soft preferences. (10) X ( ę ý X1 # a place where I can X1 = = = = 0.4 0.3 0.2 0.1 3 Preference Grammars We extend the PSCFG formalism to include soft "label preferences" for unlabeled rules that correspond to alternative labelings that have been encountered in training data for the unlabeled rule form. These preferences, estimated via relative frequency counts from rule occurrence data, are used to estimate the feature psyn (d), the probability that an unlabeled derivation can be generated under traditional syntactic constraints. In classic PSCFG, psyn (d) enforces a hard syntactic constraint (Equation 3). In our approach, label preferences influence the value of psyn (d). 3.1 Motivating example p(H0 = S, H1 = VB | r) p(H0 = S, H1 = VP | r) p(H0 = SBAR, H1 = VP | r) p(H0 = FRAG, H1 = AUX | r) (10) X m # eat p(H0 = VB | r) p(H0 = VP | r) p(H0 = NP | r) = = = 0.8 0.1 0.1 (10) X m # dish { p(H0 = NN | r) = 1.0 } Consider the following labeled Chinese-to-English PSCFG rules: 238 Each unlabeled form of the rule has an associated distribution over labels for the nonterminals referenced in the rule; the labels are random variables Hi , with H0 the left-hand-side label. These unlabeled rule forms are simply packed representations of the original labeled PSCFG rules. In addition to the usual features hi (r) for each rule, estimated based on unlabeled rule frequencies, we now have label preference distributions. These are estimated as relative frequencies from the labelings of the base, unlabeled rule. Our primary contribution is how we compute psyn (d)--the probability that an unlabeled derivation adheres to traditional syntactic constraints--for derivations built from preference grammar rules. By using psyn (d) as a feature in the log-linear model, we allow the MERT framework to evaluate the importance of syntactic structure relative to other features. The example rules above highlight the potential for psyn (d) to affect the choice of translation. The translation of the Chinese word sequence (ę ý m can be performed by expanding the nonterminal in the rule "a place where I can X1 " with either "eat" or "dish." A hierarchical system (Chiang, 2005) would allow either expansion, relying on features like pLM to select the best translation since both expansions occurred the same number of times in the data. A richly-labeled PSCFG as in Zollmann and Venugopal (2006) would immediately reject the rule generating "dish" due to hard label matching constraints, but would produce three identical, competing derivations. Two of these derivations would produce S as a root symbol, while one derivation would produce SBAR. The two S-labeled derivations compete, rather than reinforce the choice of the word "eat," which they both make. They will also compete for consideration by any decoder that prunes derivations to keep runtime down. The rule preferences indicate that VB and VP are both valid labels for the rule translating to "eat", and both of these labels are compatible with the arguments expected by "a place where I can X1 ". Alternatively, "dish" produces a NN label which is not compatible with the arguments of this higherup rule. We design psyn (d) to reflect compatibility between two rules (one expanding a right-hand side nonterminal in the other), based on label preference distributions. 3.2 Formal definition with the explicit label set N . · : H N , a function that associates each implicit label with a single explicit label. We can therefore think of H symbols as refinements of the nonterminals in N (Matsusaki et al., 2005). · For each rule r, we define a probability distribution over vectors h of implicit label bindings for its nonterminals, denoted ppref (h | r). h includes bindings for the left-hand side nonterminal (h0 ) as well as each right-hand side nonterminal (h1 , ..., h|h| ). Each hi H. When N , H are defined to include just a single generic symbol as in (Chiang, 2005), we produce the unlabeled grammar discussed above. In this work, we define · N = {S, X} · H = {NP, DT, NN · · · } = NSAMT where N corresponds to the generic labels of Chiang (2005) and H corresponds to the syntactically motivated SAMT labels from (Zollmann and Venugopal, 2006), and maps all elements of H to X. We will use hargs(r) to denote the set of all h = h0 , h1 , ..., hk Hk+1 that are valid bindings for the rule with nonzero preference probability. The preference distributions ppref from each rule used in d are used to compute psyn (d) as described next. 4 Computing feature psyn (d) Let us view a derivation d as a collection of nonterminal tokens nj , j {1, ..., |d|}. Each nj takes an explicit label in N . The score psyn (d) is a product, with one factor per nj in the derivation d: psyn (d) = |d| j=1 j (6) Probabilistic synchronous context-free preference grammars are defined as PSCFGs with the following additional elements: · H: a set of implicit labels, not to be confused 239 Each j factor considers the two rules that nj participates in. We will refer to the rule above nonterminal token nj as rj (the nonterminal is a child in this rule) and the rule that expands nonterminal token j as rj . The intuition is that derivations in which these two rules agree (at each j) about the implicit label for nj , in H are preferable to derivations in which they do not. Rather than making a decision about the implicit label, we want to reward psyn when rj and rj are consistent. Our way of measuring this consistency is an inner product of preference distributions: j ppref (h | rj )ppref (h | rj ) (7) signature [Y, i - ||, j + | |, v, ...]. The left-handside preferences v for the new item are calculated as follows: v(h) = v (h) = ~ v (h) ~ ~ h v (h ) where (8) hH h H: h,h hargs(r) ppref ( h, h | r) × u(h ) This is not quite the whole story, because ppref (· | r) is defined as a joint distribution of all the implicit labels within a rule; the implicit labels are not independent of each other. Indeed, we want the implicit labels within each rule to be mutually consistent, i.e., to correspond to one of the rule's preferred labelings, for both hargs(r) and hargs(r). Our approach to calculating psyn within the dynamic programming algorithm is to recursively calculate preferences for each chart item based on (a) the smaller items used to construct the item and (b) the rule that permits combination of the smaller items into the larger one. We describe how the preferences for chart items are calculated. Let a chart item be denoted [X, i, j, u, ...] where X N and i and j are positions in the source sentence, and u : {h H | (h) = X} [0, 1] (where h u(h) = 1) denotes a distribution over possible X-refinement labels. We will refer to it below as the left-hand-side preference distribution. Additional information (such as language model state) may also be included; it is not relevant here. The simplest case is for a nonterminal token nj that has no nonterminal children. Here the left-handside preference distribution is simply given by u(h) = ppref (h | rj ) . and we define the psyn factor to be j = 1. Now consider the dynamic programming step of combining an already-built item [X, i, j, u, ...] rooted by explicit nonterminal X, spanning source sentence positions i to j, with left-hand-side preference distribution u, to build a larger item rooted by Y through a rule r = Y X1 , X1 , w with preferences ppref (· | r).2 The new item will have 2 We assume for the discussion that , TS and , Renormalizing keeps the preference vectors on the same scale as those in the rules. The psyn factor , which is factored into the value of the new item, is calculated as: = h H: h,h hargs(r) u(h ) (9) so that the value considered for the new item is w × × ..., where factors relating to pLM , for example, may also be included. Coming back to our example, if we let r be the leaf rule producing "eat" at shared nonterminal n1 , we generate an item with: u = u(VB) = 0.8, u(VP) = 0.1, u(NP) = 0.1 1 = 1 Combining this item with X ( ę ý X1 # a place where I can X1 as r2 at nonterminal n2 generates a new target item with translation "a place where I can eat", 2 = 0.9 and v as calculated in Fig. 1. In contrast, 2 = 0 for the derivation where r is the leaf rule that produces "dish". This calculation can be seen as a kind of singlepass, bottom-up message passing inference method embedded within the usual dynamic programming search. 5 Decoding Approximations As defined above, accurately computing psyn (d) requires extending the chart item structure with u. For models that use the n-gram LM feature, the item structure would be: [X, i, j, q(), u] : w (10) Since u effectively summarizes the choice of rules in a derivation, this extension would partition the TT . If there are multiple nonterminals on the right-hand side of the rule, we sum over the longer sequences in hargs(r) and include appropriate values from the additional "child" items' preference vectors in the product. 240 v (S) = ppref ( h = S, h = VB | r)u(VB) + ppref ( h = S, h = VP | r)u(VP) = (0.4 × 0.8) + (0.3 × 0.1) = 0.35 ~ v (SBAR) = p( h = SBAR, h = VP | r)u(VP) = (0.2 × 0.1) = 0.02 ~ v = v(S) = 0.35/(~(S) + v (SBAR)), v(SBAR) = 0.02/~(S) + v (SBAR) = v(S) = 0.35/0.37, v(SBAR) = 0.02/0.37 v ~ v ~ 2 = u(VB) + u(VP) = 0.8 + 0.1 = 0.9 Figure 1: Calculating v and 2 for the running example. search space further. To prevent this partitioning, we follow the approach of Venugopal et al. (2007). We keep track of u for the best performing derivation from the set of derivations that share [X, i, j, q()] in a first-pass decoding. In a second top-down pass similar to Huang and Chiang (2007), we can recalculate psyn (d) for alternative derivations in the hypergraph; potentially correcting search errors made in the first pass. We face another significant practical challenge during decoding. In real data conditions, the size of the preference vector for a single rule can be very high, especially for rules that include multiple nonterminal symbols that are located on the left and right boundaries of . For example, the Chineseto-English rule X X1 ,, X2 # X1 's X2 has over 24K elements in hargs(r) when learned for the medium-sized NIST task used below. In order to limit the explosive growth of nonterminals during decoding for both memory and runtime reasons, we define the following label pruning parameters: · R : This parameter limits the size of hargs(r) to the R top-scoring preferences, defaulting other values to zero. · L : This parameter is the same as R but applied only to rules with no nonterminals. The stricter of L and R is applied if both thresholds apply. · P : This parameter limits the number labels in item preference vectors (Equation 8) to the P most likely labels during decoding, defaulting other preferences to zero. subset of the full training data (67M words of English text) from the annual NIST MT Evaluation. Development corpora are used to train model parameters via MERT. We use a variant of MERT that prefers sparse solutions where i = 0 for as many features as possible. At each MERT iteration, a subset of features are assigned 0 weight and optimization is repeated. If the resulting BLEU score is not lower, these features are left at zero. All systems are built on the SAMT framework described in Zollmann et al. (2008), using a trigram LM during search and the full-order LM during a second hypergraph rescoring pass. Reordering limits are set to 10 words for all systems. Pruning parameters during decoding limit the number of derivations at each source span to 300. The system "Hier." uses a grammar with a single nonterminal label as in Chiang (2005). The system "Syntax" applies the grammar from Zollmann and Venugopal (2006) that generates a large number of syntactically motivated nonterminal labels. For the NIST task, rare rules are discarded based on their frequency in the training data. Purely lexical rules (that include no terminal symbols) that occur less than 2 times, or non-lexical rules that occur less than 4 times are discarded. IWSLT task: We evaluate the preference grammar system "Pref." with parameters R = 100, L = 5, P = 2. Results comparing systems Pref. to Hier. and Syntax are shown in Table 2. Automatic evaluation results using the preference grammar translation model are positive. The preference grammar system shows improvements over both the Hier. and Syntax based systems on both unseen evaluation sets IWSLT 2007 and 2008. The improvements are clearest on the BLEU metric (matching the MERT training criteria). On 2007 test data, Pref. shows a 1.2-point improvement over Hier., while on the 2008 data, there is a 0.6-point improvement. For the IWSLT task, we report additional au- 6 Empirical Results We evaluate our preference grammar model on small (IWSLT) and medium (NIST) data Chineseto-English translation tasks (described in Table 1). IWSLT is a limited domain, limited resource task (Paul, 2006), while NIST is a broadcast news task with wide genre and domain coverage. We use a 241 System Name IWSLT NIST Words in Target Text 632K 67M LM singleton 1-n-grams (n) 431K (5) 102M (4) Dev. IWSLT06 MT05 Test IWSLT07,08 MT06 Table 1: Training data configurations used to evaluate preference grammars. The number of words in the target text and the number of singleton 1-n-grams represented in the complete model are the defining statistics that characterize the scale of each task. For each LM we also indicate the order of the n-gram model. System Dev BLEU (lpen) 28.0 (0.89) 30.9 (0.96) 28.3 (0.88) 2007 BLEU (lpen) 37.0 (0.89) 35.5 (0.94) 38.2 (0.90) 2008 BLEU (lpen) 45.9 (0.91) 45.3 (0.95) 46.3 (0.91) 2008 WER 44.5 45.7 43.8 2008 PER 39.9 40.4 40.0 2008 MET. 61.8 62.1 61.7 2008 GTM 70.7 71.5 71.2 Hier. Syntax Pref. Table 2: Translation quality metrics on the IWSLT translation task, with IWSLT 2006 as the development corpora, and IWSLT 2007 and 2008 as test corpora. Each metric is annotated with an if increases in the metric value correspond to increase in translation quality and a if the opposite is true. We also list length penalties for the BLEU metric to show that improvements are not due to length optimizations alone. tomatic evaluation metrics that generally rank the Pref. system higher than Hier. and Syntax. As a further confirmation, our feature selection based MERT chooses to retain m+1 in the model. While the IWSLT results are promising, we perform a more complete evaluation on the NIST translation task. NIST task: This task generates much larger rule preference vectors than the IWSLT task simply due to the size of the training corpora. We build systems with both R = 100, 10 varying P . Varying P isolates the relative impact of propagating alternative nonterminal labels within the preference grammar model. L = 5 for all NIST systems. Parameters are trained via MERT on the R = 100, L = 5, P = 2 system. BLEU scores for each preference grammar and baseline system are shown in Table 3, along with translation times on the test corpus. We also report length penalties to show that improvements are not simply due to better tuning of output length. The preference grammar systems outperform the Hier. baseline by 0.5 points on development data, and upto 0.8 points on unseen test data. While systems with R = 100 take significantly longer to translate the test data than Hier., setting R = 10 takes approximately as long as the Syntax based system but produces better slightly better results (0.3 242 points). The improvements in translation quality with the preference grammar are encouraging, but how much of this improvement can simply be attributed to MERT finding a better local optimum for parameters ? To answer this question, we use parameters optimized by MERT for the preference grammar system to run a purely hierarchical system, denoted Hier.( ), which ignores the value of m+1 during decoding. While almost half of the improvement comes from better parameters learned via MERT for the preference grammar systems, 0.5 points can be still be attributed purely to the feature psyn . In addition, MERT does not set parameter m+1 to 0, corroborating the value of the psyn feature again. Note that Hier.( ) achieves better scores than the Hier. system which was trained via MERT without psyn . This highlights the local nature of MERT parameter search, but also points to the possibility that training with the feature psyn produced a more diverse derivation space, resulting in better parameters . We see a very small improvement (0.1 point) by allowing the runtime propagation of more than 1 nonterminal label in the left-hand side posterior distribution, but the improvement doesn't extend to P = 5. Improved integration of the feature psyn (d) into decoding might help to widen this gap. Dev. Test System BLEU (lpen) BLEU (lpen) Baseline Systems Hier. 34.1 (0.99) 31.8 (0.95) Syntax 34.7 (0.99) 32.3 (0.95) Hier.( ) 32.1 (0.95) Preference Grammar: R = 100 P = 1 32.5 (0.96) P = 2 34.6 (0.99) 32.6 (0.95) P = 5 32.5 (0.95) Preference Grammar: R = 10 P = 1 32.5 (0.95) P = 2 32.6 (0.95) P = 5 32.5 (0.95) Test time (h:mm) 0:12 0:45 0:12 3:00 3:00 3:20 1:03 1:10 1:10 notations add an additional level of structure that must be marginalized during search. They demonstrate improvements in parse quality only when a variational approximation is used to select the most likely unannotated tree rather than simply stripping annotations from the MAP annotated tree. In our work, we focused on approximating the selection of the most likely unlabeled derivation during search, rather than as a post-processing operation; the methods described above might improve this approximation, at some computational expense. 8 Conclusions and Future Work Table 3: Translation quality and test set translation time (using 50 machines with 2 tasks per machine) measured by the BLEU metric for the NIST task. NIST 2006 is used as the development (Dev.) corpus and NIST 2007 is used as the unseen evaluation corpus (Test). Dev. scores are reported for systems that have been separately MERT trained, Pref. systems share parameters from a single MERT training. Systems are described in the text. 7 Related Work There have been significant efforts in the both the monolingual parsing and machine translation literature to address the impact of the MAP approximation and the choice of labels in their respective models; we survey the work most closely related to our approach. May and Knight (2006) extract nbest lists containing unique translations rather than unique derivations, while Kumar and Byrne (2004) use the Minimum Bayes Risk decision rule to select the lowest risk (highest BLEU score) translation rather than derivation from an n-best list. Tromble et al. (2008) extend this work to lattice structures. All of these approaches only marginalize over alternative candidate derivations generated by a MAPdriven decoding process. More recently, work by Blunsom et al. (2007) propose a purely discriminative model whose decoding step approximates the selection of the most likely translation via beam search. Matsusaki et al. (2005) and Petrov et al. (2006) propose automatically learning annotations that add information to categories to improve monolingual parsing quality. Since the parsing task requires selecting the most non-annotated tree, the an243 We have proposed a novel grammar formalism that replaces hard syntactic constraints with "soft" preferences. These preferences are used to compute a machine translation feature (psyn (d)) that scores unlabeled derivations, taking into account traditional syntactic constraints. Representing syntactic constraints as a feature allows MERT to train the corresponding weight for this feature relative to others in the model, allowing systems to learn the relative importance of labels for particular resource and language scenarios as well as for alternative approaches to labeling PSCFG rules. This approach takes a step toward addressing the fragmentation problems of decoding based on maximum-weighted derivations, by summing the contributions of compatible label configurations rather than forcing them to compete. We have suggested an efficient technique to approximate psyn (d) that takes advantage of a natural factoring of derivation scores. Our approach results in improvements in translation quality on small and medium resource translation tasks. In future work we plan to focus on methods to improve on the integration of the psyn (d) feature during decoding and techniques that allow us consider more of the search space through less pruning. Acknowledgements We appreciate helpful comments from three anonymous reviewers. Venugopal and Zollmann were supported by a Google Research Award. Smith was supported by NSF grant IIS-0836431. References Phil Blunsom, Trevor Cohn, and Miles Osborne. 2007. A discriminative latent variable model for statistical machine translation. In Proceedings of the Annual Meeting of the Association for Computational Linguistics (ACL). Francisco Casacuberta and Colin de la Higuera. 2000. Computational complexity of problems on probabilistic grammars and transducers. In Proc. of the 5th International Colloquium on Grammatical Inference: Algorithms and Applications. David Chiang. 2005. A hierarchical phrase-based model for statistical machine translation. In Proceedings of the Annual Meeting of the Association for Compuational Linguistics (ACL). David Chiang. 2007. Hierarchical phrase based translation. Computational Linguistics. Michel Galley, Jonathan Graehl, Kevin Knight, Daniel Marcu, Steve DeNeefe, Wei Wang, and Ignacio Thayer. 2006. Scalable inferences and training of context-rich syntax translation models. In Proceedings of the Annual Meeting of the Association for Compuational Linguistics (ACL), Sydney, Australia. Liang Huang and David Chiang. 2007. Forest rescoring: Faster decoding with integrated language models. In Proceedings of the Annual Meeting of the Association for Compuational Linguistics (ACL). Kevin Knight. 1999. Decoding complexity in wordreplacement translation models. Computational Linguistics, Squibs and Discussion. Shankar Kumar and William Byrne. 2004. Minimum Bayes-risk decoding for statistical machine translation. In Proceedings of the Human Language Technology and North American Association for Computational Linguistics Conference (HLT/NAACL), Boston,MA, May 27-June 1. Takuya Matsusaki, Yusuke Miyao, and Junichi Tsujii. 2005. Probabilistic CFG with latent annotations. In Proceedings of the Annual Meeting of the Association for Computational Linguistics (ACL). Jonathan May and Kevin Knight. 2006. A better N-best list: Practical determinization of weighted finite tree automata. In Proceedings of the Human Language Technology Conference of the North American Chapter of the Association for Computational Linguistics Conference (HLT/NAACL). Franz J. Och. 2003. Minimum error rate training in statistical machine translation. In Proceedings of the Annual Meeting of the Association for Compuational Linguistics (ACL). Kishore Papineni, Salim Roukos, Todd Ward, and WeiJing Zhu. 2002. BLEU: a method for automatic evaluation of machine translation. In Proceedings of the Annual Meeting of the Association for Compuational Linguistics (ACL). Michael Paul. 2006. Overview of the IWSLT 2006 evaluation campaign. In Proceedings of the International Workshop on Spoken Language Translation (IWSLT). Slav Petrov, Leon Barrett, Romain Thibaux, and Dan Klein. 2006. Learning accurate, compact, and interpretable tree annotation. In Proceedings of the Annual Meeting of the Association for Compuational Linguistics (ACL). Khalil Sima'an. 2002. Computational complexity of probabilistic disambiguation. Grammars, 5(2):125­ 151. Roy Tromble, Shankar Kumar, Franz Och, and Wolfgang Macherey. 2008. Lattice minimum Bayes-risk decoding for statistical machine translation. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP). Ashish Venugopal, Andreas Zollmann, and Stephan Vogel. 2007. An efficient two-pass approach to Synchronous-CFG driven statistical MT. In Proceedings of the Human Language Technology Conference of the North American Chapter of the Association for Computational Linguistics Conference (HLT/NAACL). Andreas Zollmann and Ashish Venugopal. 2006. Syntax augmented machine translation via chart parsing. In Proceedings of the Workshop on Statistical Machine Translation, HLT/NAACL, New York, June. Andreas Zollmann, Ashish Venugopal, Franz J. Och, and Jay Ponte. 2008. A systematic comparison of phrasebased, hierarchical and syntax-augmented statistical MT. In Proceedings of the Conference on Computational Linguistics (COLING). 244 Using a Dependency Parser to Improve SMT for Subject-Object-Verb Languages Peng Xu, Jaeho Kang, Michael Ringgaard and Franz Och Google Inc. 1600 Amphitheatre Parkway Mountain View, CA 94043, USA {xp,jhkang,ringgaard,och}@google.com Abstract We introduce a novel precedence reordering approach based on a dependency parser to statistical machine translation systems. Similar to other preprocessing reordering approaches, our method can efficiently incorporate linguistic knowledge into SMT systems without increasing the complexity of decoding. For a set of five subject-object-verb (SOV) order languages, we show significant improvements in BLEU scores when translating from English, compared to other reordering approaches, in state-of-the-art phrase-based SMT systems. 1 Introduction Over the past ten years, statistical machine translation has seen many exciting developments. Phrasebased systems (Och, 2002; Koehn et.al., 2003; Och and Ney, 2004) advanced the machine translation field by allowing translations of word sequences (a.k.a., phrases) instead of single words. This approach has since been the state-of-the-art because of its robustness in modeling local word reordering and the existence of an efficient dynamic programming decoding algorithm. However, when phrase-based systems are used between languages with very different word orders, such as between subject-verb-object (SVO) and subject-object-verb (SOV) languages, long distance reordering becomes one of the key weaknesses. Many reordering methods have been proposed in recent years to address this problem in different aspects. 245 The first class of approaches tries to explicitly model phrase reordering distances. Distance based distortion model (Och, 2002; Koehn et.al., 2003) is a simple way of modeling phrase level reordering. It penalizes non-monotonicity by applying a weight to the number of words between two source phrases corresponding to two consecutive target phrases. Later on, this model was extended to lexicalized phrase reordering (Tillmann, 2004; Koehn, et.al., 2005; Al-Onaizan and Papineni, 2006) by applying different weights to different phrases. Most recently, a hierarchical phrase reordering model (Galley and Manning, 2008) was proposed to dynamically determine phrase boundaries using efficient shift-reduce parsing. Along this line of research, discriminative reordering models based on a maximum entropy classifier (Zens and Ney, 2006; Xiong, et.al., 2006) also showed improvements over the distance based distortion model. None of these reordering models changes the word alignment step in SMT systems, therefore, they can not recover from the word alignment errors. These models are also limited by a maximum allowed reordering distance often used in decoding. The second class of approaches puts syntactic analysis of the target language into both modeling and decoding. It has been shown that direct modeling of target language constituents movement in either constituency trees (Yamada and Knight, 2001; Galley et.al., 2006; Zollmann et.al., 2008) or dependency trees (Quirk, et.al., 2005) can result in significant improvements in translation quality for translating languages like Chinese and Arabic into English. A simpler alternative, the hierarchical phrase-based Human Language Technologies: The 2009 Annual Conference of the North American Chapter of the ACL, pages 245­253, Boulder, Colorado, June 2009. c 2009 Association for Computational Linguistics approach (Chiang, 2005; Wu, 1997) also showed promising results for translating Chinese to English. Similar to the distance based reordering models, the syntactical or hierarchical approaches also rely on other models to get word alignments. These models typically combine machine translation decoding with chart parsing, therefore significantly increase the decoding complexity. Even though some recent work has shown great improvements in decoding efficiency for syntactical and hierarchical approaches (Huang and Chiang, 2007), they are still not as efficient as phrase-based systems, especially when higher order language models are used. Finally, researchers have also tried to put source language syntax into reordering in machine translation. Syntactical analysis of source language can be used to deterministically reorder input sentences (Xia and McCord, 2004; Collins et.al., 2005; Wang et.al., 2007; Habash, 2007), or to provide multiple orderings as weighted options (Zhang et.al., 2007; Li et.al., 2007; Elming, 2008). In these approaches, input source sentences are reordered based on syntactic analysis and some reordering rules at preprocessing step. The reordering rules can be either manually written or automatically extracted from data. Deterministic reordering based on syntactic analysis for the input sentences provides a good way of resolving long distance reordering, without introducing complexity to the decoding process. Therefore, it can be efficiently incorporated into phrase-based systems. Furthermore, when the same preprocessing reordering is performed for the training data, we can still apply other reordering approaches, such as distance based reordering and hierarchical phrase reordering, to capture additional local reordering phenomena that are not captured by the preprocessing reordering. The work presented in this paper is largely motivated by the preprocessing reordering approaches. In the rest of the paper, we first introduce our dependency parser based reordering approach based on the analysis of the key issues when translating SVO languages to SOV languages. Then, we show experimental results of applying this approach to phrasebased SMT systems for translating from English to five SOV languages (Korean, Japanese, Hindi, Urdu and Turkish). After showing that this approach can also be beneficial for hierarchical phrase-based sys246 John can hit the ball . . Figure 1: Example Alignment Between an English and a Korean Sentence tems, we will conclude the paper with future research directions. 2 Translation between SVO and SOV Languages In linguistics, it is possible to define a basic word order in terms of the verb (V) and its arguments, subject (S) and object (O). Among all six possible permutations, SVO and SOV are the most common. Therefore, translating between SVO and SOV languages is a very important area to study. We use English as a representative of SVO languages and Korean as a representative for SOV languages in our discussion about the word orders. Figure 1 gives an example sentence in English and its corresponding translation in Korean, along with the alignments between the words. Assume that we split the sentences into four phrases: (John , t@), (can hit , ` ^ µ Č ä), (the ball , ř ő D) and (. , .). Since a phrase-based decoder generates the translation from left to right, the following steps need to happen when we translate from English to Korean: · Starts from the beginning of the sentence, translates "John" to "t@"; · Jumps to the right by two words, translates "the ball" to "ř őD"; · Jumps to the left by four words, translates "can hit" to "` ^µČä"; · Finally, jumps to the right by two words, translates "." to ".". It is clear that in order for the phrase-based decoder to successfully carry out all of the reordering steps, a very strong reordering model is required. When the sentence gets longer with more complex structure, the number of words to move over during decoding can be quite high. Imagine when we translate Figure 2: Dependency Parse Tree of an Example English Sentence the sentence "English is used as the first or second language in many countries around the world .". The decoder needs to make a jump of 13 words in order to put the translation of "is used" at the end of the translation. Normally in a phrase-based decoder, very long distance reordering is not allowed because of efficiency considerations. Therefore, it is very difficult in general to translate English into Korean with proper word order. However, knowing the dependency parse trees of the English sentences may simplify the reordering problem significantly. In the simple example in Figure 1, if we analyze the English sentence and know that "John" is the subject, "can hit" is the verb and "the ball" is the object, we can reorder the English into SOV order. The resulting sentence "John the ball can hit ." will only need monotonic translation. This motivates us to use a dependency parser for English to perform the reordering. 3 Precedence Reordering Based on a Dependency Parser Figure 2 shows the dependency tree for the example sentence in the previous section. In this parse, the verb "hit" has four children: a subject noun "John", an auxiliary verb "can", an object noun "ball" and a punctuation ".". When transforming the sentence to SOV order, we need to move the object noun and the subtree rooted at it to the front of the head verb, but after the subject noun. We can have a simple rule to achieve this. However, in reality, there are many possible children for a verb. These children have some relative ordering that is typically fixed for SOV languages. In order to describe this kind of ordering, we propose precedence reordering rules based on a dependency parse tree. All rules here are based English 247 and Korean examples, but they also apply to other SOV languages, as we will show later empirically. A precedence reordering rule is a mapping from T to a set of tuples {(L, W, O)}, where T is the part-of-speech (POS) tag of the head in a dependency parse tree node, L is a dependency label for a child node, W is a weight indicating the order of that child node and O is the type of order (either NORMAL or REVERSE). The type of order is only used when we have multiple children with the same weight, while the weight is used to determine the relative order of the children, going from largest to smallest. The weight can be any real valued number. The order type NORMAL means we preserve the original order of the children, while REVERSE means we flip the order. We reserve a special label self to refer to the head node itself so that we can apply a weight to the head, too. We will call this tuple a precedence tuple in later discussions. In this study, we use manually created rules only. Suppose we have a precedence rule: VB (nsubj, 2, NORMAL), (dobj, 1, NORMAL), (self, 0, NORMAL). For the example shown in Figure 2, we would apply it to the ROOT node and result in "John the ball can hit .". Given a set of rules, we apply them in a dependency tree recursively starting from the root node. If the POS tag of a node matches the left-hand-side of a rule, the rule is applied and the order of the sentence is changed. We go through all children of the node and get the precedence weights for them from the set of precedence tuples. If we encounter a child node that has a dependency label not listed in the set of tuples, we give it a default weight of 0 and default order type of NORMAL. The children nodes are sorted according to their weights from highest to lowest, and nodes with the same weights are ordered according to the type of order defined in the rule. 3.1 Verb Precedence Rules Verb movement is the most important movement when translating from English (SVO) to Korean (SOV). In a dependency parse tree, a verb node can potentially have many children. For example, auxiliary and passive auxiliary verbs are often grouped together with the main verb and moved together with it. The order, however, is reversed after the movement. In the example of Figure 2, the correct Korean T VB* . Figure 3: Dependency Parse Tree with Alignment for a Sentence with Preposition Modifier JJ or JJS or JJR word order is "` (hit) ` ^ µ Č ä(can) . Other categories that are in the same group are phrasal verb particle and negation. If the verb in an English sentence has a prepositional phrase as a child, the prepositional phrase is often placed before the direct object in the Korean counterpart. As shown in Figure 3, ")Ýt \" ("with a bat") is actually between "t@" ("John") and "ř őD" ("the ball"). Another common reordering phenomenon is when a verb has an adverbial clause modifier. In that case, the whole adverbial clause is moved together to be in front of the subject of the main sentence. Inside the adverbial clause, the ordering follows the same verb reordering rules, so we recursively reorder the clause. Our verb precedence rule, as in Table 1, can cover all of the above reordering phenomena. One way to interpret this rule set is as follows: for any node whose POS tag is matches VB* (VB, VBZ, VBD, VBP, VBN, VBG), we group the children node that are phrasal verb particle (prt), auxiliary verb (aux), passive auxiliary verb (auxpass), negation (neg) and the verb itself (self) together and reverse them. This verb group is moved to the end of the sentence. We move adverbial clause modifier to the beginning of the sentence, followed by a group of noun subject (nsubj), preposition modifier and anything else not listed in the table, in their original order. Right before the verb group, we put the direct object (dobj). Note that all of the children are optional. 3.2 Adjective Precedence Rules NN or NNS IN or TO (L, W, O) (advcl, 1, NORMAL) (nsubj, 0, NORMAL) (prep, 0, NORMAL) (dobj, -1, NORMAL) (prt, -2, REVERSE) (aux, -2, REVERSE) (auxpass, -2, REVERSE) (neg, -2, REVERSE) (self, -2, REVERSE) (advcl, 1, NORMAL) (self, -1, NORMAL) (aux, -2, REVERSE) (auxpass, -2, REVERSE) (neg, -2, REVERSE) (cop, -2, REVERSE) (prep, 2, NORMAL) (rcmod, 1, NORMAL) (self, 0, NORMAL) (pobj, 1, NORMAL) (self, -1, NORMAL) Table 1: Precedence Rules to Reorder English to SOV Language Order (These rules were extracted manually by a bilingual speaker after looking at some text book examples in English and Korean, and the dependency parse trees of the English examples.) as modifiers. In such cases, the change in order from English to Korean is similar to the verb rule, except that the head adjective itself should be in front of the verbs. Therefore, in our adjective precedence rule in the second panel of Table 1, we group the auxiliary verb, the passive auxiliary verb and the negation and move them together after reversing their order. They are moved to right after the head adjective, which is put after any other modifiers. For both verb and adjective precedence rules, we also apply some heuristics to prevent excessive movements. In order to do this, we disallow any movement across punctuation and conjunctions. Therefore, for sentences like "John hit the ball but Sam threw the ball", the reordering result would be "John the ball hit but Sam the ball threw", instead of "John the ball but Sam the ball threw hit". 3.3 Noun and Preposition Precedence Rules Similar to the verbs, adjectives can also take an auxiliary verb, a passive auxiliary verb and a negation 248 In Korean, when a noun is modified by a prepositional phrase, such as in "the way to happiness", the prepositional phrase is usually moved in front of the noun, resulting in "ő (happiness) <\ " 8 (to the way)" . Similarly for relative clause modifier, it is also reordered to the front of the head noun. For preposition head node with an object modifier, the order is the object first and the preposition last. One example is "with a bat" in Figure 3. It corresponds to ") Ý t (a bat) \(with)". We handle ) these types of reordering by the noun and preposition precedence rules in the third and fourth panel of Table 1. With the rules defined in Table 1, we now show a more complex example in Figure 4. First, the ROOT node matches an adjective rule, with four children nodes labeled as (csubj, cop, advcl, p), and with precedence weights of (0, -2, 1, 0). The ROOT node itself has a weight of -1. After reordering, the sentence becomes: "because we do n't know what the future has Living exciting is .". Note that the whole adverbial phrase rooted at "know" is moved to the beginning of the sentence. After that, we see that the child node rooted at "know" matches a verb rule, with five children nodes labeled as (mark, nsubj, aux, neg, ccomp), with weights (0, 0, -2, -2, 0). In this case, the verb itself also has weight -2. Now we have two groups of nodes, with weight 0 and -2, respectively. The first group has a NORMAL order and the second group has a REVERSE order. After reordering, the sentence becomes: "because we what the future has know n't do Living exciting is .". Finally, we have another node rooted at "has" that matches the verb rule again. After the final reordering, we end up with the sentence: "because we the future what has know n't do Living exciting is .". We can see in Figure 4 that this sentence has an almost monotonic alignment with a reasonable Korean translation shown in the figure1 . 4 Related Work As we mentioned in our introduction, there have been several studies in applying source sentence reordering using syntactical analysis for statistical machine translation. Our precedence reordering approach based on a dependency parser is motivated by those previous works, but we also distinguish from their studies in various ways. Several approaches use syntactical analysis to provide multiple source sentence reordering options through word lattices (Zhang et.al., 2007; Li et.al., 2007; Elming, 2008). A key difference between We could have improved the rules by using a weight of -3 for the label "mark", but it was not in our original set of rules. 1 their approaches and ours is that they do not perform reordering during training. Therefore, they would need to rely on reorder units that are likely not violating "phrase" boundaries. However, since we reorder both training and test data, our system operates in a matched condition. They also focus on either Chinese to English (Zhang et.al., 2007; Li et.al., 2007) or English to Danish (Elming, 2008), which arguably have less long distance reordering than between English and SOV languages. Studies most similar to ours are those preprocessing reordering approaches (Xia and McCord, 2004; Collins et.al., 2005; Wang et.al., 2007; Habash, 2007). They all perform reordering during preprocessing based on either automatically extracted syntactic rules (Xia and McCord, 2004; Habash, 2007) or manually written rules (Collins et.al., 2005; Wang et.al., 2007). Compared to these approaches, our work has a few differences. First of all, we study a wide range of SOV languages using manually extracted precedence rules, not just for one language like in these studies. Second, as we will show in the next section, we compare our approach to a very strong baseline with more advanced distance based reordering model, not just the simplest distortion model. Third, our precedence reordering rules, like those in Habash, 2007, are more flexible than those other rules. Using just one verb rule, we can perform the reordering of subject, object, preposition modifier, auxiliary verb, negation and the head verb. Although we use manually written rules in this study, it is possible to learn our rules automatically from alignments, similarly to Habash, 2007. However, unlike Habash, 2007, our manually written rules handle unseen children and their order naturally because we have a default precedence weight and order type, and we do not need to match an often too specific condition, but rather just treat all children independently. Therefore, we do not need to use any backoff scheme in order to have a broad coverage. Fourth, we use dependency parse trees rather than constituency trees. There has been some work on syntactic word order model for English to Japanese machine translation (Chang and Toutanova, 2007). In this work, a global word order model is proposed based on features including word bigram of the target sentence, displacements and POS tags on both source and tar- 249 Label Token POS csubj Living VBG cop is VBZ ROOT JJ mark IN nsubj PRP aux neg advcl dobj do VBP det the DT nsubj future NN ccomp has VBZ exciting because we n't RB know what VB WP p . . . because we the future what has know n't do Living exciting is . Figure 4: A Complex Reordering Example (Reordered English sentence and alignments are at the bottom.) get sides. They build a log-linear model using these features and apply the model to re-rank N -best lists from a baseline decoder. Although we also study the reordering problem in English to Japanese translation, our approach is to incorporate the linguistically motivated reordering directly into modeling and decoding. System EnglishKorean EnglishJapanese EnglishHindi EnglishUrdu EnglishTurkish Source 303M 316M 16M 17M 83M Target 267M 350M 17M 19M 76M Table 2: Training Corpus Statistics (#words) of Systems for 5 SOV Languages 5 Experiments We carried out all our experiments based on a stateof-the-art phrase-based statistical machine translation system. When training a system for English to any of the 5 SOV languages, the word alignment step includes 3 iterations of IBM Model-1 training and 2 iterations of HMM training. We do not use Model-4 because it is slow and it does not add much value to our systems in a pilot study. We use the standard phrase extraction algorithm (Koehn et.al., 2003) to get all phrases up to length 5. In addition to the regular distance distortion model, we incorporate a maximum entropy based lexicalized phrase reordering model (Zens and Ney, 2006) as a feature used in decoding. In this model, we use 4 reordering classes (+1, > 1, -1, < -1) and words from both source and target as features. For source words, we use the current aligned word, the word before the current aligned word and the next aligned word; for target words, we use the previous two words in the immediate history. Using this type of features makes it possible to directly use the maximum entropy model in the decoding process (Zens and Ney, 2006). The maximum entropy models are trained on all events extracted from training data word alignments using the LBFGS algorithm (Malouf, 2002). Overall for decoding, we use between 20 250 to 30 features, whose weights are optimized using MERT (Och, 2003), with an implementation based on the lattice MERT (Macherey et.al., 2008). For parallel training data, we use an in-house collection of parallel documents. They come from various sources with a substantial portion coming from the web after using simple heuristics to identify potential document pairs. Therefore, for some documents in the training data, we do not necessarily have the exact clean translations. Table 2 shows the actual statistics about the training data for all five languages we study. For all 5 SOV languages, we use the target side of the parallel data and some more monolingual text from crawling the web to build 4gram language models. We also collected about 10K English sentences from the web randomly. Among them, 9.5K are used as evaluation data. Those sentences were translated by humans to all 5 SOV languages studied in this paper. Each sentence has only one reference translation. We split them into 3 subsets: dev contains 3,500 sentences, test contains 1,000 sentences and the rest of 5,000 sentences are used in a blindtest set. The dev set is used to perform MERT training, while the test set is used to select trained weights due to some nondeterminism of MERT training. We use IBM BLEU (Papineni et al., 2002) to evaluate our translations and use character level BLEU for Korean and Japanese. 5.1 Preprocessing Reordering and Reordering Models Language Korean We first compare our precedence rules based preprocessing reordering with the maximum entropy based lexicalized reordering model. In Table 3, Baseline is our system with both a distance distortion model and the maximum entropy based lexicalized reordering model. For all results reported in this section, we used a maximum allowed reordering distance of 10. In order to see how the lexicalized reordering model performs, we also included systems with and without it (-LR means without it). PR is our proposed approach in this paper. Note that since we apply precedence reordering rules during preprocessing, we can combine this approach with any other reordering models used during decoding. The only difference is that with the precedence reordering, we would have a different phrase table and in the case of LR, different maximum entropy models. In order to implement the precedence rules, we need a dependency parser. We choose to use a deterministic inductive dependency parser (Nivre and Scholz, 2004) for its efficiency and good accuracy. Our implementation of the deterministic dependency parser using maximum entropy models as the underlying classifiers achieves 87.8% labeled attachment score and 88.8% unlabeled attachment score on standard Penn Treebank evaluation. As our results in Table 3 show, for all 5 languages, by using the precedence reordering rules as described in Table 1, we achieve significantly better BLEU scores compared to the baseline system. In the table, We use two stars () to mean that the statistical significance test using the bootstrap method (Koehn, 2004) gives an above 95% significance level when compared to the baselie. We measured the statistical significance level only for the blindtest data. Note that for Korean and Japanese, our precedence reordering rules achieve better absolute BLEU score improvements than for Hindi, Urdu and Turkish. Since we only analyzed English and Korean sentences, it is possible that our rules are more geared toward Korean. Japanese has almost exactly the same word order as Korean, so we could assume 251 Japanese Hindi Urdu Turkish System BL -LR -LR+PR +PR BL -LR -LR+PR +PR BL -LR -LR+PR +PR BL -LR -LR+PR +PR BL -LR -LR+PR +PR dev 25.8 24.7 27.3 27.8 29.5 29.2 30.3 30.7 19.1 17.4 19.6 19.9 9.7 9.1 10.0 10.0 10.0 9.1 10.5 10.5 test 27.0 25.6 28.3 28.7 29.3 29.0 31.0 31.2 18.9 17.1 18.8 18.9 9.5 8.6 9.6 9.8 10.5 10.0 11.0 10.9 blind 26.2 25.1 27.5** 27.9** 29.3 29.0 30.6** 31.1** 18.3 16.4 18.7** 18.8** 8.9 8.2 9.6** 9.6** 9.8 9.0 10.3** 10.4** Table 3: BLEU Scores on Dev, Test and Blindtest for English to 5 SOV Languages with Various Reordering Options (BL means baseline, LR means maximum entropy based lexialized phrase reordering model, PR means precedence rules based preprocessing reordering.) the benefits can carry over to Japanese. 5.2 Reordering Constraints One of our motivations of using the precedence reordering rules is that English will look like SOV languages in word order after reordering. Therefore, even monotone decoding should be able to produce better translations. To see this, we carried out a controlled experiment, using Korean as an example. Clearly, after applying the precedence reordering rules, our English to Korean system is not sensitive to the maximum allowed reordering distance anymore. As shown in Figure 5, without the rules, the blindtest BLEU scores improve monotonically as the allowed reordering distance increases. This indicates that the order difference between English and Korean is very significant. Since smaller allowed reordering distance directly corresponds to decoding time, we can see that with the same decoding speed, our proposed approach can achieve almost 5% BLEU score improvements on blindtest set. 5.3 Preprocessing Reordering and Hierarchical Model The hierarchical phrase-based approach has been successfully applied to several systems (Chiang, 0.28 Language Korean No LexReorder Baseline No LexReorder, with ParserReorder With ParserReorder 0.27 0.26 0.25 Japanese Hindi 0.24 0.23 1 2 4 6 8 10 Maximum Allowed Reordering Distance Urdu Turkish Figure 5: Blindtest BLEU Score for Different Maximum Allowed Reordering Distance for English to Korean Systems with Different Reordering Options System PR Hier PR+Hier PR Hier PR+Hier PR Hier PR+Hier PR Hier PR+Hier PR Hier PR+Hier dev 27.8 27.4 28.5 30.7 30.5 31.0 19.9 20.3 20.0 10.0 10.4 11.2 10.5 11.0 11.1 test 28.7 27.7 29.1 31.2 30.6 31.3 18.9 20.3 19.7 9.8 10.3 10.7 10.9 11.8 11.6 blind 27.9 27.9 28.8** 31.1** 30.5 31.1** 18.8 19.3 19.3 9.6 10.0 10.7** 10.4 10.5 10.9** Blindtest BLEU Score 2005; Zollmann et.al., 2008). Since hierarchical phrase-based systems can capture long distance reordering by using a PSCFG model, we expect it to perform well in English to SOV language systems. We use the same training data as described in the previous sections for building hierarchical systems. The same 4-gram language models are also used for the 5 SOV languages. We adopt the SAMT package (Zollmann and Venugopal, 2006) and follow similar settings as Zollmann et.al., 2008. We allow each rule to have at most 6 items on the source side, including nonterminals and extract rules from initial phrases of maximum length 12. During decoding, we allow application of all rules of the grammar for chart items spanning up to 12 source words. Since our precedence reordering applies at preprocessing step, we can train a hierarchical system after applying the reordering rules. When doing so, we use exactly the same settings as a regular hierarchical system. The results for both hierarchical systems and those combined with the precedence reordering are shown in Table 4, together with the best normal phrase-based systems we copy from Table 3. Here again, we mark any blindtest BLEU score that is better than the corresponding hierarchical system with confidence level above 95%. Note that the hierarchical systems can not use the maximum entropy based lexicalized phrase reordering models. Except for Hindi, applying the precedence reordering rules in a hierarchical system can achieve statistically significant improvements over a normal hierarchical system. We conjecture that this may be because of the simplicity of our reordering rules. 252 Table 4: BLEU Scores on Dev, Test and Blindtest for English to 5 SOV Languages in Hierarchical Phrase-based Systems (PR is precedence rules based preprocessing reordering, same as in Table 3, while Hier is the hierarchical system.) Other than the reordering phenomena covered by our rules in Table 1, there could be still some local or long distance reordering. Therefore, using a hierarchical phrase-based system can improve those cases. Another possible reason is that after the reordering rules apply in preprocessing, English sentences in the training data are very close to the SOV order. As a result, EM training becomes much easier and word alignment quality becomes better. Therefore, a hierarchical phrase-based system can extract better rules and hence achievesbetter translation quality. We also point out that hierarchical phrase-based systems require a chart parsing algorithm during decoding. Compared to the efficient dynamic programming in phrase-based systems, it is much slower. This makes our approach more appealing in a realtime statistical machine translation system. 6 Conclusion In this paper, we present a novel precedence reordering approach based on a dependency parser. We successfully applied this approach to systems translating English to 5 SOV languages: Korean, Japanese, Hindi, Urdu and Turkish. For all 5 languages, we achieve statistically significant improvements in BLEU scores over a state-of-the-art phrasebased baseline system. The amount of training data for the 5 languages varies from around 17M to more than 350M words, including some noisy data from the web. Our proposed approach has shown to be robust and versatile. For 4 out of the 5 languages, our approach can even significantly improve over a hierarchical phrase-based baseline system. As far as we know, we are the first to show that such reordering rules benefit several SOV languages. We believe our rules are flexible and can cover many linguistic reordering phenomena. The format of our rules also makes it possible to automatically extract rules from word aligned corpora. In the future, we plan to investigate along this direction and extend the rules to languages other than SOV. The preprocessing reordering like ours is known to be sensitive to parser errors. Some preliminary error analysis already show that indeed some sentences suffer from parser errors. In the recent years, several studies have tried to address this issue by using a word lattice instead of one reordering as input (Zhang et.al., 2007; Li et.al., 2007; Elming, 2008). Although there is clearly room for improvements, we also feel that using one reordering during training may not be good enough either. It would be very interesting to investigate ways to have efficient procedure for training EM models and getting word alignments using word lattices on the source side of the parallel data. Along this line of research, we think some kind of tree-to-string model (Liu et.al., 2006) could be interesting directions to pursue. References Yaser Al-Onaizan and Kishore Papineni 2006. Distortion Models for Statistical Machine Translation In Proceedings of ACL Pi-Chuan Chang and Kristina Toutanova 2007. A Discriminative Syntactic Word Order Model for Machine Translation In Proceedings of ACL David Chiang 2005. A Hierarchical Phrase-based Model for Statistical Machine Translation In Proceedings of ACL Michael Collins, Philipp Koehn and Ivona Kucerova 2005. Clause Restructuring for Statistical Machine Translation In Proceedings of ACL Jakob Elming 2008. Syntactic Reordering Integrated with Phrasebased SMT In Proceedings of COLING Michel Galley and Christopher D. Manning 2008. A Simple and Effective Hierarchical Phrase Reordering Model In Proceedings of EMNLP Michel Galley, Jonathan Graehl, Kevin Knight, Daniel Marcu, Steve DeNeefe, Wei Wang and Ignacio Thayer 2006. Scalable Inference and Training of Context-Rich Syntactic Translation Models In Proceedings of COLING-ACL Nizar Habash 2007. Syntactic Preprocessing for Statistical Machine Translation In Proceedings of 11th MT Summit Liang Huang and David Chiang 2007. Forest Rescoring: Faster Decoding with Integrated Language Models, In Proceedings of ACL Philipp Koehn 2004. Statistical Significance Tests for Machine Translation Evaluation In Proceedings of EMNLP Philipp Koehn, Amittai Axelrod, Alexandra Birch Mayne, Chris Callison-Burch, Miles Osborne and David Talbot 2005. Edinborgh System Description for the 2005 IWSLT Speech Translation Evaluation In International Workshop on Spoken Language Translation Philipp Koehn, Franz J. Och and Daniel Marcu 2003. Statistical Phrase-based Translation, In Proceedings of HLT-NAACL Chi-Ho Li, Dongdong Zhang, Mu Li, Ming Zhou, Minghui Li and Yi Guan 2007. A Probabilistic Approach to Syntax-based Reordering for Statistical Machine Translation, In Proceedings of ACL Yang Liu, Qun Liu and Shouxun Lin 2006. Tree-to-string Alignment Template for Statistical Machine Translation, In Proceedings of COLING-ACL Wolfgang Macherey, Franz J. Och, Ignacio Thayer and Jakob Uszkoreit 2008. Lattice-based Minimum Error Rate Training for Statistical Machine Translation In Proceedings of EMNLP Robert Malouf 2002. A comparison of algorithms for maximum entropy parameter estimation In Proceedings of the Sixth Workshop on Computational Language Learning (CoNLL-2002) Joakim Nivre and Mario Scholz 2004. Deterministic Dependency Parsing for English Text. In Proceedings of COLING Franz J. Och 2002. Statistical Machine Translation: From Single Word Models to Alignment Template Ph.D. Thesis, RWTH Aachen, Germany Franz J. Och. 2003. Minimum Error Rate Training in Statistical Machine Translation. In Proceedings of ACL Franz J. Och and Hermann Ney 2004. The Alignment Template Approach to Statistical Machine Translation. Computational Linguistics, 30:417-449 Kishore Papineni, Roukos, Salim et al. 2002. BLEU: A Method for Automatic Evaluation of Machine Translation. In Proceedings of ACL Chris Quirk, Arul Menezes and Colin Cherry 2005. Dependency Tree Translation: Syntactically Informed Phrasal SMT In Proceedings of ACL Christoph Tillmann 2004. A Block Orientation Model for Statistical Machine Translation In Proceedings of HLT-NAACL Chao Wang, Michael Collins and Philipp Koehn 2007. Chinese Syntactic Reordering for Statistical Machine Translation In Proceedings of EMNLP-CoNLL Dekai Wu 1997. Stochastic Inversion Transduction Grammars and Bilingual Parsing of Parallel Corpus In Computational Linguistics 23(3):377-403 Fei Xia and Michael McCord 2004. Improving a Statistical MT System with Automatically Learned Rewrite Patterns In Proceedings of COLING Deyi Xiong, Qun Liu and Shouxun Lin 2006. Maximum Entropy Based Phrase Reordering Model for Statistical Machine Translation In Proceedings of COLING-ACL Kenji Yamada and Kevin Knight 2001. A Syntax-based Statistical Translation Model In Proceedings of ACL Yuqi Zhang, Richard Zens and Hermann Ney 2007. Improve Chunklevel Reordering for Statistical Machine Translation In Proceedings of IWSLT Richard Zens and Hermann Ney 2006. Discriminative Reordering Models for Statistical Machine Translation In Proceedings of the Workshop on Statistical Machine Translation, HLT-NAACL pages 55-63 Andreas Zollmann and Ashish Venugopal 2006. Syntax Augmented Machine Translation via Chart Parsing In Proceedings of NAACL 2006 - Workshop on Statistical Machine Translation Andreas Zollmann, Ashish Venugopal, Franz Och and Jay Ponte 2008. A Systematic Comparison of Phrase-Based, Hierarchical and Syntax-Augmented Statistical MT In Proceedings of COLING 253 Learning Bilingual Linguistic Reordering Model for Statistical Machine Translation Han-Bin Chen, Jian-Cheng Wu and Jason S. Chang Department of Computer Science National Tsing Hua University 101, Guangfu Road, Hsinchu, Taiwan {hanbin,d928322,jschang}@cs.nthu.edu.tw Abstract In this paper, we propose a method for learning reordering model for BTG-based statistical machine translation (SMT). The model focuses on linguistic features from bilingual phrases. Our method involves extracting reordering examples as well as features such as part-of-speech and word class from aligned parallel sentences. The features are classified with special considerations of phrase lengths. We then use these features to train the maximum entropy (ME) reordering model. With the model, we performed Chinese-to-English translation tasks. Experimental results show that our bilingual linguistic model outperforms the state-of-the-art phrase-based and BTG-based SMT systems by improvements of 2.41 and 1.31 BLEU points respectively. sequences of consecutive phrases, mapping to cells in a CKY matrix, are then translated through a bilingual phrase table and scored as implemented in (Koehn et al., 2005; Chiang, 2005). In other words, their system shares the same phrase table with standard phrase-based SMT systems. 3 3 three years ago A1 after three A2 A2 (a) years A1 (b) Figure 1: Two reordering examples, with straight rule applied in (a), and inverted rule in (b). On the other hand, there are various proposed BTG reordering models to predict correct orientations between neighboring blocks (bilingual phrases). In Figure 1, for example, the role of reordering model is to predict correct orientations of neighboring blocks A1 and A2. In flat model (Wu, 1996; Zens et al., 2004; Kumar and Byrne, 2005), reordering probabilities are assigned uniformly during decoding, and can be tuned depending on different language pairs. It is clear, however, that this kind of model would suffer when the dominant rule is wrongly applied. Predicting orientations in BTG depending on context information can be achieved with lexical features. For example, Xiong et al. (2006) proposed MEBTG, based on maximum entropy (ME) classification with words as features. In MEBTG, first words of blocks are considered as the features, which are then used to train a ME model 1 Introduction Bracketing Transduction Grammar (BTG) is a special case of Synchronous Context Free Grammar (SCFG), with binary branching rules that are either straight or inverted. BTG is widely adopted in SMT systems, because of its good trade-off between efficiency and expressiveness (Wu, 1996). In BTG, the ratio of legal alignments and all possible alignment in a translation pair drops drastically especially for long sentences, yet it still covers most of the syntactic diversities between two languages. It is common to utilize phrase translation in BTG systems. For example in (Xiong et al., 2006), source sentences are segmented into phrases. Each 254 Human Language Technologies: The 2009 Annual Conference of the North American Chapter of the ACL, pages 254­262, Boulder, Colorado, June 2009. c 2009 Association for Computational Linguistics for predicting orientations of neighboring blocks. Xiong et al. (2008b) proposed a linguistically annotated BTG (LABTG), in which linguistic features such as POS and syntactic labels from source-side parse trees are used. Both MEBTG and LABTG achieved significant improvements over phrase-based Pharaoh (Koehn, 2004) and Moses (Koehn et al., 2007) respectively, on Chinese-to-English translation tasks. Nes Nf Nv the details of 14 49 50 the plan 14 18 DE Na tion on target one, as shown in Figure 2. Additionally, features are extracted and classified depending on lengths of blocks in order to obtain a more informed model. The rest of this paper is organized as follows. Section 2 reviews the related work. Section 3 describes the model used in our BTG-based SMT systems. Section 4 formally describes our bilingual linguistic reordering model. Section 5 and Section 6 explain the implementation of our systems. We show the experimental results in Section 7 and make the conclusion in Section 8. A2 A1 2 Related Work Figure 2: An inversion reordering example, with POS below source words, and class numbers below target words. However, current BTG-based reordering methods have been limited by the features used. Information might not be sufficient or representative, if only the first (or tail) words are used as features. For example, in Figure 2, consider target first-word features extracted from an inverted reordering example (Xiong et al., 2006) in MEBTG, in which first words on two blocks are both "the". This kind of feature set is too common and not representative enough to predict the correct orientation. Intuitively, one solution is to extend the feature set by considering both boundary words, forming a more complete boundary description. However, this method is still based on lexicalized features, which causes data sparseness problem and fails to generalize. In Figure 2, for example, the orientation should basically be the same, when the source/target words "/plan" from block A1 is replaced by other similar nouns and translations (e.g. "plans", "events" or "meetings"). However, such features would be treated as unseen by the current ME model, since the training data can not possibly cover all such similar cases. In this paper we present an improved reordering model based on BTG, with bilingual linguistic features from neighboring blocks. To avoid data sparseness problem, both source and target words are classified; we perform part-of-speech (POS) tagging on source language, and word classifica255 In statistical machine translation, reordering model is concerned with predicting correct orders of target language sentence given a source language one and translation pairs. For example, in phrase-based SMT systems (Koehn et al., 2003; Koehn, 2004), distortion model is used, in which reordering probabilities depend on relative positions of target side phrases between adjacent blocks. However, distortion model can not model long-distance reordering, due to the lack of context information, thus is difficult to predict correct orders under different circumstances. Therefore, while phrase-based SMT moves from words to phrases as the basic unit of translation, implying effective local reordering within phrases, it suffers when determining phrase reordering, especially when phrases are longer than three words (Koehn et al., 2003). There have been much effort made to improve reordering model in SMT. For example, researchers have been studying CKY parsing over the last decade, which considers translations and orientations of two neighboring block according to grammar rules or context information. In hierarchical phrase-based systems (Chiang, 2005), for example, SCFG rules are automatically learned from aligned bilingual corpus, and are applied in CKY style decoding. As an another application of CKY parsing technique is BTG-based SMT. Xiong et al. (2006) and Xiong et al. (2008a) developed MEBTG systems, in which first or tail words from reordering examples are used as features to train ME-based reordering models. Similarly, Zhang et al. (2007) proposed a model similar to BTG, which uses first/tail words of phrases, and syntactic labels (e.g. NP and VP) from source parse trees as features. In their work, however, inverted rules are allowed to apply only when source phrases are syntactic; for nonsyntactic ones, blocks are combined straight with a constant score. More recently, Xiong et al. (2008b) proposed LABTG, which incorporates linguistic knowledge by adding features such as syntactic labels and POS from source trees to improve their MEBTG. Different from Zhang's work, their model do not restrict non-syntactic phrases, and applies inverted rules on any pair of neighboring blocks. Although POS information is used in LABTG and Zhang's work, their models are syntax-oriented, since they focus on syntactic labels. Boundary POS is considered in LABTG only when source phrases are not syntactic phrases. In contrast to the previous works, we present a reordering model for BTG that uses bilingual information including class-level features of POS and word classes. Moreover, our model is dedicated to boundary features and considers different combinations of phrase lengths, rather than only first/tail words. In addition, current state-of-the-art Chinese parsers, including the one used in LABTG (Xiong et al., 2005), lag beyond in inaccuracy, compared with English parsers (Klein and Manning, 2003; Petrov and Klein 2007). In our work, we only use more reliable information such as Chinese word segmentation and POS tagging (Ma and Chen, 2003). Preo ( A1 , A2 , order) reo where order {straight, inverted}. In MEBTG, a ME reordering model is trained using features extracted from reordering examples of aligned parallel corpus. First words on neighboring blocks are used as features. In reordering example (a), for example, the feature set is {"S1L=three", "S2L=ago", "T1L=3", "T2L="} where "S1" and "T1" denote source and target phrases from the block A1. Rule (3) is lexical translation rule, which translates source phrase x into target phrase y. We use the same feature functions as typical phrase-based SMT systems (Koehn et al., 2005): Ptrans ( x | y ) p( x | y ) 1 p ( y | x) 2 plw ( x | y ) 3 plw ( y | x) 4 e 5 e y 6 where plw ( x | y ) 3 plw ( y | x) 4 , e 5 and e are lexical translation probabilities in both directions, phrase penalty and word penalty. During decoding, the blocks are produced by applying either one of two reordering rules on two smaller blocks, or applying lexical rule (3) on some source phrase. Therefore, the score of a block A is defined as y 6 3 The Model P( A) P( A1 ) P( A2 ) Plm ( A1 , A2 )lm Preo ( A1 , A2 , order)reo or Following Wu (1996) and Xiong et al. (2006), we implement BTG-based SMT as our system, in which three rules are applied during decoding: A A1 A A1 A2 A2 (1) (2) (3) P( A) Plm ( A) lm Ptrans ( x | y ) where Plm ( A) lm and Plm ( A1 , A2 ) lm are respectively the usual and incremental score of language model. To tune all lambda weights above, we perform minimum error rate training (Och, 2003) on the development set described in Section 7. Let B be the set of all blocks with source side sentence C. Then the best translation of C is the target side of the block A , where A x/ y where A1 and A2 are blocks in source order. Straight rule (1) and inverted rule (2) are reordering rules. They are applied for predicting target-side order when combining two blocks, and form the reordering model with the distributions 256 A argmax P( A) AB 4 Bilingual Linguistic Model In this section, we formally describe the problem we want to address and the proposed method. 4.1 Problem Statement We focus on extracting features representative of the two neighboring blocks being considered for reordering by the decoder, as described in Section 3. We define S(A) and T(A) as the information on source and target side of a block A. For two neighboring blocks A1 and A2, the set of features extracted from information of them is denoted as feature set function F(S(A1), S(A2), T(A1), S(A2)). In Figure 1 (b), for example, S(A1) and T(A1) are simply the both sides sentences "3 " and "three years", and F(S(A1), S(A2), T(A1), S(A2)) is {"S1L=three", "S2L=after", "T1L=3", "T2L="} where "S1L" denotes the first source word on the block A1, and "T2L" denotes the first target word on the block A2. Given the adjacent blocks A1 and A2, our goal includes (1) adding more linguistic and representative information to A1 and A2 and (2) finding a feature set function F' based on added linguistic information in order to train a more linguistically motivated and effective model. 4.2 Word Classification vides more fine-grained tags, including many tags with semantic information (e.g., Nc for place nouns, Nd for time nouns), and verb transitivity and subcategorization (e.g., VA for intransitive verbs, VC for transitive verbs, VK for verbs that take a clause as object). On the other hand, using the POS features in combination with the lexical features in target language will cause another sparseness problem in the phrase table, since one source phrase would map to multiple target ones with different POS sequences. As an alternative, we use mkcls toolkit (Och, 1999), which uses maximum-likelihood principle to perform classification on target side. After classification, the toolkit produces a many-to-one mapping between English tokens and class numbers. Therefore, there is no ambiguity of word class in target phrases and word class features can be used independently to avoid data sparseness problem and the phrase table remains unchanged. As mentioned in Section 1, features based on words are not representative enough in some cases, and tend to cause sparseness problem. By classifying words we are able to linguistically generalize the features, and hence predict the rules more robustly. In Figure 2, for example, the target words are converted to corresponding classes, and form the more complete boundary feature set {"T1L=14", "T1R=18", "T2L=14", "T2R=50"} (4) In the feature set (4), #14 is the class containing "the", #18 is the class containing "plans", and #50 is the class containing "of." Note that we add lastword features "T1R=18" and "T2R=50". As mentioned in Section 1, the word "plan" from block A1 is replaceable with similar nouns. This extends to other nominal word classes to realize the general rule of inverting "the ... NOUN" and "the ... of". It is hard to achieve this kind of generality using only lexicalized feature. With word classification, we gather feature sets with similar concepts from the training data. Table 1 shows the word classes can be used effectively to cope with data sparseness. For example, the feature set (4) occurs 309 times in our training data, and only 2 of them are straight, with the remaining 307 inverted examples, implying that similar features based on word classes lead to similar orientation. Additional examples of similar feature sets with different word classes are shown in Table 1. As described in Section 1, designing a more complete feature set causes data sparseness problem, if we use lexical features. One natural solution is using POS and word class features. In our model, we perform Chinese POS tagging on source language. In Xiong et al. (2008b) and Zhang et al. (2007), Chinese parsers with Penn Chinese Treebank (Xue et al., 2005) style are used to derive source parse trees, from which sourceside features such as POS are extracted. However, due to the relatively low accuracy of current Chinese parsers compared with English ones, we instead use CKIP Chinese word segmentation system (Ma and Chen, 2003) in order to derive Chinese tags with high accuracy. Moreover, compared with the Treebank Chinese tagset, the CKIP tagset pro257 class X 9 18 20 48 T1R = X graph, government plans, events bikes, motors day, month, year straight/inverted 2/488 2/307 0/694 4/510 A1 Nh I A2 VE think A1 P for A2 Neqa Na these reasons Table 1: List of feature sets in the form of {"T1L=14", "T1R=X", "T2L=14", "T2R=50"}. 4.3 (a) M class A1 A2 Na Caa Na technology and equipment (b) L class A1 P Nc in Jordan A2 VC Na hold meeting Feature with Length Consideration Boundary features using both the first and last words provide more detailed descriptions of neighboring blocks. However, we should take the special case blocks with length 1 into consideration. For example, consider two features sets from straight and inverted reordering examples (a) and (b) in Figure 3. There are two identical source features in both feature set, since first words on block A1 and last words on block A2 are the same: {"S1L=P","S2R=Na"} F(S(A1),S(A2),T(A1), S(A2)) Therefore, without distinguishing the special case, the features would represent quite different cases with the same feature, possibly leading to failure to predict orientations of two blocks. We propose a method to alleviate the problem of features with considerations of lengths of two adjacent phrases by classifying both the both source and target phrase pairs into one of four classes: M, L, R and B, corresponding to different combinations of phrase lengths. Suppose we are given two neighboring blocks A1 and A2, with source phrases P1 and P2 respectively. Then the feature set from source side is classified into one of the classes as follows. We give examples of feature set for each class according to Figure 4. P Neqa Na for these reasons (c) R class (d) B class Figure 4: Examples of different length combinations, mapping to four classes. 1. M class. The lengths of P1 and P2 are both 1. In Figure 4 (a), for example, the feature set is {"M1=Nh", "M2=VE"} 2. L class. The length of P1 is 1, and the length of P2 is greater than 1. In Figure 4 (b), for example, the feature set is {"L1=P", "L2=Neqa", "L3=Na"} 3. R class. The length of P1 is greater than 1, and the length of P2 is 1. In Figure 4 (c), for example, the feature set is {"R1=Na", "R2=Caa", "R3=Na"} 4. B class. The lengths of P1 and P2 are both greater than 1. In Figure 4 (d), for example, the feature set is {"B1=P", "B2=Nc", "B3=VC", "B4=Na"} We use the same scheme to classify the two target phrases. Since both source and target words are classified as described in Section 4.2, the feature sets are more representative and tend to lead to consistent prediction of orientation. Additionally, the length-based features are easy to fit into memory, in contrast to lexical features in MEBTG. To summarize, we extract features based on word lengths, target-language word classes, and fine-grained, semantic oriented parts of speech. To illustrate, we use the neighboring blocks from Fig- P Nc VC Na hold meeting A1 A2 A1 (b) A2 (a) in jordan Figure 3: Two reordering examples with ambiguous features on source side. 258 ure 2 to show an example of complete bilingual linguistic feature set: {"S.B1=Nes", "S.B2=Nv", "S.B3=DE", "S.B4=Na", "T.B1=14", "T.B2=18", "T.B3=14", "T.B4=50"} where "S." and "T." denote source and target sides. In the next section, we describe the process of preparing the feature data and training an ME model. In Section 7, we perform evaluations of this ME-based reordering model against standard phrase-based SMT and previous work based on ME and BTG. 6 Decoding 5 Training In order to train the translation and reordering model, we first set up Moses SMT system (Koehn et al., 2007). We obtain aligned parallel sentences and the phrase table after the training of Moses, which includes running GIZA++ (Och and Ney, 2003), grow-diagonal-final symmetrization and phrase extraction (Koehn et al., 2005). Our system shares the same translation model with Moses, since we directly use the phrase table to apply translation rules (3). On the other side, we use the aligned parallel sentences to train our reordering model, which includes classifying words, extracting bilingual phrase samples with orientation information, and training an ME model for predicting orientation. To perform word classification, the source sentences are tagged and segmented before the Moses training. As for target side, we ran the Moses scripts to classify target language words using the mkcls toolkit before running GIZA++. Therefore, we directly use its classification result, which generate 50 classes with 2 optimization runs on the target sentences. To extract the reordering examples, we choose sentence pairs with top 50% alignment scores provided by GIZA++, in order to fit into memory. Then the extraction is performed on these aligned sentence pairs, together with POS tags and word classes, using basically the algorithm presented in Xiong et al. (2006). However, we enumerate all reordering examples, rather than only extract the smallest straight and largest inverted examples. Finally, we use the toolkit by Zhang (2004) to train the ME model with extracted reordering examples. 259 We develop a bottom-up CKY style decoder in our system, similar to Chiang (2005). For a Chinese sentence C, the decoder finds its best translation on the block with entire C on source side. The decoder first applies translation rules (3) on cells in a CKY matrix. Each cell denotes a sequence of source phrases, and contains all of the blocks with possible translations. The longest length of source phrase to be applied translations rules is restricted to 7 words, in accordance with the default settings of Moses training scripts. To reduce the search space, we apply threshold pruning and histogram pruning, in which the block scoring worse than 10-2 times the best block in the same cell or scoring worse than top 40 highest scores would be pruned. These pruning techniques are common in SMT systems. We also apply recombination, which distinguish blocks in a cell only by 3 leftmost and rightmost target words, as suggested in (Xiong et al., 2006). 7 Experiments and Results We perform Chinese-to-English translation task on NIST MT-06 test set, and use Moses and MEBTG as our competitors. The bilingual training data containing 2.2M sentences pairs from Hong Kong Parallel Text (LDC2004T08) and Xinhua News Agency (LDC2007T09), with length shorter than 60, is used to train the translation and reordering model. The source sentences are tagged and segmented with CKIP Chinese word segmentation system (Ma and Chen, 2003). About 35M reordering examples are extracted from top 1.1M sentence pairs with higher alignment scores. We generate 171K features for lexicalized model used in MEBTG system, and 1.41K features for our proposed reordering model. For our language model, we use Xinhua news from English Gigaword Third Edition (LDC2007T07) to build a trigram model with SRILM toolkit (Stolcke, 2002). Our development set for running minimum error rate training is NIST MT-08 test set, with sentence lengths no more than 20. We report the experimental results on NIST MT-06 test set. Our evaluation metric is BLEU (Papineni et al., 2002) with caseinsensitive matching from unigram to four-gram. System Moses(distortion) Moses(lexicalized) MEBTG WC+LC BLEU-4 22.55 23.42 23.65 24.96 System MEBTG WC+MEBTG Feature size 171K 0.24K BLEU-4 23.65 23.79 Table 3: Performances of lexicalized and word classified MEBTG. System MEBTG Boundary LC Feature size 171K 349K 780K BLEU-4 23.65 23.42 23.86 Table 2: Performances of various systems. The overall result of our experiment is shown in Table 2. The lexicalized MEBTG system proposed by Xiong et al. (2006) uses first words on adjacent blocks as lexical features, and outperforms phrasebased Moses with default distortion model and enhanced lexicalized model, by 1.1 and 0.23 BLEU points respectively. This suggests lexicalized Moses and MEBTG with context information outperforms distance-based distortion model. Besides, MEBTG with structure constraints has better global reordering estimation than unstructured Moses, while incorporating their local reordering ability by using phrase tables. The proposed reordering model trained with word classification (WC) and length consideration (LC) described in Section 4 outperforms MEBTG by 1.31 point. This suggests our proposed model not only reduces the model size by using 1% fewer features than MEBTG, but also improves the translation quality. We also evaluate the impacts of WC and LC separately and show the results in Table 3-5. Table 3 shows the result of MEBTG with word classified features. While classified MEBTG only improves 0.14 points over original lexicalized one, it drastically reduces the feature size. This implies WC alleviates data sparseness by generalizing the observed features. Table 4 compares different length considerations, including boundary model demonstrated in Section 4.2, and the proposed LC in Section 4.3. Although boundary model describes features better than using only first words, which we will show later, it suffers from data sparseness with twice feature size of MEBTG. The LC model has the largest feature size but performs best among three systems, suggesting the effectiveness of our LC. In Table 5 we show the impacts of WC and LC together. Note that all the systems with WC significantly reduce the size of features compared to lexicalized ones. Table 4: Performances of BTG systems with different representativeness. System MEBTG WC+MEBTG WC+Bounary WC+LC Feature size 171K 0.24K 0.48K 1.41K BLEU-4 23.65 23.79 24.29 24.96 Table 5: Different representativeness with word classification. While boundary model is worse than first-word MEBTG in Table 4, it outperforms the latter when both are performed WC. We obtain the best result that outperforms the baseline MEBTG by more than 1 point when we apply WC and LC together. Our experimental results show that we are able to ameliorate the sparseness problem by classifying words, and produce more representative features by considering phrase length. Moreover, they are both important, in that we are unable to outperform our competitors by a large margin unless we combine both WC and LC. In conclusion, while designing more representative features of reordering model in SMT, we have to find solutions to generalize them. 8 Conclusion and Future Works We have proposed a bilingual linguistic reordering model to improve current BTG-based SMT systems, based on two drawbacks of previously proposed reordering model, which are sparseness and representative problem. First, to solve the sparseness problem in previously proposed lexicalized model, we perform word classification on both sides. 260 Secondly, we present a more representative feature extraction method. This involves considering length combinations of adjacent phrases. The experimental results of Chinese-to-English task show that our model outperforms baseline phrase-based and BTG systems. We will investigate more linguistic ways to classify words in future work, especially on target language. For example, using word hierarchical structures in WordNet (Fellbaum, 1998) system provides more linguistic and semantic information than statistically-motivated classification tools. Wei-Yun Ma and Keh-Jiann Chen. 2003. Introduction to CKIP Chinese Word Segmentation System for the First International Chinese Word Segmentation Bakeoff. In Proceedings of ACL, Second SIGHAN Workshop on Chinese Language Processing, pp168171. Franz Josef Och. 1999. An efficient method for determining bilingual word classes. In EACL '99: Ninth Conference of the European Chapter of the Association for Computational Linguistics, pages 71­76, Bergen, Norway, June. Franz Josef Och and Hermann Ney. 2003. A Systematic Comparison of Various Statistical Alignment Models. Computational Linguistics, 29:19-51. Franz Josef Och. 2003. Minimum error rate training in statistical machine translation. In Proceedings of ACL 2003, pages 160-167. Kishore Papineni, Salim Roukos, Todd Ward, and WeiJing Zhu. 2002. BLEU: a method for automatic evaluation of machine translation. In Proceedings of the 40th Annual Meeting of the ACL, pages 311­318. Slav Petrov and Dan Klein. 2007. Improved Inferencefor Unlexicalized Parsing. In Proceedings of HLTNAACL 2007. Andreas Stolcke. 2002. SRILM ­ an extensible language modeling toolkit. In Proceedings of the International Conference on Spoken Language Processing, volume 2, pages 901­904. Dekai Wu. 1996. A Polynomial-Time Algorithm for Statistical Machine Translation. In Proceedings of ACL 1996. Deyi Xiong, Shuanglong Li, Qun Liu, Shouxun Lin, and Yueliang Qian. 2005. Parsing the Penn Chinese treebank with semantic knowledge. In Proceedings of IJCNLP 2005, pages 70-81. Deyi Xiong, Qun Liu and Shouxun Lin. 2006. Maximum Entropy Based Phrase Reordering Model for Statistical Machine Translation. In Proceedings of ACL-COLING 2006. Deyi Xiong, Min Zhang, Aiti Aw, Haitao Mi, Qun Liu, and Shouxun Liu. 2008a. Refinements in BTG-based statistical machine translation. In Proceedings of IJCNLP 2008, pp. 505-512. Deyi Xiong, Min Zhang, Ai Ti Aw, and Haizhou Li. 2008b. Linguistically Annotated BTG for Statistical Machine Translation. In Proceedings of COLING 2008. Nianwen Xue, Fei Xia, Fu-Dong Chiou, and Martha Palmer. 2005. The Penn Chinese Treebank: Phrase Acknowledgements This work was supported by National Science Council of Taiwan grant NSC 95-2221-E-007-182MY3. References David Chiang. 2005. A hierarchical phrase-based model for statistical machine translation. In Proceedings of ACL 2005, pp. 263-270. Christiane Fellbaum, editor. 1998. WordNet: An Electronic Lexical Database. MIT Press, Cambridge, Massachusetts. Philipp Koehn, Franz Joseph Och, and Daniel Marcu. 2003. Statistical Phrase-Based Translation. In Proceedings of HLT/NAACL 2003. Philipp Koehn. 2004. Pharaoh: a Beam Search Decoder for Phrased-Based Statistical Machine Translation Models. In Proceedings of AMTA 2004. Philipp Koehn, Amittai Axelrod, Alexandra Birch Mayne, Chris Callison-Burch, Miles Osborne and David Talbot. 2005. Edinburgh System Description for the 2005 IWSLT Speech Translation Evaluation. In International Workshop on Spoken Language Translation. Philipp Koehn, Hieu Hoang, Alexandra Birch, Chris Callison-Burch, Marcello Federico, Nicola Bertoldi, Brooke Cowan,Wade Shen, Christine Moran, Richard Zens, Chris Dyer, Ondrej Bojar, Alexandra Constrantin, and Evan Herbst. 2007. Moses: Open source toolkit for statistical machine translation. In Proceedings of ACL 2007, Demonstration Session. Dan Klein and Christopher D. Manning. 2003. Accurate Unlexicalized Parsing. In Proceedings of ACL 2003. Shankar Kumar and William Byrne. 2005. Local phrase reordering models for statistical machine translation. In Proceedings of HLT-EMNLP 2005. 261 structure annotation of a large corpus. Natural Language Engineering, 11(2):207­238. R. Zens, H. Ney, T. Watanabe, and E. Sumita. 2004. Reordering Constraints for Phrase-Based Statistical Machine Translation. In Proceedings of CoLing 2004, Geneva, Switzerland, pp. 205-211. Le Zhang. 2004. Maximum Entropy Modeling Toolkit for Python and C++. Available at http://homepa ges.inf.ed.ac.uk/s0450736/maxent_toolkit.html. Dongdong Zhang, Mu Li, Chi-Ho Li and Ming Zhou. 2007. Phrase Reordering Model Integrating Syntactic Knowledge for SMT. In Proceedings of EMNLPCoNLL 2007. 262 May All Your Wishes Come True: A Study of Wishes and How to Recognize Them Andrew B. Goldberg, Nathanael Fillmore, David Andrzejewski Zhiting Xu, Bryan Gibson, Xiaojin Zhu Computer Sciences Department, University of Wisconsin-Madison, Madison, WI 53706, USA {goldberg, nathanae, andrzeje, zhiting, bgibson, jerryzhu}@cs.wisc.edu Abstract A wish is "a desire or hope for something to happen." In December 2007, people from around the world offered up their wishes to be printed on confetti and dropped from the sky during the famous New Year's Eve "ball drop" in New York City's Times Square. We present an in-depth analysis of this collection of wishes. We then leverage this unique resource to conduct the first study on building general "wish detectors" for natural language text. Wish detection complements traditional sentiment analysis and is valuable for collecting business intelligence and insights into the world's wants and desires. We demonstrate the wish detectors' effectiveness on domains as diverse as consumer product reviews and online political discussions. corpus. Some are far-reaching fantasies and aspirations, while others deal with everyday concerns like economic and medical distress. We analyze this first-of-its-kind corpus in Section 2. The New Oxford American Dictionary defines "wish" as "a desire or hope for something to happen." How wishes are expressed, and how such wishful expressions can be automatically recognized, are open questions in natural language processing. Leveraging the WISH corpus, we conduct the first study on building general "wish detectors" for natural language text, and demonstrate their effectiveness on domains as diverse as consumer product reviews and online political discussions. Such wish detectors have tremendous value in collecting business intelligence and public opinions. We discuss the wish detectors in Section 3, and experimental results in Section 4. 1.1 Relation to Prior Work 1 Introduction Each year, New York City rings in the New Year with the famous "ball drop" in Times Square. In December 2007, the Times Square Alliance, coproducer of the Times Square New Year's Eve Celebration, launched a Web site called the Virtual Wishing Wall1 that allowed people around the world to submit their New Year's wishes. These wishes were then printed on confetti and dropped from the sky at midnight on December 31, 2007 in sync with the ball drop. We obtained access to this set of nearly 100,000 New Year's wishes, which we call the "WISH corpus." Table 1 shows a selected sample of the WISH 1 http://www.timessquarenyc.org/nye/nye interactive.html Studying wishes is valuable in at least two aspects: 1. Being a special genre of subjective expression, wishes add a novel dimension to sentiment analysis. Sentiment analysis is often used as an automatic market research tool to collect valuable business intelligence from online text (Pang and Lee, 2008; Shanahan et al., 2005; Koppel and Shtrimberg, 2004; Mullen and Malouf, 2008). Wishes differ from the recent focus of sentiment analysis, namely opinion mining, by revealing what people explicitly want to happen, not just what they like or dislike (Ding et al., 2008; Hu and Liu, 2004). For example, wishes in product reviews could contain new feature requests. Consider the following (real) prod- Human Language Technologies: The 2009 Annual Conference of the North American Chapter of the ACL, pages 263­271, Boulder, Colorado, June 2009. c 2009 Association for Computational Linguistics 263 514 351 331 244 112 76 75 51 21 21 16 16 8 7 6 5 5 5 1 1 1 1 1 1 peace on earth peace world peace happy new year love health and happiness to be happy i wish for world peace i wish for health and happiness for my family let there be peace on earth i wish u to call me if you read this 555-1234 to find my true love i wish for a puppy for the war in iraq to end peace on earth please a free democratic venezuela may the best of 2007 be the worst of 2008 to be financially stable a little goodness for everyone would be nice i hope i get accepted into a college that i like i wish to get more sex in 2008 please let name be healthy and live all year to be emotionally stable and happy to take over the world with only dozens or hundreds of participants. The WISH corpus provides the first large-scale collection of wishes as a window into the world's desires. Beyond sentiment analysis, classifying sentences as wishes is an instance of non-topical classification. Tasks under this heading include computational humor (Mihalcea and Strapparava, 2005), genre classification (Boese and Howe, 2005), authorship attribution (Argamon and Shimoni, 2003), and metaphor detection (Krishnakumaran and Zhu, 2007), among others (Mishne et al., 2007; Mihalcea and Liu, 2006). We share the common goal of classifying text into a unique set of target categories (in our case, wishful and non-wishful), but use different techniques catered to our specific task. Our feature-generation technique for wish detection resembles template-based methods for information extraction (Brin, 1999; Agichtein and Gravano, 2000). 2 Analyzing the WISH Corpus Table 1: Example wishes and their frequencies in the WISH corpus. uct review excerpt: "Great camera. Indoor shots with a flash are not quite as good as 35mm. I wish the camera had a higher optical zoom so that I could take even better wildlife photos." The first sentence contains positive opinion, the second negative opinion. However, wishful statements like the third sentence are often annotated as non-opinion-bearing in sentiment analysis corpora (Hu and Liu, 2004; Ding et al., 2008), even though they clearly contain important information. An automatic "wish detector" text-processing tool can be useful for product manufacturers, advertisers, politicians, and others looking to discover what people want. 2. Wishes can tell us a lot about people: their innermost feelings, perceptions of what they're lacking, and what they desire (Speer, 1939). Many psychology researchers have attempted to quantify the contents of wishes and how they vary with factors such as location, gender, age, and personality type (Speer, 1939; Milgram and Riedel, 1969; Ehrlichman and Eichenstein, 1992; King and Broyles, 1997). These studies have been small scale 264 We analyze the WISH corpus with a variety of statistical methods. Our analyses not only reveal what people wished for on New Year's Eve, but also provide insight for the development of wish detectors in Section 3. The complete WISH corpus contains nearly 100,000 wishes collected over a period of 10 days in December 2007, most written in English, with the remainder in Portuguese, Spanish, Chinese, French, and other languages. For this paper, we consider only the 89,574 English wishes. Most of these English wishes contain optional geographic meta data provided by the wisher, indicating a variety of countries (not limited to English-speaking) around the world. We perform minimal preprocessing, including TreeBank-style tokenization, downcasing, and punctuation removal. Each wish is treated as a single entity, regardless of whether it contains multiple sentences. After preprocessing, the average length of a wish is 8 tokens. 2.1 The Topic and Scope of Wishes As a first step in understanding the content of the wishes, we asked five annotators to manually annotate a random subsample of 5,000 wishes. Sections 2.1 and 2.2 report results on this subsample. The wishes were annotated in terms of two at- tiple scope labels applied, the broadest scope was selected. Figure 1(b) shows the scope distribution. It is bimodal: over one third of the wishes are narrowly directed at one's self, while broad wishes at the world level are also frequent. The in-between scopes are less frequent. 2.2 Wishes Differ by Geographic Location (a) Topic of Wishes (b) Scope of Wishes Figure 1: Topic and scope distributions based on manual annotations of a random sample of 5,000 wishes in the WISH corpus. tributes: topic and scope. We used 11 pre-defined topic categories, and their distribution in this subsample of the WISH corpus is shown in Figure 1(a). The most frequent topic is love, while health, happiness, and peace are also common themes. Many wishes also fell into an other category, including specific individual requests ("i wish for a new puppy"), solicitations or advertisements ("call me 555-1234", "visit website.com"), or sinister thoughts ("to take over the world"). The 5,000 wishes were also manually assigned a scope. The scope of a wish refers to the range of people that are targeted by the wish. We used 6 pre-defined scope categories: self ("I want to be happy"), family ("For a cure for my husband"), specific person by name ("Prayers for name"), country ("Bring our troops home!"), world ("Peace to everyone in the world"), and other. In cases where mul265 As mentioned earlier, wishers had the option to enter a city/country when submitting wishes. Of the manually annotated wishes, about 4,000 included valid location information, covering all 50 states in the U.S., and all continents except Antarctica. We noticed a statistically significant difference between wishes submitted from the United States (about 3600) versus non-U.S. (about 400), both in terms of their topic and scope distributions. For each comparison, we performed a Pearson 2 -test using location as the explanatory variable and either topic or scope as the response variable.2 The null hypothesis is that the variables are independent. For both tests we reject the null hypothesis, with p < 0.001 for topic, and p = 0.006 for scope. This indicates a dependence between location and topic/scope. Asterisks in Figure 2 denote the labels that differ significantly between U.S. and non-U.S. wishes.3 In particular, we observed that there are significantly more wishes about love, peace, and travel from non-U.S. locales, and more about religion from the U.S. There are significantly more world-scoped wishes from non-U.S. locales, and more countryand family-scoped wishes from the U.S. We also compared wishes from "red states" versus "blue states" (U.S. states that voted a majority for the Republican and Democratic presidential candidates in 2008, respectively), but found no significant differences. 2 The topic test examined a 2 × 11 contingency table, while the scope test used a 2 × 6 contingency table. In both tests, all of the cells in the tables had an expected frequency of at least 5, so the 2 approximation is valid. 3 To identify the labels that differ significantly by location, we computed the standardized residuals for the cells in the two contingency tables. Standardized residuals are approximately N (0, 1)-distributed and can be used to locate the major contributors to a significant 2 -test statistic (Agresti, 2002). The asterisks in Figure 2 indicate the surprisingly large residuals, i.e., the difference between observed and expected frequencies is outside a 95% confidence interval. 10 3 peace log(frequency) 10 2 to find my true love 10 1 to take over the world (a) Wish topics differ by Location 10 0 10 0 10 1 10 10 log(rank) 2 3 10 4 10 5 Figure 3: The rank vs. frequency plot of wishes, approximately obeying Zipf's law. Note the log-log scale. (b) Wish scopes differ by Location Figure 2: Geographical breakdown of topic and scope distributions based on approximately 4,000 locationtagged wishes. Asterisks indicate statistically significant differences. 2.3 Wishes Follow Zipf's Law We now move beyond the annotated subsample and examine the full set of 89,574 English wishes. We noticed that a small fraction (4%) of unique wishes account for a relatively large portion (16%) of wish occurrences, while there are also many wishes that only occur once. The question naturally arises: do wishes obey Zipf's Law (Zipf, 1932; Manning and Sch¨ tze, 1999)? If so, we should expect the freu quency of a unique wish to be inversely proportional to its rank, when sorted by frequency. Figure 3 plots rank versus frequency on a log-log scale and reveals an approximately linear negative slope, thus suggesting that wishes do follow Zipf's law. It also shows that low-occurrence wishes dominate, hence learning might be hindered by data sparseness. 2.4 Latent Topic Modeling for Wishes unsupervised fashion. The goal is to validate and complement the study in Section 2.1. To apply LDA to the wishes, we treated each individual wish as a short document. We used 12 topics, Collapsed Gibbs Sampling (Griffiths and Steyvers, 2004) for inference, hyperparameters = 0.5 and = 0.1, and ran Markov Chain Monte Carlo for 2000 iterations. The resulting 12 LDA topics are shown in Table 2, in the form of the highest probability words p(word|topic) in each topic. We manually added summary descriptors for readability. With LDA, it is also possible to observe which words were assigned to which topics in each wish. For example, LDA assigned most words in the wish "world(8) peace(8) and my friends(4) in iraq(1) to come(1) home(1)" to two topics: peace and troops (topic numbers in parentheses). Interestingly, these LDA topics largely agree with the pre-defined topics in Section 2.1. 3 Building Wish Detectors The 11 topics in Section 2.1 were manually predefined based on domain knowledge. In contrast, in this section we applied Latent Dirichlet Allocation (LDA) (Blei et al., 2003) to identify the latent topics in the full set of 89,574 English wishes in an 266 We now study the novel NLP task of wish detection, i.e., classifying individual sentences as being wishes or not. Importantly, we want our approach to transfer to domains other than New Year's wishes, including consumer product reviews and online political discussions. It should be pointed out that wishes are highly domain dependent. For example, "I wish for world peace" is a common wish on New Year's Eve, but is exceedingly rare in product reviews; and vice versa: "I want to have instant access to the volume" may occur in product reviews, but is an un- Topic 0 1 2 3 4 5 6 7 8 9 10 11 Summary New Year Troops Election Life Prosperity Love Career Lottery Peace Religion Family Health Top words in the topic, sorted by p(word|topic) year, new, happy, 2008, best, everyone, great, years, wishing, prosperous, may, hope all, god, home, come, may, safe, s, us, bless, troops, bring, iraq, return, 2008, true, dreams wish, end, no, more, 2008, war, stop, president, paul, not, ron, up, free, less, bush, vote more, better, life, one, live, time, make, people, than, everyone, day, wish, every, each health, happiness, good, family, friends, all, love, prosperity, wealth, success, wish, peace love, me, find, wish, true, life, meet, want, man, marry, call, someone, boyfriend, fall, him get, wish, job, out, t, hope, school, better, house, well, want, back, don, college, married wish, win, 2008, money, want, make, become, lottery, more, great, lots, see, big, times peace, world, all, love, earth, happiness, everyone, joy, may, 2008, prosperity, around love, forever, jesus, know, loves, together, u, always, 2, 3, 4, much, best, mom, christ healthy, happy, wish, 2008, family, baby, life, children, long, safe, husband, stay, marriage com, wish, s, me, lose, please, let, cancer, weight, cure, mom, www, mother, visit, dad Table 2: Wish topics learned from Latent Dirichlet Allocation. Words are sorted by p(word|topic). likely New Year's wish. For this initial study, we do assume that there are some labeled training data in the target domains of interest. To transfer the knowledge learned from the outof-domain WISH corpus to other domains, our key insight is the following: while the content of wishes (e.g., "world peace") may not transfer across domains, the ways wishes are expressed (e.g., "I wish for ") may. We call these expressions wish templates. Our novel contribution is an unsupervised method for discovering candidate templates from the WISH corpus which, when applied to other target domains, improve wish detection in those domains. 3.1 Two Simple Wish Detectors i wish i hope i want hopefully if only would be better if would like if should would that can't believe didn't don't believe didn't do want i can has Table 3: Manual templates for identifying wishes. Before describing our template discovery method, we first describe two simple wish detectors, which serve as baselines. 1. [Manual]: It may seem easy to locate wishes. Perhaps looking for sentences containing the phrases "i wish," "i hope," or some other simple patterns is sufficient for identifying the vast majority of wishes in a domain. To test this hypothesis, we asked two native English speakers (not the annotators, nor affiliated with the project; no exposure to any of the wish datasets) to come up with text patterns that might be used to express wishes. They were shown three dictionary definitions of "to wish (v)" and "wish (n)". They produced a ranked list of 13 templates; see Table 3. The underscore matches any string. These templates can be turned into a simple rule-based classifier: If part of a sentence matches one of the templates, the sentence is 267 classified as a wish. By varying the depth of the list, one can produce different precision/recall behaviors. Overall, we expect [Manual] to have relatively high precision but low recall. 2. [Words]: Another simple method for detecting wishes is to train a standard word-based text classifier using the labeled training set in the target domain. Specifically, we represent each sentence as a binary word-indicator vector, normalized to sum to 1. We then train a linear Support Vector Machine (SVM). This method may have higher recall, but precision may suffer. For instance, the sentence "Her wish was carried out by her husband" is not a wish, but could be misclassified as one because of the word "wish." Note that neither of the two baseline methods uses the WISH corpus. 3.2 Automatically Discovering Wish Templates world peace health and happiness health c1 c2 count(c1+t1) t1 i wish for ___ count(c2) We now present our method to automatically discover high quality wish templates using the WISH corpus. The key idea is to exploit redundancy in how the same wish content is expressed. For example, as we see in Table 1, both "world peace" and "i wish for world peace" are common wishes. Similarly, both "health and happiness" and "i wish for health and happiness" appear in the WISH corpus. It is thus reasonable to speculate that "i wish for " is a good wish template. Less obvious templates can be discovered in this way, too, such as "let there be " from "peace on earth" and "let there be peace on earth." We formalize this intuition as a bipartite graph, illustrated in Figure 4. Let W = {w1 , . . . , wn } be the set of unique wishes in the WISH corpus. The bipartite graph has two types of nodes: content nodes C and template nodes T , and they are generated as follows. If a wish wj (e.g., "i wish for world peace") contains another wish wi (e.g., "world peace"), we create a content node c1 = wi and a template node t1 ="i wish for ". We denote this relationship by wj = c1 + t1 . Note the order of c1 and t1 is insignificant, as how the two combine is determined by the underscore in t1 , and wj = t1 + c1 is just fine. In addition, we place a directed edge from c1 to t1 with edge weight count(wj ), the frequency of wish wj in the WISH corpus. Then, a template node appears to be a good one if many heavy edges point to it. On the other hand, a template is less desirable if it is part of a content node. For example, when wj ="health and happiness" and wi ="health", we create the template t2 =" and happiness" and the content node c3 = wi . If there is another wish wk ="i wish for health and happiness", then there will be a content node c2 = wj . The template t2 thus contains some content words (since it matches c2 ), and may not generalize well in a new domain. We capture this by backward edges: if c C, and string s (s not necessarily in C or W ) such that c = s + t, we add a backward edge from t to c with edge weight count(c ). Based on such considerations, we devised the following scheme for scoring templates: score(t) = in(t) - out(t), (1) 268 t2 ___ and happiness c3 Figure 4: The bipartite graph to create templates. where in(t) is the in-degree of node t, defined as the sum of edge weights coming into t; out(t) is the outdegree of node t, defined similarly. In other words, a template receives a high score if it is "used" by many frequent wishes but does not match many frequent content-only wishes. To create the final set of template features, we apply the threshold score(t) 5. This produces a final list of 811 templates. Table 4 lists some of the top templates ranked by score(t). While some of these templates still contain time- or scope-related words ("for my family"), they are devoid of specific topical content. Notice that we have automatically identified several of the manually derived templates in Table 3, and introduce many new variations that a learning algorithm can leverage. Top 10 in 2008 i wish for i wish i want this year i wish in 2008 i wish to for my family i wish this year in the new year Others in Top 200 i want to for everyone i hope my wish is please wishing for may you i wish i had to finally for my family to have Table 4: Top templates according to Equation 1. 3.3 Learning with Wish Template Features After discovering wish templates as described above, we use them as features for learning in a new domain (e.g., product reviews). For each sentence in the new domain, we assign binary features indicating which templates match the sentence. Two types of matching are possible. Strict matching requires that the template must match an entire sentence from beginning to end, with at least one word filling in for the underscore. (All matching during the template generation process was strict.) Non-strict matching 1 0.9 0.8 0.7 Precision Precision Manual Words Templates Words + Templates 0.2 0.4 Recall 0.6 0.8 1 0.6 0.5 0.4 0.3 0.2 0.1 0 0 1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 0 Manual Words Templates Words + Templates 0.2 0.4 Recall 0.6 0.8 1 Figure 5: Politics domain precision-recall curves. Figure 6: Products domain precision-recall curves. requires only that template match somewhere within a sentence. Rather than choose one type of matching, we create both strict and non-strict template features (1622 binary features total) and let the machine learning algorithm decide what is most useful. Our third wish detector, [Templates], is a linear SVM with the 1622 binary wish template features. Our fourth wish detector, [Words + Templates], is a linear SVM with both template and word features. 4 4.1 Experimental Results Target Domains and Experimental Setup We experimented with two domains, manually labeled at the sentence-level as wishes or non-wishes.4 Example wishes are listed in Table 6. Products. Consumer product reviews: 1,235 sentences selected from a collection of amazon.com and cnet.com reviews (Hu and Liu, 2004; Ding et al., 2008). 12% of the sentences are labeled as wishes. Politics. Political discussion board postings: 6,379 sentences selected from politics.com (Mullen and Malouf, 2008). 34% are labeled as wishes. We automatically split the corpora into sentences using MxTerminator (Reynar and Ratnaparkhi, 1997). As preprocessing before learning, we tokenized the text in the Penn TreeBank style, downThese wish-annotated corpora are available for download at http://pages.cs.wisc.edu/goldberg/wish data. 4 cased, and removed all punctuation. For all four wish detectors, we performed 10-fold cross validation. We used the default parameter in SVMlight for all trials (Joachims, 1999). As the data sets are skewed, we compare the detectors using precision-recall curves and the area under the curve (AUC). For the manual baseline, we produce the curve by varying the number of templates applied (in rank order), which gradually predicts more sentences as wishes (increasing recall at the expense of precision). A final point is added at recall 1.0, corresponding to applying an empty template that matches all sentences. For the SVM-based methods, we vary the threshold applied to the real-valued margin prediction to produce the curves. All curves are interpolated, and AUC measures are computed, using the techniques of (Davis and Goadrich, 2006). 4.2 Results Figure 5 shows the precision-recall curves for the Politics corpus. All curves are averages over 10 folds (i.e., for each of 100 evenly spaced, interpolated recall points, the 10 precision values are averaged). As expected, [Manual] can be very precise with low recall--only the very top few templates achieve high precision and pick out a small number of wishes with "i wish" and "i hope." As we introduce more templates to cover more true wishes, precision drops off quickly. [Templates] is similar, 269 Corpus Politics Products [Manual] 0.67 ± 0.03 0.49 ± 0.13 [Words] 0.77 ± 0.03 0.52 ± 0.16 [Templates] 0.73 ± 0.03 0.47 ± 0.16 [Words + Templates] 0.80 ± 0.03 0.56 ± 0.16 Table 5: AUC results (10-fold averages ± one standard deviation). Products: the only area i wish apple had improved upon would be the screen i just want music to eminate from it when i want how i want the dial on the original zen was perfect and i wish it was on this model i would like album order for my live albums and was just wondering Politics: all children should be allowed healthcare please call on your representatives in dc and ask them to please stop the waste in iraq i hope that this is a new beginning for the middle east may god bless and protect the brave men and that we will face these dangers in the future Table 6: Example target-domain wishes correctly identified by [Words + Templates]. with slightly better precision in low recall regions. [Words] is the opposite: bad in high recall but good in low recall regions. [Words + Templates] is the best, taking the best from both kinds of features to dominate other curves. Table 5 shows the average AUC across 10 folds. [Words + Templates] is significantly better than all other detectors under paired t-tests (p = 1 × 10-7 vs. [Manual], p = 0.01 vs. [Words], and p = 4 × 10-7 vs. [Templates]). All other differences are statistically significant, too. Figure 6 shows the precision-recall curves for the Products corpus. Again, [Words + Templates] mostly dominates other detectors. In terms of average AUC across folds (Table 5), [Words + Templates] is also the best. However, due to the small size of this corpus, the AUC values have high variance, and the difference between [Words + Templates] and [Words] is not statistically significant under a paired t-test (p = 0.16). Finally, to understand what is being learned in more detail, we take a closer look at the SVM models' weights for one fold of the Products corpus (Table 7). The most positive and negative features make intuitive sense. Note that [Words + Templates] seems to rely on templates for selecting wishes and words for excluding non-wishes. This partially explains the synergy of combining the feature types. 270 Sign + + + + + - [Words] wish hope hopefully hoping want money find digital again you [Templates] i hope i wish hoping i just want i would like family forever let me d for my dad [Words + Templates] hoping i hope i just want i wish i would like micro about fix digital you Table 7: Features with the largest magnitude weights in the SVM models for one fold of the Products corpus. 5 Conclusions and Future Work We have presented a novel study of wishes from an NLP perspective. Using the first-of-its-kind WISH corpus, we generated domain-independent wish templates that improve wish detection performance across product reviews and political discussion posts. Much work remains in this new research area, including the creation of more types of features. Also, due to the difficulty in obtaining wishannotated training data, we plan to explore semisupervised learning for wish detection. Acknowledgements We thank the Times Square Alliance for providing the WISH corpus, and the Wisconsin Alumni Research Foundation. AG is supported in part by a Yahoo! Key Technical Challenges Grant. References Eugene Agichtein and Luis Gravano. 2000. Snowball: Extracting relations from large plain-text collections. In In Proceedings of the 5th ACM International Conference on Digital Libraries, pages 85­94. Alan Agresti. 2002. Categorical Data Analysis. WileyInterscience, second edition. Shlomo Argamon and Anat Rachel Shimoni. 2003. Automatically categorizing written texts by author gender. Literary and Linguistic Computing, 17:401­412. David M. Blei, Andrew Y. Ng, and Michael I. Jordan. 2003. Latent dirichlet allocation. Journal of Machine Learning Research, 3:993­1022. Elizabeth Sugar Boese and Adele Howe. 2005. Genre classification of web documents. In Proceedings of the 20th National Conference on Artificial Intelligence (AAAI-05), Poster paper. Sergey Brin. 1999. Extracting patterns and relations from the world wide web. In WebDB '98: Selected papers from the International Workshop on The World Wide Web and Databases, pages 172­183. SpringerVerlag. Jesse Davis and Mark Goadrich. 2006. The relationship between precision-recall and roc curves. In ICML '06: Proceedings of the 23rd international conference on Machine learning, New York, NY, USA. ACM. Xiaowen Ding, Bing Liu, and Philip S. Yu. 2008. A holistic lexicon-based approach to opinion mining. In WSDM '08: Proceedings of the international conference on Web search and web data mining, pages 231­ 240. ACM. Howard Ehrlichman and Rosalind Eichenstein. 1992. Private wishes: Gender similarities and difference. Sex Roles, 26(9):399­422. Thomas Griffiths and Mark Steyvers. 2004. Finding scientific topics. Proceedings of the National Academy of Sciences, 101(suppl. 1):5228­5235. Minqing Hu and Bing Liu. 2004. Mining and summarizing customer reviews. In Proceedings of KDD '04, the ACM SIGKDD international conference on Knowledge discovery and data mining, pages 168­177. ACM Press. Thorsten Joachims. 1999. Making large-scale svm learning practical. In B. Sch¨ lkopf, C. Burges, and o A. Smola, editors, Advances in Kernel Methods - Support Vector Learning. MIT Press. Laura A. King and Sheri J. Broyles. 1997. Wishes, gender, personality, and well-being. Journal of Personality, 65(1):49­76. Moshe Koppel and Itai Shtrimberg. 2004. Good news or bad news? let the market decide. In AAAI Spring Symposium on Exploring Attitude and Affect in Text, pages 86­88. Saisuresh Krishnakumaran and Xiaojin Zhu. 2007. Hunting elusive metaphors using lexical resources. In Proceedings of the Workshop on Computational Approaches to Figurative Language, pages 13­20, Rochester, New York, April. Association for Computational Linguistics. Christopher D. Manning and Hinrich Sch¨ tze. 1999. u Foundations of Statistical Natural Language Processing. The MIT Press, Cambridge, Massachusetts. Rada Mihalcea and Hugo Liu. 2006. A corpus-based approach to finding happiness. In Proceedings of AAAICAAW-06, the Spring Symposia on Computational Approaches to Analyzing Weblogs. Rada Mihalcea and Carlo Strapparava. 2005. Making computers laugh: Investigations in automatic humor recognition. In Empirical Methods in Natural Language Processing. Norman A. Milgram and Wolfgang W. Riedel. 1969. Developmental and experiential factors in making wishes. Child Development, 40(3):763­771. Gilad Mishne, Krisztian Balog, Maarten de Rijke, and Breyten Ernsting. 2007. Moodviews: Tracking and searching mood-annotated blog posts. In Proceedings International Conf. on Weblogs and Social Media (ICWSM-2007), pages 323­324. Tony Mullen and Robert Malouf. 2008. Taking sides: User classification for informal online political discourse. Internet Research, 18:177­190. Bo Pang and Lillian Lee. 2008. Opinion mining and sentiment analysis. Foundations and Trends in Information Retrieval, 2(1-2):1­135. Jeffrey C. Reynar and Adwait Ratnaparkhi. 1997. A maximum entropy approach to identifying sentence boundaries. In Fifth Conference on Applied Natural Language Processing. James Shanahan, Yan Qu, and Janyce Wiebe, editors. 2005. Computing attitude and affect in text. Springer, Dordrecht, The Netherlands. George S. Speer. 1939. Oral and written wishes of rural and city school children. Child Development, 10(3):151­155. G. K. Zipf. 1932. Selected Studies of the Principle of Relative Frequency in Language. Harvard University Press. 271 Predicting Risk from Financial Reports with Regression Shimon Kogan McCombs School of Business University of Texas at Austin Austin, TX 78712, USA shimon.kogan@mccombs.utexas.edu Dimitry Levin Mellon College of Science Carnegie Mellon University Pittsburgh, PA 15213, USA dimitrylevin@gmail.com Bryan R. Routledge Tepper School of Business Carnegie Mellon University Pittsburgh, PA 15213, USA routledge@cmu.edu Jacob S. Sagi Owen Graduate School of Management Vanderbilt University Nashville, TN 37203, USA Jacob.Sagi@Owen.Vanderbilt.edu Noah A. Smith School of Computer Science Carnegie Mellon University Pittsburgh, PA 15213, USA nasmith@cs.cmu.edu Abstract We address a text regression problem: given a piece of text, predict a real-world continuous quantity associated with the text's meaning. In this work, the text is an SEC-mandated financial report published annually by a publiclytraded company, and the quantity to be predicted is volatility of stock returns, an empirical measure of financial risk. We apply wellknown regression techniques to a large corpus of freely available financial reports, constructing regression models of volatility for the period following a report. Our models rival past volatility (a strong baseline) in predicting the target variable, and a single model that uses both can significantly outperform past volatility. Interestingly, our approach is more accurate for reports after the passage of the Sarbanes-Oxley Act of 2002, giving some evidence for the success of that legislation in making financial reports more informative. 1 Introduction We consider a text regression problem: given a piece of text, predict a R-valued quantity associated with that text. Specifically, we use a company's annual financial report to predict the financial risk of investment in that company, as measured empirically by a quantity known as stock return volatility. 272 Predicting financial risk is of clear interest to anyone who invests money in stocks and central to modern portfolio choice. Financial reports are a government-mandated artifact of the financial world that--one might hypothesize--contain a large amount of information about companies and their value. Indeed, it is an important question whether mandated disclosures are informative, since they are meant to protect investors but are costly to produce. The intrinsic properties of the problem are attractive as a test-bed for NLP research. First, there is no controversy about the usefulness or existential reality of the output variable (volatility). Statistical NLP often deals in the prediction of variables ranging from text categories to linguistic structures to novel utterances. While many of these targets are uncontroversially useful, they often suffer from evaluation difficulties and disagreement among annotators. The output variable in this work is a statistic summarizing facts about the real world; it is not subject to any kind of human expertise, knowledge, or intuition. Hence this prediction task provides a new, objective test-bed for any kind of linguistic analysis. Second, many NLP problems rely on costly annotated resources (e.g., treebanks or aligned bilingual corpora). Because the text and historical financial data used in this work are freely available (by law) and are generated as a by-product of the American Human Language Technologies: The 2009 Annual Conference of the North American Chapter of the ACL, pages 272­280, Boulder, Colorado, June 2009. c 2009 Association for Computational Linguistics economy, old and new data can be obtained by anyone with relatively little effort. In this paper, we demonstrate that predicting financial volatility automatically from a financial report is a novel, challenging, and easily evaluated natural language understanding task. We show that a very simple representation of the text (essentially, bags of unigrams and bigrams) can rival and, in combination, improve over a strong baseline that does not use the text. Analysis of the learned models provides insights about what can make this problem more or less difficult, and suggests that disclosurerelated legislation led to more transparent reporting. els. A large body of research in finance suggests that the two types of quantities are very different: while predictability of returns could be easily traded away by the virtue of buying/selling stocks that are under- or over-valued (Fama, 1970), similar trades are much more costly to implement with respect to predictability of volatility (Dumas et al., 2007). By focusing on volatility prediction, we avoid taking a stance on whether or not the United States stock market is informationally efficient. 3 Problem Formulation 2 Stock Return Volatility Given a text document d, we seek to predict the value of a continuous variable v. We do this via a parameterized function f : v = f (d; w) ^ (2) Volatility is often used in finance as a measure of risk. It is measured as the standard deviation of a stock's returns over a finite period of time. A stock will have high volatility when its price fluctuates widely and low volatility when its price remains more or less constant. Let rt = PPt - 1 be the return on a given stock t-1 between the close of trading day t - 1 and day t, where Pt is the (dividend-adjusted) closing stock price at date t. The measured volatility over the time period from day t - to day t is equal to the sample s.d.: where w Rd are the parameters or weights. Our approach is to learn a human-interpretable w from a collection of N training examples { di , vi }N , i=1 where each di is a document and each vi R. Support vector regression (Drucker et al., 1997) is a well-known method for training a regression model. SVR is trained by solving the following optimization problem: min 1 C w 2+ 2 N N i=1 v[t-,t] = i=0 (rt-i - r)2 Ż (1) wRd max 0, vi - f (di ; w) - -insensitive loss function where r is the sample mean of rt over the period. In Ż this work, the above estimate will be treated as the true output variable on training and testing data. It is important to note that predicting volatility is not the same as predicting returns or value. Rather than trying to predict how well a stock will perform, we are trying to predict how stable its price will be over a future time period. It is, by now, received wisdom in the field of economics that predicting a stock's performance, based on easily accessible public information, is difficult. This is an attribute of well-functioning (or "efficient") markets and a cornerstone of the so-called "efficient market hypothesis" (Fama, 1970). By contrast, the idea that one can predict a stock's level of risk using public information is uncontroversial and a basic assumption made by many economically sound pricing mod273 (3) where C is a regularization constant and controls the training error.1 The training algorithm finds weights w that define a function f minimizing the (regularized) empirical risk. Let h be a function from documents into some vector-space representation Rd . In SVR, the function f takes the form: N f (d; w) = h(d) w = i=1 i K(d, di ) (4) where Equation 4 re-parameterizes f in terms of a kernel function K with "dual" weights i . K can Given the embedding h of documents in Rd , defines a "slab" (region between two parallel hyperplanes, sometimes called the " -tube") in Rd+1 through which each h(di ), f (di ; w) must pass in order to have zero loss. 1 year words documents words/doc. 1996 5.5M 1,408 3,893 1997 9.3M 2,260 4,132 1998 11.8M 2,462 4,808 1999 14.5M 2,524 5,743 2000 13.4M 2,425 5,541 2001 15.4M 2,596 5,928 2002 22.7M 2,846 7,983 2003 35.3M 3,612 9,780 2004 38.9M 3,559 10,936 2005 41.9M 3,474 12,065 2006 38.8M 3,308 11,736 total 247.7M 26,806 9,240 Table 1: Dimensions of the dataset used in this paper, after filtering and tokenization. The near doubling in average document size during 2002­3 is possibly due to the passage of the Sarbanes-Oxley Act of 2002 in the wake of Enron's accounting scandal (and numerous others). be seen as a similarity function between two documents. At test time, a new example is compared to a subset of the training examples (those with i = 0); typically with SVR this set is sparse. With the linear kernel, the primal and dual weights relate linearly: N 1996­2006 from 10,492 different companies. Each report comes with a date of publication, which is important for tying the text back to the financial variables we seek to predict. From the perspective of predicting future events, one section of the 10-K is of special interest: Section 7, known as "management's discussion and analysis of financial conditions and results of operations" (MD&A), and in particular Subsection 7A, "quantitative and qualitative disclosures about market risk." Because Section 7 is where the most important forward-looking content is most likely to be found, we filter other sections from the reports. The filtering is done automatically using a short, hand-written Perl script that seeks strings loosely matching the Section 7, 7A, and 8 headers, finds the longest reasonable "Section 7" match (in words) of more than 1,000 whitespace-delineated tokens. Section 7 typically begins with an introduction like this (from ABC's 1998 Form 10-K, before tokenization for readability; boldface added): The following discussion and analysis of ABC's consolidated financial condition and consolidated results of operation should be read in conjunction with ABC's Consolidated Financial Statements and Notes thereto included elsewhere herein. This discussion contains certain forward-looking statements which involve risks and uncertainties. ABC's actual results could differ materially from the results expressed in, or implied by, such statements. See "Regarding ForwardLooking Statements." w = i=1 i h(di ) (5) The full details of SVR and its implementation are beyond the scope of this paper; interested readers are referred to Sch¨ lkopf and Smola (2002). SVMlight o (Joachims, 1999) is a freely available implementation of SVR training that we used in our experiments.2 4 Dataset In the United States, the Securities Exchange Commission mandates that all publicly-traded corporations produce annual reports known as "Form 10K." The report typically includes information about the history and organization of the company, equity and subsidiaries, as well as financial information. These reports are available to the public and published on the SEC's web site.3 The structure of the 10-K is specified in detail in the legislation. We have collected 54,379 reports published over the period 2 3 Not all of the documents downloaded pass the filter at all, and for the present work we have only used documents that do pass the filter. (One reason for the failure of the filter is that many 10-K reports include Section 7 "by reference," so the text is not directly included in the document.) In addition to the reports, we used the Center for Research in Security Prices (CRSP) US Stocks Database to obtain the price return series along with other firm characteristics.4 We proceeded to calculate two volatilities for each firm/report observation: the twelve months prior to the report (v (-12) ) and the twelve months after the report (v (+12) ). The text and volatility data are publicly available at http: //www.ark.cs.cmu.edu/10K. 4 Available at http://svmlight.joachims.org. http://www.sec.gov/edgar.shtml 274 Tokenization was applied to the text, including punctuation removal, downcasing, collapsing all digit sequences,5 and heuristic removal of remnant markup. Table 1 gives statistics on the corpora used in this research; this is a subset of the corpus for which there is no missing volatility information. The drastic increase in length during the 2002­ 2003 period might be explained by the passage by the US Congress of the Sarbanes-Oxley Act of 2002 (and related SEC and exchange rules), which imposed revised standards on reporting practices of publicly-traded companies in the US. data used for training. We will always report performance over test sets consisting of one year's worth of data (the subcorpora described in Table 1). In this work, we focus on predicting the volatility over the year following the report (v (+12) ). In all experiments, = 0.1 and C is set using the default choice of SVMlight , which is the inverse of the average of h(d) h(d) over the training data.7 6.1 Feature Representation We first consider how to represent the 10-K reports. We adopt various document representations, all using word features. Let M be the vocabulary size derived from the training data.8 Let freq(xj ; d) denote the number of occurrences of the jth word in the vocabulary in document d. · TF: hj (d) = 1 |d| freq(xj ; d), 5 Baselines and Evaluation Method Volatility displays an effect known as autoregressive conditional heteroscedasticity (Engle, 1982). This means that the variance in a stock's return tends to change gradually. Large changes in price are presaged by other changes, and periods of stability tend to continue. Volatility is, generally speaking, not constant, yet prior volatility (e.g., v (-12) ) is a very good predictor of future volatility (e.g., v (+12) ). At the granularity of a year, which we consider here because the 10-K reports are annual, there are no existing models of volatility that are widely agreed to be significantly more accurate than our historical volatility baseline. We tested a state-of-theart model known as GARCH(1, 1) (Engle, 1982; Bollerslev, 1986) and found that it was no stronger than our historical volatility baseline on this sample. Throughout this paper, we will report performance using the mean squared error between the predicted and true log-volatilities:6 MSE = 1 N N i=1 j {1, ..., M }. 1 · TFIDF: hj (d) = |d| freq(xj ; d) × log(N/|{d : freq(xj ; d) > 0}|), where N is the number of documents in the training set. This is the classic "TFIDF" score. · LOG 1 P: hj (d) = log(1 + freq(xj ; d)). Rather than normalizing word frequencies as for TF, this score dampens them with a logarithm. We also include a variant of LOG 1 P where terms are the union of unigrams and bigrams. Note that each of these preserves sparsity; when freq(xj ; d) = 0, hj (d) = 0 in all cases. For interpretability of results, we use a linear kernel. The usual bias weight b is included. We found it convenient to work in the logarithmic domain for the predicted variable, predicting log v instead of v, since volatility is always nonnegative. In this setting, the predicted volatility takes the form: M (log(vi ) - log(^i ))2 v (6) where N is the size of the test set, given in Table 1. 6 Experiments log v = b + ^ j=1 wj hj (d) (7) In our experiments, we vary h (the function that maps inputs to a vector space) and the subset of the While numerical information is surely informative about risk, recall that our goal is to find indicators of risk expressed in the text; automatic predictors of risk from numerical data would use financial data streams directly, not text reports. 6 We work in the log domain because it is standard in finance, due to the dynamic range of actual volatilities; the distribution over log v across companies tends to have a bell shape. 5 Because the goal of this work is to explore how text might be used to predict volatility, we also wish These values were selected after preliminary and cursory exploration with 1996­2000 as training data and 2001 as the test set. While the effects of and C were not large, further improvements may be possible with more careful tuning. 8 Preliminary experiments that filtered common or rare words showed a negligible or deleterious effect on performance. 7 275 2001 2002 2003 2004 2005 2006 micro-ave. 0.1747 0.1600 0.1873 0.1442 0.1365 0.1463 0.1576 0.2433 0.4323 0.1869 0.2717 0.3184 5.6778 1.2061 0.2053 0.1653 0.2051 0.1337 0.1405 0.1517 0.1655 TF 0.2219 0.2571 0.2588 0.2134 0.1850 0.1862 0.2197 TFIDF 0.2033 0.2118 0.2178 0.1660 0.1544 0.1599 0.1842 LOG 1 P 0.2107 0.2214 0.2040 0.1693 0.1581 0.1715 0.1873 LOG 1 P , bigrams 0.1968 0.2015 0.1729 0.1500 0.1394 0.1532 0.1667 0.1541 TF + 0.1885 0.1616 0.1925 0.1230 0.1272 0.1402 TFIDF + 0.1919 0.1618 0.1965 0.1246 0.1276 0.1403 0.1557 LOG 1 P + 0.1846 0.1764 0.1671 0.1309 0.1319 0.1458 0.1542 LOG 1 P +, bigrams 0.1852 0.1792 0.1599 0.1352 0.1307 0.1448 0.1538 Table 2: MSE (Eq. 6) of different models on test data predictions. Lower values are better. Boldface denotes improvements over the baseline, and denotes significance compared to the baseline under a permutation test (p < 0.05). both words history features v (-12) (baseline) v (-12) (SVR with bias) v (-12) (SVR without bias) to see whether text adds information beyond what can be predicted using historical volatility alone (the baseline, v (-12) ). We therefore consider models augmented with an additional feature, defined as hM +1 = log v (-12) . Since this is historical information, it is always available when the 10-K report is published. These models are denoted TF +, TFIDF +, and LOG 1 P +. The performance of these models, compared to the baseline from Section 5, is shown in Table 2. We used as training examples all reports from the five-year period preceding the test year (so six experiments on six different training and test sets are shown in the figure). We also trained SVR models on the single feature v (-12) , with and without bias weights (b in Eq. 7); these are usually worse and never signficantly better than the baseline. Strikingly, the models that use only the text to predict volatility come very close to the historical baseline in some years. That a text-only method (LOG 1 P with bigrams) for predicting future risk comes within 5% of the error of a strong baseline (2003­6) shows promise for the overall approach. A combined model improves substantially over the baseline in four out of six years (2003­6), and this difference is usually robust to the representation used. Table 3 shows the most strongly weighted terms in each of the text-only LOG 1 P models (including bigrams). These weights are recovered using the relationship expressed in Eq. 5. 6.2 Training Data Effects It is well known that more training data tend to improve the performance of a statistical method; how276 ever, the standard assumption is that the training data are drawn from the same distribution as the test data. In this work, where we seek to predict the future based on data from past, that assumption is obviously violated. It is therefore an open question whether more data (i.e., looking farther into the past) is helpful for predicting volatility, or whether it is better to use only the most recent data. Table 4 shows how performance varies when one, two, or five years of historical training data are used, averaged across test years. In most cases, using more training data (from a longer historical period) is helpful, but not always. One interesting trend, not shown in the aggregate statistics of Table 4, is that recency of the training set affected performance much more strongly in earlier train/test splits (2001­3) than later ones (2004­6). This experiment leads us to conclude that temporal changes in financial reporting make training data selection nontrivial. Changes in the macro economy and specific businesses make older reports less relevant for prediction. For example, regulatory changes like Sarbanes-Oxley, variations in the business cycle, and technological innovation like the Internet influence both the volatility and the 10-K text. 6.3 Effects of Sarbanes-Oxley We noted earlier that the passage of the SarbanesOxley Act of 2002, which sought to reform financial reporting, had a clear effect on the lengths of the 10-K reports in our collection. But are the reports more informative? This question is important, because producing reports is costly; we present an empirical argument based on our models that the legis- high v ^ Table 3: Most strongly-weighted terms in models learned from various time periods (LOG 1 P model with unigrams and bigrams). "#" denotes any digit sequence. 0.025 0.017 0.016 0.015 0.014 0.013 0.013 0.013 -0.011 -0.011 -0.012 -0.012 -0.013 -0.014 -0.017 -0.021 2001­2005 loss net loss going concern expenses a going personnel financing administrative policies by the earnings dividends unsecured properties rate net income features 1 2 5 TF + 0.1509 0.1450 0.1541 TFIDF + 0.1512 0.1455 0.1557 LOG 1 P + 0.1621 0.1611 0.1542 LOG 1 P +, bigrams 0.1617 0.1588 0.1538 Table 4: MSE of volatility predictions using reports from varying historical windows (1, 2, and 5 years), microaveraged across six train/test scenarios. Boldface marks best in a row. The historical baseline achieves 0.1576 MSE (see Table 2). 0.026 0.018 0.014 0.014 0.014 0.013 0.013 0.012 -0.011 -0.011 -0.011 -0.012 -0.012 -0.013 -0.014 -0.018 low v ^ 1998­2002 loss 0.023 net loss 0.020 expenses 0.017 year # 0.015 obligations 0.015 financing 0.014 convertible 0.014 additional 0.014 unsecured -0.012 earnings -0.012 distributions -0.012 dividends -0.015 income -0.016 properties -0.016 net income -0.019 rate -0.022 lation has actually been beneficial. Our experimental results in Section 6.1, in which volatility in the years 2004­2006 was more accurately predicted from the text than in 2001­2002, suggest that the Sarbanes-Oxley Act led to more informative reports. We compared the learned weights (LOG 1 P +, unigrams) between the six overlapping five-year windows ending in 2000­2005; measured in L1 distance, these were, in consecutive order, 52.2, 59.9, 60.7, 55.3, 52.3 ; the biggest differences came between 2001 and 2002 and between 2002 and 2003. (Firms are most likely to have begun compliance with the new law in 2003 or 2004.) The same pattern held when only words appearing in all five models were considered. Variation in the recency/training set size tradeoff (§6.2), particularly during 2002­3, also suggests that there were substantial changes in the reports during that time. 6.4 Qualitative Evaluation One of the advantages of a linear model is that we can explore what each model discovers about different unigram and bigram terms. Some manually selected examples of terms whose learned weights (w) show interesting variation patterns over time are shown in Figure 1, alongside term frequency patterns, for the text-only LOG 1 P model (with bigrams). These examples were suggested by experts in finance from terms with weights that were both large and variable (across training sets). A particularly interesting case, in light of Sarbanes-Oxley, is the term accounting policies. Sarbanes-Oxley mandated greater discussion of accounting policy in the 10-K MD&A section. Before 2002 this term indicates high volatility, perhaps due to complicated off-balance sheet transactions or unusual accounting policies. Starting in 2002, explicit mention of accounting policies indi- -0.012 -0.012 -0.012 -0.012 -0.013 -0.015 -0.019 -0.021 -0.021 -0.015 -0.015 -0.017 -0.017 -0.018 -0.019 0.026 0.024 0.020 0.019 0.017 0.014 0.014 0.014 -0.014 -0.015 -0.015 -0.015 -0.017 -0.018 1996­2000 net loss year # loss expenses covenants diluted convertible date longterm rates dividend unsecured merger agreement properties income rate -0.022 1997­2001 year # net loss expenses loss experienced of $# covenants additional merger agreement dividends unsecured dividend properties net income income rate -0.025 0.028 0.023 0.020 0.020 0.015 0.015 0.015 0.014 1999­2003 loss net loss expenses going concern year # financing a going additional distributions annual dividend dividends rates properties rate net income -0.023 0.026 0.020 0.017 0.015 0.015 0.014 0.014 0.013 2000­2004 loss net loss year # expenses going concern a going administrative personnel distributions insurance critical accounting lower interest dividends properties rate net income 277 0.005 0 -0.005 -0.010 -0.015 0.005 0 w -0.005 -0.010 0.010 0.005 0 -0.005 -0.010 96-00 97-01 98-02 99-03 00-04 01-05 w ave. term frequency w ave. term frequency 8 estimates accounting policies 6 4 2 0 ave. term frequency mortgages 0.8 0.6 0.4 0.2 0 0.20 reit higher margin 0.15 0.10 0.05 0 lower margin Figure 1: Left: learned weights for selected terms across models trained on data from different time periods (x-axis). These weights are from the LOG 1 P (unigrams and bigrams) models trained on five-year periods, the same models whose extreme weights are summarized in Tab. 3. Note that all weights are within 0 ± 0.026. Right: the terms' average frequencies (by document) over the same periods. cates lower volatility. The frequency of the term also increases drastically over the same period, suggesting that the earlier weights may have been inflated. A more striking example is estimates, which averages one occurrence per document even in the 1996­2000 period, experiences the same term frequency explosion, and goes through a similar weight change, from strongly indicating high volatility to strongly indicating low volatility. As a second example, consider the terms mortgages and reit (Real Estate Investment Trust, a tax designation for businesses that invest in real estate). Given the importance of the housing and mortgage market over the past few years, it is interesting to note that the weight on both of these terms increases over the period from a strong low volatility term to a weak indicator of high volatility. It will be interesting to see how the dramatic decline in housing prices in late 2007, and the fallout created in credit markets in 2008, is reflected in future models. Finally, notice that high margin and low margin, whose frequency patterns are fairly flat "switch places," over the sample: first indicating high and low volatility, respectively, then low and high. There is no a priori reason to expect high or low margins 278 would be associated with high or low stock volatility. However, this is an interesting example where bigrams are helpful (the word margin by itself is uninformative) and indicates that predicting risk is highly time-dependent. 6.5 Delisting An interesting but relatively infrequent phenomenon is the delisting of a company, i.e., when it ceases to be traded on a particular exchange due to dissolution after bankruptcy, a merger, or violation of exchange rules. The relationship between volatility and delisting has been studied by Merton (1974), among others. Our dataset includes a small number of cases where the volatility figures for the period following the publication of a 10-K report are unavailable because the company was delisted. Learning to predict delisting is extremely difficult because fewer than 4% of the 2001­6 10-K reports precede delisting. Using the LOG 1 P representation, we built a linear SVM classifier for each year in 2001­6 (trained on the five preceding years' data) to predict whether a company will be delisted following its 10-K report. Performance for various precision measures is shown in Table 5. Notably, for 2001­4 we achieve precision (%) at ... recall = 10% n=5 n = 10 n = 100 oracle F1 (%) '01 80 100 80 38 35 '02 93 100 90 48 42 '03 79 40 70 53 44 '04 100 100 90 29 36 '05 47 60 60 24 31 '06 21 80 70 20 16 6 5 4 3 2 1 bulletin, creditors, dip, otc court chapter, debtors, filing, prepetition bankruptcy concern, confirmation, going, liquidation debtorinpossession, delisted, nasdaq, petition Table 5: Left: precision of delisting predictions. The "oracle F1 " row shows the maximal F1 score obtained for any n. Right: Words most strongly predicting delisting of a company. The number is how many of the six years (2001­6) the word is among the ten most strongly weighted. There were no clear patterns across years for words predicting that a company would not be delisted. The word otc refers to "over-the-counter" trading, a high-risk market. above 75% precision at 10% recall. Our best (oracle) F1 scores occur in 2002 and 2003, suggesting again a difference in reports around Sarbanes-Oxley. Table 5 shows words associated with delisting. 7 Related Work In NLP, regression is not widely used, since most natural language-related data are discrete. Regression methods were pioneered by Yang and Chute (1992) and Yang and Chute (1993) for information retrieval purposes, but the predicted continuous variable was not an end in itself in that work. Blei and McAuliffe (2007) used latent "topic" variables to predict movie reviews and popularity from text. Lavrenko et al. (2000b) and Lavrenko et al. (2000a) modeled influences between text and time series financial data (stock prices) using language models. Farther afield, Albrecht and Hwa (2007) used SVR to train machine translation evaluation metrics to match human evaluation scores and compared techniques using correlation. Regression has also been used to order sentences in extractive summarization (Biadsy et al., 2008). While much of the information relevant for investors is communicated through text (rather than numbers), only recently is this link explored. Some papers relate news articles to earning forecasts, stock returns, volatility, and volume (Koppel and Shtrimberg, 2004; Tetlock, 2007; Tetlock et al., 2008; Gaa, 2007; Engelberg, 2007). Das and Chen (2001) and Antweiler and Frank (2004) ask whether messages posted on message boards can help explain stock performance, while Li (2005) measures the association between frequency of words associated with risk and subsequent stock returns. Weiss-Hanley and Hoberg (2008) study initial public offering disclosures using word statistics. Many researchers have focused the related problem of predicting sentiment 279 and opinion in text (Pang et al., 2002; Wiebe and Riloff, 2005), sometimes connected to extrinsic values like prediction markets (Lerman et al., 2008). In contrast to text regression, text classification comprises a widely studied set of problems involving the prediction of categorial variables related to text. Applications have included the categorization of documents by topic (Joachims, 1998), language (Cavnar and Trenkle, 1994), genre (Karlgren and Cutting, 1994), author (Bosch and Smith, 1998), sentiment (Pang et al., 2002), and desirability (Sahami et al., 1998). Text categorization has served as a test application for nearly every machine learning technique for discrete classification. 8 Conclusion We have introduced and motivated a new kind of task for NLP: text regression, in which text is used to make predictions about measurable phenomena in the real world. We applied the technique to predicting financial volatility from companies' 10-K reports, and found text regression model predictions to correlate with true volatility nearly as well as historical volatility, and a combined model to perform even better. Further, improvements in accuracy and changes in models after the passage of the SarbanesOxley Act suggest that financial reporting reform has had interesting and measurable effects. Acknowledgments The authors are grateful to Jamie Callan, Chester Spatt, Anthony Tomasic, Yiming Yang, and Stanley Zin for helpful discussions, and to the anonymous reviewers for useful feedback. This research was supported by grants from the Institute for Quantitative Research in Finanace and from the Center for Analytical Research in Technology at the Tepper School of Business, Carnegie Mellon University. References J. S. Albrecht and R. Hwa. 2007. Regression for sentence-level MT evaluation with pseudo references. In Proc. of ACL. W. Antweiler and M. Z. Frank. 2004. Is all that talk just noise? the information content of internet stock message boards. Journal of Finance, 59:1259­1294. F. Biadsy, J. Hirschberg, and E. Filatova. 2008. An unsupervised approach to biography production using Wikipedia. In Proc. of ACL. D. M. Blei and J. D. McAuliffe. 2007. Supervised topic models. In Advances in NIPS 21. T. Bollerslev. 1986. Generalized autoregressive conditional heteroskedasticity. Journal of Econometrics, 31:307­327. R. Bosch and J. Smith. 1998. Separating hyperplanes and the authorship of the disputed Federalist papers. American Mathematical Monthly, 105(7):601­608. W. B. Cavnar and J. M. Trenkle. 1994. n-gram-based text categorization. In Proc. of SDAIR. S. Das and M. Chen. 2001. Yahoo for Amazon: Extracting market sentiment from stock mesage boards. In Proc. of Asia Pacific Finance Association Annual Conference. H. Drucker, C. J. C. Burges, L. Kaufman, A. Smola, and V. Vapnik. 1997. Support vector regression machines. In Advances in NIPS 9. B. Dumas, A. Kurshev, and R. Uppal. 2007. Equilibrium portfolio strategies in the presence of sentiment risk and excess volatility. Swiss Finance Institute Research Paper No. 07-37. J. Engelberg. 2007. Costly information processing: Evidence from earnings announcements. Working paper, Northwestern University. R. F. Engle. 1982. Autoregressive conditional heteroscedasticity with estimates of variance of united kingdom inflation. Econometrica, 50:987­1008. E. F. Fama. 1970. Efficient capital markets: A review of theory and empirical work. Journal of Finance, 25(2):383­417. C. Gaa. 2007. Media coverage, investor inattention, and the market's reaction to news. Working paper, University of British Columbia. T. Joachims. 1998. Text categorization with support vector machines: Learning with many relevant features. In Proc. of ECML. T. Joachims. 1999. Making large-scale SVM learning practical. In Advances in Kernel Methods - Support Vector Learning. MIT Press. J. Karlgren and D. Cutting. 1994. Recognizing text genres with simple metrics using discriminant analysis. In Proc. of COLING. M. Koppel and I. Shtrimberg. 2004. Good news or bad news? let the market decide. In AAAI Spring Symposium on Exploring Attitude and Affect in Text: Theories and Applications. V. Lavrenko, M. Schmill, D. Lawrie, P. Ogilvie, D. Jensen, and J. Allan. 2000a. Language models for financial news recommendation. In Proc. of CIKM. V. Lavrenko, M. Schmill, D. Lawrie, P. Ogilvie, D. Jensen, and J. Allan. 2000b. Mining of concurrent text and time series. In Proc. of KDD. K. Lerman, A. Gilder, M. Dredze, and F. Pereira. 2008. Reading the markets: Forecasting public opinion of political candidates by news analysis. In COLING. F. Li. 2005. Do stock market investors understand the risk sentiment of corporate annual reports? Working Paper, University of Michigan. R. Merton. 1974. On the pricing of corporate debt: The risk structure of interest rates. Journal of Finance, 29:449­470. B. Pang, L. Lee, and S. Vaithyanathan. 2002. Thumbs up? Sentiment classification using machine learning techniques. In Proc. of EMNLP. M. Sahami, S. Dumais, D. Heckerman, and E. Horvitz. 1998. A Bayesian approach to filtering junk email. In Proc. of AAAI Workshop on Learning for Text Categorization. B. Sch¨ lkopf and A. J. Smola. 2002. Learning with Kero nels: Support Vector Machines, Regularization, Optimization, and Beyond. MIT Press. P. C. Tetlock, M. Saar-Tsechansky, and S. Macskassy. 2008. More than words: Quantifying language to measure firms' fundamentals. Journal of Finance, 63(3):1437­1467. P. C. Tetlock. 2007. Giving content to investor sentiment: The role of media in the stock market. Journal of Finance, 62(3):1139­1168. K. Weiss-Hanley and G. Hoberg. 2008. Strategic disclosure and the pricing of initial public offerings. Working paper. J. Wiebe and E. Riloff. 2005. Creating subjective and objective sentence classifiers from unannotated texts. In CICLing. Y. Yang and C. G. Chute. 1992. A linear least squares fit mapping method for information retrieval from natural language texts. In Proc. of COLING. Y. Yang and C. G. Chute. 1993. An application of least squares fit mapping to text information retrieval. In Proc. of SIGIR. 280 Domain Adaptation with Latent Semantic Association for Named Entity Recognition Honglei Guo Huijia Zhu Zhili Guo Xiaoxun Zhang Xian Wu and Zhong Su IBM China Research Laboratory Beijing, P. R. China {guohl, zhuhuiji, guozhili, zhangxx, wuxian, suzhong}@cn.ibm.com Abstract Domain adaptation is an important problem in named entity recognition (NER). NER classifiers usually lose accuracy in the domain transfer due to the different data distribution between the source and the target domains. The major reason for performance degrading is that each entity type often has lots of domainspecific term representations in the different domains. The existing approaches usually need an amount of labeled target domain data for tuning the original model. However, it is a labor-intensive and time-consuming task to build annotated training data set for every target domain. We present a domain adaptation method with latent semantic association (LaSA). This method effectively overcomes the data distribution difference without leveraging any labeled target domain data. LaSA model is constructed to capture latent semantic association among words from the unlabeled corpus. It groups words into a set of concepts according to the related context snippets. In the domain transfer, the original term spaces of both domains are projected to a concept space using LaSA model at first, then the original NER model is tuned based on the semantic association features. Experimental results on English and Chinese corpus show that LaSA-based domain adaptation significantly enhances the performance of NER. important task in information extraction and natural language processing (NLP) applications. Supervised learning methods can effectively solve NER problem by learning a model from manually labeled data (Borthwick, 1999; Sang and Meulder, 2003; Gao et al., 2005; Florian et al., 2003). However, empirical study shows that NE types have different distribution across domains (Guo et al., 2006). Trained NER classifiers in the source domain usually lose accuracy in a new target domain when the data distribution is different between both domains. Domain adaptation is a challenge for NER and other NLP applications. In the domain transfer, the reason for accuracy loss is that each NE type often has various specific term representations and context clues in the different domains. For example, {"economist", "singer", "dancer", "athlete", "player", "philosopher", ...} are used as context clues for NER. However, the distribution of these representations are varied with domains. We expect to do better domain adaptation for NER by exploiting latent semantic association among words from different domains. Some approaches have been proposed to group words into "topics" to capture important relationships between words, such as Latent Semantic Indexing (LSI) (Deerwester et al., 1990), probabilistic Latent Semantic Indexing (pLSI) (Hofmann, 1999), Latent Dirichlet Allocation (LDA) (Blei et al., 2003). These models have been successfully employed in topic modeling, dimensionality reduction for text categorization (Blei et al., 2003), ad hoc IR (Wei and Croft., 2006), and so on. In this paper, we present a domain adaptation method with latent semantic association. We focus 1 Introduction Named entities (NE) are phrases that contain names of persons, organizations, locations, etc. NER is an Human Language Technologies: The 2009 Annual Conference of the North American Chapter of the ACL, pages 281­289, Boulder, Colorado, June 2009. c 2009 Association for Computational Linguistics 281 on capturing the hidden semantic association among words in the domain adaptation. We introduce the LaSA model to overcome the distribution difference between the source domain and the target domain. LaSA model is constructed from the unlabeled corpus at first. It learns latent semantic association among words from their related context snippets. In the domain transfer, words in the corpus are associated with a low-dimension concept space using LaSA model, then the original NER model is tuned using these generated semantic association features. The intuition behind our method is that words in one concept set will have similar semantic features or latent semantic association, and share syntactic and semantic context in the corpus. They can be considered as behaving in the same way for discriminative learning in the source and target domains. The proposed method associates words from different domains on a semantic level rather than by lexical occurrence. It can better bridge the domain distribution gap without any labeled target domain samples. Experimental results on English and Chinese corpus show that LaSA-based adaptation significantly enhances NER performance across domains. The rest of this paper is organized as follows. Section 2 briefly describes the related works. Section 3 presents a domain adaptation method based on latent semantic association. Section 4 illustrates how to learn LaSA model from the unlabeled corpus. Section 5 shows experimental results on large-scale English and Chinese corpus across domains, respectively. The conclusion is given in Section 6. Daume III (2007) further augments the feature space on the instances of both domains. Jiang and Zhai (2006) exploit the domain structure contained in the training examples to avoid over-fitting the training domains. Arnold et al. (2008) exploit feature hierarchy for transfer learning in NER. Instance weighting (Jiang and Zhai, 2007) and active learning (Chan and Ng, 2007) are also employed in domain adaptation. Most of these approaches need the labeled target domain samples for the model estimation in the domain transfer. Obviously, they require much efforts for labeling the target domain samples. Some approaches exploit the common structure of related problems. Ando et al. (2005) learn predicative structures from multiple tasks and unlabeled data. Blitzer et al. (2006, 2007) employ structural corresponding learning (SCL) to infer a good feature representation from unlabeled source and target data sets in the domain transfer. We present LaSA model to overcome the data gap across domains by capturing latent semantic association among words from unlabeled source and target data. In addition, Miller et al. (2004) and Freitag (2004) employ distributional and hierarchical clustering methods to improve the performance of NER within a single domain. Li and McCallum (2005) present a semi-supervised sequence modeling with syntactic topic models. In this paper, we focus on capturing hidden semantic association among words in the domain adaptation. 3 Domain Adaptation Based on Latent Semantic Association 2 Related Works Some domain adaptation techniques have been employed in NLP in recent years. Some of them focus on quantifying the generalizability of certain features across domains. Roark and Bacchiani (2003) use maximum a posteriori (MAP) estimation to combine training data from the source and target domains. Chelba and Acero (2004) use the parameters of the source domain maximum entropy classifier as the means of a Gaussian prior when training a new model on the target data. Daume III and Marcu (2006) use an empirical Bayes model to estimate a latent variable model grouping instances into domain-specific or common across both domains. The challenge in domain adaptation is how to capture latent semantic association from the source and target domain data. We present a LaSA-based domain adaptation method in this section. NER can be considered as a classification problem. Let X be a feature space to represent the observed word instances, and let Y be the set of class labels. Let ps (x, y) and pt (x, y) be the true underlying distributions for the source and the target domains, respectively. In order to minimize the efforts required in the domain transfer, we often expect to use ps (x, y) to approximate pt (x, y). However, data distribution are often varied with the domains. For example, in the economics-to- 282 entertainment domain transfer, although many NE triggers (e.g. "company" and "Mr.") are used in both domains, some are totally new, like "dancer", "singer". Moreover, many useful words (e.g. "economist") in the economics NER are useless in the entertainment domain. The above examples show that features could change behavior across domains. Some useful predictive features from one domain are not predictive or do not appear in another domain. Although some triggers (e.g. "singer", "economist") are completely distinct for each domain, they often appear in the similar syntactic and semantic context. For example, triggers of person entity often appear as the subject of "visited", "said", etc, or are modified by "excellent", "popular", "famous" etc. Such latent semantic association among words provides useful hints for overcoming the data distribution gap of both domains. Hence, we present a LaSA model s,t to capture latent semantic association among words in the domain adaptation. s,t is learned from the unlabeled source and target domain data. Each instance is characterized by its co-occurred context distribution in the learning. Semantic association feature in s,t is a hidden random variable that is inferred from data. In the domain adaptation, we transfer the problem of semantic association mapping to a posterior inference task using LaSA model. Latent semantic concept association set of a word instance x (denoted by SA(x)) is generated by s,t . Instances in the same concept set are considered as behaving in the same way for discriminative learning in both domains. Even though word instances do not appear in a training corpus (or appear rarely) but are in similar context, they still might have relatively high probability in the same semantic concept set. Obviously, SA(x) can better bridge the gap between the two distributions ps (y|x) and pt (y|x). Hence, LaSA model can enhance the estimate of the source domain distribution ps (y|x; s,t ) to better approximate the target domain distribution pt (y|x; s,t ). to build LaSA model from words and their context snippets in this section. LaSA model actually can be considered as a general probabilistic topic model. It can be learned on the unlabeled corpus using the popular hidden topic models such as LDA or pLSI. 4.1 Virtual Context Document The distribution of content words (e.g. nouns, adjectives) is usually varied with domains. Hence, in the domain adaptation, we focus on capturing the latent semantic association among content words. In order to learn latent relationships among words from the unlabeled corpus, each content word is characterized by a virtual context document as follows. Given a content word xi , the virtual context document of xi (denoted by vdxi ) consists of all the context units around xi in the corpus. Let n be the total number of the sentences which contain xi in the corpus. vdxi is constructed as follows. where, F (xsk ) denotes the context feature set of i xi in the sentence sk , 1 k n. Given the context window size {-t, t} (i.e. previous t words and next t words around xi in sk ). F (xsk ) usually consists of the following features. i 1. Anchor unit Axi : the current focused word unit xi . C 2. Left adjacent unit Axi : The nearest left adjacent L unit xi-1 around xi , denoted by AL (xi-1 ). 3. Right adjacent unit Axi : The nearest right adjacent R unit xi+1 around xi , denoted by AR (xi+1 ). x 4. Left context set CLi : the other left adjacent units {xi-t , ..., xi-j , ..., xi-2 } (2 j t) around xi , denoted by {CL (xi-t ), ..., CL (xi-j ), ..., CL (xi-2 )}. x 5. Right context set CRi : the other right adjacent units {xi+2 , ..., xi+j , ..., xi+t } (2 j t ) around xi , denoted by {CR (xi+2 ), ..., CR (xi+j ), ..., CR (xi+t )}. vdxi = {F (xs1 ), ..., F (xsk ), ..., F (xsn )} i i i 4 Learning LaSA Model from Virtual Context Documents In the domain adaptation, LaSA model is employed to find the latent semantic association structures of "words" in a text corpus. We will illustrate how For example, given xi ="singer", sk ="This popular new singer attended the new year party". Let the context window size be {-3,3}. F (singer) = {singer, AL (new), AR (attend(ed)), CL (this), CL (popular), CR (the), CR (new) }. vdxi actually describes the semantic and syntactic feature distribution of xi in the domains. We construct the feature vector of xi with all the observed context features in vdxi . Given vdxi = 283 {f1 , ..., fj , ..., fm }, fj denotes jth context feature around xi , 1 j m, m denotes the total number of features in vdxi . The value of fj is calculated by Mutual Information (Church and Hanks, 1990) between xi and fj . W eight(fj , xi ) = log2 P (fj , xi ) P (fj )P (xi ) (1) Algorithm 1: LaSA Model Training 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 Inputs: · Unlabeled data set: Du ; Outputs: · LaSA model: s,t ; Initialization: · Virtual context document set: V Ds,t = ; · Candidate content word set: Xs,t = ; Steps: begin foreach content word xi Du do if Frequency(xi ) the predefined threshold then AddT o(xi , Xs,t ); foreach xk Xs,t do foreach sentence Si Du do if xk Si then {xk , ALk , ARk , CLk , CRk }; S AddT o(F (x i ), vdxk ); k where, P (fj , xi ) is the joint probability of xi and fj co-occurred in the corpus, P (fj ) is the probability of fj occurred in the corpus. P (xi ) is the probability of xi occurred in the corpus. 4.2 Learning LaSA Model Topic models are statistical models of text that posit a hidden space of topics in which the corpus is embedded (Blei et al., 2003). LDA (Blei et al., 2003) is a probabilistic model that can be used to model and discover underlying topic structures of documents. LDA assumes that there are K "topics", multinomial distributions over words, which describes a collection. Each document exhibits multiple topics, and each word in each document is associated with one of them. LDA imposes a Dirichlet distribution on the topic mixture weights corresponding to the documents in the corpus. The topics derived by LDA seem to possess semantic coherence. Those words with similar semantics are likely to occur in the same topic. Since the number of LDA model parameters depends only on the number of topic mixtures and vocabulary size, LDA is less prone to over-fitting and is capable of estimating the probability of unobserved test documents. LDA is already successfully applied to enhance document representations in text classification (Blei et al., 2003), information retrieval (Wei and Croft., 2006). In the following, we illustrate how to construct LDA-style LaSA model s,t on the virtual context documents. Algorithm 1 describes LaSA model training method in detail, where, Function AddT o(data, Set) denotes that data is added to Set. Given a large-scale unlabeled data set Du which consists of the source and target domain data, virtual context document for each candidate content word is extracted from Du at first, then the value of each feature in a virtual context document is calculated using its Mutual Information ( see Equation 1 in Section 4.1) instead of the counts when running F (xk i ) - x S x x x 17 18 19 AddT o(vdxk , V Ds,t ); end · Generate LaSA model s,t with Dirichlet distribution on V Ds,t . LDA. LaSA model s,t with Dirichlet distribution is generated on the virtual context document set V Ds,t using the algorithm presented by Blei et al (2003). 1 customer president singer manager economist policeman reporter director consumer dancer 2 theater showplace courtyard center city gymnasium airport square park building 3 company government university community team enterprise bank market organization agency 4 Beijing Hongkong China Japan Singapore New York Vienna America Korea international 5 music film arts concert party Ballet dance song band opera Table 1: Top 10 nouns from 5 randomly selected topics computed on the economics and entertainment domains LaSA model learns the posterior distribution to decompose words and their corresponding virtual context documents into topics. Table 1 lists top 10 nouns from a random selection of 5 topics computed on the unlabeled economics and entertainment domain data. As shown, words in the same topic are representative nouns. They actually are grouped into broad concept sets. For example, set 1, 3 and 4 correspond to nominal person, nominal organization and location, respectively. With a large-scale unlabeled corpus, we will have enough words assigned to each topic concept to better approximate the underlying semantic association distribution. In LDA-style LaSA model, the topic mixture is drawn from a conjugate Dirichlet prior that remains the same for all the virtual context docu- 284 ments. Hence, given a word xi in the corpus, we may perform posterior inference to determine the conditional distribution of the hidden topic feature variables associated with xi . Latent semantic association set of xi (denoted by SA(xi )) is generated using Algorithm 2. Here, Multinomial(s,t (vdxi )) refers to sample from the posterior distribution over topics given a virtual document vdxi . In the domain adaptation, we do semantic association inference on the source domain training data using LaSA model at first, then the original source domain NER model is tuned on the source domain training data set by incorporating these generated semantic association features. Algorithm 2: Generate Latent Semantic Association Set of Word xi Using K-topic LaSA Model 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 Inputs: · s,t : LaSA model with multinomial distribution; · Dirichlet(): Dirichlet distribution with parameter ; · xi : Content word; Outputs: · SA(xi ): Latent semantic association set of xi ; Steps: begin · Extract vdxi from the corpus. · Draw topic weights s,t (vdxi ) from Dirichlet(); · foreach fj in vdxi do draw a topic zj { 1,...,K} from Multinomial(s,t (vdxi )); AddT o(zj , T opics(vdxi )); · Rank all the topics in T opics(vdxi ); · SA(xi ) top n topics in T opics(vdxi ); - 5.1 Experimental setting In the NER domain adaptation, nouns and adjectives make a significant impact on the performance. Thus, we focus on capturing latent semantic association for high-frequency nouns and adjectives (i.e. occurrence count 50 ) in the unlabeled corpus. LaSA models for nouns and adjectives are learned from the unlabeled corpus using Algorithm 1 (see section 4.2), respectively. Our empirical study shows that better adaptation is obtained with a 50-topic LaSA model. Therefore, we set the number of topics N as 50, and define the context view window size as {3,3} (i.e. previous 3 words and next 3 words) in the LaSA model learning. LaSA features for other irrespective words (e.g. token unit "the") are assigned with a default topic value N +1. All the basic NER models are trained on the domain-specific training data using RRM classifier (Guo et al., 2005). RRM is a generalization Winnow learning algorithm (Zhang et al., 2002). We set the context view window size as {-2,2} in NER. Given a word instance x, we employ local linguistic features (e.g. word unit, part of speech) of x and its context units ( i.e. previous 2 words and next 2 words ) in NER. All Chinese texts in the experiments are automatically segmented into words using HMM. In LaSA-based domain adaptation, the semantic association features of each unit in the observation window {-2,2} are generated by LaSA model at first, then the basic source domain NER model is tuned on the original source domain training data set by incorporating the semantic association features. For example, given the sentence "This popular new singer attended the new year party", Figure 1 illustrates various features and views at the current word wi = "singer" in LaSA-based adaptation. Position Word POS SA ..... Tag wi-2 popular adj SA(popular) ti-2 Tagging wi-1 wi new singer adj noun SA(new) SA(singer) ti-1 ti wi+1 attend verb SA(attend) wi+2 the article SA(the) end LaSA model better models latent semantic association distribution in the source and the target domains. By grouping words into concepts, we effectively overcome the data distribution difference of both domains. Thus, we may reduce the number of parameters required to model the target domain data, and improve the quality of the estimated parameters in the domain transfer. LaSA model extends the traditional bag-of-words topic models to context-dependence concept association model. It has potential use for concept grouping. 5 Experiments We evaluate LaSA-based domain adaptation method on both English and Chinese corpus in this section. In the experiments, we focus on recognizing person (PER), location (LOC) and organization (ORG) in the given four domains, including economics (Eco), entertainment (Ent), politics (Pol) and sports (Spo). Figure 1: Feature window in LaSA-based adaptation In the viewing window at the word "singer" (see Figure 1), each word unit around "singer" is codified with a set of primitive features (e.g. P OS, SA, T ag), together with its relative position to "singer". 285 Here, "SA" denotes semantic association feature set which is generated by LaSA model. "T ag" denotes NE tags labeled in the data set. Given the input vector constructed with the above features, RRM method is then applied to train linear weight vectors, one for each possible class-label. In the decoding stage, the class with the maximum confidence is then selected for each token unit. In our evaluation, only NEs with correct boundaries and correct class labels are considered as the correct recognition. We use the standard Precision 2P R (P), Recall (R), and F-measure (F = P +R ) to measure the performance of NER models. 5.2 Data We built large-scale English and Chinese annotated corpus. English corpus are generated from wikipedia while Chinese corpus are selected from Chinese newspapers. Moreover, test data do not overlap with training data and unlabeled data. 5.2.1 Generate English Annotated Corpus from Wikipedia Wikipedia provides a variety of data resources for NER and other NLP research (Richman and Schone, 2008). We generate all the annotated English corpus from wikipedia. With the limitation of efforts, only PER NEs in the corpus are automatically tagged using an English person gazetteer. We automatically extract an English Person gazetteer from wikipedia at first. Then we select the articles from wikipedia and tag them using this gazetteer. In order to build the English Person gazetteer from wikipdedia, we manually selected several key phrases, including "births", "deaths", "surname", "given names" and "human names" at first. For each article title of interest, we extracted the categories to which that entry was assigned. The entry is considered as a person name if its related explicit category links contain any one of the key phrases, such as "Category: human names". We totally extracted 25,219 person name candidates from 204,882 wikipedia articles. And we expanded this gazetteer by adding the other available common person names. Finally, we obtained a large-scale gazetteer of 51,253 person names. All the articles selected from wikipedia are further tagged using the above large-scale gazetteer. Since human annotated set were not available, we held out more than 100,000 words of text from the automatically tagged corpus to as a test set in each domain. Table 2 shows the data distribution of the training and test data sets. Domains Pol Eco Spo Ent Training Data Set Size PERs 0.45M 9,383 1.06M 21,023 0.47M 17,727 0.36M 12,821 Test Data Set Size PERs 0.23M 6,067 0.34M 6,951 0.20M 6,075 0.15M 5,395 Table 2: English training and test data sets We also randomly select 17M unlabeled English data (see Table 3) from Wikipedia. These unlabeled data are used to build the English LaSA model. All Data Size(M) 17.06 Pol 7.36 Domain Eco Spo 2.59 3.65 Ent 3.46 Table 3: Domain distribution in the unlabeled English data set 5.2.2 Chinese Data We built a large-scale high-quality Chinese NE annotated corpus. All the data are news articles from several Chinese newspapers in 2001 and 2002. All the NEs (i.e. PER, LOC and ORG ) in the corpus are manually tagged. Cross-validation checking is employed to ensure the quality of the annotated corpus. Domain Pol Eco Spo Ent Domain Pol Eco Spo Ent Size (M) 0.90 1.40 0.60 0.60 Size (M) 0.20 0.26 0.10 0.10 PER 11,388 6,821 11,647 12,954 PER 2,470 1,098 1,802 2,458 NEs in the training data set ORG LOC 6,618 14,350 18,827 14,332 8,105 7,468 2,823 4,665 NEs in the test data set ORG LOC 1,528 2,540 2,971 2,362 1,323 1,246 526 738 Total 32,356 39,980 27,220 20,442 Total 6,538 6,431 4,371 3,722 Table 4: Chinese training and test data sets All the domain-specific training and test data are selected from this annotated corpus according to the domain categories (see Table 4). 8.46M unlabeled Chinese data (see Table 5) are randomly selected from this corpus to build the Chinese LaSA model. 5.3 Experimental Results All the experiments are conducted on the above large-scale English and Chinese corpus. The overall performance enhancement of NER by LaSA-based 286 All Data Size(M) 8.46 Pol 2.34 Domain Eco Spo 1.99 2.08 Ent 2.05 Source Target EcoEnt PolEnt SpoEnt EntEco PolEco SpoEco EcoPol EntPol SpoPol EcoSpo EntSpo PolSpo FBase 60.45% 69.89% 68.66% 58.50% 62.89% 60.44% 67.03% 66.64 % 65.40% 67.20% 70.05% 70.99% Performance in the domain transfer FLaSA 66.42% 73.07% 70.89% 61.35% 64.93% 63.20% 70.90 % 68.94 % 67.20% 70.77% 72.20% 73.86% (F ) +9.88% +4.55% +3.25% + 4.87% +3.24% + 4.57 % +5.77% +3.45% +2.75% +5.31% +3.07% +4.04% (loss) 26.29% 23.96% 15.38% 11.98% 10.52% 12.64% 27.78% 16.06% 11.57% 15.47% 10.64% 14.91% FT op in FEnt =83.16% in FEnt =83.16% in FEnt =83.16% in FEco =82.28% in FEco =82.28% in FEco =82.28% in FP ol =80.96% in FP ol =80.96% in FP ol =80.96% in FSpo =90.24% in FSpo =90.24% in FSpo =90.24% Table 5: Domain distribution in the unlabeled Chinese data set domain adaptation is evaluated at first. Since the distribution of each NE type is different across domains, we also analyze the performance enhancement on each entity type by LaSA-based adaptation. 5.3.1 Performance Enhancement of NER by LaSA-based Domain Adaptation Table 6 and 7 show the experimental results for all pairs of domain adaptation on both English and Chinese corpus, respectively. In the experiment, the basic source domain NER model Ms is learned from the specific domain training data set Ddom (see Table 2 and 4 in Section 5.2). Here, dom in {Eco, Ent, P ol, Spo}. Fdom denotes the top-line F-measure of Ms in the source trained domain dom. When Ms is directly applied in a new target domain, its F-measure in this basic transfer is considered as baseline (denoted by FBase ). FLaSA denotes F-measure of Ms achieved in the target domain with LaSA-based domain adaptation. (F ) = FLaSA -FBase , which denotes the relative F-measure FBase enhancement by LaSA-based domain adaptation. Source Target EcoEnt PolEnt SpoEnt EntEco PolEco SpoEco EcoPol EntPol SpoPol EcoSpo EntSpo PolSpo FBase 57.61% 57.5 % 58.66% 70.56 % 63.62% 70.35% 50.59% 56.12% 60.22% 60.28% 60.28% 56.94% Performance in the domain transfer FLaSA 59.22% 59.83% 62.46% 72.46% 68.1% 72.85% 52.7% 59.82% 62.6% 61.21% 62.68% 60.48% (F ) +2.79% +4.05% +6.48% +2.69% +7.04% +3.55% +4.17% +6.59% +3.95% +1.54% +3.98% +6.22% (loss) 17.87% 25.55% 47.74% 19.33% 26.71% 24.90% 15.81% 47.31% 63.98% 9.93% 25.61% 27.85% FT op in FEnt =66.62% in FEnt =66.62% in FEnt =66.62% in FEco =80.39% in FEco =80.39% in FEco =80.39% in FP ol =63.94% in FP ol =63.94% in FP ol =63.94% in FSpo =69.65% in FSpo =69.65% in FSpo =69.65% Table 7: Experimental results on Chinese corpus cent points in this basic transfer. Significant performance degrading of Ms is observed in all the basic transfer. It shows that the data distribution of both domains is very different in each possible transfer. Experimental results on English corpus show that LaSA-based adaptation effectively enhances the performance in each domain transfer (see Table 6). For example, in the "PolEco" transfer, FBase is 63.62% while FLaSA achieves 68.10%. Compared with FBase , LaSA-based method significantly enhances F-measure by 7.04%. We perform t-tests on F-measure of all the comparison experiments on English corpus. The p-value is 2.44E-06, which shows that the improvement is statistically significant. Table 6 also gives the accuracy loss due to transfer in each domain adaptation on English corpus. The F accuracy loss is defined as loss = 1 - F in . And dom Table 6: Experimental results on English corpus Experimental results on English and Chinese corpus indicate that the performance of Ms significantly degrades in each basic domain transfer without using LaSA model (see Table 6 and 7). For example, in the "EcoEnt" transfer on Chinese corin pus (see Table 7), Feco of Ms is 82.28% while FBase of Ms is 60.45% in the entertainment domain. Fmeasure of Ms significantly degrades by 21.83 per- the relative reduction in error is defined as (loss)= |1 - lossLaSA |. Experimental results indicate that lossBase the relative reduction in error is above 9.93% with LaSA-based transfer in each test on English corpus. LaSA model significantly decreases the accuracy loss by 29.38% in average. Especially for "SpoPol" transfer, (loss) achieves 63.98% with LaSA-based adaptation. All the above results show that LaSA-based adaptation significantly reduces the accuracy loss in the domain transfer for English NER without any labeled target domain samples. Experimental results on Chinese corpus also show that LaSA-based adaptation effectively increases the accuracy in all the tests (see Table 7). For example, in the "EcoEnt" transfer, compared with FBase , LaSA-based adaptation significantly increases Fmeasure by 9.88%. We also perform t-tests on F- 287 measure of 12 comparison experiments on Chinese corpus. The p-value is 1.99E-06, which shows that the enhancement is statistically significant. Moreover, the relative reduction in error is above 10% with LaSA-based method in each test. LaSA model decreases the accuracy loss by 16.43% in average. Especially for the "EcoEnt" transfer (see Table 7), (loss) achieves 26.29% with LaSA-based method. All the above experimental results on English and Chinese corpus show that LaSA-based domain adaptation significantly decreases the accuracy loss in the transfer without any labeled target domain data. Although automatically tagging introduced some errors in English source training data, the relative reduction in errors in English NER adaptation seems comparable to that one in Chinese NER adaptation. 5.3.2 Accuracy Enhancement for Each NE Type Recognition Our statistic data (Guo et al., 2006) show that the distribution of NE types varies with domains. Each NE type has different domain features. Thus, the performance stability of each NE type recognition is very important in the domain transfer. Figure 2 gives F-measure of each NE type recognition achieved by LaSA-based adaptation on English and Chinese corpus. Experimental results show that LaSA-based adaptation effectively increases the accuracy of each NE type recognition in the most of the domain transfer tests. We perform t-tests on F-measure of the comparison experiments on each NE type, respectively. All the p-value is less than 0.01, which shows that the improvement on each NE type recognition is statistically significant. Especially, the p-value of English and Chinese PER is 2.44E-06 and 9.43E-05, respectively, which shows that the improvement on PER recognition is very significant. For example, in the "EcoPol" transfer on Chinese corpus, compared with FBase , LaSA-based adaptation enhances F-measure of PER recognition by 9.53 percent points. Performance enhancement for ORG recognition is less than that one for PER and LOC recognition using LaSA model since ORG NEs usually contain much more domainspecific information than PER and LOC. The major reason for error reduction is that external context and internal units are better semantically associated using LaSA model. For example, LaSA Figure 2: PER, LOC and ORG recognition in the transfer model better groups various titles from different domains (see Table 1 in Section 4.2). Various industry terms in ORG NEs are also grouped into the semantic sets. These semantic associations provide useful hints for detecting the boundary of NEs in the new target domain. All the above results show that LaSA model better compensates for the feature distribution difference of each NE type across domains. 6 Conclusion We present a domain adaptation method with LaSA model in this paper. LaSA model captures latent semantic association among words from the unlabeled corpus. It better groups words into a set of concepts according to the related context snippets. LaSAbased domain adaptation method projects words to a low-dimension concept feature space in the transfer. It effectively overcomes the data distribution gap across domains without using any labeled target domain data. Experimental results on English and Chinese corpus show that LaSA-based domain adaptation significantly enhances the performance of NER across domains. Especially, LaSA model effectively increases the accuracy of each NE type recognition in the domain transfer. Moreover, LaSA-based domain adaptation method works well across languages. To further reduce the accuracy loss, we will explore informative sampling to capture fine-grained data difference in the domain transfer. References Rie Ando and Tong Zhang. 2005. A Framework for Learning Predictive Structures from Multiple Tasks 288 and Unlabeled Data. In Journal of Machine Learning Research 6 (2005), pages 1817­1853. Andrew Arnold, Ramesh Nallapati, and William W. Cohen. 2008. Exploiting Feature Hierarchy for Transfer Learning in Named Entity Recognition. In Proceedings of 46th Annual Meeting of the Association of Computational Linguistics (ACL'08), pages 245-253. David Blei, Andrew Ng, and Michael Jordan. 2003. Latent Dirichlet Allocation. Journal of Machine Learning Research, 3:993­1022. John Blitzer, Ryan McDonald, and Fernando Pereira. 2006. Domain Adaptation with Structural Correspondence Learning. In Proceedings of the 2006 Conference on Empirical Methods in Natural Language Processing (EMNLP 2006), pages 120-128. John Blitzer, Mark Dredze, and Fernando Pereira. 2007. Biographies, Bollywood, Boom-boxes and Blenders: Domain Adaptation for Sentiment Classification. In Proceedings of the 45th Annual Meeting of the Association of Computational Linguistics (ACL'07), pages 440-447. Andrew Borthwick. 1999. A Maximum Entropy Approach to Named Entity Recognition. Ph.D. thesis, New York University. Yee Seng Chan and Hwee Tou Ng. 2007. Domain Adaptation with Active Learning for Word Sense Disambiguation. In Proceedings of the 45th Annual Meeting of the Association of Computational Linguistics (ACL'07). Ciprian Chelba and Alex Acero. 2004. Adaptation of maximum entropy capitalizer: Little data can help a lot. In Proceedings of the 2004 Conference on Empirical Methods in Natural Language Processing. Kenneth Ward Church and Patrick Hanks. 1990. Word association norms, mutual information and lexicography. Computational Linguistics, 16(1):22­29. Hal Daume III. 2007. Frustratingly Easy Domain Adaptation. In Proceedings of the 45th Annual Meeting of the Association of Computational Linguistics. Hal Daume III and Daniel Marcu. 2006. Domain adaptation for statistical classifiers. Journal of Artificial Intelligence Research, 26:101­126. Scott Deerwester, Susan T. Dumais, and Richard Harshman. 1990. Indexing by latent semantic analysis. Journal of the American Society for Information Science, 41(6):391­407. Radu Florian, Abe Ittycheriah, Hongyan Jing, and Tong Zhang. 2003. Named entity recogintion through classifier combination. In Proceedings of the 2003 Conference on Computational Natural Language Learning. Freitag. 2004. Trained Named Entity Recognition Using Distributional Clusters. In Proceedings of the 2004 Conference on Empirical Methods in Natural Language Processing (EMNLP 2004). Scott Miller, Jethran Guinness, and Alex Zamanian. 2004. Name Tagging with Word Clusters and Discriminative Training. In Proceedings of HLT-NAACL 04. Jianfeng Gao, Mu Li, Anndy Wu, and Changning Huang. 2005. Chinese Word Segmentation and Named Entity Recognition: A Pragmatic Approach. Computational Linguisitc, 31(4):531­574. Honglei Guo, Jianmin Jiang, Gang Hu, and Tong Zhang. 2005. Chinese Named Entity Recognition Based on Multilevel Linguistic Features. In Lecture Notes in Artificial Intelligence, 3248:90­99. Honglei Guo, Li Zhang, and Zhong Su. 2006. Empirical Study on the Performance Stability of Named Entity Recognition Model across Domains. In Proceedings of the 2006 Conference on Empirical Methods in Natural Language Processing (EMNLP 2006), pages 509516. Thomas Hofmann. 1999. Probabilistic latent semantic indexing. In Proceedings of the 22th Annual International SIGIR Conference on Research and Development in Information Retrieval (SIGIR'99). Jing Jiang and ChengXiang Zhai. 2006. Exploiting Domain Structure for Named Entity Recognition. In Proceedings of HLT-NAACL 2006, pages 74­81. Jing Jiang and ChengXiang Zhai. 2007. Instance Weighting for Domain Adaptation in NLP. In Proceedings of the 45th Annual Meeting of the Association of Computational Linguistics (ACL'07), pages 264­271. Wei Li and Andrew McCallum. 2005. Semi-supervised sequence modeling with syntactic topic models. In Proceedings of Twenty AAAI Conference on Artificial Intelligence (AAAI-05). Alexander E. Richman and Patrick Schone. 2008. Mining Wiki Resources for Multilingual Named Entity Recognition. In Proceedings of the 46th Annual Meeting of the Association of Computational Linguistics. Brian Roark and Michiel Bacchiani. 2003. Supervised and unsupervised PCFG adaptation to novel domains. In Proceedings of the 2003 Human Language Technology Conference of the North American Chapter of the Association for Computational Linguistics. Erik F. Tjong Kim Sang and Fien De Meulder. 2003. Introduction to the conll-2003 shared task: Language independent named entity recognition. In Proceedings of the 2003 Conference on Computational Natural Language Learning (CoNLL-2003), pages 142­147. Xing Wei and Bruce Croft. 2006. LDA-based document models for ad-hoc retrieval. In Proceedings of the 29th Annual International SIGIR Conference on Research and Development in Information Retrieval. Tong Zhang, Fred Damerau, and David Johnson. 2002 Text chunking based on a generalization of Winnow. Journal of Machine Learning Research, 2:615­637. 289 Semi-Automatic Entity Set Refinement Vishnu Vyas and Patrick Pantel Yahoo! Labs Santa Clara, CA 95054 {vishnu,ppantel}@yahoo-inc.com Abstract State of the art set expansion algorithms produce varying quality expansions for different entity types. Even for the highest quality expansions, errors still occur and manual refinements are necessary for most practical uses. In this paper, we propose algorithms to aide this refinement process, greatly reducing the amount of manual labor required. The methods rely on the fact that most expansion errors are systematic, often stemming from the fact that some seed elements are ambiguous. Using our methods, empirical evidence shows that average R-precision over random entity sets improves by 26% to 51% when given from 5 to 10 manually tagged errors. Both proposed refinement models have linear time complexity in set size allowing for practical online use in set expansion systems. et al. 2008). Semi-supervised approaches are often used in practice since they allow for targeting specific entity classes such as European Cities and French Impressionist Painters. Methods differ in complexity from simple ones using lexicosyntactic patterns (Hearst 1992) to more complicated techniques based on distributional similarity (Paca 2007a). Even for state of the art methods, expansion errors inevitably occur and manual refinements are necessary for most practical uses requiring high precision (such as for query interpretation at commercial search engines). Looking at expansions from state of the art systems such as GoogleSets1 , we found systematic errors such as those resulting from ambiguous seed instances. For example, consider the following seed instances for the target set Roman Gods: Minerva, Neptune, Baccus, Juno, Apollo 1 Introduction Sets of named entities are extremely useful in a variety of natural language and information retrieval tasks. For example, companies such as Yahoo! and Google maintain sets of named entities such as cities, products and celebrities to improve search engine relevance. Manually creating and maintaining large sets of named entities is expensive and laborious. In response, many automatic and semi-automatic methods of creating sets of named entities have been proposed, some are supervised (Zhou and Su, 2001), unsupervised (Pantel and Lin 2002, Nadeau et al. 2006), and others semi-supervised (Kozareva 290 GoogleSet's expansion as well others employing distributional expansion techniques consists of a mishmash of Roman Gods and celestial bodies, originating most likely from the fact that Neptune is both a Roman God and a Planet. Below is an excerpt of the GoogleSet expansion: Mars, Venus, *Moon, Mercury, *asteroid, Jupiter, *Earth, *comet, *Sonne, *Sun, ... The inherent semantic similarity between the errors can be leveraged to quickly clean up the expansion. For example, given a manually tagged error "asteroid", a distributional similarity thesaurus 1 http://labs.google.com/sets Human Language Technologies: The 2009 Annual Conference of the North American Chapter of the ACL, pages 290­298, Boulder, Colorado, June 2009. c 2009 Association for Computational Linguistics such as (Lin 1998)2 can identify comet as similar to asteroid and therefore potentially also as an error. This method has its limitations since a manually tagged error such as Earth would correctly remove Moon and Sun, but it would also incorrectly remove Mars, Venus and Jupiter since they are also similar to Earth3. In this paper, we propose two algorithms to improve the precision of automatically expanded entity sets by using minimal human negative judgments. The algorithms leverage the fact that set expansion errors are systematically caused by ambiguous seed instances which attract incorrect instances of an unintended entity type. We use distributional similarity and sense feature modeling to identify such unintended entity types in order to quickly clean up errors with minimal manual labor. We show empirical evidence that average Rprecision over random entity sets improves by 26% to 51% when given from 5 to 10 manually tagged errors. Both proposed refinement models have linear time complexity in set size allowing for practical online use in set expansion systems. The remainder of this paper is organized as follows. In the next section we review related work and position our contribution within its landscape. Section 3 presents our task of dynamically modeling the similarity of a set of words and describes algorithms for refining sets of named entities. The datasets and our evaluation methodology used to perform our experiments are presented in Section 4 and in Section 5 we describe experimental results. Finally, we conclude with some discussion and future work. 2 Related Work There is a large body of work for automatically building sets of named entities using various techniques including supervised, unsupervised and semi-supervised methods. Supervised techniques use large amounts of training data to detect and classify entities into coarse grained classes such as People, Organizations, and Places (Bunescu and Mooney 2004; Etzioni et al. 2005). On the other hand, unsupervised methods require no training See http://demo.patrickpantel.com/ for a demonstration of the distributional thesaurus. 3 In practice, this problem is rare since most terms that are similar in one of their senses tend not to be similar in their other senses. 2 data and rely on approaches such as clustering, targeted patterns and co-occurrences to extract sets of entities (Pantel and Lin 2002; Downey et al. 2007). Semi-supervised approaches are often used in practice since they allow for targeting specific entity classes. These methods rely on a small set of seed examples to extract sets of entities. They either are based on distributional approaches or employ lexico-syntactic patterns to expand a small set of seeds to a larger set of candidate expansions. Some methods such as (Riloff and Shepherd 1997; Riloff and Jones 1999; Banko et al. 2007;Paca 2007a) use lexico-syntactic patterns to expand a set of seeds from web text and query logs. Others such as (Paca et al. 2006; Paca 2007b; Paca and Durme 2008) use distributional approaches. Wang and Cohen (2007) use structural cues in semistructured text to expand sets of seed elements. In all methods however, expansion errors inevitably occur. This paper focuses on the task of post processing any such system's expansion output using minimal human judgments in order to remove expansion errors. Using user feedback to improve a system's performance is a common theme within many information retrieval and machine learning tasks. One form of user feedback is active learning (Cohn et al. 1994), where one or more classifiers are used to focus human annotation efforts on the most beneficial test cases. Active learning has been successfully applied to various natural language tasks such as parsing (Tang et al. 2001), POS tagging (Dagan and Engelson 1995) and providing large amounts of annotations for common natural language processing tasks such as word sense disambiguation (Banko and Brill 2001). Relevance feedback is another popular feedback paradigm commonly used in information retrieval (Harman 1992), where user feedback (either explicit or implicit) is used to refine the search results of an IR system. Relevance feedback has been successfully applied to many IR applications including content-based image retrieval (Zhouand Huang 2003) and web search (Vishwa et al. 2005). Within NLP applications relevance feedback has also been used to generate sense tagged examples for WSD tasks (Stevenson et al. 2008), and Question Answering (Negri 2004). Our methods use relevance feedback in the form of negative examples to refine the results of a set expansion system. 291 3 Dynamic Similarity Modeling The set expansion algorithms discussed in Section 2 often produce high quality entity sets, however inevitably errors are introduced. Applications requiring high precision sets must invest significantly in editorial efforts to clean up the sets. Although companies like Yahoo! and Google can afford to routinely support such manual labor, there is a large opportunity to reduce the refinement cost (i.e., number of required human judgments). Recall the set expansion example of Roman Gods from Section 1. Key to our approach is the hypothesis that most expansion errors result from some systematic cause. Manual inspection of expansions from GoogleSets and distributional set expansion techniques revealed that most errors are due to the inherent ambiguity of seed terms (such as Neptune in our example) and data sparseness (such as Sonne in our example, a very rare term). The former kind of error is systematic and can be leveraged by an automatic method by assuming that any entity semantically similar to an identified error will also be erroneous. In this section, we propose two methods for leveraging this hypothesis. In the first method, described in Section 3.1, we use a simple distributional thesaurus and remove all entities which are distributionally similar to manually identified errors. In the second method, described in Section 3.2, we model the semantics of the seeds using distributional features and then dynamically change the feature space according to the manually identified errors and rerank the entities in the set. Both methods rely on the following two observations: a) Many expansion errors are systematically caused by ambiguous seed examples which draw in several incorrect entities of its unintended senses (such as seed Neptune in our Roman Gods example which drew in celestial bodies such as Earth and Sun); b) Entities which are similar in one sense are usually not similar in their other senses. For example, Apple and Sun are similar in their Company sense but their other senses (Fruit and Celestial Body) are not similar. Our example in Section 1 illustrates a rare counterexample where Neptune and Mercury are similar in both their Planets and Roman Gods senses. 292 Task Outline: Our task is to remove errors from entity sets by using a minimal amount of manual judgments. Incorporating feedback into this process can be done in multiple ways. The most flexible system would allow a judge to iteratively remove as many errors as desired and then have the system automatically remove other errors in each iteration. Because it is intractable to test arbitrary numbers of manually identified errors in each iteration, we constrain the judge to identify at most one error in each iteration. Although this paper focuses solely on removing errors in an entity set, it is also possible to improve expanded sets by using feedback to add new elements to the sets. We consider this task out of scope for this paper. 3.1 Similarity Method (SIM) Our first method directly models observation a) in the previous section. Following Lin (1998), we model the similarity between entities using the distributional hypothesis, which states that similar terms tend to occur in similar contexts (Harris 1985). A semantic model can be obtained by recording the surrounding contexts for each term in a large collection of unstructured text. Methods differ in their definition of a context (e.g., text window or syntactic relations), or a means to weigh contexts (e.g., frequency, tf-idf, pointwise mutual information), or ultimately in measuring the similarity between two context vectors (e.g., using Euclidean distance, Cosine, Dice). In this paper, we use a text window of size 1, we weigh our contexts using pointwise mutual information, and we use the cosine score to compute the similarity between context vectors (i.e., terms). Section 5.1 describes our source corpus and extraction details. Computing the full similarity matrix for many terms over a very large corpus is computationally intensive. Our specific implementation follows the one presented in (Bayardo et al. 2007). The similarity matrix computed above is then directly used to refine entity sets. Given a manually identified error at each iteration, we automatically remove each entity in the set that is found to be semantically similar to the error. The similarity threshold was determined by manual inspection and is reported in Section 5.1. Due to observation b) in the previous section, we expect that this method will perform poorly on entity sets such as the one presented in our example of Section 1 where the manual removal of Earth would likely remove correct entities such as Mars, Venus and Jupiter. The method presented in the next section attempts to alleviate this problem. 3.2 Feature Modification Method (FMM) b) in Section 3. We showed that SIM would incorrectly remove expansions such as Mars, Venus and Jupiter given the erroneous expansion Earth. The FMM method would instead remove the Planet features from the seed feature vectors and the remaining features would still overlap with Mars, Venus and Jupiter's Roman God sense. Efficiency: FMM requires online similarity computations between centroid vectors and all elements of the expanded set. For large corpora such as Wikipedia articles or the Web, feature vectors are large and storing them in memory and performing similarity computations repeatedly for each editorial judgment is computationally intensive. For example, the size of the feature vector for a single word extracted from Wikipedia can be in the order of a few gigabytes. Storing the feature vectors for all candidate expansions and the seed set is inefficient and too slow for an interactive system. The next section proposes a solution that makes this computation very fast, requires little memory, and produces near perfect approximations of the similarity scores. Under the distributional hypothesis, the semantics of a term are captured by the contexts in which it occurs. The Feature Modification Method (FMM), in short, attempts to automatically discover the incorrect contexts of the unintended senses of seed elements and then filters out expanded terms whose contexts do not overlap with the other contexts of the seed elements. Consider the set of seed terms S and an erroneous expanded instance e. In the SIM method of Section 3.1 all set elements that have a feature vector (i.e., context vector) similar to e are removed. The Feature Modification Method (FMM) instead tries to identify the subset of features of the error e which represent the unintended sense of the seed terms S. For example, let S = {Minerva, Neptune, Baccus, Juno, Apollo}. Looking at the contexts of these words in a large corpus, we construct a centroid context vector for S by taking a weighted average of the contexts of the seeds in S. In Wikipedia articles we see contexts (i.e., features) such as4: attack, kill, *planet, destroy, Goddess, *observe, statue, *launch, Rome, *orbit, ... 3.3 Approximating Cosine Similarity Given an erroneous expansion such as e = Earth, we postulate that removing the intersecting features from Earth's feature vector and the above feature vector will remove the unintended Planet sense of the seed set caused by the seed element Neptune. The intersecting features that are removed are bolded in the above feature vector for S. The similarity between this modified feature vector for S and all entities in the expansion set can be recomputed as described in Section 3.1. Entities with a low similarity score are removed from the expanded set since they are assumed to be part of the unintended semantic class (Planet in this example). Unlike the SIM method from Section 3.1, this method is more stable with respect to observation 4 The full feature vector for these and all other terms in Wikipedia can be found at http://demo.patrickpantel.com/.. There are engineering optimizations that are available that allow us to perform a near perfect approximation of the similarity computation from the previous section. The proposed method requires us to only store the shared features between the centroid and the words rather than the complete feature vectors, thus reducing our space requirements dramatically. Also, FMM requires us to repeatedly calculate the cosine similarity between a modified centroid feature vector and each candidate expansion at each iteration. Without the full context vectors of all candidate expansions, computing the exact cosine similarity is impossible. Given, however, the original cosine scores between the seed elements and the candidate expansions before the first refinement iteration as well as the shared features, we can approximate with very high accuracy the updated cosine score between the modified centroid and each candidate expansion. Our method relies on the fact that features (i.e., contexts) are only ever removed from the original centroid ­ no new features are ever added. Let be the original centroid representing the seed instances. Given an expansion error e, FMM creates a modified centroid by removing all fea- 293 tures intersecting between e and . Let ' be this modified centroid. FMM requires us to compute the similarity between ' and all candidate expansions x as: cos(x, ) = x i i to use as seeds for our seed set expansion algorithm. Also, in section 4.2 we discuss how we can simulate editorial feedback using our gold standard sets. x 4.1 Gold Standard Entity Sets where i iterates over the feature space. In our efficient setting, the only element that we do not have for calculating the exact cosine similarity is the norm of x, x . Given that we have the original cosine similarity score, cos(x, ) and that we have the shared features between the original centroid and the candidate expansion x we can calculate x as: x = cos(x, ) x i i Combining the two equations, have: cos(x, ) = cos(x, ) x x i i i i In the above equation, the modified cosine score can be considered as an update to the original cosine score, where the update depends only on the shared features and the original centroid. The above update equation can be used to recalculate the similarity scores without resorting to an expensive computation involving complete feature vectors. Storing the original centroid is expensive and can be approximated instead from only the shared features between the centroid and all instances in the expanded set. We empirically tested this approximation by comparing the cosine scores between the candidate expansions and both the true centroid and the approximated centroid. The average error in cosine score was 9.5E-04 ± 7.83E-05 (95% confidence interval). The gold standard sets form an essential part of our evaluation. These sets were chosen to represent a single concept such as Countries and Archbishops of Canterbury. These sets were selected from the List of pages from Wikipedia5. We randomly sorted the list of every noun occurring in Wikipedia. Then, for each noun we verified whether or not it existed in a Wikipedia list, and if so we extracted this list ­ up to a maximum of 50 lists. If a noun belonged to multiple lists, the authors chose the list that seemed most appropriate. Although this does not generate a perfect random sample, diversity is ensured by the random selection of nouns and relevancy is ensured by the author adjudication. Lists were then scraped from the Wikipedia website and they went through a manual cleanup process which included merging variants. . The 50 sets contain on average 208 elements (with a minimum of 11 and a maximum of 1116 elements) for a total of 10,377 elements. The final gold standard lists contain 50 sets including classical pianists, Spanish provinces, Texas counties, male tennis players, first ladies, cocktails, bottled water brands, and Archbishops of Canterbury6. 4.2 Generation of Experimental Trials To provide a statistically significant view of the performance of our algorithm, we created more than 1000 trials as follows. For each of the gold standard seed sets, we created 30 random sortings. These 30 random sortings were then used to generate trial seed sets with a maximum size of 20 seeds. 4 Datasets and Baseline Algorithm 4.3 Simulating User Feedback and Baseline Algorithm We evaluate our algorithms against manually scraped gold standard sets, which were extracted from Wikipedia to represent a random collection of concepts. Section 4.1 discusses the gold standard sets and the criteria behind their selection. To present a statistically significant view of our results we generated a set of trials from gold standard sets 294 User feedback forms an integral part of our algorithm. We used the gold standard sets to judge the 5 In this paper, extractions from Wikipedia are taken from a snapshot of the resource in December 2007. 6 The gold standard is available for download at http://www.patrickpantel.com/cgi-bin/Web/Tools/getfile.pl? type=data&id=sse-gold/wikipedia.20071218.goldsets.tgz candidate expansions. The judged expansions were used to simulate user feedback by marking those candidate expansions that were incorrect. The first candidate expansion that was marked incorrect in each editorial iteration was used as the editor's negative example and was given to the system as an error. In the next section, we report R-precision gains at each iteration in the editorial process for our two methods described in Section 3. Our baseline method simply measures the gains obtained by removing the first incorrect entry in a candidate expansion set at each iteration. This simulates the process of manually cleaning a set by removing one error at a time. Table 1. R-precision of the three methods with 95% confidence bounds. ITERATION BASELINE SIM FMM 1 2 3 4 5 6 7 8 9 10 0.219±0.012 0.223±0.013 0.227±0.013 0.232±0.013 0.235±0.014 0.236±0.014 0.238±0.014 0.24±0.014 0.242±0.014 0.243±0.014 0.234±0.013 0.242±0.014 0.251±0.015 0.26±0.016 0.266±0.017 0.269±0.017 0.273±0.018 0.28±0.018 0.285±0.018 0.286±0.018 0.220±0.015 0.227±0.017 0.235±0.019 0.252±0.021 0.267±0.022 0.282±0.023 0.294±0.023 0.303±0.024 0.315±0.025 0.322±0.025 5 5.1 Experimental Results Experimental Setup Wikipedia5 served as the source corpus for our algorithms described in Sections 3.1 and 3.2. All articles were POS-tagged using (Brill 1995) and later chunked using a variant of (Abney 1991). Corpus statistics from this processed text were collected to build the similarity matrix for the SIM method (Section 3.1) and to extract the features required for the FMM method (Section 3.2). In both cases corpus statistics were extracted over the semi-syntactic contexts (chunks) to approximate term meanings. The minimum similarity thresholds were experimentally set to 0.15 and 0.11 for the SIM and FMM algorithms respectively. Each experimental trial described in Section 4.2, which consists of a set of seed instances of one of our 50 random semantic classes, was expanded using a variant of the distributional set expansion algorithm from Sarmento et al. (2007). The expansions were judged against the gold standard and each candidate expansion was marked as either correct or incorrect. This set of expanded and judged candidate files were used as inputs to the algorithms described in Sections 3.1 and 3.2. Choosing the first candidate expansion that was judged as incorrect simulated our user feedback. This process was repeated for each iteration of the algorithm and results are reported for 10 iterations. The outputs of our algorithms were again judged against the gold standard lists and the performance was measured in terms of precision gains over the baseline at various ranks. Precision gain 295 for an algorithm over a baseline is the percentage increase in precision for the same values of parameters of the algorithm over the baseline. Also, as the size of our gold standard lists vary, we report another commonly used statistic, R-precision. Rprecision for any set is the precision at the size of the gold standard set. For example, if a gold standard set contains 20 elements, then R-precision for any set expansion is measured as the precision at rank 20. The average R-precision over each set is then reported. 5.2 Quantitative Analysis Table 1 lists the performance of our baseline algorithm (Section 4.3) and our proposed methods SIM and FMM (Sections 3.1 and 3.2) in terms of their R-precision with 95% confidence bounds over 10 iterations of each algorithm. The FMM of Section 3.2 is the best performing method in terms of R-precision reaching a maximum value of 0.322 after the 10th iteration. For small numbers of iterations, however, the SIM method outperforms FMM since it is bolder in its refinements by removing all elements similar to the tagged error. Inspection of FMM results showed that bad instances get ranked lower in early iterations but it is only after 4 or 5 iterations that they get pushed passed the similarity threshold (accounting for the low marginal increase in precision gain for FMM in the first 4 to 5 iterations). FMM outperforms the SIM method by an average of 4% increase in performance (13% improvement after 10 iterations). However both the FMM and the SIM method are able to outperform Figure 1. Precision gain over baseline algorithm for SIM method. Figure 2. Precision gain over baseline algorithm for FMM method. the baseline method. Using the FMM method one would achieve an average of 17% improvement in R-precision over manually cleaning up the set (32.5% improvement after 10 iterations). Using the SIM method one would achieve an average of 13% improvement in R-precision over manually cleaning up the set (17.7% improvement after 10 iterations). sions that are random errors introduced due to data sparsity. Such unsystematic errors are not detectable by the SIM method. 5.4 Intrinsic Analysis of the FMM Algorithm 5.3 Intrinsic Analysis of the SIM Algorithm Figure 1 shows the precision gain of the similarity matrix based algorithm over the baseline algorithm. The results are shown for precision at ranks 1, 2, 5, 10, 25, 50 and 100, as well as for Rprecision. The results are also shown for the first 10 iterations of the algorithm. SIM outperforms the baseline algorithm for all ranks and increases in gain throughout the 10 iterations. As the number of iterations increases the change in precision gain levels off. This behavior can be attributed to the fact that we start removing errors from top to bottom and in each iteration the rank of the error candidate provided to the algorithm is lower than in the previous iteration. This results in errors which are not similar to any other candidate expansions. These are random errors and the discriminative capacity of this method reduces severely. Figure 1 also shows that the precision gain of the similarity matrix algorithm over the baseline algorithm is higher at ranks 1, 2 and 5. Also, the performance increase drops at ranks 50 and 100. This is because low ranks contain candidate expan296 The feature modification method of Section 3.2 shows similar behavior to the SIM method, however as Figure 2 shows, it outperforms SIM method in terms of precision gain for all values of ranks tested. This is because the FMM method is able to achieve fine-grained control over what it removes and what it doesn't, as described in Section 5.2. Another interesting aspect of FMM is illustrated in the R-precision curve. There is a sudden jump in precision gain after the fifth iteration of the algorithm. In the first iterations only few errors are pushed beneath the similarity threshold as centroid features intersecting with tagged errors are slowly removed. As the feature vector for the centroid gets smaller and smaller, remaining features look more and more unambiguous to the target entity type and erroneous example have less chance of overlapping with the centroid causing them to be pushed pass the conservative similarity threshold. Different conservative thresholds yielded similar curves. High thresholds yield bad performance since all but the only very prototypical set instances are removed as errors. The R-precision measure indirectly models recall as a function of the target coverage of each set. We also directly measured recall at various ranks and FMM outperformed SIM at all ranks and iterations. 5.5 Discussion In this paper we proposed two techniques which use user feedback to remove systematic errors in set expansion systems caused by ambiguous seed instances. Inspection of expansion errors yielded other types of errors. First, model errors are introduced in candidate expansion sets by noise from various preprocessing steps involved in generating the expansions. Such errors cause incorrect contexts (or features) to be extracted for seed instances and ultimately can cause erroneous expansions to be produced. These errors do not seem to be systematic and are hence not discoverable by our proposed method. Other errors are due to data sparsity. As the feature space can be very large, the difference in similarity between a correct candidate expansion and an incorrect expansion can be very small for sparse entities. Previous approaches have suggested removing candidate expansions for which too few statistics can be extracted, however at the great cost of recall (and lower R-precision). judgments. Both proposed refinement models have linear time complexity in set size allowing for practical online use in set expansion systems. This paper only addresses techniques for removing erroneous entities from expanded entity sets. A complimentary way to improve performance would be to investigate the addition of relevant candidate expansions that are not already in the initial expansion. We are currently investigating extensions to FMM that can efficiently add new candidate expansions to the set by computing the similarity between modified centroids and all terms occurring in a large body of text. We are also investigating ways to use the findings of this work to a priori remove ambiguous seed instances (or their ambiguous contexts) before running the initial expansion algorithm. It is our hope that most of the errors identified in this work could be automatically discovered without any manual judgments. References Abney, S. Parsing by Chunks. 1991. In: Robert Berwick, Steven Abney and Carol Tenny (eds.), Principle-Based Parsing. Kluwer Academic Publishers, Dordrecht. Banko, M. and Brill, E. 2001. Scaling to very large corpora for natural language disambiguation. In Proceedings of ACL-2001.pp 26-33. Morristown, NJ. Banko, M.; Cafarella, M.; Soderland, S.; Broadhead, M.; Etzioni, O. 2007. Open Information Extraction from the Web. In Proceedings of IJCAI-07. Bayardo, R. J; Yiming Ma,; Ramakrishnan Srikant.; Scaling Up All-Pairs Similarity Search. In Proc. of the 16th Int'l Conf. on World Wide Web. pp 131-140 2007. Brill, E. 1995. Transformation-Based Error-Driven Learning and Natural Language Processing: A Case Study in Part of Speech Tagging. Computational Linguistics. Bunescu, R. and Mooney, R. 2004 Collective Information Extraction with Relational Markov Networks. In Proceedings of ACL-04.pp. 438-445. Cohn, D. A., Atlas, L., and Ladner, R. E. 1994. Improving Generalization with Active Learning. Machine Learning, 15(2):201-221. Springer, Netherlands. Dagan, I. and Engelson, S. P. 1995. Selective Sampling in Natural Language Learning. In Proceedings of IJCAI-95 Workshop on New Approaches to Learning for Natural Language Processing. Montreal, Canada. Downey, D.; Broadhead, M; Etzioni, O. 2007. Locating Complex Named Entities in Web Text. In Proceedings of IJCAI-07. 6 Conclusion In this paper we presented two algorithms for improving the precision of automatically expanded entity sets by using minimal human negative judgments. We showed that systematic errors which arise from the semantic ambiguity inherent in seed instances can be leveraged to automatically refine entity sets. We proposed two techniques: SIM which boldly removes instances that are distributionally similar to errors, and FMM which more conservatively removes features from the seed set representing its unintended (ambiguous) concept in order to rank lower potential errors. We showed empirical evidence that average Rprecision over random entity sets improves by 26% to 51% when given from 5 to 10 manually tagged errors. These results were reported by testing the refinement algorithms on a set of 50 randomly chosen entity sets expanded using a state of the art expansion algorithm. Given very small amounts of manual judgments, the SIM method outperformed FMM (up to 4 manual judgments). FMM outperformed the SIM method given more than 6 manual 297 Etzioni, O.; Cafarella, M.; Downey. D.; Popescu, A.; Shaked, T; Soderland, S.; Weld, D.; Yates, A. 2005. Unsupervised named-entity extraction from the Web: An Experimental Study. In Artificial Intelligence, 165(1):91-134. Harris, Z. 1985. Distributional Structure. In: Katz, J. J. (ed.), The Philosophy of Linguistics. New York: Oxford University Press. pp. 26-47. Harman, D. 1992. Relevance feedback revisited. In Proceeedings of SIGIR-92. Copenhagen, Denmark. Hearst, M. A. 1992.Automatic acquisition of hyponyms from large text corpora.In Proceedings of COLING92. Nantes, France. Kozareva, Z., Riloff, E. and Hovy, E. 2008. Semantic Class Learning from the Web with Hyponym Pattern Linkage Graphs.In Proceedings of ACL-08.pp 10481056. Columbus, OH Lin, D. 1998.Automatic retrieval and clustering of similar words.In Proceedings of COLING/ACL-98.pp. 768­774. Montreal, Canada. Nadeau, D., Turney, P. D. and Matwin., S. 2006. Unsupervised Named-Entity Recognition: Generating Gazetteers and Resolving Ambiguity. In Advances in Artifical Intelligence.pp 266-277. Springer Berlin, Heidelberg. Negri, M. 2004. Sense-based blind relevance feedback for question answering. In Proceedings of SIGIR-04 Workshop on Information Retrieval For Question Answering (IR4QA). Sheffield, UK, Pantel, P. and Lin, D. 2002. Discovering Word Senses from Text. In Proceedings of KDD-02.pp. 613-619. Edmonton, Canada. Paca, M. 2007a.Weakly-supervised discovery of named entities using web search queries. In Proceedings of CIKM-07.pp. 683-690. Pasca, M. 2007b. Organizing and Searching the World Wide Web of Facts - Step Two: Harnessing the Wisdom of the Crowds. In Proceedings of WWW-07. pp. 101-110. Paca, M.; Lin, D.; Bigham, J.; Lifchits, A.; Jain, A. 2006. Names and Similarities on the Web: Fact Extraction in the Fast Lane. In Proceedings of ACL2006.pp. 113-120. Paca, M. and Durme, B.J. 2008. Weakly-supervised Acquisition of Open-Domain Classes and Class Attributes from Web Documents and Query Logs. In Proceedings of ACL-08. Riloff, E. and Jones, R. 1999 Learning Dictionaries for Information Extraction by Multi-Level Boostrapping.In Proceedings of AAAI/IAAAI-99. Riloff, E. and Shepherd, J. 1997. A corpus-based approach for building semantic lexicons.In Proceedings of EMNLP-97. Sarmento, L.; Jijkuon, V.; de Rijke, M.; and Oliveira, E. 2007. "More like these": growing entity classes from seeds. In Proceedings of CIKM-07. pp. 959-962. Lisbon, Portugal. Stevenson, M., Guo, Y. and Gaizauskas, R. 2008. Acquiring Sense Tagged Examples using Relevance Feedback. In Proceedings ofCOLING-08. Manchester UK. Tang, M., Luo, X., and Roukos, S. 2001. Active learning for statistical natural language parsing.In Proceedings of ACL-2001.pp 120 -127. Philadelphia, PA. Vishwa. V, Wood, K., Milic-Frayling, N. and Cox, I. J. 2005. Comparing Relevance Feedback Algorithms for Web Search. In Proceedings of WWW 2005. Chiba, Japan. Wang. R.C. and Cohen, W.W. 2007.LanguageIndependent Set Expansion of Named Entities Using the Web.In Proceedings of ICDM-07. Zhou, X. S. and Huang, S. T. 2003. Relevance Feedback in Image Retrieval: A Comprehensive Review Xiang Sean Zhou, Thomas S. Huang Multimedia Systems. pp 8:536-544. Zhou, G. and Su, J. 2001. Named entity recognition using an HMM-based chunk tagger. In Proceedings of ACL-2001.pp. 473-480. Morristown, NJ. 298 Unsupervised Constraint Driven Learning For Transliteration Discovery Ming-Wei Chang Dan Goldwasser Dan Roth Yuancheng Tu University of Illinois at Urbana Champaign Urbana, IL 61801 {mchang21,goldwas1,danr,ytu}@uiuc.edu Abstract This paper introduces a novel unsupervised constraint-driven learning algorithm for identifying named-entity (NE) transliterations in bilingual corpora. The proposed method does not require any annotated data or aligned corpora. Instead, it is bootstrapped using a simple resource ­ a romanization table. We show that this resource, when used in conjunction with constraints, can efficiently identify transliteration pairs. We evaluate the proposed method on transliterating English NEs to three different languages - Chinese, Russian and Hebrew. Our experiments show that constraint driven learning can significantly outperform existing unsupervised models and achieve competitive results to existing supervised models. 1 Introduction Named entity (NE) transliteration is the process of transcribing a NE from a source language to some target language while preserving its pronunciation in the original language. Automatic NE transliteration is an important component in many cross-language applications, such as Cross-Lingual Information Retrieval (CLIR) and Machine Translation(MT) (Hermjakob et al., 2008; Klementiev and Roth, 2006a; Meng et al., 2001; Knight and Graehl, 1998). It might initially seem that transliteration is an easy task, requiring only finding a phonetic mapping between character sets. However simply matching every source language character to its target language counterpart is not likely to work well as in practice this mapping depends on the context the 299 characters appear in and on transliteration conventions which may change across domains. As a result, current approaches employ machine learning methods which, given enough labeled training data learn how to determine whether a pair of words constitute a transliteration pair. These methods typically require training data and language-specific expertise which may not exist for many languages. In this paper we try to overcome these difficulties and show that when the problem is modeled correctly, a simple character level mapping is a sufficient resource. In our experiments, English was used as the source language, allowing us to use romanization tables, a resource commonly-available for many languages1 . These tables contain an incomplete mapping between character sets, mapping every character to its most common counterpart. Our transliteration model takes a discriminative approach. Given a word pair, the model determines if one word is a transliteration of the other. The features used by this model are character n-gram matches across the two strings. For example, Figure 1 describes the decomposition of a word pair into unigram features as a bipartite graph in which each edge represents an active feature. We enhance the initial model with constraints, by framing the feature extraction process as a structured prediction problem - given a word pair, the set of possible active features is defined as a set of latent binary variables. The contextual dependency beThe romanization tables available at the Library of Congress website (http://www.loc.gov/catdir/cpso/roman.html) cover more than 150 languages written in various non-Roman scripts 1 Human Language Technologies: The 2009 Annual Conference of the North American Chapter of the ACL, pages 299­307, Boulder, Colorado, June 2009. c 2009 Association for Computational Linguistics Figure 1: Top: The space of all possible features that can be generated given the word pair. Bottom: A pruned features representation generated by the inference process. tween features is encoded as a set of constraints over these variables. Features are extracted by finding an assignment that maximizes the similarity score between the two strings and conforms to the constraints. The model is bootstrapped using a romanization table and uses a discriminatively self-trained classifier as a way to improve over several training iterations. Furthermore, when specific knowledge about the source and target languages exists, it can be directly injected into the model as constraints. We tested our approach on three very different languages - Russian, a Slavic language, Hebrew a Semitic language, and Chinese, a SinoTibetan language. In all languages, using this simple resource in conjunction with constraints provided us with a robust transliteration system which significantly outperforms existing unsupervised approaches and achieves comparable performance to supervised methods. The rest of the paper is organized as follows. Sec. 2 briefly examines more related work. Sec. 3 explains our model and Sec. 4 provide a linguistic intuition for it. Sec. 5 describes our experiments and evaluates our results followed by sec. 6 which concludes our paper. 2 Related Works Transliteration methods typically fall into two categories: generative approaches (Li et al., 2004; Jung et al., 2000; Knight and Graehl, 1998) that try to produce the target transliteration given a source language NE, and discriminative approaches (Goldwasser and Roth, 2008b; Bergsma and Kondrak, 2007; Sproat et al., 2006; Klementiev and Roth, 2006a), that try to identify the correct translitera300 tion for a word in the source language given several candidates in the target language. Generative methods encounter the Out-Of-Vocabulary (OOV) problem and require substantial amounts of training data and knowledge of the source and target languages. Discriminative approaches, when used to for discovering NE in a bilingual corpora avoid the OOV problem by choosing the transliteration candidates from the corpora. These methods typically make very little assumptions about the source and target languages and require considerably less data to converge. Training the transliteration model is typically done under supervised settings (Bergsma and Kondrak, 2007; Goldwasser and Roth, 2008b), or weakly supervised settings with additional temporal information (Sproat et al., 2006; Klementiev and Roth, 2006a). Our work differs from these works in that it is completely unsupervised and makes no assumptions about the training data. Incorporating knowledge encoded as constraints into learning problems has attracted a lot of attention in the NLP community recently. This has been shown both in supervised settings (Roth and Yih, 2004; Riedel and Clarke, 2006) and unsupervised settings (Haghighi and Klein, 2006; Chang et al., 2007) in which constraints are used to bootstrap the model. (Chang et al., 2007) describes an unsupervised training of a Constrained Conditional Model (CCM), a general framework for combining statistical models with declarative constraints. We extend this work to include constraints over possible assignments to latent variables which, in turn, define the underlying representation for the learning problem. In the transliteration community there are several works (Ristad and Yianilos, 1998; Bergsma and Kondrak, 2007; Goldwasser and Roth, 2008b) that show how the feature representation of a word pair can be restricted to facilitate learning a string similarity model. We follow the approach discussed in (Goldwasser and Roth, 2008b), which considers the feature representation as a structured prediction problem and finds the set of optimal assignments (or feature activations), under a set of legitimacy constraints. This approach stresses the importance of interaction between learning and inference, as the model iteratively uses inference to improve the sample representation for the learning problem and uses the learned model to improve the accuracy of the in- ference process. We adapt this approach to unsupervised settings, where iterating over the data improves the model in both of these dimensions. 3 Unsupervised Constraint Driven Learning In this section we present our Unsupervised Constraint Driven Learning (UCDL) model for discovering transliteration pairs. Our task is in essence a ranking task. Given a NE in the source language and a list of candidate transliterations in the target language, the model is supposed to rank the candidates and output the one with the highest score. The model is bootstrapped using two linguistic resources: a romanization table and a set of general and linguistic constraints. We use several iterations of self training to learn the model. The details of the procedure are explained in Algorithm 1. In our model features are character pairs (cs , ct ), where cs Cs is a source word character and ct Ct is a target word character. The feature representation of a word pair vs , vt is denoted by F (vs , vt ). Each feature (cs , ct ) is assigned a weight W (cs , ct ) R. In step 1 of the algorithm we initialize the weights vector using the romanization table. Given a pair (vs , vt ), a feature extraction process is used to determine the feature based representation of the pair. Once features are extracted to represent a pair, the sum of the weights of the extracted features is the score assigned to the target transliteration candidate. Unlike traditional feature extraction approaches, our feature representation function does not produce a fixed feature representation. In step 2.1, we formalize the feature extraction process as a constrained optimization problem that captures the interdependencies between the features used to represent the sample. That is, obtaining F (vs , vt ) requires solving an optimization problem. The technical details are described in Sec. 3.1. The constraints we use are described in Sec. 3.2. In step 2.2 the different candidates for every source NE are ranked according to the similarity score associated with their chosen representation. This ranking is used to "label" examples for a discriminative learning process that learns increasingly better weights, and thus improve the representation of the pair: each source NE paired with its top ranked transliteration is labeled as a positive exam301 ples (step 2.3) and the rest of the samples are considered as negative samples. In order to focus the learning process, we removed from the training set all negative examples ruled-out by the constraints (step 2.4). As the learning process progresses, the initial weights are replaced by weights which are discriminatively learned (step 2.5). This process is repeated several times until the model converges, and repeats the same ranking over several iterations. Input: Romanization table T : Cs Ct , Constraints C, Source NEs: Vs , Target words: Vt 1. Initialize Model Let W : Cs × Ct R be a weight vector. Initialize W using T by the following procedure (cs , ct ), (cs , ct ) T W(cs , ct ) = 0, (cs , ct ), ¬((cs , ct ) T ) W(cs , ct ) = -1, cs , W(cs , ) = -1, ct , W( , ct ) = -1. 2. Constraints driven unsupervised training while not converged do 1. vs Vs , vt Vt , use C and W to generate a representation F (vs , vt ) 2. vs Vs , find the top ranking transliteration pair (vs , vt ) by solving vt = arg maxvt score(F (vs , vt )). 3. D = {(+, F (vs , vt )) | vs Vs }. 4. vs Vs , vt Vt , if vt = vt and score(F (vs , vt )) = -, then D = D {(-, F (vs , vt ))}. end 5. W train(D) Algorithm 1: UCDL Transliteration Framework. In the rest of this section we explain this process in detail. We define the feature extraction inference process in Sec. 3.1, the constraints used in Sec. 3.2 and the inference algorithm in Sec. 3.3. The linguistic intuition for our model is described in Sec. 4. 3.1 Finding Feature Representation as Constrained Optimization We use the formulation of Constrainted Conditional Models (CCMs) (Roth and Yih, 2004; Roth and Yih, 2007; Chang et al., 2008). Previous work on CCM models dependencies between different decisions in structured prediction problems. Transliteration discovery is a binary classification problem, however, the underlying representation of each sample can be modeled as a CCM, defined as a set of latent variables corresponding to the set of all possible features for a given sample. The dependencies between the features are captured using constraints. Given a word pair, the set of all possible features consists of all character mappings from the source word to the target word. Since in many cases the size of the words differ we augment each of the words with a blank character (denoted as ' '). We model character omission by mapping the character to the blank character. This process is formally defined as an operator mapping a transliteration candidate pair to a set of binary variables, denoted as All-F eatures (AF ). AF = {(cs , ct )|cs vs { }, ct vt { }} This representation is depicted at the top of Figure 1. The initial sample representation (AF ) generates features by coupling substrings from the two terms without considering the dependencies between the possible combinations. This representation is clearly noisy and in order to improve it we select a subset F AF of the possible features. The selection process is formulated as a linear optimization problem over binary variables encoding feature activations in AF . Variables assigned 1 are selected to be in F , and those assigned 0 are not. The objective function maximized is a linear function over the variables in AF , each with its weight as a coefficient, as in the left part of Equation 1 below. We seek to maximize this linear sum subject to a set of constraints. These represent the dependencies between selections and prior knowledge about possible legitimate character mappings and correspond to the right side of Equation 1. In our settings only hard constraints are used and therefore the penalty () for violating any of the constraints is set to . The specific constraints used are discussed in Sec. 3.2. The score of the mapping F (vs , vt ) can be written as follows: 1 (W · F (vs , vt ) - |vt | ci (F (vs , vt )) ci C The result of the optimization process is a set F of active features, defined in Equation 2. The result of this process is described at the bottom of Figure 1. F (vs , vt ) = arg maxF AF (vs ,vt ) score(F ). (2) The ranking process done by our model can now be naturally defined - given a source word vs , and a 0 n set of candidates target words vt , . . . , vt , find the candidate whose optimal representation maximizes Equation 1. This process is defined in Equation 3. i vt = arg max score(F (vs , vt )). i vt (3) 3.2 Incorporating Mapping Constraints We consider two types of constraints: language specific and general constraints that apply to all languages. Language specific constraints typically impose a local restriction such as individually forcing some of the possible character mapping decisions. The linguistic intuition behind these constraints is discussed in Section 4. General constraints encode global restrictions, capturing the dependencies between different mapping decisions. General constraints: To facilitate readability we denote the feature mapping the i-th source word character to the j-th target word character as a Boolean variable aij that is 1 if that feature is active and 0 otherwise. · Coverage - Every character must be mapped only to a single character or to the blank character. We can formulate this as: j aij = 1 and i aij = 1. · No Crossing - Every character mapping, except mapping to blank character, should preserve the order of appearance in the source and target words, or formally - i, j s.t. aij = 1 l < i, k > j, alk = 0. Another constraint is i, j s.t. aij = 1 l > i, k < j, alk = 0. Language specific constraints · Restricted Mapping: These constraints restrict the possible local mappings between source and target language characters. We maintain a set of possible mappings {cs cs }, where cs Ct and {ct ct }, where ct Cs . Any feature (cs , ct ) such that cs ct or / ct cs is penalized in our model. / (1) We normalize this score by dividing it by the size of the target word, since the size of candidates varies, normalization improved the ranking of candidates. 302 · Length restriction: An additional constraint restricts the size difference between the two words. We formulate this as follows: vs Vs , vt Vt , if |vt | > |vs | and |vs | > |vt |, score(F (vs , vt )) = -. Although can take different values for different languages, we simply set to 2 in this paper. In addition to biasing the model to choose the right candidate, the constraints also provide a computational advantage: a given a word pair is eliminated from consideration when the length restriction is not satisfied or there is no way to satisfy the restricted mapping constraints. 3.3 Inference The optimization problem defined in Equation 2 is an integer linear program (ILP). However, given the structure of the problem it is possible to develop an efficient dynamic programming algorithm for it, based on the algorithm for finding the minimal edit distance of two strings. The complexity of finding the optimal set of features is only quadratic in the size of the input pair, a clear improvement over the ILP exponential time algorithm. The algorithm minimizes the weighted edit distance between the strings, and produces a character alignment that satisfies the general constraints (Sec. 3.2). Our modifications are only concerned with incorporating the language-specific constraints into the algorithm. This can be done simply by assigning a negative infinity score to any alignment decision not satisfying these constraints. from English to Chinese, are language specific and can be captured by language specific sound change patterns. Phonemes Vowel Nasal Approximant Fricative Plosive Constraints i y; u w; a a m m; m, n m r l; l, r l l l; w h, w, f h, o, u, v w; y y v w, b, f s s, x, z; s, c s p b, p; p p b b; t t t, d d; q k Table 1: All language specific constraints used in our English to Chinese transliteration (see Sec. 3.2 for more details). Constraints in boldface apply to all positions, the rest apply only to characters appearing in initial position. 4 Bootstrapping with Linguistic Information Our model is bootstrapped using two resources - a romanization table and mapping constraints. Both resources capture the same information - character mapping between languages. The distinction between the two represents the difference in the confidence we have in these resources - the romanization table is a noisy mapping covering the character set and is therefore better suited as a feature. Constraints, represented by pervasive, correct character mapping, indicate the sound mapping tendency between source and target languages. For example, certain n-gram phonemic mappings, such as r l 303 These patterns have been used by other systems as features or pseudofeatures (Yoon et al., 2007). However, in our system these language specific ruleof-thumbs are systematically used as constraints to exclude impossible alignments and therefore generate better features for learning. We listed in Table 1 all 20 language specific constraints we used for Chinese. There is a total of 24 constraints for Hebrew and 17 for Russian. The constraints in Table 1 indicate a systematic sound mapping between English and Chinese unigram character mappings. Arranged by manners of articulation each row of the table indicates the sound change tendency among vowels, nasals, approximants (retroflex and glides), fricatives and plosives. For example, voiceless plosive sounds such as p, t in English, tend to map to both voiced (such as b, d) and voiceless sounds in Chinese. However, if the sound is voiceless in Chinese, its backtrack English sound must be voiceless. This voice-voiceless sound change tendency is captured by our constraints such as p b, p and p p; t t. 5 Experiments and Analysis In this section, we demonstrate the effectiveness of constraint driven learning empirically. We start by describing the datasets and experimental settings and then proceed to describe the results. We evaluated our method on three very different target lan- 1 0.9 0.8 0.7 0.6 0.5 [KR 06] + temporal information [KR 06] All Cons. + unsupervsied learning 1.1 1 0.9 0.8 0.7 0.6 [GR 08] 250 labeled ex. with cons [GR 08] 250 labeled ex. w/o cons General cons + unsupervised learning All cons. + unsupervised learning 0.4 0.5 0.3 0.4 0 2 4 6 8 10 12 Number of Rounds 14 16 18 20 0 2 4 6 8 10 12 Number of Rounds 14 16 18 20 Figure 2: Comparison between our models and weakly supervised learning methods (Klementiev and Roth, 2006b). Note that one of the models proposed in (Klementiev and Roth, 2006b) takes advantage of the temporal information. Our best model, the unsupervised learning with all constraints, outperforms both models in (Klementiev and Roth, 2006b), even though we do not use any temporal information. Figure 3: Comparison between our works and supervised models in (Goldwasser and Roth, 2008b). We show the learning curves for Hebrew under two different settings: unsupervised learning with general and all constraints. The results of two supervised models (Goldwasser and Roth, 2008b) are also included here. Note that our unsupervised model with all constraints is competitive to the supervised model with 250 labeled examples. See the text for more comparisons and details. guages: Russian, Chinese, and Hebrew, and compared our results to previously published results. 5.1 Experimental Settings In our experiments the system is evaluated on its ability to correctly identify the gold transliteration for each source word. We evaluated the system's performance using two measures adopted in many transliteration works. The first one is Mean Reciprocal Rank (MRR), used in (Tao et al., 2006; Sproat et al., 2006), which is the average of the multiplicative inverse of the rank of the correct answer. Formally, Let n be the number of source NEs. Let GoldRank(i) be the rank the algorithm assigns to the correct transliteration. Then, MRR is defined by: MRR = 1 n n i=1 hinge loss as our loss function and fixed the regularization parameter C to be 0.5. 5.2 Datasets We experimented using three different target languages Russian, Chinese, and Hebrew. We used English as the source language in all these experiments. The Russian data set2 , originally introduced in (Klementiev and Roth, 2006b), is comprised of temporally aligned news articles. The dataset contains 727 single word English NEs with a corresponding set of 50,648 potential Russian candidate words which include not only name entities, but also other words appearing in the news articles. The Chinese dataset is taken directly from an English-Chinese transliteration dictionary, derived from LDC Gigaword corpus3 . The entire dictionary consists of 74,396 pairs of English-Chinese NEs, where Chinese NEs are written in Pinyin, a romanized spelling system of Chinese. In (Tao et al., 2006) a dataset which contains about 600 English NEs and 700 Chinese candidates is used. Since the dataset is not publicly available, we created a dataset in a similar way. We randomly selected approximately 600 NE pairs and then added about 100 candidates which do not correspond to any of the English NE 2 3 1 . goldRank(i) Another measure is Accuracy (ACC) used in (Klementiev and Roth, 2006a; Goldwasser and Roth, 2008a), which is the percentage of the top rank candidates being the gold transliteration. In our implementation we used the support vector machine (SVM) learning algorithm with linear kernel as our underlying learning algorithm (mentioned in part 2.5 of Algorithm 1) . We used the package LIBLINEAR (Hsieh et al., 2008) in our experiments. Through all of our experiments, we used the 2-norm 304 MRR ACC The corpus is available http://L2R.cs.uiuc.edu/cogcomp. http://www.ldc.upenn.edu Language Rus. (ACC) Heb. (MRR) UCDL 73 0.899 Prev. works 63 (41) (KR'06) 0.894 (GR'08) Table 2: Comparison to previously published results. UCDL is our method, KR'06 is described in (Klementiev and Roth, 2006b) and GR'08 in (Goldwasser and Roth, 2008b). Note that our results for Hebrew are comparable with a supervised system. previously selected. The Hebrew dataset, originally introduced in (Goldwasser and Roth, 2008a), consists of 300 English-Hebrew pairs extracted from Wikipedia. 5.3 Results We begin by comparing our model to previously published models tested over the same data, in two different languages, Russian and Hebrew. For Russian, we compare to the model presented in (Klementiev and Roth, 2006b), a weakly supervised algorithm that uses both phonetic information and temporal information. The model is bootstrapped using a set of 20 labeled examples. In their setting the candidates are ranked by combining two scores, one obtained using the transliteration model and a second by comparing the relative occurrence frequency of terms over time in both languages. Due to computational tractability reasons we slightly changed Algorithm 1 to use only a small subset of the possible negative examples. For Hebrew, we compare to the model presented in (Goldwasser and Roth, 2008b), a supervised model trained using 250 labeled examples. This model uses a bigram model to represent the transliteration samples (i.e., features are generated by pairing character unigrams and bigrams). The model also uses constraints to restrict the feature extraction process, which are equivalent to the coverage constraint we described in Sec. 3.2. The results of these experiments are reported using the evaluation measures used in the original papers and are summarized in Table 2. The results show a significant improvement over the Russian data set and comparable performance to the supervised method used for Hebrew. Figure 2 describes the learning curve of our method over the Russian dataset. We compared our algorithm to two models described in (Klementiev 305 and Roth, 2006b) - one uses only phonetic similarity and the second also considers temporal cooccurrence similarity when ranking the transliteration candidates. Both models converge after 50 iterations. When comparing our model to their models, we found that even though our model ignores the temporal information it achieves better results and converges after fewer iterations. Their results report a significant improvement when using temporal information - improving an ACC score of 41% without temporal information to 63% when using it. Since the temporal information is orthogonal to the transliteration model, our model should similarly benefit from incorporating the temporal information. Figure 3 compares the learning curve of our method to an existing supervised method over the Hebrew data and shows we get comparable results. Unfortunately, we could not find a published Chinese dataset. However, our system achieved similar results to other systems, over a different dataset with similar number of training examples. For example, (Sproat et al., 2006) presents a supervised system that achieves a MRR score of 0.89, when evaluated over a dataset consisting of 400 English NE and 627 Chinese words. Our results for a different dataset of similar size are reported in Table 3. 5.4 Analysis The resources used in our framework consist of - a romanization table, general and language specific transliteration constraints. To reveal the impact of each component we experimented with different combination of the components, resulting in three different testing configurations. Romanization Table: We initialized the weight vector using a romanization table and did not use any constraints. To generate features we use a modified version of our AF operator (see Sec. 3), which generates features by coupling characters in close positions in the source and target words. This configuration is equivalent to the model used in (Klementiev and Roth, 2006b). +General Constraints: This configuration uses the romanization table for initializing the weight vector and general transliteration constraints (see Sec. 3.2) for feature extraction. +All Constraints: This configuration uses language specific constraints in addition to the gen- Settings Romanization table Romanization table +learning +Gen Constraints +Gen Constraints +learning +All Constraints +All Constraints +learning Chinese 0.019 (0.5) 0.020 (0.3) 0.746 (67.1) 0.867 (82.2) 0.801 (73.4) 0.889 (84.7) Russian 0.034 (1.0) 0.048 (1.3) 0.809 (74.3) 0.906 (86.7) 0.849 (79.3) 0.931 (90.0) Hebrew 0.046 (1.7) 0.028 (0.7) 0.533 (45.0) 0.834 (76.0) 0.743 (66.0) 0.899 (85.0) Table 3: Results of an ablation study of unsupervised method for three target languages. Results for ACC are inside parentheses, and for MRR outside. When the learning algorithm is used, the results after 20 rounds of constraint driven learning are reported. Note that using linguistic constraints has a significant impact in the English-Hebrew experiments. Our results show that a small amount of constraints can go a long way, and better constraints lead to better learning performance. eral transliteration constraints to generate the feature representation. (see Sec. 4). +Learning: Indicates that after initializing the weight vector, we update the weight using Algorithm 1. In all of the experiments, we report the results after 20 training iterations. The results are summarized in Table 3. Due to the size of the Russian dataset, we used a subset consisting of 300 English NEs and their matching Russian transliterations for the analysis presented here. After observing the results, we discovered the following regularities for all three languages. Using the romanization table directly without constraints results in very poor performance, even after learning. This can be used as an indication of the difficulty of the transliteration problem and the difficulties earlier works have had when using only romanization tables, however, when used in conjunction with constraints results improve dramatically. For example, in the English-Chinese data set, we improve MRR from 0.02 to 0.746 and for the English-Russian data set we improve 0.03 to 0.8. Interestingly, the results for the English-Hebrew data set are lower than for other languages - we achieve 0.53 MRR in this setting. We attribute the difference to the quality of the mapping in the romanization table for that language. Indeed, the weights learned after 20 training iterations improve the results to 0.83. This improvement is consistent across all languages, after learning we are able to achieve a MRR score of 0.87 for the English-Chinese data set and 0.91 for the English-Russian data set. These results show that romanization table contains enough information to bootstrap the model when used in conjunction with constraints. We are able to achieve results compa306 rable to supervised methods that use a similar set of constraints and labeled examples. Bootstrapping the weight vector using language specific constraints can further improve the results. They provide several advantages: a better starting point, an improved learning rate and a better final model. This is clear in all three languages, for example results for the Russian and Chinese bootstrapped models improve by 5%, and by over 20% for Hebrew. After training the difference is smaller- only 3% for the first two and 6% for Hebrew. Figure 3 describes the learning curve for models with and without language specific constraints for the EnglishHebrew data set, it can be observed that using these constraints the model converges faster and achieves better results. 6 Conclusion In this paper we develop a constraints driven approach to named entity transliteration. In doing it we show that romanization tables are a very useful resource for transliteration discovery if proper constraints are included. Our framework does not need labeled data and does not assume that bilingual corpus are temporally aligned. Even without using any labeled data, our model is competitive to existing supervised models and outperforms existing weakly supervised models. 7 Acknowledgments We wish to thank the reviewers for their insightful comments. This work is partly supported by NSF grant SoD-HCER-0613885 and DARPA funding under the Bootstrap Learning Program. References S. Bergsma and G. Kondrak. 2007. Alignment-based discriminative string similarity. In Proc. of the Annual Meeting of the Association of Computational Linguistics (ACL), pages 656­663, Prague, Czech Republic, June. Association for Computational Linguistics. M. Chang, L. Ratinov, and D. Roth. 2007. Guiding semisupervision with constraint-driven learning. In Proc. of the Annual Meeting of the Association of Computational Linguistics (ACL), pages 280­287, Prague, Czech Republic, Jun. Association for Computational Linguistics. M. Chang, L. Ratinov, N. Rizzolo, and D. Roth. 2008. Learning and inference with constraints. In Proc. of the National Conference on Artificial Intelligence (AAAI), July. D. Goldwasser and D. Roth. 2008a. Active sample selection for named entity transliteration. In Proc. of the Annual Meeting of the Association of Computational Linguistics (ACL), June. D. Goldwasser and D. Roth. 2008b. Transliteration as constrained optimization. In Proc. of the Conference on Empirical Methods for Natural Language Processing (EMNLP), pages 353­362, Oct. A. Haghighi and D. Klein. 2006. Prototype-driven learning for sequence models. In Proc. of the Annual Meeting of the North American Association of Computational Linguistics (NAACL). U. Hermjakob, K. Knight, and H. Daum´ III. 2008. e Name translation in statistical machine translation learning when to transliterate. In Proc. of the Annual Meeting of the Association of Computational Linguistics (ACL), pages 389­397, Columbus, Ohio, June. Cho-Jui Hsieh, Kai-Wei Chang, Chih-Jen Lin, S. Sathiya Keerthi, and S. Sundararajan. 2008. A dual coordinate descent method for large-scale linear svm. In ICML '08: Proceedings of the 25th international conference on Machine learning, pages 408­415, New York, NY, USA. ACM. S. Jung, S. Hong, and E. Paek. 2000. An english to korean transliteration model of extended markov window. In Proc. the International Conference on Computational Linguistics (COLING), pages 383­389. A. Klementiev and D. Roth. 2006a. Named entity transliteration and discovery from multilingual comparable corpora. In Proc. of the Annual Meeting of the North American Association of Computational Linguistics (NAACL), pages 82­88, June. A. Klementiev and D. Roth. 2006b. Weakly supervised named entity transliteration and discovery from multilingual comparable corpora. In Proc. of the Annual Meeting of the Association of Computational Linguistics (ACL), pages USS,TL,ADAPT, July. K. Knight and J. Graehl. 1998. Machine transliteration. Computational Linguistics, pages 599­612. H. Li, M. Zhang, and J. Su. 2004. A joint source-channel model for machine transliteration. In Proc. of the Annual Meeting of the Association of Computational Linguistics (ACL), pages 159­166, Barcelona, Spain, July. H. Meng, W. Lo, B. Chen, and K. Tang. 2001. Generating phonetic cognates to handle named entities in english-chinese cross-langauge spoken document retreival. In Proceedings of the Automatic Speech Recognition and Understanding Workshop, pages 389­397. S. Riedel and J. Clarke. 2006. Incremental integer linear programming for non-projective dependency parsing. In Proc. of the Conference on Empirical Methods for Natural Language Processing (EMNLP), pages 129­ 137, Sydney, Australia. E. S. Ristad and P. N. Yianilos. 1998. Learning string edit distance. IEEE Transactions on Pattern Recognition and Machine Intelligence, 20(5):522­532, May. D. Roth and W. Yih. 2004. A linear programming formulation for global inference in natural language tasks. pages 1­8. Association for Computational Linguistics. D. Roth and W. Yih. 2007. Global inference for entity and relation identification via a linear programming formulation. In Lise Getoor and Ben Taskar, editors, Introduction to Statistical Relational Learning. MIT Press. R. Sproat, T. Tao, and C. Zhai. 2006. Named entity transliteration with comparable corpora. In Proc. of the Annual Meeting of the Association of Computational Linguistics (ACL), pages 73­80, Sydney, Australia, July. T. Tao, S. Yoon, A. Fister, R. Sproat, and C. Zhai. 2006. Unsupervised named entitly transliteration using temporal and phonetic correlation. In Proc. of the Conference on Empirical Methods for Natural Language Processing (EMNLP), pages 250­257. S. Yoon, K. Kim, and R. Sproat. 2007. Multilingual transliteration using feature based phonetic method. In Proc. of the Annual Meeting of the Association of Computational Linguistics (ACL), pages 112­119, Prague, Czech Republic, June. 307 On the Syllabification of Phonemes Susan Bartlett and Grzegorz Kondrak and Colin Cherry Department of Computing Science Microsoft Research University of Alberta One Microsoft Way Edmonton, AB, T6G 2E8, Canada Redmond, WA, 98052 {susan,kondrak}@cs.ualberta.ca colinc@microsoft.com Abstract Syllables play an important role in speech synthesis and recognition. We present several different approaches to the syllabification of phonemes. We investigate approaches based on linguistic theories of syllabification, as well as a discriminative learning technique that combines Support Vector Machine and Hidden Markov Model technologies. Our experiments on English, Dutch and German demonstrate that our transparent implementation of the sonority sequencing principle is more accurate than previous implementations, and that our language-independent SVM-based approach advances the current state-of-the-art, achieving word accuracy of over 98% in English and 99% in German and Dutch. 1 Introduction Syllabification is the process of dividing a word into its constituent syllables. Although some work has been done on syllabifying orthographic forms (M¨ ller et al., 2000; Bouma, 2002; Marchand and u Damper, 2007; Bartlett et al., 2008), syllables are, technically speaking, phonological entities that can only be composed of strings of phonemes. Most linguists view syllables as an important unit of prosody because many phonological rules and constraints apply within syllables or at syllable boundaries (Blevins, 1995). Apart from their purely linguistic significance, syllables play an important role in speech synthesis and recognition (Kiraz and M¨ bius, 1998; Pearson o et al., 2000). The pronunciation of a given phoneme tends to vary depending on its location within a syl308 lable. While actual implementations vary, text-tospeech (TTS) systems must have, at minimum, three components (Damper, 2001): a letter-to-phoneme (L2P) module, a prosody module, and a synthesis module. Syllabification can play a role in all three modules. Because of the productive nature of language, a dictionary look-up process for syllabification is inadequate. No dictionary can ever contain all possible words in a language. For this reason, it is necessary to develop systems that can automatically syllabify out-of-dictionary words. In this paper, we advance the state-of-the-art in both categorical (non-statistical) and supervised syllabification. We outline three categorical approaches based on common linguistic theories of syllabification. We demonstrate that when implemented carefully, such approaches can be very effective, approaching supervised performance. We also present a data-driven, discriminative solution: a Support Vector Machine Hidden Markov Model (SVM-HMM), which tags each phoneme with its syllabic role. Given enough data, the SVM-HMM achieves impressive accuracy thanks to its ability to capture context-dependent generalizations, while also memorizing inevitable exceptions. Our experiments on English, Dutch and German demonstrate that our SVM-HMM approach substantially outperforms the existing state-of-the-art learning approaches. Although direct comparisons are difficult, our system achieves over 99% word accuracy on German and Dutch, and the highest reported accuracy on English. The paper is organized as follows. We outline common linguistic theories of syllabification in Section 2, and we survey previous computational sys- Human Language Technologies: The 2009 Annual Conference of the North American Chapter of the ACL, pages 308­316, Boulder, Colorado, June 2009. c 2009 Association for Computational Linguistics tems in Section 3. Our linguistically-motivated approaches are described in Section 4. In Section 5, we describe our system based on the SVM-HMM. The experimental results are presented in Section 6. 2 Theories of Syllabification There is some debate as to the exact structure of a syllable. However, phonologists are in general agreement that a syllable consists of a nucleus (vowel sound), preceded by an optional onset and followed by an optional coda. In many languages, both the onset and the coda can be complex, i.e., composed of more than one consonant. For example, the word breakfast [br k-fst] contains two syllables, of which the first has a complex onset [br], and the second a complex coda [st]. Languages differ with respect to various typological parameters, such as optionality of onsets, admissibility of codas, and the allowed complexity of the syllable constituents. For example, onsets are required in German, while Spanish prohibits complex codas. There are a number of theories of syllabification; we present three of the most prevalent. The Legality Principle constrains the segments that can begin and end syllables to those that appear at the beginning and end of words. In other words, a syllable is not allowed to begin with a consonant cluster that is not found at the beginning of some word, or end with a cluster that is not found at the end of some word (Goslin and Frauenfelder, 2001). Thus, a word like admit [dmÁt] must be syllabified [dmÁt] because [dm] never appears word-initially or word-finally in English. A shortcoming of the legality principle is that it does not always imply a unique syllabification. For example, in a word like askew [skju], the principle does not rule out any of [-skju], [s-kju], or [sk-ju], as all three employ legal onsets and codas. The Sonority Sequencing Principle (SSP) provides a stricter definition of legality. The sonority of a sound is its inherent loudness, holding factors like pitch and duration constant (Crystal, 2003). Low vowels like [a], the most sonorous sounds, are high on the sonority scale, while plosive consonants like [t] are at the bottom. When syllabifying a word, SSP states that sonority should increase from the first phoneme of the onset to the syllable's nu309 cleus, and then fall off to the coda (Selkirk, 1984). Consequently, in a word like vintage [vÁntÁ ], we can rule out a syllabification like [vÁ-ntÁ ] because [n] is more sonorant than [t]. However, SSP does not tell us whether to prefer [vÁn-tÁ ] or [vÁnt-Á ]. Moreover, when syllabifying a word like vintner [vÁntnr], the theory allows both [vÁn-tnr] and [vÁntnr], even though [tn] is an illegal onset in English. Both the Legality Principle and SSP tell us which onsets and codas are permitted in legal syllables, and which are not. However, neither theory gives us any guidance when deciding between legal onsets. The Maximal Onset Principle addresses this by stating we should extend a syllable's onset at the expense of the preceding syllable's coda whenever it is legal to do so (Kahn, 1976). For example, the principle gives preference to [-skju] and [vÁn-tÁ ] over their alternatives. 3 Previous Computational Approaches Unlike tasks such as part of speech tagging or syntactic parsing, syllabification does not involve structural ambiguity. It is generally believed that syllable structure is usually predictable in a language provided that the rules have access to all conditioning factors: stress, morphological boundaries, part of speech, etymology, etc. (Blevins, 1995). However, in speech applications, the phonemic transcription of a word is often the only linguistic information available to the system. This is the common assumption underlying a number of computational approaches that have been proposed for the syllabification of phonemes. Daelemans and van den Bosch (1992) present one of the earliest systems on automatic syllabification: a neural network-based implementation for Dutch. Daelemans et al. (1997) also explore the application of exemplar-based generalization (EBG), sometimes called instance-based learning. EBG generally performs a simple database look-up to syllabify a test pattern, choosing the most common syllabification. In cases where the test pattern is not found in the database, the most similar pattern is used to syllabify the test pattern. Hidden Markov Models (HMMs) are another popular approach to syllabification. Krenn (1997) introduces the idea of treating syllabification as a tagging task. Working from a list of syllabified phoneme strings, she automatically generates tags for each phone. She uses a second-order HMM to predict sequences of tags; syllable boundaries can be trivially recovered from the tags. Demberg (2006) applies a fourth-order HMM to the syllabification task, as a component of a larger German text-tospeech system. Schmid et al. (2007) improve on Demberg's results by applying a fifth-order HMM that conditions on both the previous tags and their corresponding phonemes. Kiraz and M¨ bius (1998) present a weighted o finite-state-based approach to syllabification. Their language-independent method builds an automaton for each of onsets, nuclei, and codas, by counting occurrences in training data. These automatons are then composed into a transducer accepting sequences of one or more syllables. They do not report quantitative results for their method. Pearson et al. (2000) compare two rule-based systems (they do not elaborate on the rules employed) with a CART decision tree-based approach and a "global statistics" algorithm. The global statistics method is based on counts of consonant clusters in contexts such as word boundaries, short vowels, or long vowels. Each test word has syllable boundaries placed according to the most likely location given a cluster and its context. In experiments performed with their in-house dataset, their statistics-based method outperforms the decisiontree approach and the two rule-based methods. M¨ ller (2001) presents a hybrid of a categoriu cal and data-driven approach. First, she manually constructs a context-free grammar of possible syllables. This grammar is then made probabilistic using counts obtained from training data. M¨ ller (2006) u attempts to make her method language-independent. Rather than hand-crafting her context-free grammar, she automatically generates all possible onsets, nuclei, and codas, based on the phonemes existing in the language. The results are somewhat lower than in (M¨ ller, 2001), but the approach can be more easu ily ported across languages. Goldwater and Johnson (2005) also explore using EM to learn the structure of English and German phonemes in an unsupervised setting, following M¨ ller in modeling syllable structure with PCFGs. u They initialize their parameters using a deterministic 310 parser implementing the sonority principle and estimate the parameters for their maximum likelihood approach using EM. Marchand et al. (2007) apply their Syllabification by Analogy (SbA) technique, originally developed for orthographic forms, to the pronunciation domain. For each input word, SbA finds the most similar substrings in a lexicon of syllabified phoneme strings and then applies the dictionary syllabifications to the input word. Their survey paper also includes comparisons with a method broadly based on the legality principle. The authors find their legalitybased implementation fares significantly worse than SbA. 4 Categorical Approaches Categorical approaches to syllabification are appealing because they are efficient and linguistically intuitive. In addition, they require little or no syllableannotated data. We present three categorical algorithms that implement the linguistic insights outlined in Section 2. All three can be viewed as variations on the basic pseudo-code shown in Figure 1. Every vowel is labeled as a nucleus, and every consonant is labeled as either an onset or a coda. The algorithm labels all consonants as onsets unless it is illegal to do so. Given the labels, it is straightforward to syllabify a word. The three methods differ in how they determine a "legal" onset. As a rough baseline, the M AX O NSET implementation considers all combinations of consonants to be legal onsets. Only word-final consonants are labeled as codas. L EGALITY combines the Legality Principle with onset maximization. In our implementation, we collect all word-initial consonant clusters from the corpus and deem them to be legal onsets. With this method, no syllable can have an onset that does not appear word-initially in the training data. We do not test for the legality of codas. The performance of L EGALITY depends on the number of phonetic transcriptions that are available, but the transcriptions need not be annotated with syllable breaks. S ONORITY combines the Sonority Sequencing Principle with onset maximization. In this approach, an onset is considered legal if every member of the onset ranks lower on the sonority scale than ensuing until current phoneme is a vowel label current phoneme as an onset end loop until all phonemes have been labeled label current phoneme as a nucleus if there are no more vowels in the word label all remaining consonants as codas else onset := all consonants before next vowel coda := empty until onset is legal coda := coda plus first phoneme of onset onset := onset less first phoneme end loop end if end loop Insert syllable boundaries before onsets Sound Vowels Glides Liquids Nasals Obstruents Examples u, , . . . w, j, . . . l, r, . . . m, Ć, . . . g, Ě, . . . Level 4 3 2 1 0 Figure 2: The sonority scale employed by S ONORITY. A special provision allows for prepending the phoneme [s] to onsets beginning with a voiceless plosive. This reflects the special status of [s] in English, where onsets like [sk] and [sp] are legal even though the sonority is not strictly increasing. Figure 1: Pseudo-code for syllabifying a string of phonemes. 5 Supervised Approach: SVM-HMM If annotated data is available, a classifier can be trained to predict the syllable breaks. A Support Vector Machine (SVM) is a discriminative supervised learning technique that allows for a rich feature representation of the input space. In principle, we could use a multi-class SVM to classify each phoneme according to its position in a syllable on the basis of a set of features. However, a traditional SVM would treat each phoneme in a word as an independent instance, preventing us from considering interactions between labels. In order to overcome this shortcoming, we employ an SVM-HMM1 (Altun et al., 2003), an instance of the Structured SVM formalism (Tsochantaridis et al., 2004) that has been specialized for sequence tagging. When training a structured SVM, each training instance xi is paired with its label yi , drawn from the set of possible labels, Yi . In our case, the training instances xi are words, represented as sequences of phonemes, and their labels yi are syllabifications, represented as sequences of onset/nucleus/coda tags. For each training example, a feature vector (x, y) represents the relationship between the example and a candidate tag sequence. The SVM finds a weight vector w, such that w · (x, y) separates correct taggings from incorrect taggings by as large a margin as possible. Hamming distance DH is used to capture how close a wrong sequence y is to yi , which 1 consonants. S ONORITY requires no training data because it implements a sound linguistic theory. However, an existing development set for a given language can help with defining and validating additional language-specific constraints. Several sonority scales of varying complexity have been proposed. For example, Selkirk (1984) specifies a hierarchy of eleven distinct levels. We adopt a minimalistic scale shown in Figure 2. which avoids most of the disputed sonority contrasts (Jany et al., 2007). We set the sonority distance parameter to 2, which ensures that adjacent consonants in the onset differ by at least two levels of the scale. For example, [pr] is an acceptable onset because it is composed of an obstruent and a liquid, but [pn] is not, because nasals directly follow obstruents on our sonority scale. In addition, we incorporate several Englishspecific constraints listed by Kenstowicz (1994, pages 257­258). The constraints, or filters, prohibit complex onsets containing: (i) two labials (e.g., [pw], [bw], [fw], [vw]), (ii) a non-strident coronal followed by a lateral (e.g., [tl], [dl], [Ěl]) (iii) a voiced fricative (e.g., [vr], [zw], except [vj]), (iv) a palatal consonant (e.g., [Ël], [ r], except [Ër]). 311 http://svmlight.joachims.org/svm struct.html in turn impacts the required margin. Tag sequences that share fewer tags in common with the correct sequence are separated by a larger margin. Mathematically, a (simplified) statement of the SVM learning objective is: i yYi ,y=yi : [(xi , yi ) · w > (xi , y) · w + DH (y, yi )] (1) Method M AX O NSET L EGALITY S ONORITY SVM-HMM tsylb English 61.38 93.16 95.00 98.86 93.72 Table 1: Word accuracy on the CELEX dataset. This objective is only satisfied when w tags all training examples correctly. In practice, slack variables are introduced, which allow us to trade off training accuracy and the complexity of w via a cost parameter. We tune this parameter on our development set. The SVM-HMM training procedure repeatedly uses the Viterbi algorithm to find, for the current w and each (xi , yi ) training pair, the sequence y that most drastically violates the inequality shown in Equation 1. These incorrect tag sequences are added to a growing set, which constrains the quadratic optimization procedure used to find the next w. The process iterates until no new violating sequences are found, producing an approximation to the inequality over all y Yi . A complete explanation is given by Tsochantaridis et al. (2004). Given a weight vector w, a structured SVM tags new instances x according to: argmaxyY [(x, y) · w] (2) The SVM-HMM gets the HMM portion of its name from its use of the HMM Viterbi algorithm to solve this argmax. 5.1 Features sequence y. Unlike a generative HMM, these emission features do not require any conditional independence assumptions. Transition features link tags to tags. Our only transition features are counts of adjacent tag pairs occurring in y. For the emission features, we use the current phoneme and a fixed-size context window of surrounding phonemes. Thus, the features for the phoneme [k] in hockey [h ki] might include the [ ] preceding it, and the [i] following it. In experiments on our development set, we found that the optimal window size is nine: four phonemes on either side of the focus phoneme. Because the SVM-HMM is a linear classifier, we need to explicitly state any important conjunctions of features. This allows us to capture more complex patterns in the language that unigrams alone cannot describe. For example, the bigram [ps] is illegal as an onset in English, but perfectly reasonable as a coda. Experiments on the development set showed that performance peaked using all unigrams, bigrams, trigrams, and four-grams found within our context window. 6 Syllabification Experiments We developed our approach using the English portion of the CELEX lexical database (Baayen et al., 1995). CELEX provides the phonemes of a word and its correct syllabification. It does not designate the phonemes as onsets, nuclei, or codas, which is the labeling we want to predict. Fortunately, extracting the labels from a syllabified word is straightforward. All vowel phones are assigned to be nuclei; consonants preceding the nucleus in a syllable are assigned to be onsets, while consonants following the nucleus in a syllable are assigned to be codas. The results in Table 1 were obtained on a test set of 5K randomly selected words. For training the SVM-HMM, we randomly selected 30K words not We investigated several tagging schemes, described in detail by Bartlett (2007). During development, we found that tagging each phoneme with its syllabic role (Krenn, 1997) works better than the simple binary distinction between syllable-final and other phonemes (van den Bosch, 1997). We also discovered that accuracy can be improved by numbering the tags. Therefore, in our tagging scheme, the single-syllable word strengths [str ĆĚs] would be labeled with the sequence {O1 O2 O3 N1 C1 C2 C3}. Through the use of the Viterbi algorithm, our feature vector (x, y) is naturally divided into emission and transition features. Emission features link an aspect of the input word x with a single tag in the 312 appearing in the test set, while 6K training examples were held out for development testing. We report the performance in terms of word accuracy (entire words syllabified correctly). Among the categorical approaches, S ONORITY clearly outperforms not only L EGALITY, but also tsylb (Fisher, 1996), an implementation of the complex algorithm of Kahn (1976), which makes use of lists of legal English onsets. Overall, our SVM-based approach is a clear winner. The results of our discriminative method compares favorably with the results of competing approaches on English CELEX. Since there are no standard train-test splits for syllabification, the comparison is necessarily indirect, but note that our training set is substantially smaller. For her language-independent PCFG-based approach, M¨ ller (2006) reports 92.64% word accuracy on the u set of 64K examples from CELEX using 10-fold cross-validation. The Learned EBG approach of van den Bosch (1997) achieves 97.78% word accuracy when training on approximately 60K examples. Therefore, our results represent a nearly 50% reduction of the error rate. syllabifications of tooth-ache and pass-ports follow the morphological boundaries of the compound words. Morphological factors are a source of errors for both approaches, but significantly more so for S ONORITY. The performance difference comes mainly from the SVM's ability to handle many of these morphological exceptions. The SVM generates the correct syllabification of northeast [norĚist], even though an onset of [Ě] is perfectly legal. On the other hand, the SVM sometimes overgeneralizes, as in the last example in Table 4. SVM-HMM tu-Ěek p -sports norĚ-ist dÁs-plizd dÁs-koz S ONORITY tu-Ěek p -sports nor-Ěist dÁ-splizd dÁ-skoz toothache passports northeast displeased discos Figure 4: Examples of syllabification errors. (Correct syllabifications are shown in bold.) 6.2 The NETtalk Dataset Figure 3: Word accuracy on English CELEX as a function of the number of thousands of training examples. Though the SVM-HMM's training data requirements are lower than previous supervised syllabification approaches, they are still substantial. Figure 3 shows a learning curve over varying amounts of training data. Performance does not reach acceptable levels until 5K training examples are provided. 6.1 Error Analysis There is a fair amount of overlap in the errors made by the SVM-HMM and the SONORITY. Table 4 shows a few characteristic examples. The CELEX 313 Marchand et al. (2007) report a disappointing word accuracy of 54.14% for their legality-based implementation, which does not accord with the results of our categorical approaches on English CELEX. Consequently, we also apply our methods to the dataset they used for their experiments: the NETtalk dictionary. NETtalk contains 20K English words; in the experiments reported here, we use 13K training examples and 7K test words. As is apparent from Table 2, our performance degrades significantly when switching to NETtalk. The steep decline found in the categorical methods is particularly notable, and indicates significant divergence between the syllabifications employed in the two datasets. Phonologists do not always agree on the correct syllable breaks for a word, but the NETtalk syllabifications are often at odds with linguistic intuitions. We randomly selected 50 words and compared their syllabifications against those found in Merriam-Webster Online. We found that CELEX syllabifications agree with MerriamWebster in 84% of cases, while NETtalk only agrees 52% of the time. Figure 5 shows several words from the NETtalk Method M AX O NSET S ONORITY L EGALITY SVM-HMM English 33.64 52.80 53.08 92.99 Table 2: Word accuracy on the NETtalk dataset. Method M AX O NSET S ONORITY L EGALITY SVM-HMM (50K words) SVM-HMM (250K words) German 19.51 76.32 79.55 99.26 99.87 Dutch 23.44 77.51 64.31 97.79 99.16 Table 3: Word accuracy on the CELEX dataset. and CELEX datasets. We see that CELEX follows the maximal onset principle consistently, while NETtalk does in some instances but not others. We also note that there are a number of NETtalk syllabifications that are clearly wrong, such as the last two examples in Figure 5. The variability of NETtalk is much more difficult to capture with any kind of principled approach. Thus, we argue that low performance on NETtalk indicate inconsistent syllabifications within that dataset, rather than any actual deficiency of the methods. NETtalk s-taÁz r z-Ád-ns dÁ-strÇÁ fo-t n r-p -io er--baÍ-t CELEX -staÁz r -zÁ-dns dÁ-strÇÁ fo-t n r-p - i-o -r-baÍt chastise residence destroy photon arpeggio thereabout Figure 5: Examples of CELEX and NETtalk syllabifications. NETtalk's variable syllabification practices notwithstanding, the SVM-HMM approach still outperforms the previous benchmark on the dataset. Marchand et al. (2007) report 88.53% word accuracy for their SbA technique using leave-one-out testing on the entire NETtalk set (20K words). With fewer training examples, we reduce the error rate by almost 40%. 6.3 Other Languages We performed experiments on German and Dutch, the two other languages available in the CELEX lexical database. The German and Dutch lexicons of CELEX are larger than the English lexicon. For both languages, we selected a 25K test set, and two different training sets, one containing 50K words and the other containing 250K words. The results are 314 presented in Table 3. While our SVM-HMM approach is entirely language independent, the same cannot be said about other methods. The maximal onset principle appears to hold much more strongly for English than for German and Dutch (e.g., patron: [pe-trn] vs. [pat-ron]). L EGALITY and S ONORITY also appear to be less effective, possibly because of greater tendency for syllabifications to match morphological boundaries (e.g., English exclusive: [Ák-sklu-sÁv] vs. Dutch exclusief [ ks-kly-zif]). S ONORITY is further affected by our decision to employ the constraints of Kenstowicz (1994), although they clearly pertain to English. We expect that adapting them to specific languages would bring the results closer to the level of the English experiments. Although our SVM system is tuned using an English development set, the results on both German and Dutch are excellent. We could not find any quantitative data for comparisons on Dutch, but the comparison with the previously reported results on German CELEX demonstrates the quality of our approach. The numbers that follow refer to 10-fold cross-validation on the entire lexicon (over 320K entries) unless noted otherwise. Krenn (1997) obtains tag accuracy of 98.34%, compared to our system's tag accuracy of 99.97% when trained on 250K words. With a hand-crafted grammar, M¨ ller (2002) achieves 96.88% word accuracy u on CELEX-derived syllabifications, with a training corpus of two million tokens. Without a handcrafted grammar, she reports 90.45% word accuracy (M¨ ller, 2006). Applying a standard smoothing u algorithm and fourth-order HMM, Demberg (2006) scores 98.47% word accuracy. A fifth-order joint N -gram model of Schmid et al. (2007) achieves 99.85% word accuracy with about 278K training points. However, unlike generative approaches, our Method S ONORITY SVM-HMM Categorical Parser Maximum Likelihood English 97.0 99.9 94.9 98.1 German 94.2 99.4 92.7 97.4 7 Conclusion We have presented several different approaches to the syllabification of phonemes. The results of our linguistically-motivated algorithms, show that it is possible to achieve adequate syllabification word accuracy in English with no little or no syllableannotated training data. We have demonstrated that the poor performance of categorical methods on English NETtalk actually points to problems with the NETtalk annotations, rather than with the methods themselves. We have also shown that SVM-HMMs can be used to great effect when syllabifying phonemes. In addition to being both efficient and languageindependent, they establish a new state-of-the-art for English and Dutch syllabification. However, they do require thousands of labeled training examples to achieve this level of accuracy. In the future, we plan to explore a hybrid approach, which would benefit from both the generality of linguistic principles and the smooth exception-handling of supervised techniques, in order to make best use of whatever data is available. Table 4: Word accuracy on the datasets of Goldwater and Johnson (2005). SVM-HMM can condition each emission on large portions of the input using only a first-order Markov model, which implies much faster syllabification performance. 6.4 Direct Comparison with an MLE approach The results of the competitive approaches that have been quoted so far (with the exception of tsylb) are not directly comparable, because neither the respective implementations, nor the actual train-test splits are publicly available. However, we managed to obtain the English and German data sets used by Goldwater and Johnson (2005) in their study, which focused primarily on unsupervised syllabification. Their experimental framework is similar to (M¨ ller, 2001). They collect words from running u text and create a training set of 20K tokens and a test set of 10K tokens. The running text was taken from the Penn WSJ and ECI corpora, and the syllabified phonemic transcriptions were obtained from CELEX. Table 4 compares our experimental results with their reported results obtained with: (a) supervised Maximum Likelihood training procedures, and (b) a Categorical Syllable Parser implementing the principles of sonority sequencing and onset maximization without Kenstowicz's (1994) onset constraints. The accuracy figures in Table 4 are noticeably higher than in Table 1. This stems from fundamental differences in the experimental set-up; Goldwater and Johnson (2005) test on tokens as found in text, therefore many frequent short words are duplicated. Furthermore, some words occur during both training and testing, to the benefit of the supervised systems (SVM-HMM and Maximum Likelihood). Nevertheless, the results confirm the level of improvement obtained by both our categorical and supervised approaches. 315 Acknowledgements We are grateful to Sharon Goldwater for providing the experimental data sets for comparison. This research was supported by the Natural Sciences and Engineering Research Council of Canada and the Alberta Informatics Circle of Research Excellence. References Yasemin Altun, Ioannis Tsochantaridis, and Thomas Hofmann. 2003. Hidden markov support vector machines. Proceedings of the 20Th International Conference on Machine Learning (ICML). R. Baayen, R. Piepenbrock, and L. Gulikers. 1995. The CELEX lexical database (CD-ROM). Susan Bartlett, Grzegorz Kondrak, and Colin Cherry. 2008. Automatic syllabification with structured SVMs for letter-to-phoneme conversion. In Proceedings of ACL-08: HLT, pages 568­576, Columbus, Ohio. Susan Bartlett. 2007. Discriminative approach to automatic syllabification. Master's thesis, Department of Computing Science, University of Alberta. Juliette Blevins. 1995. The syllable in phonological theory. In John Goldsmith, editor, The handbook of phonological theory, pages 206­244. Blackwell. Gosse Bouma. 2002. Finite state methods for hyphenation. Natural Language Engineering, 1:1­16. David Crystal. 2003. A dictionary of linguistics and phonetics. Blackwell. Walter Daelemans and Antal van den Bosch. 1992. Generalization performance of backpropagaion learning on a syllabification task. In Proceedings of the 3rd Twente Workshop on Language Technology, pages 27­ 38. Walter Daelemans, Antal van den Bosch, and Ton Weijters. 1997. IGTree: Using trees for compression and classification in lazy learning algorithms. Artificial Intelligence Review, pages 407­423. Robert Damper. 2001. Learning about speech from data: Beyond NETtalk. In Data-Driven Techniques in Speech Synthesis, pages 1­25. Kluwer Academic Publishers. Vera Demberg. 2006. Letter-to-phoneme conversion for a German text-to-speech system. Master's thesis, University of Stuttgart. William Fisher. 1996. Tsylb syllabification package. ftp://jaguar.ncsl.nist.gov/pub/tsylb2-1.1.tar.Z. Last accessed 31 March 2008. Sharon Goldwater and Mark Johnson. 2005. Representational bias in usupervised learning of syllable structure. In Prcoeedings of the 9th Conference on Computational Natural Language Learning (CoNLL), pages 112­119. Jeremy Goslin and Ulrich Frauenfelder. 2001. A comparison of theoretical and human syllabification. Language and Speech, 44:409­436. Carmen Jany, Matthew Gordon, Carlos M Nash, and Nobutaka Takara. 2007. How universal is the sonority hierarchy? A cross-linguistic study. In 16th International Congress of Phonetic Sciences, pages 1401­ 1404. Daniel Kahn. 1976. Syllable-based generalizations in English Phonology. Ph.D. thesis, Indiana University. Michael Kenstowicz. 1994. Phonology in Generative Grammar. Blackwell. George Kiraz and Bernd M¨ bius. 1998. Multilingual o syllabification using weighted finite-state transducers. In Proceedings of the 3rd Workshop on Speech Synthesis. Brigitte Krenn. 1997. Tagging syllables. In Proceedings of Eurospeech, pages 991­994. Yannick Marchand and Robert Damper. 2007. Can syllabification improve pronunciation by analogy of English? Natural Language Engineering, 13(1):1­24. Yannick Marchand, Connie Adsett, and Robert Damper. 2007. Automatic syllabification in English: A comparison of different algorithms. Language and Speech. To appear. Karin M¨ ller, Bernd M¨ bius, and Detlef Prescher. 2000. u o Inducing probabilistic syllable classes using multivariate clustering. In Prcoeedings of the 38th meeting of the ACL. Karin M¨ ller. 2001. Automatic detection of syllable u boundaries combining the advantages of treebank and bracketed corpora training. Proceedings on the 39Th Meeting of the ACL. Karin M¨ ller. 2002. Probabilistic context-free grammars u for phonology. Proceedings of the 6th Workshop of the ACL Special Interest Group in Computational Phonology (SIGPHON), pages 80­90. Karin M¨ ller. 2006. Improving syllabification modu els with phonotactic knowledge. Proceedings of the Eighth Meeting of the ACL Special Interest Group on Computational Phonology At HLT-NAACL. Steve Pearson, Roland Kuhn, Steven Fincke, and Nick Kibre. 2000. Automatic methods for lexical stress assignment and syllabification. In Proceedings of the 6th International Conference on Spoken Language Processing (ICSLP). Helmut Schmid, Bernd M¨ bius, and Julia Weidenkaff. o 2007. Tagging syllable boundaries with joint N-gram models. In Proceedings of Interspeech. Elisabeth Selkirk. 1984. On the major class features and syllable theory. In Language Sound Structure. MIT Press. Ioannis Tsochantaridis, Thomas Hofmann, Thorsten Joachims, and Yasemin Altun. 2004. Support vector machine learning for interdependent and structured output spaces. Proceedings of the 21st International Conference on Machine Learning (ICML). Antal van den Bosch. 1997. Learning to pronounce written words: a study in inductive language learning. Ph.D. thesis, Universiteit Maastricht. 316 Improving nonparameteric Bayesian inference: experiments on unsupervised word segmentation with adaptor grammars Mark Johnson Brown University Providence, RI Mark Johnson@Brown.edu Sharon Goldwater University of Edinburgh Edinburgh EH8 9AB sgwater@inf.ed.ac.uk Abstract One of the reasons nonparametric Bayesian inference is attracting attention in computational linguistics is because it provides a principled way of learning the units of generalization together with their probabilities. Adaptor grammars are a framework for defining a variety of hierarchical nonparametric Bayesian models. This paper investigates some of the choices that arise in formulating adaptor grammars and associated inference procedures, and shows that they can have a dramatic impact on performance in an unsupervised word segmentation task. With appropriate adaptor grammars and inference procedures we achieve an 87% word token f-score on the standard Brent version of the BernsteinRatner corpus, which is an error reduction of over 35% over the best previously reported results for this corpus. 1 Introduction Most machine learning algorithms used in computational linguistics are parametric, i.e., they learn a numerical weight (e.g., a probability) associated with each feature, where the set of features is fixed before learning begins. Such procedures can be used to learn features or structural units by embedding them in a "propose-and-prune" algorithm: a feature proposal component proposes potentially useful features (e.g., combinations of the currently most useful features), which are then fed to a parametric learner that estimates their weights. After estimating feature weights and pruning "useless" low-weight features, the cycle repeats. While such algorithms can achieve impressive results (Stolcke and Omohundro, 317 1994), their effectiveness depends on how well the feature proposal step relates to the overall learning objective, and it can take considerable insight and experimentation to devise good feature proposals. One of the main reasons for the recent interest in nonparametric Bayesian inference is that it offers a systematic framework for structural inference, i.e., inferring the features relevant to a particular problem as well as their weights. (Here "nonparametric" means that the models do not have a fixed set of parameters; our nonparametric models do have parameters, but the particular parameters in a model are learned along with their values). Dirichlet Processes and their associated predictive distributions, Chinese Restaurant Processes, are one kind of nonparametric Bayesian model that has received considerable attention recently, in part because they can be composed in hierarchical fashion to form Hierarchical Dirichlet Processes (HDP) (Teh et al., 2006). Lexical acquisition is an ideal test-bed for exploring methods for inferring structure, where the features learned are the words of the language. (Even the most hard-core nativists agree that the words of a language must be learned). We use the unsupervised word segmentation problem as a test case for evaluating structural inference in this paper. Nonparametric Bayesian methods produce state-of-the-art performance on this task (Goldwater et al., 2006a; Goldwater et al., 2007; Johnson, 2008). In a computational linguistics setting it is natural to try to align the HDP hierarchy with the hierarchy defined by a grammar. Adaptor grammars, which are one way of doing this, make it easy to explore a wide variety of HDP grammar-based models. Given an appropriate adaptor grammar, the fea- Human Language Technologies: The 2009 Annual Conference of the North American Chapter of the ACL, pages 317­325, Boulder, Colorado, June 2009. c 2009 Association for Computational Linguistics tures learned by adaptor grammars can correspond to linguistic units such as words, syllables and collocations. Different adaptor grammars encode different assumptions about the structure of these units and how they relate to each other. A generic adaptor grammar inference program infers these units from training data, making it easy to investigate how these assumptions affect learning (Johnson, 2008).1 However, there are a number of choices in the design of adaptor grammars and the associated inference procedure. While this paper studies the impact of these on the word segmentation task, these choices arise in other nonparametric Bayesian inference problems as well, so our results should be useful more generally. The rest of this paper is organized as follows. The next section reviews adaptor grammars and presents three different adaptor grammars for word segmentation that serve as running examples in this paper. Adaptor grammars contain a large number of adjustable parameters, and Section 3 discusses how these can be estimated using Bayesian techniques. Section 4 examines several implementation options within the adaptor grammar inference algorithm and shows that they can make a significant impact on performance. Cumulatively these changes make a significant difference in word segmentation accuracy: our final adaptor grammar performs unsupervised word segmentation with an 87% token f-score on the standard Brent version of the Bernstein-Ratner corpus (Bernstein-Ratner, 1987; Brent and Cartwright, 1996), which is an error reduction of over 35% compared to the best previously reported results on this corpus. child-directed speech (Bernstein-Ratner, 1987). Because these phonemic representations are obtained by looking up orthographic forms in a pronouncing dictionary and appending the results, identifying the word tokens is equivalent to finding the locations of the word boundaries. For example, the phoneme string corresponding to "you want to see the book" (with its correct segmentation indicated) is as follows: y u w a n t t u s i D 6 b U k We can represent any possible segmentation of any possible sentence as a tree generated by the following unigram grammar. Sentence Word+ Word Phoneme+ The nonterminal Phoneme expands to each possible phoneme; the underlining, which identifies "adapted nonterminals", will be explained below. In this paper "+" abbreviates right-recursion through a dummy nonterminal, i.e., the unigram grammar actually is: Sentence Word Sentence Word Sentence Word Phonemes Phonemes Phoneme Phonemes Phoneme Phonemes A PCFG with these productions can represent all possible segmentations of any Sentence into a sequence of Words. But because it assumes that the probability of a word is determined purely by multiplying together the probability of its individual phonemes, it has no way to encode the fact that certain strings of phonemes (the words of the language) have much higher probabilities than other strings containing the same phonemes. In order to do this, a PCFG would need productions like the following one, which encodes the fact that "want" is a Word. Word w a n t Adaptor grammars can be viewed as a way of formalizing this idea. Adaptor grammars learn the probabilities of entire subtrees, much as in tree substitution grammar (Joshi, 2003) and DOP (Bod, 2 Adaptor grammars This section informally introduces adaptor grammars using unsupervised word segmentation as a motivating application; see Johnson et al. (2007b) for a formal definition of adaptor grammars. Consider the problem of learning language from continuous speech: segmenting each utterance into words is a nontrivial problem that language learners must solve. Elman (1990) introduced an idealized version of this task, and Brent and Cartwright (1996) presented a version of it where the data consists of unsegmented phonemic representations of the sentences in the Bernstein-Ratner corpus of The adaptor grammar inference program is available for download at http://www.cog.brown.edu/~mj/Software.htm. 1 318 1998). (For computational efficiency reasons adaptor grammars require these subtrees to expand to terminals). The set of possible adapted tree fragments is the set of all subtrees generated by the CFG whose root label is a member of the set of adapted nonterminals A (adapted nonterminals are indicated by underlining in this paper). For example, in the unigram adaptor grammar A = {Word}, which means that the adaptor grammar inference procedure learns the probability of each possible Word subtree. Thus adaptor grammars are simple models of structure learning in which adapted subtrees are the units of generalization. One might try to reduce adaptor grammar inference to PCFG parameter estimation by introducing a context-free rule for each possible adapted subtree, but such an attempt would fail because the number of such adapted subtrees, and hence the number of corresponding rules, is unbounded. However nonparametric Bayesian inference techniques permit us to sample from this infinite set of adapted subtrees, and only require us to instantiate the finite number of them needed to analyse the finite training data. An adaptor grammar is a 7-tuple (N, W, R, S, , A, C) where (N, W, R, S, ) is a PCFG with nonterminals N , terminals W , rules R, start symbol S N and rule probabilities , where r is the probability of rule r R, A N is the set of adapted nonterminals and C is a vector of adaptors indexed by elements of A, so CX is the adaptor for adapted nonterminal X A. Informally, an adaptor CX nondeterministically maps a stream of trees from a base distribution HX whose support is TX (the set of subtrees whose root node is X N generated by the grammar's rules) into another stream of trees whose support is also TX . In adaptor grammars the base distributions HX are determined by the PCFG rules expanding X and the other adapted distributions, as explained in Johnson et al. (2007b). When called upon to generate another sample tree, the adaptor either generates and returns a fresh tree from HX or regenerates a tree it has previously emitted, so in general the adapted distribution differs from the base distribution. This paper uses adaptors based on Chinese Restaurant Processes (CRPs) or Pitman-Yor Processes (PYPs) (Pitman, 1995; Pitman and Yor, 1997; Ishwaran and James, 2003). CRPs and PYPs nondeterministically generate infinite sequences of nat319 ural numbers z1 , z2 , . . ., where z1 = 1 and each zn+1 m + 1 where m = max(z1 , . . . , zn ). In the "Chinese Restaurant" metaphor samples produced by the adaptor are viewed as "customers" and zn is the index of the "table" that the nth customer is seated at. In adaptor grammars each table in the adaptor CX is labeled with a tree sampled from the base distribution HX that is shared by all customers at that table; thus the nth sample tree from the adaptor CX is the zn th sample from HX . CRPs and PYPs differ in exactly how the sequence {zk } is generated. Suppose z = (z1 , . . . , zn ) have already been generated and m = max(z). Then a CRP generates the next table index zn+1 according to the following distribution: P(Zn+1 = k | z) nk (z) if k m if k = m + 1 where nk (z) is the number of times table k appears in z and > 0 is an adjustable parameter that determines how often a new table is chosen. This means that if CX is a CRP adaptor then the next tree tn+1 it generates is the same as a previously generated tree t with probability proportional to the number of times CX has generated t before, and is a "fresh" tree t sampled from HX with probability proportional to X HX (t). This leads to a powerful "richget-richer" effect in which popular trees are generated with increasingly high probabilities. Pitman-Yor Processes can control the strength of this effect somewhat by moving mass from existing tables to the base distribution. The PYP predictive distribution is: P(Zn+1 = k | z) nk (z) - a if k m ma + b if k = m + 1 where a [0, 1] and b > 0 are adjustable parameters. It's easy to see that the CRP is a special case of the PRP where a = 0 and b = . Each adaptor in an adaptor grammar can be viewed as estimating the probability of each adapted subtree t; this probability can differ substantially from t's probability HX (t) under the base distribution. Because Words are adapted in the unigram adaptor grammar it effectively estimates the probability of each Word tree separately; the sampling estimators described in section 4 only instantiate those Words actually used in the analysis of Sentences in the corpus. While the Word adaptor will generally prefer to reuse Words that have been used elsewhere in the corpus, it is always possible to generate a fresh Word using the CFG rules expanding Word into a string of Phonemes. We assume for now that all CFG rules RX expanding the nonterminal X N have the same probability (although we will explore estimating below), so the base distribution HWord is a "monkeys banging on typewriters" model. That means the unigram adaptor grammar implements the Goldwater et al. (2006a) unigram word segmentation model, and in fact it produces segmentations of similar accuracies, and exhibits the same characteristic undersegmentation errors. As Goldwater et al. point out, because Words are the only units of generalization available to a unigram model it tends to misanalyse collocations as words, resulting in a marked tendancy to undersegment. Goldwater et al. demonstrate that modelling bigram dependencies mitigates this undersegmentation. While adaptor grammars cannot express the Goldwater et al. bigram model, they can get much the same effect by directly modelling collocations (Johnson, 2008). A collocation adaptor grammar generates a Sentence as a sequence of Collocations, each of which expands to a sequence of Words. Sentence Colloc+ Colloc Word+ Word Phoneme+ Because Colloc is adapted, the collocation adaptor grammar learns Collocations as well as Words. (Presumably these approximate syntactic, semantic and pragmatic interword dependencies). Johnson reported that the collocation adaptor grammar segments as well as the Goldwater et al. bigram model, which we confirm here. Recently other researchers have emphasised the utility of phonotactic constraints (i.e., modeling the allowable phoneme sequences at word onsets and endings) for word segmentation (Blanchard and Heinz, 2008; Fleck, 2008). Johnson (2008) points out that adaptor grammars that model words as sequences of syllables can learn and exploit these constraints, significantly improving segmentation accuracy. Here we present an adaptor grammar that models collocations together with these phonotactic constraints. This grammar is quite complex, permitting us to study the effects of the various model and im320 plementation choices described below on a complex hierarchical nonparametric Bayesian model. The collocation-syllable adaptor grammar generates a Sentence in terms of three levels of Collocations (enabling it to capture a wider range of interword dependencies), and generates Words as sequences of 1 to 4 Syllables. Syllables are subcategorized as to whether they are initial (I), final (F) or both (IF). Sentence Colloc3+ Colloc3 Colloc2+ Colloc2 Colloc1+ Colloc1 Word+ Word SyllableIF Word SyllableI (Syllable) (Syllable) SyllableF Syllable Onset Rhyme Onset Consonant+ Rhyme Nucleus Coda Nucleus Vowel+ Coda Consonant+ SyllableIF OnsetI RhymeF OnsetI Consonant+ RhymeF Nucleus CodaF CodaF Consonant+ SyllableI OnsetI Rhyme SyllableF Onset RhymeF Here Consonant and Vowel expand to all possible consonants and vowels respectively, and the parentheses in the expansion of Word indicate optionality. Because Onsets and Codas are adapted, the collocation-syllable adaptor grammar learns the possible consonant sequences that begin and end syllables. Moreover, because Onsets and Codas are subcategorized based on whether they are wordperipheral, the adaptor grammar learns which consonant clusters typically appear at word boundaries, even though the input contains no explicit word boundary information (apart from what it can glean from the sentence boundaries). 3 Bayesian estimation of adaptor grammar parameters Adaptor grammars as defined in section 2 have a large number of free parameters that have to be chosen by the grammar designer; a rule probability r for each PCFG rule r R and either one or two hyperparameters for each adapted nonterminal X A, depending on whether Chinese Restaurant or Pitman-Yor Processes are used as adaptors. It's difficult to have intuitions about the appropriate settings for the latter parameters, and finding the optimal values for these parameters by some kind of exhaustive search is usually computationally impractical. Previous work has adopted an expedient such as parameter tying. For example, Johnson (2008) set by requiring all productions expanding the same nonterminal to have the same probability, and used Chinese Restaurant Process adaptors with tied parameters X , which was set using a grid search. We now describe two methods of dealing with the large number of parameters in these models that are both more principled and more practical than the approaches described above. First, we can integrate out , and second, we can infer values for the adaptor hyperparameters using sampling. These methods (the latter in particular) make it practical to use Pitman-Yor Process adaptors in complex grammars such as the collocation-syllable adaptor grammar, where it is impractical to try to find optimal parameter values by grid search. As we will show, they also improve segmentation accuracy, sometimes dramatically. 3.1 Integrating out Johnson et al. (2007a) describe Gibbs samplers for Bayesian inference of PCFG rule probabilities , and these techniques can be used directly with adaptor grammars as well. Just as in that paper, we place Dirichlet priors on : here X is the subvector of corresponding to rules expanding nonterminal X N , and X is a corresponding vector of positive real numbers specifying the hyperparameters of the corresponding Dirichlet distributions: P( | ) = Dir( X | X ) Onset, Nucleus and Coda adaptors in this grammar learn the probabilities of these building blocks of words, the phoneme probabilities (which is most of what encodes) play less important a role. 3.2 Slice sampling adaptor hyperparameters As far as we know, there are no conjugate priors for the adaptor hyperparameters aX or bX (which corresponds to X in a Chinese Restaurant Process), so it is not possible to integrate them out as we did with the rule probabilities . However, it is possible to perform Bayesian inference by putting a prior on them and sampling their values. Because we have no strong intuitions about the values of these parameters we chose uninformative priors. We chose a uniform Beta(1, 1) prior on aX , and a "vague" Gamma(10, 0.1) prior on bX = X (MacKay, 2003). (We experimented with other parameters in the Gamma prior, but found no significant difference in performance). After each Gibbs sweep through the parse trees t we resampled each of the adaptor parameters from the posterior distribution of the parameter using a slice sampler 10 times. For example, we resample each bX from: P(bX | t) P(t | bX ) Gamma(bX | 10, 0.1) Here P(t | bX ) is the likelihood of the current sequence of sample parse trees (we only need the factors that depend on bX ) and Gamma(bX | 10, 0.1) is the prior. The same formula is used for sampling aX , except that the prior is now a flat Beta(1, 1) distribution. In general we cannot even compute the normalizing constants for these posterior distributions, so we chose a sampler that does not require this. We use a slice sampler here because it does not require a proposal distribution (Neal, 2003). (We initially tried a Metropolis-Hastings sampler but were unable to find a proposal distribution that had reasonable acceptance ratios for all of our adaptor grammars). As Table 1 makes clear, sampling the adaptor parameters makes a significant difference, especially on the collocation-syllable adaptor grammar. This is not surprising, as the adaptors in that grammar play many different roles and there is no reason to to expect the optimal values of their parameters to be similar. XN Because the Dirichlet distribution is conjugate to the multinomial distribution, it is possible to integrate out the rule probabilities , producing the "collapsed sampler" described in Johnson et al. (2007a). In our experiments we chose an uniform prior r = 1 for all rules r R. As Table 1 shows, integrating out only has a major effect on results when the adaptor hyperparameters themselves are not sampled, and even then it did not have a large effect on the collocation-syllable adaptor grammar. This is not too surprising: because the 321 Condition Table label resampling Batch initialization Sample X = bX Word token f-scores Sample average Max. Marginal Integrate out Sample aX colloc-syll · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · 0.55 0.55 0.55 0.54 0.54 0.55 0.74 0.75 0.71 0.71 0.74 0.72 0.72 0.66 0.70 0.42 0.83 0.43 0.41 0.73 0.85 0.84 0.78 0.75 0.87 0.54 0.88 0.74 0.76 0.87 0.56 0.56 0.57 0.56 0.56 0.57 0.81 0.80 0.77 0.77 0.76 0.74 0.75 0.69 0.74 0.51 0.86 0.56 0.49 0.75 0.87 0.84 0.78 0.76 0.88 0.55 0.89 0.82 0.82 0.88 Table 1: Word segmentation accuracy measured by word token f-scores on Brent's version of the Bernstein-Ratner corpus as a function of adaptor grammar, adaptor and estimation procedure. Pitman-Yor Process adaptors were used when aX was sampled, otherwise Chinese Restaurant Process adaptors were used. In runs where was not integrated out it was set uniformly, and all X = bX were set to 100 they were not sampled. 4 Inference for adaptor grammars Johnson et al. (2007b) describe the basic adaptor grammar inference procedure that we use here. That paper leaves unspecified a number of implementation details, which we show can make a crucial difference to segmentation accuracy. The adaptor grammar algorithm is basically a Gibbs sampler of the kind widely used for nonparametric Bayesian inference (Blei et al., 2004; Goldwater et al., 2006b; Goldwater et al., 2006a), so it seems reasonable to expect that at least some of the details discussed below will be relevant to other applications as well. The inference algorithm maintains a vector t = (t1 , . . . , tn ) of sample parses, where ti TS is a parse for the ith sentence wi . It repeatedly chooses a sentence wi at random and resamples the parse tree ti for wi from P(ti | t-i , wi ), i.e., conditioned on wi and the parses t-i of all sentences except wi . 4.1 Maximum marginal decoding Sampling algorithms like ours produce a stream of samples from the posterior distribution over parses of the training data. It is standard to take the output of the algorithm to be the last sample produced, 322 and evaluate those parses. In some other applications of nonparametric Bayesian inference involving latent structure (e.g., clustering) it is difficult to usefully exploit multiple samples, but that is not the case here. In maximum marginal decoding we map each sample parse tree t onto its corresponding word segmentation s, marginalizing out irrelevant detail in t. (For example, the collocation-syllable adaptor grammar contains a syllabification and collocational structure that is irrelevant for word segmentation). Given a set of sample parse trees for a sentence we compute the set of corresponding word segmentations, and return the one that occurs most frequently (this is a sampling approximation to the maximum probability marginal structure). For each setting in the experiments described in Table 1 we ran 8 samplers for 2,000 iterations (i.e., passes through the training data), and kept the sample parse trees from every 10th iteration after iteration 1000, resulting in 800 sample parses for every sentence. (An examination of the posterior probabilities suggests that all of the samplers using batch initialization and table label resampling had "burnt colloc-syll unigram unigram colloc colloc 220000 215000 210000 205000 200000 195000 190000 185000 0 500 1000 1500 2000 batch initialization, no table label resampling incremental initialization, table label resampling batch initialization, table label resampling Figure 1: Negative log posterior probability (lower is better) as a function of iteration for 24 runs of the collocation adaptor grammar samplers with Pitman-Yor adaptors. The upper 8 runs use batch initialization but no table label resampling, the middle 8 runs use incremental initialization and table label resampling, while the lower 8 runs use batch initialization and table label resampling. in" by iteration 1000). We evaluated the word token f-score of the most frequent marginal word segmentation, and compared that to average of the word token f-score for the 800 samples, which is also reported in Table 1. For each grammar and setting we tried, the maximum marginal segmentation was better than the sample average, sometimes by a large margin. Given its simplicity, this suggests that maximum marginal decoding is probably worth trying when applicable. 4.2 Batch initialization The Gibbs sampling algorithm is initialized with a set of sample parses t for each sentence in the training data. While the fundamental theorem of Markov Chain Monte Carlo guarantees that eventually samples will converge to the posterior distribution, it says nothing about how long the "burn in" phase might last (Robert and Casella, 2004). In practice initialization can make a huge difference to the performance of Gibbs samplers (just as it can with other unsupervised estimation procedures such as Expectation Maximization). There are many different ways in which we could generate the initial trees t; we only study two of the obvious methods here. Batch initialization assigns every sentence a random parse tree in parallel. In more detail, the initial parse tree ti for sentence wi 323 is sampled from P(t | wi , G ), where G is the PCFG obtained from the adaptor grammar by ignoring its last two components A and C (i.e., the adapted nonterminals and their adaptors), and seated at a new table. This means that in batch initialization each initial parse tree is randomly generated without any adaptation at all. Incremental initialization assigns the initial parse trees ti to sentences wi in order, updating the adaptor grammar as it goes. That is, ti is sampled from P(t | wi , t1 , . . . , ti-1 ). This is easy to do in the context of Gibbs sampling, since this distribution is a minor variant of the distribution P(ti | t-i , wi ) used during Gibbs sampling itself. Incremental initialization is greedier than batch initialization, and produces initial sample trees with much higher probability. As Table 1 shows, across all grammars and conditions after 2,000 iterations incremental initialization produces samples with much better word segmentation token f-score than does batch initialization, with the largest improvement on the unigram adaptor grammar. However, incremental initialization results in sample parses with lower posterior probability for the unigram and collocation adaptor grammars (but not for the collocation-syllable adaptor grammar). Figure 1 plots the posterior probabilities of the sample trees t at each iteration for the collocation adaptor grammar, showing that even after 2,000 iterations incremental initialization results in trees that are much less likely than those produced by batch initialization. It seems that with incremental initialization the Gibbs sampler gets stuck in a local optimum which it is extremely unlikely to move away from. It is interesting that incremental initialization results in more accurate word segmentation, even though the trees it produces have lower posterior probability. This seems to be because the most probable analyses produced by the unigram and, to a lesser extent, the collocation adaptor grammars tend to undersegment. Incremental initialization greedily searches for common substrings, and because such substrings are more likely to be short rather than long, it tends to produce analyses with shorter words than batch initialization does. Goldwater et al. (2006a) show that Brent's incremental segmentation algorithm (Brent, 1999) has a similar property. We favor batch initialization because we are in- terested in understanding the properties of our models (expressed here as adaptor grammars), and batch initialization does a better job of finding the most probable analyses under these models. However, it might be possible to justify incremental initialization as (say) cognitively more plausible. 4.3 Table label resampling Unlike the previous two implementation choices which apply to a broad range of algorithms, table label resampling is a specialized kind of Gibbs step for adaptor grammars and similar hierarchical models that is designed to improve mobility. The adaptor grammar algorithm described in Johnson et al. (2007b) repeatedly resamples parses for the sentences of the training data. However, the adaptor grammar sampler itself maintains of a hierarchy of Chinese Restaurant Processes or Pitman-Yor Processes, one per adapted nonterminal X A, that cache subtrees from TX . In general each of these subtrees will occur many times in the parses for the training data sentences. Table label resampling resamples the trees in these adaptors (i.e., the table labels, to use the restaurant metaphor), potentially changing the analysis of many sentences at once. For example, each Collocation in the collocation adaptor grammar can occur in many Sentences, and each Word can occur in many Collocations. Resampling a single Collocation can change the way it is analysed into Words, thus changing the analysis of all of the Sentences containing that Collocation. Table label resampling is an additional resampling step performed after each Gibbs sweep through the training data in which we resample the parse trees labeling the tables in the adaptor for each X A. Specifically, if the adaptor CX for X A currently contains m tables labeled with the trees t = (t1 , . . . , tm ) then table label resampling replaces each tj , j 1, . . . , m in turn with a tree sampled from P(t | t-j , wj ), where wj is the terminal yield of tj . (Within each adaptor we actually resample all of the trees t in a randomly chosen order). Table label resampling is a kind of Gibbs sweep, but at a higher level in the Bayesian hierarchy than the standard Gibbs sweep. It's easy to show that table label resampling preserves detailed balance for the adaptor grammars presented in this paper, so interposing table label resampling steps with the standard Gibbs steps also preserves detailed balance. 324 We expect table label resampling to have the greatest impact on models with a rich hierarchical structure, and the experimental results in Table 1 confirm this. The unigram adaptor grammar does not involve nested adapted nonterminals, so we would not expect table label resampling to have any effect on its analyses. On the other hand, the collocation-syllable adaptor grammar involves a rich hierarchical structure, and in fact without table label resampling our sampler did not burn in or mix within 2,000 iterations. As Figure 1 shows, table label resampling produces parses with higher posterior probability, and Table 1 shows that table label resampling makes a significant difference in the word segmentation f-score of the collocation and collocation-syllable adaptor grammars. 5 Conclusion This paper has examined adaptor grammar inference procedures and their effect on the word segmentation problem. Some of the techniques investigated here, such as batch versus incremental initialization, are quite general and may be applicable to a wide range of other algorithms, but some of the other techniques, such as table label resampling, are specialized to nonparametric hierarchical Bayesian inference. We've shown that sampling adaptor hyperparameters is feasible, and demonstrated that this improves word segmentation accuracy of the collocation-syllable adaptor grammar by almost 10%, corresponding to an error reduction of over 35% compared to the best results presented in Johnson (2008). We also described and investigated table label resampling, which dramatically improves the effectiveness of Gibbs sampling estimators for complex adaptor grammars, and makes it possible to work with adaptor grammars with complex hierarchical structure. Acknowledgments We thank Erik Sudderth for suggesting sampling the Pitman-Yor hyperparameters and the ACL reviewers for their insightful comments. This research was funded by NSF awards 0544127 and 0631667 to Mark Johnson. References N. Bernstein-Ratner. 1987. The phonology of parentchild speech. In K. Nelson and A. van Kleeck, editors, Children's Language, volume 6. Erlbaum, Hillsdale, NJ. Daniel Blanchard and Jeffrey Heinz. 2008. Improving word segmentation by simultaneously learning phonotactics. In CoNLL 2008: Proceedings of the Twelfth Conference on Computational Natural Language Learning, pages 65­72, Manchester, England, August. David Blei, Thomas L. Griffiths, Michael I. Jordan, and Joshua B. Tenenbaum. 2004. Hierarchical topic models and the nested chinese restaurant process. In Sebastian Thrun, Lawrence Saul, and Bernhard Sch¨ lkopf, editors, Advances in Neural Information o Processing Systems 16. MIT Press, Cambridge, MA. Rens Bod. 1998. Beyond grammar: an experience-based theory of language. CSLI Publications, Stanford, California. M. Brent and T. Cartwright. 1996. Distributional regularity and phonotactic constraints are useful for segmentation. Cognition, 61:93­125. M. Brent. 1999. An efficient, probabilistically sound algorithm for segmentation and word discovery. Machine Learning, 34:71­105. Jeffrey Elman. 1990. Finding structure in time. Cognitive Science, 14:197­211. Margaret M. Fleck. 2008. Lexicalized phonotactic word segmentation. In Proceedings of ACL-08: HLT, pages 130­138, Columbus, Ohio, June. Association for Computational Linguistics. Sharon Goldwater, Thomas L. Griffiths, and Mark Johnson. 2006a. Contextual dependencies in unsupervised word segmentation. In Proceedings of the 21st International Conference on Computational Linguistics and 44th Annual Meeting of the Association for Computational Linguistics, pages 673­680, Sydney, Australia. Association for Computational Linguistics. Sharon Goldwater, Tom Griffiths, and Mark Johnson. 2006b. Interpolating between types and tokens by estimating power-law generators. In Y. Weiss, B. Sch¨ lkopf, and J. Platt, editors, Advances in Neural o Information Processing Systems 18, pages 459­466, Cambridge, MA. MIT Press. Sharon Goldwater, Thomas L. Griffiths, and Mark Johnson. 2007. Distributional cues to word boundaries: Context is important. In David Bamman, Tatiana Magnitskaia, and Colleen Zaller, editors, Proceedings of the 31st Annual Boston University Conference on Language Development, pages 239­250, Somerville, MA. Cascadilla Press. H. Ishwaran and L. F. James. 2003. Generalized weighted Chinese restaurant processes for species sampling mixture models. Statistica Sinica, 13:1211­ 1235. Mark Johnson, Thomas Griffiths, and Sharon Goldwater. 2007a. Bayesian inference for PCFGs via Markov chain Monte Carlo. In Human Language Technologies 2007: The Conference of the North American Chapter of the Association for Computational Linguistics; Proceedings of the Main Conference, pages 139­146, Rochester, New York, April. Association for Computational Linguistics. Mark Johnson, Thomas L. Griffiths, and Sharon Goldwater. 2007b. Adaptor Grammars: A framework for specifying compositional nonparametric Bayesian models. In B. Sch¨ lkopf, J. Platt, and T. Hoffman, edo itors, Advances in Neural Information Processing Systems 19, pages 641­648. MIT Press, Cambridge, MA. Mark Johnson. 2008. Using adaptor grammars to identifying synergies in the unsupervised acquisition of linguistic structure. In Proceedings of the 46th Annual Meeting of the Association of Computational Linguistics, Columbus, Ohio. Association for Computational Linguistics. Aravind Joshi. 2003. Tree adjoining grammars. In Ruslan Mikkov, editor, The Oxford Handbook of Computational Linguistics, pages 483­501. Oxford University Press, Oxford, England. David J.C. MacKay. 2003. Information Theory, Inference, and Learning Algorithms. Cambridge University Press. Radford M. Neal. 2003. Slice sampling. Annals of Statistics, 31:705­767. J. Pitman and M. Yor. 1997. The two-parameter PoissonDirichlet distribution derived from a stable subordinator. Annals of Probability, 25:855­900. J. Pitman. 1995. Exchangeable and partially exchangeable random partitions. Probability Theory and Related Fields, 102:145­158. Christian P. Robert and George Casella. 2004. Monte Carlo Statistical Methods. Springer. Andreas Stolcke and Stephen Omohundro. 1994. Inducing probabilistic grammars by Bayesian model merging. In Rafael C. Carrasco and Jose Oncina, editors, Grammatical Inference and Applications, pages 106­ 118. Springer, New York. Y. W. Teh, M. Jordan, M. Beal, and D. Blei. 2006. Hierarchical Dirichlet processes. Journal of the American Statistical Association, 101:1566­1581. 325 Joint Parsing and Named Entity Recognition Jenny Rose Finkel and Christopher D. Manning Computer Science Department Stanford University Stanford, CA 94305 {jrfinkel|manning}@cs.stanford.edu Abstract For many language technology applications, such as question answering, the overall system runs several independent processors over the data (such as a named entity recognizer, a coreference system, and a parser). This easily results in inconsistent annotations, which are harmful to the performance of the aggregate system. We begin to address this problem with a joint model of parsing and named entity recognition, based on a discriminative feature-based constituency parser. Our model produces a consistent output, where the named entity spans do not conflict with the phrasal spans of the parse tree. The joint representation also allows the information from each type of annotation to improve performance on the other, and, in experiments with the OntoNotes corpus, we found improvements of up to 1.36% absolute F1 for parsing, and up to 9.0% F1 for named entity recognition. fortunately, it is still common practice to cobble together independent systems for the various types of annotation, and there is no guarantee that their outputs will be consistent. This paper begins to address this problem by building a joint model of both parsing and named entity recognition. Vapnik has observed (Vapnik, 1998; Ng and Jordan, 2002) that "one should solve the problem directly and never solve a more general problem as an intermediate step," implying that building a joint model of two phenomena is more likely to harm performance on the individual tasks than to help it. Indeed, it has proven very difficult to build a joint model of parsing and semantic role labeling, either with PCFG trees (Sutton and McCallum, 2005) or with dependency trees. The CoNLL 2008 shared task (Surdeanu et al., 2008) was intended to be about joint dependency parsing and semantic role labeling, but the top performing systems decoupled the tasks and outperformed the systems which attempted to learn them jointly. Despite these earlier results, we found that combining parsing and named entity recognition modestly improved performance on both tasks. Our joint model produces an output which has consistent parse structure and named entity spans, and does a better job at both tasks than separate models with the same features. We first present the joint, discriminative model that we use, which is a feature-based CRF-CFG parser operating over tree structures augmented with NER information. We then discuss in detail how we make use of the recently developed OntoNotes corpus both for training and testing the model, and then finally present the performance of the model and some discussion of what causes its superior performance, and how the model relates to prior work. 1 Introduction In order to build high quality systems for complex NLP tasks, such as question answering and textual entailment, it is essential to first have high quality systems for lower level tasks. A good (deep analysis) question answering system requires the data to first be annotated with several types of information: parse trees, named entities, word sense disambiguation, etc. However, having high performing, lowlevel systems is not enough; the assertions of the various levels of annotation must be consistent with one another. When a named entity span has crossing brackets with the spans in the parse tree it is usually impossible to effectively combine these pieces of information, and system performance suffers. But, un326 Human Language Technologies: The 2009 Annual Conference of the North American Chapter of the ACL, pages 326­334, Boulder, Colorado, June 2009. c 2009 Association for Computational Linguistics NP NP DT NP NNP IN PP NP NNP the [District of Columbia] GPE the District of = DT NamedEntity-GPE* NP-GPE NNP-GPE PP-GPE IN-GPE NP-GPE NNP-GPE Columbia Figure 1: An example of a (sub)tree which is modified for input to our learning algorithm. Starting from the normalized tree discussed in section 4.1, a new NamedEntity node is added, so that the named entity corresponds to a single phrasal node. That node, and its descendents, have their labels augmented with the type of named entity. The * on the NamedEntity node indicates that it is the root of the named entity. 2 The Joint Model When constructing a joint model of parsing and named entity recognition, it makes sense to think about how the two distinct levels of annotation may help one another. Ideally, a named entity should correspond to a phrase in the constituency tree. However, parse trees will occasionally lack some explicit structure, such as with right branching NPs. In these cases, a named entity may correspond to a contiguous set of children within a subtree of the entire parse. The one thing that should never happen is for a named entity span to have crossing brackets with any spans in the parse tree. For named entities, the joint model should help with boundaries. The internal structure of the named entity, and the structural context in which it appears, can also help with determining the type of entity. Finding the best parse for a sentence can be helped by the named entity information in similar ways. Because named entities should correspond to phrases, information about them should lead to better bracketing. Also, knowing that a phrase is a named entity, and the type of entity, may help in getting the structural context, and internal structure, of that entity correct. 2.1 Joint Representation phrasal node in the entire tree. We then augment the labels of the phrasal node and its descendents with the type of named entity. We also distinguish between the root node of an entity, and the descendent nodes. See Figure 1 for an illustration. This representation has several benefits, outlined below. 2.1.1 Nested Entities After modifying the OntoNotes dataset to ensure consistency, which we will discuss in Section 4, we augment the parse tree with named entity information, for input to our learning algorithm. In the cases where a named entity corresponds to multiple contiguous children of a subtree, we add a new NamedEntity node, which is the new parent to those children. Now, all named entities correspond to a single 327 The OntoNotes data does not contain any nested entities. Consider the named entity portions of the rules seen in the training data. These will look, for instance, like none none person, and organization organization organization. Because we only allow named entity derivations which we have seen in the data, nested entities are impossible. However, there is clear benefit in a representation allowing nested entities. For example, it would be beneficial to recognize that the United States Supreme Court is a an organization, but that it also contains a nested GPE.1 Fortunately, if we encounter data which has been annotated with nested entities, this representation will be able to handle them in a natural way. In the given example, we would have a derivation which includes organization GPE organization. This information will be helpful for correctly labeling nested entities such as New Jersey Supreme Court, because the model will learn how nested entities tend to decompose. 2.1.2 Feature Representation for Named Entities Currently, named entity recognizers are usually constructed using sequence models, with linear chain far as we know, GENIA (Kim et al., 2003) is the only corpus currently annotated with nested entities. 1 As conditional random fields (CRFs) being the most common. While it is possible for CRFs to have links that are longer distance than just between adjacent words, most of the benefit is from local features, over the words and labels themselves, and from features over adjacent pairs of words and labels. Our joint representation allows us to port both types of features from such a named entity recognizer. The local features can be computed at the same time the features over parts of speech are computed. These are the leaves of the tree, when only the named entity for the current word is known. 2 The pairwise features, over adjacent labels, are computed at the same time as features over binary rules. Binarization of the tree is necessary for efficient computation, so the trees consist solely of unary and binary productions. Because of this, for all pairs of adjacent words within an entity, there will be a binary rule applied where one word will be under the left child and the other word will be under the right child. Therefore, we compute features over adjacent words/labels when computing the features for the binary rule which joins them. 2.2 Learning the Joint Model rules which only occur in the training data augmented with named entity information, and because of rules which only occur without the named entity information. To combat this problem, we added extra rules, unseen in the training data. 2.3.1 Augmenting the Grammar For every rule encountered in the training data which has been augmented with named entity information, we add extra copies of that rule to the grammar. We add one copy with all of the named entity information stripped away, and another copy for each other entity type, where the named entity augmentation has been changed to the other entity type. These additions help, but they are not sufficient. Most entities correspond to noun phrases, so we took all rules which had an NP as a child, and made copies of that rule where the NP was augmented with each possible entity type. These grammar additions sufficed to improve overall performance. 2.3.2 Augmenting the Lexicon The lexicon is augmented in a similar manner to the rules. For every part of speech tag seen with a named entity annotation, we also add that tag with no named entity information, and a version which has been augmented with each type of named entity. It would be computationally infeasible to allow any word to have any part of speech tag. We therefore limit the allowed part of speech tags for common words based on the tags they have been observed with in the training data. We also augment each word with a distributional similarity tag, which we discuss in greater depth in Section 3, and allow tags seen with other words which belong to the same distributional similarity cluster. When deciding what tags are allowed for each word, we initially ignore named entity information. Once we determine what base tags are allowed for a word, we also allow that tag, augmented with any type of named entity, if the augmented tag is present in the lexicon. We construct our joint model as an extension to the discriminatively trained, feature-rich, conditional random field-based, CRF-CFG parser of (Finkel and Manning, 2008). Their parser is similar to a chartbased PCFG parser, except that instead of putting probabilities over rules, it puts clique potentials over local subtrees. These unnormalized potentials know what span (and split) the rule is over, and arbitrary features can be defined over the local subtree, the span/split and the words of the sentence. The insideoutside algorithm is run over the clique potentials to produce the partial derivatives and normalizing constant which are necessary for optimizing the log likelihood. 2.3 Grammar Smoothing Because of the addition of named entity annotations to grammar rules, if we use the grammar as read off the treebank, we will encounter problems with sparseness which severely degrade performance. This degradation occurs because of CFG that features can include information about other words, because the entire sentence is observed. The features cannot include information about the labels of those words. 2 Note 3 Features We defined features over both the parse rules and the named entities. Most of our features are over one or the other aspects of the structure, but not both. Both the named entity and parsing features utilize the words of the sentence, as well as orthographic and distributional similarity information. For each word we computed a word shape which encoded 328 information about capitalization, length, and inclusion of numbers and other non-alphabetic characters. For the distributional similarity information, we had to first train a distributional similarity model. We trained the model described in (Clark, 2000), with code downloaded from his website, on several hundred million words from the British national corpus, and the English Gigaword corpus. The model we trained had 200 clusters, and we used it to assign each word in the training and test data to one of the clusters. For the named entity features, we used a fairly standard feature set, similar to those described in (Finkel et al., 2005). For parse features, we used the exact same features as described in (Finkel and Manning, 2008). When computing those features, we removed all of the named entity information from the rules, so that these features were just over the parse information and not at all over the named entity information. Lastly, we have the joint features. We included as features each augmented rule and each augmented label. This allowed the model to learn that certain types of phrasal nodes, such as NPs are more likely to be named entities, and that certain entities were more likely to occur in certain contexts and have particular types of internal structure. not surprising that we found places where the data was inconsistently annotated, namely with crossing brackets between named entity and tree annotations. In the places where we found inconsistent annotation it was rarely the case that the different levels of annotation were inherently inconsistent, but rather inconsistency results from somewhat arbitrary choices made by the annotators. For example, when the last word in a sentence ends with a period, such as Corp., one period functions both to mark the abbreviation and the end of the sentence. The convention of the Penn Treebank is to separate the final period and treat it as the end of sentence marker, but when the final word is also part of an entity, that final period was frequently included in the named entity annotation, resulting in the sentence terminating period being part of the entity, and the entity not corresponding to a single phrase. See Figure 2 for an illustration from the data. In this case, we removed the terminating period from the entity, to produce a consistent annotation. Overall, we found that 656 entities, out of 55,665 total, could not be aligned to a phrase, or multiple contiguous children of a node. We identified and corrected the following sources of inconsistencies: Periods and abbreviations. This is the problem described above with the Corp. example. We corrected it by removing the sentence terminating final period from the entity annotation. Determiners and PPs. Noun phrases composed of a nested noun phrase and a prepositional phrase were problematic when they also consisted of a determiner followed by an entity. We dealt with this by flattening the nested NP, as illustrated in Figure 3. As we discussed in Section 2.1, this tree will then be augmented with an additional node for the entity (see Figure 1). Adjectives and PPs. This problem is similar to the previous problem, with the difference being that there are also adjectives preceding the entity. The solution is also similar to the solution to the previous problem. We moved the adjectives from the nested NP into the main NP. These three modifications to the data solved most, but not all, of the inconsistencies. Another source of problems was conjunctions, such as North and South Korea, where North and South are a phrase, 4 Data For our experiments we used the LDC2008T04 OntoNotes Release 2.0 corpus (Hovy et al., 2006). The OntoNotes project leaders describe it as "a large, multilingual richly-annotated corpus constructed at 90% internanotator agreement." The corpus has been annotated with multiple levels of annotation, including constituency trees, predicate structure, word senses, coreference, and named entities. For this work, we focus on the parse trees and named entities. The corpus has English and Chinese portions, and we used only the English portion, which itself has been split into seven sections: ABC, CNN, MNB, NBC, PRI, VOA, and WSJ. These sections represent a mix of speech and newswire data. 4.1 Data Inconsistencies While other work has utilized the OntoNotes corpus (Pradhan et al., 2007; Yu et al., 2008), this is the first work to our knowledge to simultaneously model the multiple levels of annotation available. Because this is a new corpus, still under development, it is 329 S NP NNP NNP VBD VBN NP NN IN NNP [Mr. Todt]PER had been president of [Insilco VP VP NP PP NP NNP Corp .]ORG . Figure 2: An example from the data of inconsistently labeled named entity and parse structure. The inclusion of the final period in the named entity results in the named entity structure having crossing brackets with the parse structure. NP NP DT NNP IN PP NP NNP the [District of Columbia] GPE the [District of DT NP NP NNP IN PP NP NNP Columbia] GPE (a) (b) Figure 3: (a) Another example from the data of inconsistently labeled named entity and parse structure. In this instance, we flatten the nested NP, resulting in (b), so that the named entity corresponds to a contiguous set of children of the top-level NP. but South Korea is an entity. The rest of the errors seemed to be due to annotation errors and other random weirdnesses. We ended up unable to make 0.4% of the entities consistent with the parses, so we omitted those entities from the training and test data. 4.2 Named Entity Types One more change we made to the data was with respect to possessive NPs. When we encountered noun phrases which ended with (POS 's) or (POS '), we modified the internal structure of the NP. Originally, these NPs were flat, but we introduced a new nested NP which contained the entire contents of the original NP except for the POS. The original NP label was then changed to PossNP. This change is motivated by the status of 's as a phrasal affix or clitic: It is the NP preceding 's that is structurally equivalent to other NPs, not the larger unit that includes 's. This change has the additional benefit in this context that more named entities will correspond to a single phrase in the parse tree, rather than a contiguous set of phrases. 330 The data has been annotated with eighteen types of entities. Many of these entity types do not occur very often, and coupled with the relatively small amount of data, make it difficult to learn accurate entity models. Examples are work of art, product, and law. Early experiments showed that it was difficult for even our baseline named entity recognizer, based on a state-of-the-art CRF, to learn these types of entities.3 As a result, we decided to merge all but the three most dominant entity types into into one general entity type called misc. The result was four distinct entity types: person, organization, GPE (geo-political entity, such as a city or a country), and misc. 3 The difficulties were compounded by somewhat inconsistent and occasionally questionable annotations. For example, the word today was usually labeled as a date, but about 10% of the time it was not labeled as anything. We also found several strange work of arts, including Stanley Cup and the U.S.S. Cole. Training Testing Range # Sent. Range # Sent. ABC 0­55 1195 56­69 199 CNN 0­375 5092 376­437 1521 MNB 0­17 509 18­25 245 NBC 0­29 552 30­39 149 PRI 0­89 1707 90­112 394 VOA 0­198 1512 199­264 383 Table 1: Training and test set sizes for the six datasets in sentences. The file ranges refer to the numbers within the names of the original OntoNotes files. 5 Experiments We ran our model on six of the OntoNotes datasets described in Section 4,4 using sentences of length 40 and under (approximately 200,000 annotated English words, considerably smaller than the Penn Treebank (Marcus et al., 1993)). For each dataset, we aimed for roughly a 75% train / 25% test split. See Table 1 for the the files used to train and test, along with the number of sentences in each. For comparison, we also trained the parser without the named entity information (and omitted the NamedEntity nodes), and a linear chain CRF using just the named entity information. Both the baseline parser and CRF were trained using the exact same features as the joint model, and all were optimized using stochastic gradient descent. The full results can be found in Table 2. Parse trees were scored using evalB (the extra NamedEntity nodes were ignored when computing evalB for the joint model), and named entities were scored using entity F-measure (as in the CoNLL 2003 conlleval).5 While the main benefit of our joint model is the ability to get a consistent output over both types of annotations, we also found that modeling the parse 4 These datasets all consistently use the new conventions for treebank annotation, while the seventh WSJ portion is currently still annotated in the original 1990s style, and so we left the WSJ portion aside. 5 Sometimes the parser would be unable to parse a sentence (less than 2% of sentences), due to restrictions in part of speech tags. Because the underlying grammar (ignoring the additional named entity information) was the same for both the joint and baseline parsers, it is the case that whenever a sentence is unparseable by either the baseline or joint parser it is in fact unparsable by both of them, and would affect the parse scores of both models equally. However, the CRF is able to named entity tag any sentence, so these unparsable sentences had an effect on the named entity score. To combat this, we fell back on the baseline CRF model to get named entity tags for unparsable sentences. and named entities jointly resulted in improved performance on both. When looking at these numbers, it is important to keep in mind that the sizes of the training and test sets are significantly smaller than the Penn Treebank. The largest of the six datasets, CNN, has about one seventh the amount of training data as the Penn Treebank, and the smallest, MNB, has around 500 sentences from which to train. Parse performance was improved by the joint model for five of the six datasets, by up to 1.36%. Looking at the parsing improvements on a per-label basis, the largest gains came from improved identication of NML consituents, from an F-score of 45.9% to 57.0% (on all the data combined, for a total of 420 NML constituents). This label was added in the new treebank annotation conventions, so as to identify internal left-branching structure inside previously flat NPs. To our surprise, performance on NPs only increased by 1%, though over 12, 949 constituents, for the largest improvement in absolute terms. The second largest gain was on PPs, where we improved by 1.7% over 3, 775 constituents. We tested the significance of our results (on all the data combined) using Dan Bikel's randomized parsing evaluation comparator6 and found that both the precision and recall gains were significant at p 0.01. Much greater improvements in performance were seen on named entity recognition, where most of the domains saw improvements in the range of 3­ 4%, with performance on the VOA data improving by nearly 9%, which is a 45% reduction in error. There was no clear trend in terms of precision versus recall, or the different entity types. The first place to look for improvements is with the boundaries for named entities. Once again looking at all of the data combined, in the baseline model there were 203 entities where part of the entity was found, but one or both boundaries were incorrectly identified. The joint model corrected 72 of those entities, while incorrectly identifying the boundaries of 37 entities which had previously been correctly identified. In the baseline NER model, there were 243 entities for which the boundaries were correctly identified, but the type of entity was incorrect. The joint model corrected 80 of them, while changing the labels of 39 entities which had previously been correctly identified. Additionally, 190 entities were found which the baseline model had missed entirely, and 68 enti6 Available at http://www.cis.upenn.edu/ dbikel/software.html 331 ABC CNN MNB NBC PRI VOA Just Parse Just NER Joint Model Just Parse Just NER Joint Model Just Parse Just NER Joint Model Just Parse Just NER Joint Model Just Parse Just NER Joint Model Just Parse Just NER Joint Model Parse Labeled Bracketing Precision Recall F1 70.18% 70.12% 70.15% ­ 69.76% 70.23% 69.99% 76.92% 77.14% 77.03% ­ 77.43% 77.99% 77.71% 63.97% 67.07% 65.49% ­ 63.82$ 67.46% 65.59% 59.72% 63.67% 61.63% ­ 60.69% 65.34% 62.93% 76.22% 76.49% 76.35% ­ 76.88% 77.95% 77.41% 76.56% 75.74% 76.15% ­ 77.58% 77.45% 77.51% Named Entities Precision Recall F1 ­ 76.84% 72.32% 74.51% 77.70% 72.32% 74.91% ­ 75.56% 76.00% 75.78% 78.73% 78.67% 78.70% ­ 72.30% 54.59% 62.21% 71.35% 62.24% 66.49% ­ 67.53% 60.65% 63.90% 71.43% 64.81% 67.96% ­ 82.07% 84.86% 83.44% 86.13% 86.56% 86.34% ­ 82.79% 75.96% 79.23% 88.37% 87.98% 88.18% Training Time 25m 45m 16.5h 31.7h 12m 19m 10m 17m 2.4h 4.2h 2.3h 4.4h Table 2: Full parse and NER results for the six datasets. Parse trees were evaluated using evalB, and named entities were scored using macro-averaged F-measure (conlleval). ties were lost. We tested the statistical significance of the gains (of all the data combined) using the same sentence-level, stratified shuffling technique as Bikel's parse comparator and found that both precision and recall gains were significant at p < 10-4 . An example from the data where the joint model helped improve both parse structure and named entity recognition is shown in Figure 4. The output from the individual models is shown in part (a), with the output from the named entity recognizer shown in brackets on the words at leaves of the parse. The output from the joint model is shown in part (b), with the named entity information encoded within the parse. In this example, the named entity Egyptian Islamic Jihad helped the parser to get its surrounding context correct, because it is improbable to attach a PP headed by with to an organization. At the same time, the surrounding context helped the joint model correctly identify Egyptian Islamic Jihad as an organization and not a person. The baseline parser also incorrectly added an extra level of structure to the person name Osama Bin Laden, while the joint model found the correct structure. parser (Collins, 1997) over a syntactic structure augmented with the template entity and template relations annotations for the MUC-7 shared task. Their sentence augmentations were similar to ours, but they did not make use of features due to the generative nature of their model. This approach was not followed up on in other work, presumably because around this time nearly all the activity in named entity and relation extraction moved to the use of discriminative sequence models, which allowed the flexible specification of feature templates that are very useful for these tasks. The present model is able to bring together both these lines of work, by integrating the strengths of both approaches. There have been other attempts in NLP to jointly model multiple levels of structure, with varying degrees of success. Most work on joint parsing and semantic role labeling (SRL) has been disappointing, despite obvious connections between the two tasks. Sutton and McCallum (2005) attempted to jointly model PCFG parsing and SRL for the CoNLL 2005 shared task, but were unable to improve performance on either task. The CoNLL 2008 shared task (Surdeanu et al., 2008) was joint dependency parsing and SRL, but the top performing systems decoupled the tasks, rather than building joint models. Zhang and Clark (2008) successfully built a joint 6 Related Work A pioneering antecedent for our work is (Miller et al., 2000), who trained a Collins-style generative 332 VP VBD NP NNS IN NP IN NP NNS TO NML NNP were members of the [Egyptian Islamic Jihad]PER with (a) VP VBD NP NNS IN DT PP NP NamedEntity-ORG* IN NP NNS TO PP NP PP NP-PER* NNP-PER NNP-PER NNP-PER were members of the Egyptian Islamic Jihad with ties (b) to Osama Bin Laden ties to [Osama NNP Bin Laden]PER NP PP NP PP NP PP NP NNP Figure 4: An example for which the joint model helped with both parse structure and named entity recognition. The individual models (a) incorrectly attach the PP, label Egyptian Islamic Jihad as a person, and incorrectly add extra internal structure to Osama Bin Laden. The joint model (b) gets both the structure and the named entity correct. model of Chinese word segmentation and parts of speech using a single perceptron. An alternative approach to joint modeling is to take a pipelined approach. Previous work on linguistic annotation pipelines (Finkel et al., 2006; Hollingshead and Roark, 2007) has enforced consistency from one stage to the next. However, these models are only used at test time; training of the components is still independent. These models also have the potential to suffer from search errors and are not guaranteed to find the optimal output. is based on a discriminative constituency parser, with the data, grammar, and features carefully constructed for the joint task. In the future, we would like to add other levels of annotation available in the OntoNotes corpus to our model, including word sense disambiguation and semantic role labeling. Acknowledgements The first author is supported by a Stanford Graduate Fellowship. This paper is based on work funded in part by the Defense Advanced Research Projects Agency through IBM. The content does not necessarily reflect the views of the U.S. Government, and no official endorsement should be inferred. We also wish to thank the creators of OntoNotes, without which this project would not have been possible. 7 Conclusion We presented a discriminatively trained joint model of parsing and named entity recognition, which improved performance on both tasks. Our model 333 References Alexander Clark. 2000. Inducing syntactic categories by context distribution clustering. In Proc. of Conference on Computational Natural Language Learning, pages 91­94, Lisbon, Portugal. Michael Collins. 1997. Three generative, lexicalised models for statistical parsing. In ACL 1997. Jenny Rose Finkel and Christopher D. Manning. 2008. Efficient, feature-based conditional random field parsing. In ACL/HLT-2008. Jenny Rose Finkel, Trond Grenager, and Christopher Manning. 2005. Incorporating non-local information into information extraction systems by gibbs sampling. In ACL 2005. Jenny Rose Finkel, Christopher D. Manning, and Andrew Y. Ng. 2006. Solving the problem of cascading errors: Approximate bayesian inference for linguistic annotation pipelines. In EMNLP 2006. Kristy Hollingshead and Brian Roark. 2007. Pipeline iteration. In ACL 2007. Eduard Hovy, Mitchell Marcus, Martha Palmer, Lance Ramshaw, and Ralph Weischedel. 2006. Ontonotes: The 90% solution. In HLT-NAACL 2006. Jin-Dong Kim, Tomoko Ohta, Yuka Teteisi, and Jun'ichi Tsujii. 2003. Genia corpus ­ a semantically annotated corpus for bio-textmining. Bioinformatics, 19(suppl. 1):i180­i182. Mitchell P. Marcus, Beatrice Santorini, and Mary Ann Marcinkiewicz. 1993. Building a large annotated corpus of English: The Penn Treebank. Computational Linguistics, 19(2):313­330. Scott Miller, Heidi Fox, Lance Ramshaw, and Ralph Weischedel. 2000. A novel use of statistical parsing to extract information from text. In In 6th Applied Natural Language Processing Conference, pages 226­233. Andrew Ng and Michael Jordan. 2002. On discriminative vs. generative classifiers: A comparison of logistic regression and naive bayes. In Advances in Neural Information Processing Systems (NIPS). Sameer S. Pradhan, Lance Ramshaw, Ralph Weischedel, Jessica MacBride, and Linnea Micciulla. 2007. Unrestricted coreference: Identifying entities and events in ontonotes. International Conference on Semantic Computing, 0:446­453. Mihai Surdeanu, Richard Johansson, Adam Meyers, Llu´s M` rquez, and Joakim Nivre. 2008. The CoNLLi a 2008 shared task on joint parsing of syntactic and semantic dependencies. In Proceedings of the 12th Conference on Computational Natural Language Learning (CoNLL), Manchester, UK. Charles Sutton and Andrew McCallum. 2005. Joint parsing and semantic role labeling. In Conference on Natural Language Learning (CoNLL). V. N. Vapnik. 1998. Statistical Learning Theory. John Wiley & Sons. Liang-Chih Yu, Chung-Hsien Wu, and Eduard Hovy. 2008. OntoNotes: Corpus cleanup of mistaken agreement using word sense disambiguation. In Proceedings of the 22nd International Conference on Computational Linguistics (Coling 2008), pages 1057­1064. Yue Zhang and Stephen Clark. 2008. Joint word segmentation and POS tagging using a single perceptron. In ACL 2008. 334 Minimal-length linearizations for mildly context-sensitive dependency trees Y. Albert Park Roger Levy Department of Computer Science and Engineering Department of Linguistics 9500 Gilman Drive 9500 Gilman Drive La Jolla, CA 92037-404, USA La Jolla, CA 92037-108, USA yapark@ucsd.edu rlevy@ling.ucsd.edu Abstract The extent to which the organization of natural language grammars reflects a drive to minimize dependency length remains little explored. We present the first algorithm polynomial-time in sentence length for obtaining the minimal-length linearization of a dependency tree subject to constraints of mild context sensitivity. For the minimally contextsensitive case of gap-degree 1 dependency trees, we prove several properties of minimallength linearizations which allow us to improve the efficiency of our algorithm to the point that it can be used on most naturallyoccurring sentences. We use the algorithm to compare optimal, observed, and random sentence dependency length for both surface and deep dependencies in English and German. We find in both languages that analyses of surface and deep dependencies yield highly similar results, and that mild contextsensitivity affords very little reduction in minimal dependency length over fully projective linearizations; but that observed linearizations in German are much closer to random and farther from minimal-length linearizations than in English. natural language sentence, together with its dependency structure, should be generable by a mildly context-sensitivity formalism (Joshi, 1985), in particular a linear context-free rewrite system in which the right-hand side of each rule has a distinguished head (Pollard, 1984; Vijay-Shanker et al., 1987; Kuhlmann, 2007). This condition places strong constraints on the linear contiguity of word-word dependency relations, such that only limited classes of crossing context-free dependency structures may be admitted. The second constraint is a softer preference for words in a dependency relation to occur in close proximity to one another. This constraint is perhaps best documented in psycholinguistic work suggesting that large distances between governors and dependents induce processing difficulty in both comprehension and production (Hawkins, 1994, 2004; Gibson, 1998; Jaeger, 2006). Intuitively there is a relationship between these two constraints: consistently large dependency distances in a sentence would require many crossing dependencies. However, it is not the case that crossing dependencies always mean longer dependency distances. For example, (1) below has no crossing dependencies, but the distance between arrived and its dependent Yesterday is large. The overall dependency length of the sentence can be reduced by extraposing the relative clause who was wearing a hat, resulting in (2), in which the dependency Yesterdayarrived crosses the dependency womanwho. (1) (2) Yesterday a woman who was wearing a hat arrived. Yesterday a woman arrived who was wearing a hat. 1 Introduction This paper takes up the relationship between two hallmarks of natural language dependency structure. First, there seem to be qualitative constraints on the relationship between the dependency structure of the words in a sentence and their linear ordering. In particular, this relationship seems to be such that any 335 Human Language Technologies: The 2009 Annual Conference of the North American Chapter of the ACL, pages 335­343, Boulder, Colorado, June 2009. c 2009 Association for Computational Linguistics There has been some recent work on dependency length minimization in natural language sentences (Gildea and Temperley, 2007), but the relationship between the precise constraints on available linearizations and dependency length minimization remains little explored. In this paper, we introduce the first efficient algorithm for obtaining linearizations of dependency trees that minimize overall dependency lengths subject to the constraint of mild context-sensitivity, and use it to investigate the relationship between this constraint and the distribution of dependency length actually observed in natural languages. Figure 1: Sample dependency subtree for Figure 2 2 Projective and mildly non-projective dependency-tree linearizations In the last few years there has been a resurgence of interest in computation on dependency-tree structures for natural language sentences, spurred by work such as McDonald et al. (2005a,b) showing that working with dependency-tree syntactic representations in which each word in the sentence corresponds to a node in the dependency tree (and vice versa) can lead to algorithmic benefits over constituency-structure representations. The linearization of a dependency tree is simply the linear order in which the nodes of the tree occur in a surface string. There is a broad division between two classes of linearizations: projective linearizations that do not lead to any crossing dependencies in the tree, and non-projective linearizations that involve at least one crossing dependency pair. Example (1), for example, is projective, whereas Example (2) is non-projective due to the crossing between the Yesterdayarrived and womanwho dependencies. Beyond this dichotomy, however, the homomorphism from headed tree structures to dependency structures (Miller, 2000) can be used together with work on the mildly context-sensitive formalism linear context-free rewrite systems (LCFRSs) (VijayShanker et al., 1987) to characterize various classes of mildly non-projective dependency-tree linearizations (Kuhlmann and Nivre, 2006). The LCFRSs are an infinite sequence of classes of formalism for generating surface strings through derivation trees in a rule-based context-free rewriting system. The i-th LCFRS class (for i = 0, 1, 2, . . . ) imposes the con336 straint that every node in the derivation tree maps to to a collection of at most i+1 contiguous substrings. The 0-th class of LCFRS, for example, corresponds to the context-free grammars, since each node in the derivation tree must map to a single contiguous substring; the 1st class of LCFRS corresponds to TreeAdjoining Grammars (Joshi et al., 1975), in which each node in the derivation tree must map to at most a pair of contiguous substrings; and so forth. The dependency trees induced when each rewrite rule in an i-th order LCFRS distinguish a unique head can similarly be characterized by being of gap-degree i, so that i is the maximum number of gaps that may appear between contiguous substrings of any subtree in the dependency tree (Kuhlmann and M¨ hl, 2007). o The dependency tree for Example (2), for example, is of gap-degree 1. Although there are numerous documented cases in which projectivity is violated in natural language, there are exceedingly few documented cases in which the documented gap degree exceeds 1 (though see, for example, Kobele, 2006). 3 Finding minimal dependency-length linearizations Even under the strongest constraint of projectivity, the number of possible linearizations of a dependency tree is exponential in both sentence length and arity (the maximum number of dependencies for any word). As pointed out by Gildea and Temperley (2007), however, finding the unconstrained minimal-length linearization is a well-studied problem with an O(n1.6 ) solution (Chung, 1984). However, this approach does not take into account constraints of projectivity or mild context-sensitivity. Gildea and Temperley themselves introduced a novel efficient algorithm for finding the minimized dependency length of a sentence subject to the constraint that the linearization is projective. Their algorithm can perhaps be most simply understood by making three observations. First, the total depen- d12 d11 h d21 |c1 | kh d32 d31 |c2 | d22 Figure 2: Dependency length factorization for efficient projective linearization, using the dependency subtree of Figure 1 Figure 3: Factorizing dependency length at node w i of a mildly context-sensitive dependency tree. This partial linearization of head with dependent components makes c1 the head component and leads to l = 2 links crossing between c1 and c2 . dency length of a projective linearization can be written as wi where Ei is the boundary of the contiguous substring corresponding to the dependency subtree rooted at wi which stands between wi and its governor, and D(wi , Ej ) is the distance from wi to Ej , with the special case of D(wroot , Eroot ) = 0 (Figures 1 and 2). Writing the total dependency length this way makes it clear that each term in the outer sum can be optimized independently, and thus one can use dynamic programming to recursively find optimal subtree orderings from the bottom up. Second, for each subtree, the optimal ordering can be obtained by placing dependent subtrees on alternating sides of w from inside out in order of increasing length. Third, the total dependency lengths between any words withing an ordering stays the same when the ordering is reversed, letting us assume that D(wi , Ei ) will be the length to the closest edge. These three observations lead to an algorithm with worst-case complexity of O(n log m) time, where n is sentence length and m is sentence arity. (The log m term arises from the need to sort the daughters of each node into descending order of length.) When limited subclasses of nonprojectivity are admitted, however, the problem becomes more difficult because total dependency length can no longer be written in such a simple form as in Equation (1). Intuitively, the size of the effect on dependency length of a decision to order a given subtree discontiguously, as in a woman. . . who was wearing a hat in Example (2), cannot be calculated without consulting the length of the string that the discontiguous 337 D(wi , Ei ) + wj wi dep D(wi , Ej ) (1) subtree would be wrapped around. Nevertheless, for any limited gap degree, it is possible to use a different factorization of dependency length that keeps computation polynomial in sentence length. We introduce this factorization in the next section. 4 Minimization with limited gap degree We begin by defining some terms. We use the word component to refer to a full linearization of a subtree in the case where it is realized as a single contiguous string, or to refer to any of of the contiguous substrings produced when a subtree is realized discontiguously. We illustrate the factorization for gap-degree 1, so that any subtree has at most two components. We refer to the component containing the head of the subtree as the head component, the remaining component as the dependent component, and for any given (head component, dependent component) pair, we use pair component to refer to the other component in the pair. We refer to the two components of dependent dj as dj1 and dj2 respectively, and assume that dj1 is the head component. When dependencies can cross, total dependency length cannot be factorized as simply as in Equation (1) for the projective case. However, we can still make use of a more complex factorization of the total dependency length as follows: wi D(wi , Ei ) + wj wi dep D(wi , Ej ) + lj kj (2) where lj is the number of links crossing between the two components of dj , and kj is the distance added between these two components by the partial linearization at wi . Figure 3 illustrates an example of such a partial linearization, where k2 is |d31 | + |d32 | due to the fact that the links between d21 and d22 have to cross both components of d3 . The factorization in Equation (2) allows us to use dynamic programming to find minimal-length linearizations, so that worst-case complexity is polynomial rather than exponential in sentence length. However, the additional term in the factorization means that we need to track the number of links l crossing between the two components of the subtree Si headed by wi and the component lengths |c1 | and |c2 |. Additionally, the presence of crossing dependencies means that Gildea and Temperley's proof that ordering dependent components from the inside out in order of increasing length no longer goes through. This means that at each node wi we need to hold on to the minimal-length partial linearization for each combination of the following quantities: · |c2 | (which also determines |c1 |); · the number of links l between c1 and c2 ; · and the direction of the link between wi and its governor. We shall refer to a combination of these factors as a status set. The remainder of this section describes a dynamic-programming algorithm for finding optimal linearizations based on the factorization in Equation (2), and continues with several further findings leading to optimizations that make the algorithm tractable for naturally occurring sentences. 4.1 Algorithm 1 so that the maximum number of status sets at each node is bounded above by n2 . Since the sum of the status sets of all child subtrees is also bounded by n2 , the maximum number of status set combinations 2 is bounded by ( n )m (obtainable from the inequalm ity of arithmetic and geometric means). There are (2m+1)!m possible arrangements of head word and dependent components into two components. Since there are n nodes in the tree and each possible combination of status sets from each dependent sub tree must be tried, this algorithm has worst-case com2 plexity of O((2m + 1)!mn( n )m ). This algorithm m could be generalized for mildly context-sensitive linearizations polynomial in sentence length for any gap degree desired, by introducing additional l terms denoting the number of links between pairs of components. However, even for gap degree 1 this bound is incredibly large, and as we show in Figure 7, algorithm 1 is not computationally feasible for batch processing sentences of arity greater than 5. 4.2 Algorithm 2 Our first algorithm takes a tree and recursively finds the optimal orderings for each possible status set of each of its child subtrees, which it then uses to calculate the optimal ordering of the tree. To calculate the optimal orderings for each possible status set of a subtree S, we use the brute-force method of choosing all combinations of one status set from each child subtree, and for each combination, we try all possible orderings of the components of the child subtrees, calculate all possible status sets for S, and store the minimal dependency value for each appearing status set of S. The number of possible length pairings |c1 |, |c2 | and number of crossing links l are each bounded above by the sentence length n, 338 We now show how to speed up our algorithm by proving by contradiction that for any optimal ordering which minimizes the total dependency length with the two-cluster constraint, for any given subtree S and its child subtree C, the pair components c1 and c2 of a child subtree C must be placed on opposite sides of the head h of subtree S. Let us assume that for some dependency tree structure, there exists an optimal ordering where c1 and c2 are on the same side of h. Let us refer to the ordered set of words between c1 and c2 as v. None of the words in v will have dependency links to any of the words in c1 and c2 , since the dependencies of the words in c1 and c2 are either between themselves or the one link to h, which is not between the two components by our assumption. There will be j1 0 links from v going over c1 , j2 0 dependency links from v going over c2 , and l 1 links between c1 and c2 . Without loss of generality, let us assume that h is on the right side of c2 . Let us consider the effect on total dependency length of swapping c1 with v, so that the linear ordering is v c1 c2 h. The total dependency length of the new word ordering changes by -j1 |c1 |-l|v|+j2 |c1 | if c2 is the head component, and decreases by another |v| if c1 is the head component. Thus the total change in dependency length is less than or equal to (j2 - j1 )|c1 | - l × |v| < (j2 - j1 )|c1 | (3) If instead we swap places of v with c2 instead of c1 so that we have c1 c2 v h, we find that the total change in dependency length is less than or equal to (j1 - j2 )|c2 | - (l - 1)|v| (j1 - j2 )|c2 | (4) It is impossible for the right-hand sides of (3) and (4) to be positive at the same time, so swapping v with either c1 or c2 must lead to a linearization with lower overall dependency length. But this is a contradiction to our original assumption, so we see that for any optimal ordering, all split child subtree components c1 and c2 of the child subtree of S must be placed on opposite sides of the head h. This constraint allows us to simplify our algorithm for finding the minimal-length linearization. Instead of going through all logically possible orderings of components of the child subtrees, we can now decide on which side the head component will be on, and go through all possible orderings for each side. This changes the factorial part of our algorithm run time from (2m + 1)!m to 2m (m!)2 m, giving us 2 O(2m (m!)2 mn( n )m ), greatly reducing actual prom cessing time. 4.3 Algorithm 3 Figure 4: Initial setup for latter part of optimization proof in section 4.4. To the far left is the head h of subtree S. The component pair C 1 and C2 makes up S, and g is the governor of h. The length of the substring v between C 1 and C2 is k. ci and ci+1 are child subtree components. We now present two more findings for further increasing the efficiency of the algorithm. First, we look at the status sets which need to be stored for the dynamic programming algorithm. In the straightforward approach we first presented, we stored the optimal dependency lengths for all cases of possible status sets. We now know that we only need to consider cases where the pair components are on opposite sides. This means the direction of the link from the head to the parent will always be toward the inside direction of the pair components, so we can re-define the status set as (p, l) where p is again the length of the dependent component, and l is the number of links between the two pair components. If the p values for sets s1 and s2 are equal, s1 has a smaller number of links than s2 (ls1 ls2 ) and s1 has a smaller or equal total dependency length to s2 , then replacing the components of s2 with s1 will always give us the same or more optimal total 339 dependency length. Thus, we do not have to store instances of these cases for our algorithm. Next, we prove by contradiction that for any two status sets s1 and s2 , if ps1 > ps2 > 0, ls1 = ls2 , and the TOTAL INTERNAL DEPENDENCY LENGTH t1 of s1 --defined as the sum in Equation (2) over only those words inside the subtree headed by h--is less than or equal to t2 of s2 , then using s1 will be at least as good as s2 , so we can ignore s2 . Let us suppose that the optimal linearization can use s2 but not s1 . Then in the optimal linearization, the two pair components cs2 ,1 and cs2 ,2 of s2 are on opposite sides of the parent head h. WLOG, let us assume that components cs1 ,1 and cs2 ,1 are the dependent components. Let us denote the total number of links going over cs2 ,1 as j1 and the words between cs2 ,1 and cs2 ,2 as v (note that v must contain h). If we swap cs2 ,1 with v, so that cs2 ,1 lies adjacent to cs2 ,2 , then there would be j2 +1 links going over cs2 ,1 . By moving cs2 ,1 from opposite sides of the head to be right next to cs2 ,2 , the total dependency length of the sentence changes by -j1 |cs2 ,1 |- ls2 |v|+ (j2 + 1)|cs2 ,1 |. Since the ordering was optimal, we know that Since l > 0, we can see that j1 - j2 0. Now, instead of swapping v with cs2 ,1 , let us try substituting the components from s1 instead of s2 . The change of the total dependency length of the sentence will be: j1 × (|cs1 ,1 | - |cs2 ,1 |) + j2 × (|cs1 ,2 | = (j1 - j2 ) × (ps1 - ps2 ) + (t1 - t2 ) Since j1 - j2 0 and ps1 > ps2 , the first term is less than or equal to 0 and since t1 - t2 0, the total dependency length will have been be equal or -|cs2 ,2 |) + t1 - t2 (j2 - j1 + 1)|cs2 ,1 | - ls2 |v| 0 10 6 Execution times for algorithms 1 & 4 Algorithm 1 Algorithm 4 10 4 time(ms) 10 2 Figure 5: Moving c i+1 to C1 10 0 1 2 3 4 5 6 7 maximum number of dependencies per head Figure 7: Timing comparison of first and fully optimized algorithms Figure 6: Moving c i to C2 have decreased. But this contradicts our assumption that only s2 can be part of an optimal ordering. This finding greatly reduces the number of status sets we need to store and check higher up in the algorithm. The worst-case complexity remains 2 O(2m m!2 mn( n )m ), but the actual runtime is rem duced by several orders of magnitude. 4.4 Algorithm 4 Our last optimization is on the ordering among the child subtree components on each side of the subtree head h. The initially proposed algorithm went through all combinations of possible orderings to find the optimal dependency length for each status set. By the first optimization in section 4.2 we have shown that we only need to consider the orderings in which the components are on opposite sides of the head. We now look into the ordering of the components on each side of the head. We first define the rank value r for each component c as follows: |c| # links between c and its pair component+I(c) ponents have rank values ri and ri+1 respectively, ri > ri+1 , and no other component of the immediate subtrees of S intervenes between ci and ci+1 . We shall denote the number of links between each component and its pair component as li , li+1 . Let li = li + I(ci ) and li+1 = li+1 + I(ci+1 ). There are two cases to consider: either (1) ci and ci+1 are within the same component of S, or (2) ci is at the edge of C1 nearest C2 and ci+1 is at the edge of C2 neareast C1 . Consider case 1, and let us swap ci with ci+1 ; this affects only the lengths of links involving connections to ci or ci+1 . The total dependency length of the new linearization will change by -li+1 |ci | + li |ci+1 | = -li li+1 (ri - ri+1 ) < 0 This is a contradiction to the assumption that we had an optimal ordering. Now consider case 2, which is illustrated in Figure 4. We denote the number of links going over ci and ci+1 , excluding links to ci , ci+1 as 1 and 2 respectively, and the length of words between the edges of C1 and C2 as k. Let us move ci+1 to the outermost position of C1 , as shown in Figure 5. Since the original linearization was optimal, we have: -2 |ci+1 | + 1 |ci+1 | - li+1 k 0 (1 - 2 )|ci+1 | li+1 k (1 - 2 )ri+1 k where I(c) is the indicator function having value 1 if c is a head component and 0 otherwise . Using this definition, we prove by contradiction that the ordering of the components from the head outward must be in order of increasing rank value. Let us suppose that at some subtree S headed by h and with head component C1 and dependent component C2 , there is an optimal linearization in which there exist two components ci and ci+1 of immediate subtrees of S such that ci is closer to h, the com340 Let us also consider the opposite case of moving ci to the inner edge of C2 , as shown in Figure 6. Once again due to optimality of the original linearization, we have DLA Optimal with one crossing dependency Optimal with projectivity constraint Observed Random with projectivity constraint Random with two-cluster constraint Random ordering with no constraint English Surface Deep 32.7 33.0 34.1 34.4 46.6 48.0 82.4 82.8 84.0 84.3 183.2 184.2 German Surface Deep 24.5 23.3 25.5 24.2 43.6 43.1 50.6 49.2 50.7 49.5 106.9 101.1 Table 1: Average sentence dependency lengths(with max arity of 10) -1 |ci | + 2 |ci | + li k 0 (2 - 1 )|ci | -li k (1 - 2 )ri k But this is a contradiction, since ri > ri+1 . Combining the two cases, we can see that regardless of where the components may be split, in an optimal ordering the components going outwards from the head must have an increasing rank value. This result allows us to simplify our algorithm greatly, because we no longer need to go through all combinations of orderings. Once it has been decided which components will come on each side of the head, we can sort the components by rank value and place them from the head out. This reduces the factorial component of the algorithm's complexity to m log m, and the overall worst-case complexity 2 to O(nm2 log m( 2n )m ). Although this is still exm ponential in the arity of the tree, nearly all sentences encountered in treebanks have an arity low enough to make the algorithm tractable and even very efficient, as we show in the following section. 5 Empirical results Using the above algorithm, we calculated minimal dependency lengths for English sentences from the WSJ portion of the Penn Treebank, and for German sentences from the NEGRA corpus. The EnglishGerman comparison is of interest because word order is freer, and crossing dependencies more common, in German than in English (Kruijff and Vasishth, 2003). We extracted dependency trees from these corpora using the head rules of Collins (1999) for English, and the head rules of Levy and Manning (2004) for German. Two dependency trees were extracted from each sentence, the surface tree extracted by using the head rules on the context341 free tree representation (i.e. no crossing dependencies), and the deep tree extracted by first returning discontinuous dependents (marked by *T* and *ICH* in WSJ, and by *T* in the Penn-format version of NEGRA) before applying head rules. Figure 7 shows the average time it takes to calculate the minimal dependency length with crossing dependencies for WSJ sentences using the unoptimized algorithm of Section 4.1 and the fully optimized algorithm of Section 4.4. Timing tests were implemented and performed using Java 1.6.0 10 on a system running Linux 2.6.18-6-amd64 with a 2.0 GHz Intel Xeon processor and 16 gigs of memory, run on a single core. We can see from Figure 7 that the straight-forward dynamic programming algorithm takes many more magnitudes of time than our optimized algorithm, making it infeasible to calculate the minimal dependency length for larger sentences. The results we present below were obtained with the fully optimized algorithm from the sentences with a maximum arity of 10, using 49,176 of the 49,208 WSJ sentences and 20,563 of the 20,602 NEGRA sentences. Summary results over all sentences from each corpus are shown in Table 1. We can see that for both corpora, the oberved dependency length is smaller than the dependency length of random orderings, even when the random ordering is subject to the projectivity constraint. Relaxing the projectivity constraint by allowing crossing dependencies introduces a slightly lower optimal dependency length. The average sentence dependency lengths for the three random orderings are significantly higher than the observed values. It is interesting to note that the random orderings given the projectivity constraint and the two-cluster constraint have very similar dependency lengths, where as a total random ordering English/Surface 400 400 Unconstrained Random 2-component Random Projective Random Observed Projective Optimal 2-component Optimal English/Deep 400 Unconstrained Random 2-component Random Projective Random Observed Projective Optimal 2-component Optimal German/Surface Unconstrained Random 2-component Random Projective Random Observed Projective Optimal 2-component Optimal German/Deep 400 Unconstrained Random 2-component Random Projective Random Observed Projective Optimal 2-component Optimal 300 300 300 Average sentence DL Average sentence DL Average sentence DL Average sentence DL 40 50 200 200 200 100 100 100 0 0 0 0 10 20 30 40 50 0 10 20 30 40 50 0 10 20 30 0 0 100 200 300 10 20 30 40 50 Sentence length Sentence length Sentence length Sentence length Figure 8: Average sentence DL as a function of sentence length. Legend is ordered top curve to bottom curve. English/Surface 400 400 Unconstrained Random 2-component Random Projective Random Observed Projective Optimal 2-component Optimal English/Deep 400 Unconstrained Random 2-component Random Projective Random Observed Projective Optimal 2-component Optimal German/Surface Unconstrained Random 2-component Random Projective Random Observed Projective Optimal 2-component Optimal German/Deep 400 Unconstrained Random 2-component Random Projective Random Observed Projective Optimal 2-component Optimal 300 300 300 Average sentence DL Average sentence DL Average sentence DL Average sentence DL 6 7 8 200 200 200 100 100 100 0 0 0 1 2 3 4 5 6 7 8 1 2 3 4 5 6 7 8 1 2 3 4 5 0 1 100 200 300 2 3 4 5 6 7 8 Sentence Arity Sentence Arity Sentence Arity Sentence Arity Figure 9: Average sentence DL as a function of sentence arity. Legend is ordered top curve to bottom curve. increases the dependency length significantly. NEGRA generally has shorter sentences than WSJ, so we need a more detailed picture of dependency length as a function of sentence length; this is shown in Figure 8. As in Table 1, we see that English, which has less crossing dependency structures than German, has observed DL closer to optimal DL and farther from random DL. We also see that the random and observed DLs behave very similarly across different sentence lengths in English and German, but observed DL grows faster in German. Perhaps surprisingly, optimal projective DL and gap-degree 1 DL tend to be very similar even for longer sentences. The picture as a function of sentence arity is largely the same (Figure 9). 6 Conclusion In this paper, we have presented an efficient dynamic programming algorithm which finds minimumlength dependency-tree linearizations subject to constraints of mild context-sensitivity. For the gapdegree 1 case, we have proven several properties of these linearizations, and have used these properties to optimize our algorithm. This made it possible to find minimal dependency lengths for sentences from 342 the English Penn Treebank WSJ and German NEGRA corpora. The results show that for both languages, using surface dependencies and deep dependencies lead to generally similar conclusions, but that minimal lengths for deep dependencies are consistently slightly higher for English and slightly lower for German. This may be because German has many more crossing dependencies than English. Another finding is that the difference between average sentence DL does not change much between optimizing for the projectivity constraint and the twocluster constraint: projectivity seems to give natural language almost all the flexibility it needs to minimize DL. For both languages, the observed linearization is much closer in DL to optimal linearizations than to random linearizations; but crucially, we see that English is closer to the optimal linearization and farther from random linearization than German. This finding is resonant with the fact that German has richer morphology and overall greater variability in observed word order, and with psycholinguistic results suggesting that dependencies of greater linear distance do not always pose the same increased processing load in German sentence comprehension as they do in English (Konieczny, 2000). References Chung, F. R. K. (1984). On optimal linear arrangements of trees. Computers and Mathematics with Applications, 10:43­60. Collins, M. (1999). Head-Driven Statistical Models for Natural Language Parsing. PhD thesis, University of Pennsylvania. Gibson, E. (1998). Linguistic complexity: Locality of syntactic dependencies. Cognition, 68:1­76. Gildea, D. and Temperley, D. (2007). Optimizing grammars for minimum dependency length. In Proceedings of ACL. Hawkins, J. A. (1994). A Performance Theory of Order and Constituency. Cambridge. Hawkins, J. A. (2004). Efficiency and Complexity in Grammars. Oxford University Press. Jaeger, T. F. (2006). Redundancy and Syntactic Reduction in Spontaneous Speech. PhD thesis, Stanford University, Stanford, CA. Joshi, A. K. (1985). How much context-sensitivity is necessary for characterizing structural descriptions ­ Tree Adjoining Grammars. In Dowty, D., Karttunen, L., and Zwicky, A., editors, Natural Language Processing ­ Theoretical, Computational, and Psychological Perspectives. Cambridge. Joshi, A. K., Levy, L. S., and Takahashi, M. (1975). Tree adjunct grammars. Journal of Computer and System Sciences, 10(1). Kobele, G. M. (2006). Generating Copies: An investigation into Structural Identity in Language and Grammar. PhD thesis, UCLA. Konieczny, L. (2000). Locality and parsing complexity. Journal of Psycholinguistic Research, 29(6):627­645. Kruijff, G.-J. M. and Vasishth, S. (2003). Quantifying word order freedom in natural language: Implications for sentence processing. Proceedings of the Architectures and Mechanisms for Language Processing conference. Kuhlmann, M. (2007). Dependency Structures and Lexicalized Grammars. PhD thesis, Saarland University. 343 Kuhlmann, M. and M¨ hl, M. (2007). Mildly o context-sensitive dependency languages. In Proceedings of ACL. Kuhlmann, M. and Nivre, J. (2006). Mildly nonprojective dependency structures. In Proceedings of COLING/ACL. Levy, R. and Manning, C. (2004). Deep dependencies from context-free statistical parsers: correcting the surface dependency approximation. In Proceedings of ACL. McDonald, R., Crammer, K., and Pereira, F. (2005a). Online large-margin training of dependency parsers. In Proceedings of ACL. McDonald, R., Pereira, F., Ribarov, K., and Haji , c J. (2005b). Non-projective dependency parsing using spanning tree algorithms. In Proceedings of ACL. Miller, P. (2000). Strong Generative Capacity: The Semantics of Linguistic Formalism. Cambridge. Pollard, C. (1984). Generalized Phrase Structure Grammars, Head Grammars, and Natural Languages. PhD thesis, Stanford. Vijay-Shanker, K., Weir, D. J., and Joshi, A. K. (1987). Characterizing structural descriptions produced by various grammatical formalisms. In Proceedings of ACL. Positive Results for Parsing with a Bounded Stack using a Model-Based Right-Corner Transform William Schuler Dept. of Computer Science and Engineering Minneapolis, MN schuler@cs.umn.edu Abstract Statistical parsing models have recently been proposed that employ a bounded stack in timeseries (left-to-right) recognition, using a rightcorner transform defined over training trees to minimize stack use (Schuler et al., 2008). Corpus results have shown that a vast majority of naturally-occurring sentences can be parsed in this way using a very small stack bound of three to four elements. This suggests that the standard cubic-time CKY chart-parsing algorithm, which implicitly assumes an unbounded stack, may be wasting probability mass on trees whose complexity is beyond human recognition or generation capacity. This paper first describes a version of the rightcorner transform that is defined over entire probabilistic grammars (cast as infinite sets of generable trees), in order to ensure a fair comparison between bounded-stack and unbounded PCFG parsing using a common underlying model; then it presents experimental results that show a bounded-stack right-corner parser using a transformed version of a grammar significantly outperforms an unboundedstack CKY parser using the original grammar. 1 Introduction Statistical parsing models have recently been proposed that employ a bounded stack in time-series (left-to-right) recognition, in order to directly and tractably incorporate incremental phenomena such as (co-)reference or disfluency into parsing decisions (Schuler et al., 2008; Miller and Schuler, 2008). These models make use of a right-corner tree transform, based on the left-corner transform described by Johnson (1998), and are supported by 344 corpus results suggesting that most sentences (in English, at least) can be parsed using a very small stack bound of three to four elements (Schuler et al., 2008). This raises an interesting question: if most sentences can be recognized with only three or four elements of stack memory, is the standard cubic-time CKY chart-parsing algorithm, which implicitly assumes an unbounded stack, wasting probability mass on trees whose complexity is beyond human recognition or generation capacity? This paper presents parsing accuracy results using transformed and untransformed versions of a corpus-trained probabilistic context-free grammar suggesting that this is indeed the case. Experimental results show a bounded-memory time-series parser using a transformed version of a grammar significantly outperforms an unbounded-stack CKY parser using the original grammar. Unlike the tree-based transforms described previously, the model-based transform described in this paper does not introduce additional context from corpus data beyond that contained in the original probabilistic grammar, making it possible to present a fair comparison between bounded- and unbounded-stack versions of the same model. Since this transform takes a probabilistic grammar as input, it can also easily accommodate horizontal and vertical Markovisation (annotating grammar symbols with parent and sibling categories) as described by Collins (1997) and subsequently. The remainder of this paper is organized as follows: Section 2 describes related approaches to parsing with stack bounds; Section 3 describes an existing bounded-stack parsing framework using a rightcorner transform defined over individual trees; Section 4 describes a redefinition of this transform to ap- Human Language Technologies: The 2009 Annual Conference of the North American Chapter of the ACL, pages 344­352, Boulder, Colorado, June 2009. c 2009 Association for Computational Linguistics ply to entire probabilistic grammars, cast as infinite sets of generable trees; and Section 5 describes an evaluation of this transform on the Wall Street Journal corpus of the Penn Treebank showing improved results for a transformed bounded-stack version of a probabilistic grammar over the original unbounded grammar. 2 Related Work The model examined here is formally similar to Combinatorial Categorial Grammar (CCG) (Steedman, 2000). But the CCG account is a competence model as well as a performance model, in that it seeks to unify category representations used in processing with learned generalizations about argument structure; whereas the model described in this paper is exclusively a performance model, allowing generalizations about lexical argument structures to be learned in some other representation, then combined with probabilistic information about parsing strategies to yield a set of derived incomplete constituents. As a result, the model described in this paper has a freer hand to satisfy strict working memory bounds, which may not permit some of the alternative composition operations proposed in the CCG account, thought to be associated with available prosody and quantifier scope analyses.1 Other models (Abney and Johnson, 1991; Gibson, 1991) seek to explain human processing difficulties as a result of memory capacity limits in parsing ordinary phrase structure trees. The Abney-Johnson and Gibson models adopt a left-corner parsing strategy, of which the right-corner transform described in this paper is a variant, in order to minimize memory usage. But the transform-based model described in this paper exploits a conception of chunking (Miller, 1956) -- in this case, grouping recognized words into stacked-up incomplete constituents -- to operate within much stricter estimates of human shortterm memory bounds (Cowan, 2001) than assumed by Abney and Johnson. The lack of support for some of these available scope analyses may not necessarily be problematic for the present model. The complexity of interpreting nested raised quantifiers may place them beyond the capability of human interactive incremental interpretation, but not beyond the capability of post-hoc interpretation (understood after the listener has had time to think about it). 1 Several existing incremental systems are organized around a left-corner parsing strategy (Roark, 2001; Henderson, 2004). But these systems generally keep large numbers of constituents open for modifier attachment in each hypothesis. This allows modifiers to be attached as right children of any such open constituent. But if any number of open constituents are allowed, then either the assumption that stored elements have fixed syntactic (and semantic) structure will be violated, or the assumption that syntax operates within a bounded memory store will be violated, both of which are psycholinguistically attractive as simplifying assumptions. The HHMM model examined in this paper upholds both the fixed-element and boundedmemory assumptions by hypothesizing fixed reductions of right child constituents into incomplete parents in the same memory element, to make room for new constituents that may be introduced at a later time. These in-element reductions are defined naturally on phrase structure trees as the result of aligning right-corner transformed constituent structures to sequences of random variables in a factored timeseries model. 3 Background The recognition model examined in this paper is a factored time-series model, based on a Hierarchic Hidden Markov Model (Murphy and Paskin, 2001), which probabilistically estimates the contents of a memory store of three to four partially-completed constituents over time. Probabilities for expansions, transitions and reductions in this model can be defined over trees in a training corpus, transformed and mapped to the random variables in an HHMM (Schuler et al., 2008). In Section 4 these probabilities will be computed directly from a probabilistic context-free grammar, in order to evaluate the contribution of stack bounds without introducing additional corpus context into the model. 3.1 A Bounded-Stack Model HHMMs are factored HMMs which mimic a bounded-memory pushdown automaton (PDA), supporting simple push and pop operations on a bounded stack-like memory store. HMMs characterize speech or text as a sequence 345 of hidden states qt (in this case, stacked-up syntactic categories) and observed states ot (in this case, words) at corresponding time steps t. A most likely sequence of hidden states q1..T can then be hypothe^ sized given any sequence of observed states o1..T : q1..T = argmax P(q1..T | o1..T ) ^ q1..T 1 ft-1 ft1 1 qt-1 1 qt ... 2 ft-1 ft2 2 qt-1 2 qt ... 3 ft-1 (1) (2) ft3 3 qt-1 3 qt ... ... = argmax P(q1..T )·P(o1..T | q1..T ) q1..T def T ot-1 ot = argmax q1..T t=1 PA(qt | qt-1 )·PB(ot | qt ) (3) Model (B ) likelihood probability P(o1..T | q1..T ) = t PB (ot | qt ). Transition probabilities PA (qt | qt-1 ) over complex hidden states qt can be modeled using synchronized levels of stacked-up component HMMs in an HHMM. HHMM transition probabilities are calculated in two phases: a reduce phase (resulting in an intermediate, marginalized state ft ), in which component HMMs may terminate; and a shift phase (resulting in a modeled state qt ), in which unterminated HMMs transition, and terminated HMMs are reinitialized from their parent HMMs. Variables over intermediate ft and modeled qt states are factored into sequences of depth-specific variables ­ one for each of D levels in the HHMM hierarchy: ft = ft1 . . . ftD qt = 1 D qt . . . qt using Bayes' Law (Equation 2) and Markov independence assumptions (Equation 3) to define a full P(q1..T | o1..T ) probability as the product of a Transition Model (A ) prior probability def P(q1..T ) = t PA (qt | qt-1 ) and an Observation Figure 1: Graphical representation of a Hierarchic Hidden Markov Model. Circles denote random variables, and edges denote conditional dependencies. Shaded circles are observations. def (4) (5) Transition probabilities are then calculated as a product of transition probabilities at each level, using level-specific reduce R,d and shift S,d models: PA(qt |qt-1 ) = def ft P(ft |qt-1 )·P(qt |ft qt-1 ) D d d-1 PR,d(ftd |ftd+1 qt-1 qt-1 )· (6) maintaining competing analyses of the entire memory store. A graphical representation of an HHMM with three levels is shown in Figure 1. Shift and reduce probabilities can then be defined in terms of finitely recursive Finite State Automata (FSAs) with probability distributions over transition, recursive expansion, and final-state status of states at each hierarchy level. In the version of HHMMs used in this paper, each intermediate variable is a reduction or non-reduction state ftd G {1, 0} (indicating, respectively, a complete reduced constituent of some grammatical category from domain G, or a failure to reduce due to an `active' transition being performed, or a failure to reduce due to an `awaited' transition being performed, as defined in Section 4.3); and each modeled variable is a synd tactic state qt G × G (describing an incomplete constituent consisting of an active grammatical category from domain G and an awaited grammatical category from domain G). An intermediate variable ftd at depth d may indicate reduction or nonreduction according to F-Rd,d if there is a reduction at the depth level immediately below d, but must indicate non-reduction (0) with probability 1 if there was no reduction below:2 d d-1 PR,d (ftd | ftd+1 qt-1 qt-1 ) = def = 1..D d=1 d d d-1 ft PS,d(qt |ftd+1 ftd qt-1 qt ) (7) with and defined as constants. In Viterbi decoding, the sums are replaced with argmax operators. This decoding process preserves ambiguity by 346 ftD+1 0 qt if ftd+1 G : [ftd = 0] d d-1 if ftd+1 G : PF-Rd,d (ftd | qt-1 , qt-1 ) (8) Here [·] is an indicator function: [] = 1 if is true, 0 otherwise. 2 0 where ftD+1 G and qt = ROOT. d Shift probabilities over the modeled variable qt at each level are defined using level-specific transition Q-Tr,d and expansion Q-Ex,d models: def Rewrite rules for the right-corner transform are shown below:4 · Beginning case: the top of a right-expanding sequence in an ordinary phrase structure tree is mapped to the bottom of a left-expanding sequence in a right-corner transformed tree: A A·0 A·1 A d d d-1 PS,d (qt | ftd+1 ftd qt-1 qt ) = d+1 d d if ft G, ftd G : [qt = qt-1 ] d+1 G, f d G : P d d+1 d d d-1 if f Q-Tr,d (qt | ft ft qt-1 qt ) t td+1 d-1 d if ft G, ftd G : PQ-Ex,d (qt | qt ) (9) A /A·1 A·0 (10) = ROOT. This model is and where conditioned on reduce variables at and immediately below the current FSA level. If there is no reduction immediately below the current level (the first case above), it deterministically copies the current FSA state forward to the next time step. If there is a reduction immediately below the current level but no reduction at the current level (the second case above), it transitions the FSA state at the current level, according to the distribution Q-Tr,d . And if there is a reduction at the current level (the third case above), it re-initializes this state given the state at the level above, according to the distribution Q-Ex,d . The overall effect is that higher-level FSAs are allowed to transition only when lower-level FSAs terminate. An HHMM therefore behaves like a probabilistic implementation of a pushdown automaton (or shift­reduce parser) with a finite stack, where the maximum stack depth is equal to the number of levels in the HHMM hierarchy. ftD+1 G 0 qt This case of the right-corner transform may be considered a constrained version of CCG type raising. · Middle case: each subsequent branch in a right-expanding sequence of an ordinary phrase structure tree is mapped to a branch in a leftexpanding sequence of the transformed tree: A A·µ A A·µ·0 A·µ·1 A /A·µ A·µ·0 A /A·µ·1 (11) This case of the right-corner transform may be considered a constrained version of CCG forward function composition. · Ending case: the bottom of a right-expanding sequence in an ordinary phrase structure tree is mapped to the top of a left-expanding sequence in a right-corner transformed tree: A A·µ a·µ A 3.2 Tree-Based Transforms The right-corner transform used in this paper is simply the left-right dual of a left-corner transform (Johnson, 1998). It transforms all right branching sequences in a phrase structure tree into left branching sequences of symbols of the form A /A·µ , denoting an incomplete instance of an `active' category A lacking an instance of an `awaited' category A·µ yet to come.3 These incomplete constituent categories have the same form and much of the same meaning as non-constituent categories in a Combinatorial Categorial Grammar (Steedman, 2000). Here and µ are node addresses in a binary-branching tree, defined as paths of left (0) or right (1) branches from the root. 3 A /A·µ A·µ a·µ (12) This case of the right-corner transform may be considered a constrained version of CCG forward function application. 4 These rules can be applied recursively from bottom up on a source tree, synchronously associating subtree structures matched to variables , , and on the left side of each rule with transformed representations of these subtree structures on the right. 347 a) binary-branching phrase structure tree: S NP NP JJ strong NN demand IN for NNP NNP new NNP york b) result of right-corner transform: S/NP S/VP NP NP/NNS NP/NNS NP/NNS NP/NP NP/PP NP NP/NN JJ strong IN for NN demand NPpos NPpos/POS NNP NNP/NNP NNP/NNP NNP new NNP york NNP city JJ NN NNS bonds VBN VBN/PRT VBN propped NNP NNP city S/NN DT the PRT up NPpos POS 's JJ general NN obligation PP NP NNS VBN propped NNS NNS bonds S/NN JJ S NN market VBN PRT up DT the JJ municipal VP NP NN NN market municipal obligation general POS 's Figure 2: Trees resulting from a) a sample phrase structure tree for the sentence Strong demand for New York City's general obligations bonds propped up the municipal market, and b) a right-corner transform of this tree. Sequences of left children are recognized from the bottom up through in-element transitions in a Hierarchic Hidden Markov Model. Right children are recognized by expanding to additional stack elements. The completeness of the above transform rules can be demonstrated by the fact that they cover all possible subtree configurations (with the exception of bare terminals, which are simply copied). The soundness of the above transform rules can be demonstrated by the fact that each rule transforms a right-branching subtree into a left-branching subtree labeled with an incomplete constituent. An example of a right-corner transformed tree is shown in Figure 2(b). An important property of this transform is that it is reversible. Rewrite rules for reversing a right-corner transform are simply the converse of those shown above. 348 Sequences of left children in the resulting mostlyleft-branching trees are recognized from the bottom up, through transitions at the same stack element. Right children, which are much less frequent in the resulting trees, are recognized through crosselement expansions in a bounded-stack recognizer. 4 Model-Based Transforms In order to compare bounded- and unbounded-stack versions of the same model, the formulation of the right-corner and bounded-stack transforms introduced in this paper does not map trees to trees, but rather maps probability models to probability models. This eliminates complications in comparing models with different numbers of dependent variables -- and thus different numbers of free parameters -- because the model which ordinarily has more free parameters (the HHMM, in this case) is derived from the model that has fewer (the PCFG). Since they are derived from a simpler underlying model, the additional parameters of the HHMM are not free. Mapping probability models from one format to another can be thought of as mapping the infinite sets of trees that are defined by these models from one format to another. Probabilities in the transformed model are therefore defined by calculating probabilities for the relevant substructures in the source model, then marginalizing out the values of nodes in these structures that do not appear in the desired expression in the target model. A bounded-stack HHMM Q,F can therefore be derived from an unbounded PCFG G by: 1. organizing the rules in the source PCFG model G into direction-specific versions (distinguishing rules for expanding left and right children, which occur respectively as active and awaited constituent categories in incomplete constituent labels); 2. enforcing depth limits on these directionspecific rules; and 3. mapping these probabilities to HHMM random variable positions at the appropriate depth. 4.1 Direction-specific rules An inspection of the tree-based right-corner transform rewrites defined in Section 3.2 will show two things: first, that constituents occurring as left children in an original tree (with addresses ending in `0') always become active constituents (occurring before the slash, or without a slash) in incomplete constituent categories, and constituents occurring as right children in an original tree (with addresses ending in `1') always become awaited constituents (occurring after the slash); and second, that left children expand locally downward in the transformed tree (so each A·0 /... locally dominates A·0·0 /...), whereas right children expand locally upward (so each .../A·1 is locally dominated by .../A·1·1 ). This means that rules from the original grammar -- 349 if distinguished into rules applying only to left and right children (active and awaited constituents) -- can still be locally modeled following a right-corner transform. A transformed tree can be generated in this way by expanding downward along the active constituents in a transformed tree, then turning around and expanding upward to fill in the awaited constituents, then turning around again to generate the active constituents at the next depth level, and so on. 4.2 Depth bounds The locality of the original grammar rules in a rightcorner transformed tree allows memory limits on incomplete constituents to be applied directly as depth bounds in the zig-zag generation traversal defined above. These depth limits correspond directly to the depth levels in an HHMM. In the experiments described in Section 5, direction-specific and depth-specific versions of the original grammar rules are implemented in an ordinary CKY-style dynamic-programming parser, and can therefore simply be cut off at a particular depth level with no renormalization. But in an HHMM, this will result in label-bias effects, in which expanded constituents may have no valid reduction, forcing the system to define distributions for composing constituents that are not compatible. For example, if a constituent is expanded at depth D, and that constituent has no expansions that can be completely processed within depth D, it will not be able to reduce, and will remain incompatible with the incomplete constituent above it. Probabilities for depth-bounded rules must therefore be renormalized to the domain of allowable trees that can be generated within D depth levels, in order to guarantee consistent probabilities for HHMM recognition. This is done by determining the (depth- and direction-specific) probability PB-L,d (1 | A·0 ) or PB-R,d (1 | A·1 ) that a tree generated at each depth d and rooted by a left or right child will fit within depth D. These probabilities are then estimated using an approximate inference algorithm, similar to that used in value iteration (Bellman, 1957), which estimates probabilities of infinite trees by exploiting the fact that increasingly longer trees contribute exponentially decreasing probability mass (since each non-terminal expansion must avoid generating a terminal with some probability at each step from the top down), so a sum over probabilities of trees with increasing length k is guaranteed to converge. The algorithm calculates probabilities of trees with increasing length k until convergence, or to some arbitrary limit K: PB-L,d,k (1 | A·0 ) = A·1·0 , A·1·1 def where: FG-L*,d (A A·0k-1 , · A·0k-1 ·1 k A·0k ) = k-1 PG-L*,d (A A·0k-1 ) A·0k A·0k-1 ·1 ) (18) PG-L,d (A·0k-1 0 PG (A·0 A·0·0 A·0·1 ) (13) · PB-R,d,k-1 (1 | A·0·1 ) def · PB-L,d,k-1 (1 | A·0·0 ) PB-R,d,k (1 | A·1 ) = A·1·0 , A·1·1 PG (A·1 A·1·0 A·1·1 ) (14) · PB-R,d,k-1 (1 | A·1·1 ) · PB-L,d+1,k-1 (1 | A·1·0 ) and PG-L*,d (A A ) = [A = A ]. A complete HHMM can now be defined using depth-bounded right-corner PCFG probabilities. HHMM probabilities will be defined over syntactic states consisting of incomplete constituent categories A /A·µ . Expansions depend on only the incomplete constituent category ../A (for any active category `..') d-1 at qt : PQ-Ex,d (a·0·µ | ../A ) = Normalized probability distributions for depthbounded expansions G-L,d and G-R,d can now be calculated using converged B-L,d and B-R,d estimates: PG-L,d (A·0 PG (A·0 PG-R,d (A·1 A·0·0 A·0·1 ) = A·0·0 A·0·1 ) (15) A·1·0 A·1·1 ) = A·1·0 A·1·1 ) def def A·0 A·1 )· A·0 , PG-R,d-1 (A A·1 FG-L*,d (A·0 a·0·µ ) A·0 A·1 )· A·0 , PG-R,d-1 (A A·1 , FG-L*,d (A·0 a·0·µ ) a·0·µ (19) · PB-L,d (1 | A·0·0 ) · PB-R,d (1 | A·0·1 ) PG (A·1 · PB-L,d+1 (1 | A·1·0 ) · PB-R,d (1 | A·1·1 ) (16) 4.3 HHMM probabilities Converting PCFGs to HHMMs requires the calcu lation of expected frequencies FG-L*,d (A A·µ ) of generating symbols A·µ in the left-progeny of a nonterminal symbol A (in other words, of A·µ being a left child of A , or a left child of a left child of A , etc.). This is done by summing over subtrees of increasing length k using the same approximate inference technique described in Section 4.2, which guarantees convergence since each subtree of increasing length contributes exponentially decreasing probability mass to the sum: FG-L*,d (A Transitions depend on whether an `active' or `awaited' transition was performed at the current level. If an active transition was performed (where ftd = 1), the transition depends on only the incomplete constituent category A·0·µ·0 /.. (for any d awaited category `..') at qt-1 , and the incomplete constituent category ../A (for any active category d-1 `..') at qt-1 : PQ-Tr,d (A·0·µ /A·0·µ·1 | 1, A·0·µ·0 /.., ../A ) = A·0 , A·1 PG-R,d-1 (A A·0 A·1 )· FG-L*,d (A·0 A·0·µ ) FG-L*,d (A0 A0µ0 )-FG-L*,d (A0 A0µ0 ) 0 · PG-L,d (A·0·µ A·0·µ·0 A·0·µ·1 ) PG-R,d-1 (A A·0 A·1 )· FG-L*,d (A·0 A·0·µ ) A·0 , · A·1 , 0 A·0·µ , FG-L*,d (A0 A0µ0 )-FG-L*,d (A0 A0µ0 ) A·0·µ·1 P (A·0·µ A·0·µ·0 A·0·µ·1 ) G-L,d (20) If an awaited transition was performed (where ftd = 0), the transition depends on only the complete constituent category A·µ·0 at ftd+1 , and the incomplete A·µ ) = k=0 FG-L*,d (A k A·µ ) (17) 350 d constituent category A /A·µ at qt-1 : PQ-Tr,d (A /A·µ·1 | 0, A·µ·0 , A /A·µ ) = PG-R,d (A·µ A·µ·0 A·µ·1 ) A·µ·0 A·µ·1 ) A·µ·1 PG-R,d (A·µ (21) model (sect 22­24, len>40) F unbounded PCFG 66.03 bounded PCFG (D=4) 66.08 Table 1: Results of CKY parsing using bounded and unbounded PCFG. Reduce probabilities depend on the complete constituent category at ftd+1 , and the incomplete constituent category A·0·µ·0 /.. (for any awaited cated gory `..') at qt-1 , and the incomplete constituent catd-1 egory ../A (for any active category `..') at qt-1 . If the complete constituent category at ftd+1 does not d match the awaited category of qt-1 , the probability is [ftd = f 0]. If the complete constituent category d at ftd+1 does match the awaited category of qt-1 : PF-Rd,d (1 | A·0·µ /.., ../A ) = A·0 ,A·1 PG-R,d-1 (A A·0 A·1 )· FG-L*,d (A·0 A·0·µ ) PG-R,d-1 (A FG-L*,d (A·0 -FG-L*,d (A·0 0 A·0·µ ) were deleted from the PCFG. The right-corner and bounded-stack transforms described in the previous section were then applied to the PCFG. The original and bounded PCFG models were evaluated in a CKY recognizer on sections 22­24 of the Treebank, with results shown in Table 1.6 Results were significant only for sentences longer than 40 words. On these sentences, the bounded PCFG model achieves about a .15% reduction of error over the original PCFG (p < .1 using one-tailed pairwise t-test). This suggests that on long sentences the probability mass wasted due to parsing with an unbounded stack is substantial enough to impact parsing accuracy. A·0 ,A·1 A·0 A·1 )· A·0·µ ) (22) 6 Conclusion and: PF-Rd,d (A·0·µ | A·0·µ /.., ../A ) = A·0 ,A·1 FG-L*,d (A·0 A·0 ,A·1 PG-R,d-1 (A A·0 A·1 )· A·0·µ ) A·0 A·1 )· A·0·µ ) (23) 0 PG-R,d-1 (A FG-L*,d (A·0 The correctness of the above distributions can be demonstrated by the fact that all terms other than G-L,d and G-R,d probabilities will cancel out in any sequence of transitions between an expansion and a reduction, leaving only those terms that would appear as factors in an ordinary PCFG parse.5 Previous work has explored bounded-stack parsing using a right-corner transform defined on trees to minimize stack usage. HHMM parsers trained on applications of this tree-based transform of training corpora have shown improvements over ordinary PCFG models, but this may have been attributable to the richer dependencies of the HHMM. This paper has presented an approximate inference algorithm for transforming entire PCFGs, rather than individual trees, into equivalent rightcorner bounded-stack HHMMs. Moreover, a comparison with an untransformed PCFG model suggests that the probability mass wasted due to parsing with an unbounded stack is substantial enough to impact parsing accuracy. 5 Results Acknowledgments This research was supported by NSF CAREER award 0447685 and by NASA under award NNX08AC36A. The views expressed are not necessarily endorsed by the sponsors. A CKY recognizer was used in both cases in order to avoid introducing errors due to model approximation or beam limits necessary for incremental processing with large grammars. 6 A PCFG model was extracted from sections 2­21 of the Wall Street Journal Treebank. In order to keep the transform process manageable, punctuation was removed from the corpus, and rules occurring less frequently than 10 times in the corpus It is important to note, however, that these probabilities are not necessarily incrementally balanced, so this correctness only applies to parsing with an infinite beam. 5 351 References Steven P. Abney and Mark Johnson. 1991. Memory requirements and local ambiguities of parsing strategies. J. Psycholinguistic Research, 20(3):233­250. Richard Bellman. 1957. Dynamic Programming. Princeton University Press, Princeton, NJ. Michael Collins. 1997. Three generative, lexicalised models for statistical parsing. In Proceedings of the 35th Annual Meeting of the Association for Computational Linguistics (ACL '97). Nelson Cowan. 2001. The magical number 4 in shortterm memory: A reconsideration of mental storage capacity. Behavioral and Brain Sciences, 24:87­185. Edward Gibson. 1991. A computational theory of human linguistic processing: Memory limitations and processing breakdown. Ph.D. thesis, Carnegie Mellon. James Henderson. 2004. Lookahead in deterministic left-corner parsing. In Proc. Workshop on Incremental Parsing: Bringing Engineering and Cognition Together, Barcelona, Spain. Mark Johnson. 1998. Finite state approximation of constraint-based grammars using left-corner grammar transforms. In Proceedings of COLING/ACL, pages 619­623. Tim Miller and William Schuler. 2008. A syntactic timeseries model for parsing fluent and disfluent speech. In Proceedings of the 22nd International Conference on Computational Linguistics (COLING'08). George A. Miller. 1956. The magical number seven, plus or minus two: Some limits on our capacity for processing information. Psychological Review, 63:81­97. Kevin P. Murphy and Mark A. Paskin. 2001. Linear time inference in hierarchical HMMs. In Proc. NIPS, pages 833­840. Brian Roark. 2001. Probabilistic top-down parsing and language modeling. Computational Linguistics, 27(2):249­276. William Schuler, Samir AbdelRahman, Tim Miller, and Lane Schwartz. 2008. Toward a psycholinguisticallymotivated model of language. In Proceedings of COLING, Manchester, UK, August. Mark Steedman. 2000. The syntactic process. MIT Press/Bradford Books, Cambridge, MA. 352 Hierarchical Text Segmentation from Multi-Scale Lexical Cohesion Jacob Eisenstein Beckman Institute for Advanced Science and Technology University of Illinois Urbana, IL 61801 jacobe@illinois.edu Abstract This paper presents a novel unsupervised method for hierarchical topic segmentation. Lexical cohesion ­ the workhorse of unsupervised linear segmentation ­ is treated as a multi-scale phenomenon, and formalized in a Bayesian setting. Each word token is modeled as a draw from a pyramid of latent topic models, where the structure of the pyramid is constrained to induce a hierarchical segmentation. Inference takes the form of a coordinate-ascent algorithm, iterating between two steps: a novel dynamic program for obtaining the globally-optimal hierarchical segmentation, and collapsed variational Bayesian inference over the hidden variables. The resulting system is fast and accurate, and compares well against heuristic alternatives. The idea of multi-scale cohesion is illustrated by the following two examples, drawn from the Wikipedia entry for the city of Buenos Aires. There are over 150 city bus lines called Colectivos ... Colectivos in Buenos Aires do not have a fixed timetable, but run from 4 to several per hour, depending on the bus line and time of the day. The Buenos Aires metro has six lines, 74 stations, and 52.3 km of track. An expansion program is underway to extend existing lines into the outer neighborhoods. Track length is expected to reach 89 km... 1 Introduction Recovering structural organization from unformatted texts or transcripts is a fundamental problem in natural language processing, with applications to classroom lectures, meeting transcripts, and chatroom logs. In the unsupervised setting, a variety of successful systems have leveraged lexical cohesion (Halliday and Hasan, 1976) ­ the idea that topically-coherent segments display consistent lexical distributions (Hearst, 1994; Utiyama and Isahara, 2001; Eisenstein and Barzilay, 2008). However, such systems almost invariably focus on linear segmentation, while it is widely believed that discourse displays a hierarchical structure (Grosz and Sidner, 1986). This paper introduces the concept of multi-scale lexical cohesion, and leverages this idea in a Bayesian generative model for hierarchical topic segmentation. 353 The two sections are both part of a high-level segment on transportation. Words in bold are characteristic of the subsections (buses and trains, respectively), and do not occur elsewhere in the transportation section; words in italics occur throughout the high-level section, but not elsewhere in the article. This paper shows how multi-scale cohesion can be captured in a Bayesian generative model and exploited for unsupervised hierarchical topic segmentation. Latent topic models (Blei et al., 2003) provide a powerful statistical apparatus with which to study discourse structure. A consistent theme is the treatment of individual words as draws from multinomial language models indexed by a hidden "topic" associated with the word. In latent Dirichlet allocation (LDA) and related models, the hidden topic for each word is unconstrained and unrelated to the hidden topic of neighboring words (given the parameters). In this paper, the latent topics are constrained to produce a hierarchical segmentation structure, as shown in Figure 1. Human Language Technologies: The 2009 Annual Conference of the North American Chapter of the ACL, pages 353­361, Boulder, Colorado, June 2009. c 2009 Association for Computational Linguistics 8 6 1 2 3 4 7 5 w1 ... wT Figure 1: Each word wt is drawn from a mixture of the language models located above t in the pyramid. These structural requirements simplify inference, allowing the language models to be analytically marginalized. The remaining hidden variables are the scale-level assignments for each word token. Given marginal distributions over these variables, it is possible to search the entire space of hierarchical segmentations in polynomial time, using a novel dynamic program. Collapsed variational Bayesian inference is then used to update the marginals. This approach achieves high quality segmentation on multiple levels of the topic hierarchy. Source code is available at http://people. csail.mit.edu/jacobe/naacl09.html. 2 Related Work The use of lexical cohesion (Halliday and Hasan, 1976) in unsupervised topic segmentation dates back to Hearst's seminal T EXT T ILING system (1994). Lexical cohesion was placed in a probabilistic (though not Bayesian) framework by Utiyama and Isahara (2001). The application of Bayesian topic models to text segmentation was investigated first by Blei and Moreno (2001) and later by Purver et al. (2006), using HMM-like graphical models for linear segmentation. Eisenstein and Barzilay (2008) extend this work by marginalizing the language models using the Dirichlet compound multinomial distribution; this permits efficient inference to be performed directly in the space of segmentations. All of these papers consider only linear topic segmentation; we introduce multi-scale lexical cohesion, which posits that the distribution of some 354 words changes slowly with high-level topics, while others change rapidly with lower-level subtopics. This gives a principled mechanism to model hierarchical topic segmentation. The literature on hierarchical topic segmentation is relatively sparse. Hsueh et al. (2006) describe a supervised approach that trains separate classifiers for topic and sub-topic segmentation; more relevant for the current work is the unsupervised method of Yaari (1997). As in T EXT T ILING, cohesion is measured using cosine similarity, and agglomerative clustering is used to induce a dendrogram over paragraphs; the dendrogram is transformed into a hierarchical segmentation using a heuristic algorithm. Such heuristic approaches are typically brittle, as they include a number of parameters that must be hand-tuned. These problems can be avoided by working in a Bayesian probabilistic framework. We note two orthogonal but related approaches to extracting nonlinear discourse structures from text. Rhetorical structure theory posits a hierarchical structure of discourse relations between spans of text (Mann and Thompson, 1988). This structure is richer than hierarchical topic segmentation, and the base level of analysis is typically more fine-grained ­ at the level of individual clauses. Unsupervised approaches based purely on cohesion are unlikely to succeed at this level of granularity. Elsner and Charniak (2008) propose the task of conversation disentanglement from internet chatroom logs. Unlike hierarchical topic segmentation, conversational threads may be disjoint, with unrelated threads interposed between two utterances from the same thread. Elsner and Charniak present a supervised approach to this problem, but the development of cohesion-based unsupervised methods is an interesting possibility for future work. 3 Model Topic modeling is premised on a generative framework in which each word wt is drawn from a multinomial yt , where yt is a hidden topic indexing the language model that generates wt . From a modeling standpoint, linear topic segmentation merely adds the constraint that yt {yt-1 , yt-1 + 1}. Segmentations that draw boundaries so as to induce compact, low-entropy language models will achieve a high likelihood. Thus topic models situate lexical cohesion in a probabilistic setting. For hierarchical segmentation, we take the hypothesis that lexical cohesion is a multi-scale phenomenon. This is represented with a pyramid of language models, shown in Figure 1. Each word may be drawn from any language model above it in the pyramid. Thus, the high-level language models will be required to explain words throughout large parts of the document, while the low-level language models will be required to explain only a local set of words. A hidden variable zt indicates which level is responsible for generating the word wt . Ideally we would like to choose the segmentation y = argmaxy p(w|y)p(y). However, we must deal ^ with the hidden language models and scale-level assignments z. The language models can be integrated out analytically (Section 3.1). Given marginal likelihoods for the hidden variables z, the globally optimal segmentation y can be found using a dy^ namic program (Section 4.1). Given a segmentation, we can estimate marginals for the hidden variables, using collapsed variational inference (Section 4.2). We iterate between these procedures in an EM-like coordinate-ascent algorithm (Section 4.4) until convergence. With these pieces in place, we can write the observation likelihood, T p(w|y, z, ) = t K p(wt |y(zt ) ) t = j {t:y (zt ) =j} t p(wt |j ), where we have merely rearranged the product to group terms that are drawn from the same language model. As the goal is to obtain the hierarchical segmentation and not the language models, the search space can be reduced by marginalizing . The derivation is facilitated by a notational convenience: xj represents the lexical counts induced by the set (z ) of words {wt : yt t = j}. K p(w|y, z, ) = j K dj p(j |)p(xj |j ) pdcm (xj ; ) = j K = j (W ) ( W i W i xji + ) (xji + ) . () (1) 3.1 Language models We begin the formal presentation of the model with some notation. Each word wt is modeled as a single draw from a multinomial language model j . The language models in turn are drawn from symmetric Dirichlet distributions with parameter . The number of language models is written K; the number of words is W ; the length of the document is T ; and the depth of the hierarchy is L. For hierarchical segmentation, the vector yt indicates the segment index of t at each level of the topic hierarchy; the specific level of the hierarchy responsible for wt is given by the hidden variable zt . Thus, (z ) yt t is the index of the language model that generates wt . 355 Here, pdcm indicates the Dirichlet compound multinomial distribution (Madsen et al., 2005), which is the closed form solution to the integral over language models. Also known as the multivariate Polya distribution, the probability density function can be computed exactly as a ratio of gamma functions. Here we use a symmetric Dirichlet prior , though asymmetric priors can easily be applied. Thus far we have treated the hidden variables z as observed. In fact we will compute approximate marginal probabilities Qzt (zt ), written t Qzt (zt = ). Writing x Qz for the expectation of x under distribution Qz , we approximate, pdcm (xj ; ) xj (i) Qz Qz pdcm ( xj L Qz ; ) ( ) = {t:jyt } (wt = i)(yt = j)t , where xj (i) indicates the count for word type i generated from segment j. In the outer sum, we consider all t for possibly drawn from segment j. The inner sum goes over all levels of the pyramid. The delta functions take the value one if the enclosed Boolean expression is true and zero otherwise, so we are adding the fractional counts t only when ( ) wt = i and yt = j. 3.2 Prior on segmentations 4 Inference This section describes the inference for the segmentation y, the approximate marginals QZ , and the hyperparameter . 4.1 Dynamic programming for hierarchical segmentation Maximizing the joint probability p(w, y) = p(w|y)p(y) leaves the term p(y) as a prior on segmentations. This prior can be used to favor segmentations with the desired granularity. Consider a prior of the form p(y) = L p(y( ) |y( -1) ); for nota=1 tional convenience, we introduce a base level such (0) that yt = t, where every word is a segmentation point. At every level > 0, the prior is a Markov ( ) ( ) process, p(y( ) |y( -1) ) = T p(yt |yt-1 , y( -1) ). t The constraint yt {yt-1 , yt-1 + 1} ensures a linear segmentation at each level. To enforce hierar( ) chical consistency, each yt can be a segmentation point only if t is also a segmentation point at the lower level - 1. Zero probability is assigned to segmentations that violate these constraints. To quantify the prior probability of legal segmentations, assume a set of parameters d , indicating the expected segment duration at each level. If t is a valid potential segmentation point at level ( -1) ( -1) (i.e., yt = 1 + yt-1 ), then the prior probability of a segment transition is r = d -1 /d , with d0 = 1. If there are N segments in level and M N segments in level - 1, then the prior p(y( ) |y( -1) ) = rN (1 - r )M -N , as long as the hierarchical segmentation constraint is obeyed. For the purposes of inference it will be preferable to have a prior that decomposes over levels and segments. In particular, we do not want to have to commit to a particular segmentation at level before segmenting level + 1. The above prior can be approximated by replacing M with its expectation M d -1 = T /d -1 . Then a single segment ranging from wu to wv (inclusive) will contribute v-u log r + d -1 log(1 - r ) to the log of the prior. 356 ( ) ( ) ( ) While the model structure is reminiscent of a factorial hidden Markov model (HMM), there are important differences that prevent the direct application of HMM inference. Hidden Markov models assume that the parameters of the observation likelihood distributions are available directly, while we marginalize them out. This has the effect of introducing dependencies throughout the state space: the segment assignment for each yt contributes to lexical counts which in turn affect the observation likelihoods for many other t . However, due to the left-to-right nature of segmentation, efficient inference of the optimal hierarchical segmentation (given the marginals QZ ) is still possible. Let B ( ) [u, v] represent the log-likelihood of grouping together all contiguous words wu . . . wv-1 at level of the segmentation hierarchy. Using xt to indicate a vector of zeros with one at the position wt , we can express B more formally: v B ( ) [u, v] = log pdcm t=u xt t v-u-1 log(1 - r ). d -1 + log r + The last two terms are from the prior p(y), as explained in Section 3.2. The value of B ( ) [u, v] is computed for all u, all v > u, and all . Next, we compute the log-likelihood of the optimal segmentation, which we write as A(L) [0, T ]. This matrix can be filled in recursively: A( ) [u, v] = max B ( ) [t, v] + A( ut