You enter a
dark forest. Standing in front of you is:
A professor named Hal Daumé III (he/him).
He wields appointments in
Computer Science where he is a
Perotto Professor, as well as
Language Science at
UMD (in Fall 2019 he is teaching Computational Linguistics I); he also
spends time in the machine learning and fairness
groups at Microsoft Research NYC.
He and his wonderful advisees
like to study
questions related to how to get machines to becomes more adept at
human language (and artificial intelligence tasks more broadly),
by developing models and algorithms that allow them
to learn from data. (Keywords: natural language processing and machine
learning.)
The two major questions that really drive their research these days are:
(1) how can we get computers to learn
through natural interaction with people/users?
and (2) how can we do this in a way that minimize harms
in the learned models?
He's discussed interactive learning informally in a Talking Machines Podcast
and more technically in recent talks;
and has discussed fairness/bias in broad terms in a (now somewhat outdated) blog post.
He is the author of the online textbook A Course in Machine Learning,
which is fully open source.
Hal is super fortunate to have awesome colleagues in the Computional
Linguistics and Information Processing Lab (which he formerly
directed) and Center for Machine Learning.
If you want to contact him, email is your best bet; you can
also find him on @haldaume3
on Twitter. Or, in person, in his office
(IRB 4150).
If you're a prospective grad student or grad applicant, please read
his FAQ to answer some common questions.
If you're thinking of inviting him for a talk or event, please ensure
that the event is organized in an inclusive manner (inclusion rider).
More generally, if you are organizing a conference, workshop or other
event, you may wish to read the NeurIPS D&I survey
results (joint with Katherine Heller),
Humberto Corona's collection of resources/advice,
or two blog posts on this topic.
I acknowledge that I live and work on the ancestral and unceded lands of the Piscataway People, who were among the first in the Western Hemisphere to encounter European colonists, as well as the lands of the Lenape and Nacotchtank people.
Recent Publications:
Operationalizing the Legal Principle of Data Minimization for Personalization
Asia J. Biega, Peter Potash, Hal Daumé III, Fernando Diaz and MichèB;le Finck
Conference on Research and Developments in Information Retrieval (SIGIR), 2020
[Abstract] [BibTeX]
Article 5(1)(c) of the European Union’s General Data Protection Regulation (GDPR) requires that "personal data shall be [...] adequate, relevant, and limited to what is necessary in relation to the purposes for which they are processed (`data minimisation')". To date, the legal and computational definitions of `purpose limitation' and `data minimization' remain largely unclear. In particular, the interpretation of these principles is an open issue for information access systems that optimize for user experience through personalization and do not strictly require personal data collection for the delivery of basic service. In this paper, we identify a lack of a homogeneous interpretation of the data minimization principle and explore two operational definitions applicable in the context of personalization. The focus of our empirical study in the domain of recommender systems is on providing foundational insights about the (i) feasibility of different data minimization definitions, (ii) robustness of different recommendation algorithms to minimization, and (iii) performance of different minimization strategies. We find that the performance decrease incurred by data minimization might not be substantial, but that it might disparately impact different users--a finding which has implications for the viability of different formal minimization definitions. Overall, our analysis uncovers the complexities of the data minimization problem in the context of personalization and maps the remaining computational and regulatory challenges.
@inproceedings{daume20minimization,
title = {Operationalizing the Legal Principle of Data Minimization for
Personalization},
author = {Asia J. Biega and Peter Potash and Daum\'e, III, Hal and Fernando Diaz
and Mich\`ele Finck},
booktitle = {Proceedings of the Conference on Research and Developments in
Information Retrieval (SIGIR)},
year = {2020},
url = {http://hal3.name/docs/#daume20minimization},
}
Toward Gender-Inclusive Coreference Resolution
Yang Trista Cao and Hal Daumé III
Conference of the Association for Computational Linguistics (ACL), 2020
[Abstract] [BibTeX]
Correctly resolving textual mentions of people fundamentally entails making inferences about those people. Such inferences raise the risk of systemic biases in coreference resolution systems, including biases that can harm binary and non-binary trans and cis stakeholders. To better understand such biases, we foreground nuanced conceptualizations of gender from sociology and sociolinguistics, and develop two new datasets for interrogating bias in crowd annotations and in existing coreference resolution systems. Through these studies, conducted on English text, we confirm that without acknowledging and building systems that recognize the complexity of gender, we build systems that lead to many potential harms.
@inproceedings{daume20gicoref,
title = {Toward Gender-Inclusive Coreference Resolution},
author = {Yang Trista Cao and Daum\'e, III, Hal},
booktitle = {Proceedings of the Conference of the Association for Computational
Linguistics (ACL)},
year = {2020},
url = {http://hal3.name/docs/#daume20gicoref},
}
Language (technology) is Power: A Critical Survey of "Bias" in NLP
Su Lin Blodgett, Solon Barocas, Hal Daumé III and Hanna Wallach
Conference of the Association for Computational Linguistics (ACL), 2020
[Abstract] [BibTeX]
We survey 146 papers analyzing "bias" in NLP systems, finding that their motivations are often vague, inconsistent, and lacking in normative reasoning, despite the fact that analyzing "bias" is an inherently normative process. We further find that these papers’ proposed quantitative techniques for measuring or mitigating "bias" are poorly matched to their motivations and do not engage with the relevant literature outside of NLP. Based on these findings, we describe the beginnings of a path forward by proposing three recommendations that should guide work analyzing "bias" in NLP systems. These recommendations rest on a greater recognition of the relationships between language and social hierarchies, encouraging researchers and practitioners to articulate their conceptualizations of "bias"--i.e., what kinds of system behaviors are harmful, in what ways, to whom, and why, as well as the normative reasoning underlying these statements--and to center work around the lived experiences of members of communities affected by NLP systems, while interrogating and reimagining the power relations between technologists and such communities.
@inproceedings{daume20power,
title = {Language (technology) is Power: A Critical Survey of ``Bias'' in NLP},
author = {Su Lin Blodgett and Solon Barocas and Daum\'e, III, Hal and Hanna
Wallach},
booktitle = {Proceedings of the Conference of the Association for Computational
Linguistics (ACL)},
year = {2020},
url = {http://hal3.name/docs/#daume20power},
}
Active Imitation Learing with Noisy Guidance
Kianté Brantley, Amr Sharaf and Hal Daumé III
Conference of the Association for Computational Linguistics (ACL), 2020
[BibTeX]
@inproceedings{daume20active,
title = {Active Imitation Learing with Noisy Guidance},
author = {Kiant\'e Brantley and Amr Sharaf and Daum\'e, III, Hal},
booktitle = {Proceedings of the Conference of the Association for Computational
Linguistics (ACL)},
year = {2020},
url = {http://hal3.name/docs/#daume20active},
}
Global Voices: Crossing Borders in Automatic News Summarization
Khanh Nguyen and Hal Daumé III
EMNLP Summarization Workshop, 2019
[Abstract] [BibTeX]
We construct Global Voices, a multilingual dataset for evaluating cross-lingual summarization methods. We extract social-network descriptions of Global Voices news articles to cheaply collect evaluation data for into-English and from-English summarization in 15 languages. Especially, for the into-English summarization task, we crowd-source a high-quality evaluation dataset based on guidelines that emphasize accuracy, coverage, and understandability. To ensure the quality of this dataset, we collect human ratings to filter out bad summaries, and conduct a survey on humans, which shows that the remaining summaries are preferred over the social-network summaries. We study the effect of translation quality in cross-lingual summarization, comparing a translate-then-summarize approach with several baselines. Our results highlight the limitations of the ROUGE metric that are overlooked in monolingual summarization.
@inproceedings{daume19global,
title = {Global Voices: Crossing Borders in Automatic News Summarization},
author = {Khanh Nguyen and Daum\'e, III, Hal},
booktitle = {EMNLP Summarization Workshop},
year = {2019},
url = {http://hal3.name/docs/#daume19global},
}
More papers please!
Recent Talks:
(Meta-)Learning from Interaction
NYU Machine Learning Reading Group 2019
[PDF]
[ODP]
Out of Order! Flexible neural language generation
NAACL 2019 NeuralGen Workshop
[PPTx]
Beyond demonstrations: Learning behavior from higher-level supervision
ICML 2019 I3 Workshop
[PPTX]
Imitation Learning
Vector Institute Reinforcement Learning Summer School 2018
[PDF]
[ODP]
Learning language through interaction
December 2016, Georgetown, Amazon, USC, GATech, UW, ...
[PDF]
[ODP]
[Video]
Bias in AI
November 2016, UMD MCWIC Diversity Summit
[PDF]
[ODP]
[PPTx (exported)]
[Blog Post]
More talks please!
Contact information:
email: me AT hal3 DOT name skype: haldaume3
phone: 301-405-1073 twitter: haldaume3
office: IRB 4150 github: hal3
I can't reply to all
prospective students email; please
read this before emailing me.
credits: design and font inspired by Seth Able's LoRD, some images converted to ANSI using ManyTools, original drawing of me by anonymous.
last updated on eight june, two thousand twenty.