Jordan Boyd-Graber: Home

I am a full professor in the University of Maryland Computer Science Department (tenure home), Institute of Advanced Computer Studies, INFO, and Language Science Center.

My research focuses on making machine learning more useful, more interpretable, and able to learn and interact from humans. This helps users sift through decades of documents; discover when individuals lie, reframe, or change the topic in a conversation; or to compete against humans in games that are based in natural language.

My Google Scholar page

Book a meeting with me (collaborators and UMD students).

Recent Publications

Yoo Yeon Sung, Maharshi Gor, Eve Fleisig, Ishani Mondal, and Jordan Boyd-Graber. ADVSCORE: A Metric for the Evaluation and Creation of Adversarial Benchmarks. North American Association for Computational Linguistics, 2025. [Bibtex]
```
@inproceedings{Sung:Gor:Fleisig:Mondal:Boyd-Graber-2025,
	Title = {ADVSCORE: A Metric for the Evaluation and Creation of Adversarial Benchmarks},
	Author = {Yoo Yeon Sung and Maharshi Gor and Eve Fleisig and Ishani Mondal and Jordan Lee Boyd-Graber},
	Journal = {North American Association for Computational Linguistics},
	Year = {2025},
	Url = {http://cs.umd.edu/~jbg//docs/2025_naacl_advscore.pdf},
}
```
This was one of ten papers selected as an Outstanding Paper at NAACL 2025

Accessible Abstract: Adversarial datasets should validate AI robustness by presenting samples that humans handle well but models struggle with. However, as models advance, these datasets risk becoming obsolete. Assessing whether a dataset remains adversarial is challenging due to the absence of a standardized metric for adversarialness. To address this, we introduce AdvScore, a human-grounded evaluation metric that quantifies a dataset's adversarial nature by accounting for the differing abilities of models and humans while also identifying low-quality examples.

Zongxia Li , Xiyang Wu, Guangyao Shi, Yubin Qin, Hongyang Du, Tianyi Zhou, Dinesh Manocha, and Jordan Boyd-Graber. VideoHallu: Evaluating and Mitigating Multi-modal Hallucinations on Synthetic Video Understanding. Neural Information Processing Systems, 2025. [Bibtex]

@inproceedings{Li:Wu:Shi:Qin:Du:Zhou:Manocha:Boyd-Graber-2025,
	Title = {VideoHallu: Evaluating and Mitigating Multi-modal Hallucinations on Synthetic Video Understanding},
	Author = {Zongxia Li  and Xiyang Wu and Guangyao Shi and Yubin Qin and Hongyang Du and Tianyi Zhou and Dinesh Manocha and Jordan Lee Boyd-Graber},
	Booktitle = {Neural Information Processing Systems},
	Year = {2025},
	Location = {San Diego},
	Url = {http://cs.umd.edu/~jbg//docs/2025_neurips_videohallusion.pdf},
}

Wichayaporn Wongkamjan, Yanze Wang, Feng Gu, Denis Peskoff, Jonathan K. Kummerfeld, Jonathan May, and Jordan Boyd-Graber. Should I Trust You? Detecting Deception in Negotiations using Counterfactual RL. Findings of the Association for Computational Linguistics, 2025. [Code/Data] [Bibtex]
```
@article{Wongkamjan:Wang:Gu:Peskoff:Kummerfeld:May:Boyd-Graber-2025,
	Title = {Should I Trust You? Detecting Deception in Negotiations using Counterfactual RL},
	Author = {Wichayaporn Wongkamjan and Yanze Wang and Feng Gu and Denis Peskoff and Jonathan K. Kummerfeld and Jonathan May and Jordan Boyd-Graber},
	Journal = {Findings of the Association for Computational Linguistics},
	Year = {2025},
	Url = {http://cs.umd.edu/~jbg//docs/2025_acl_ctrld.pdf},
}
```
Accessible Abstract: When determining when an offer sounds "too good to be true", it helps to consider what the person sending the message has to gain. When we provide this information to classifiers tasked with determining if a message is deceptive in the online game of Diplomacy, it dramatically improves ability to detect deception.

Ishani Mondal, Jack W. Stokes, Sujay Kumar Jauhar, Longqi Yang, Mengting Wan, Xiaofeng Xu, Xia Song, Jordan Boyd-Graber, and Jennifer Neville. Group Preference Alignment: Customizing LLM Responses from In-Situ Conversations Only When Needed. Empirical Methods in Natural Language Processing (Industry), 2025. [Bibtex]

@inproceedings{Mondal:Stokes:Jauhar:Yang:Wan:Xu:Song:Boyd-Graber:Neville-2025,
	Url = {http://cs.umd.edu/~jbg//docs/2025_emnlp_grouppreference.pdf},
	Author = {Ishani Mondal and Jack W. Stokes and Sujay Kumar Jauhar and Longqi Yang and Mengting Wan and Xiaofeng Xu and Xia Song and Jordan Boyd-Graber and Jennifer Neville},
	Booktitle = {Empirical Methods in Natural Language Processing (Industry)},
	Title = {Group Preference Alignment: Customizing LLM Responses from In-Situ Conversations Only When Needed},
	Location = {Suzhou, China},
	Year = {2025},
}

Lorena Calvo-Bartolomé, Valérie Aldana, Karla Cantarero, Alonso Madroñal de Mesa, Jerónimo Arenas-García, and Jordan Boyd-Graber. Discrepancy Detection at the Data Level: Toward Consistent Multilingual Question Answering. Empirical Methods in Natural Language Processing, 2025. [code+data] [Bibtex]
```
@inproceedings{Calvo-Bartolome:Aldana:Cantarero:Mesa:o:Boyd-Graber-2025,
	Author = {Lorena Calvo-Bartolom\'{e} and Val\'{e}rie Aldana and Karla Cantarero and Alonso Madro\~nal de Mesa and Jer\'{o}nimo Arenas-Garc\'{i}a and Jordan Boyd-Graber},
	Booktitle = {Empirical Methods in Natural Language Processing},
	Title = {Discrepancy Detection at the Data Level: Toward Consistent Multilingual Question Answering},
	Location = {Suzhou, China},
	Year = {2025},
	Url = {http://cs.umd.edu/~jbg//docs/2025_emnlp_mind.pdf},
}
```
Accessible Abstract: Imagine asking an AI chatbot for health advice and getting conflicting guidance—or turning to a chatbot in a crisis only to receive unclear instructions. Confusing or inconsistent AI isn’t just frustrating; it can put people’s health and safety at risk. To address this our system proactively identifies discrepancies across languages before they appear in AI-generated answers. Dubbed MIND (Multilingual Inconsistent Notion Detection), the system aligns documents from different languages in a shared conceptual space, compares interpretations, and flags factual or culturally divergent information. For example, guidance on childbirth practices can vary by region, and MIND highlights these differences so users can trust the information.
Nishant Balepur, Matthew Shu, Yoo Yeon Sung, Seraphina Goldfarb-Tarrant, Shi Feng, Fumeng Yang, Rachel Rudinger, and Jordan Boyd-Graber. A Good Plan is Hard to Find: Aligning Models with Preferences is Misaligned with What Helps Users. Empirical Methods in Natural Language Processing, 2025. [code+data] [video] [Bibtex]
```
@inproceedings{Balepur:Shu:Sung:Goldfarb-Tarrant:Feng:Yang:Rudinger:Boyd-Graber-2025,
	Title = {A Good Plan is Hard to Find: Aligning Models with Preferences is Misaligned with What Helps Users},
	Author = {Nishant Balepur and Matthew Shu and Yoo Yeon Sung and Seraphina Goldfarb-Tarrant and Shi Feng and Fumeng Yang and Rachel Rudinger and Jordan Boyd-Graber},
	Booktitle = {Empirical Methods in Natural Language Processing},
	Location = {Suzhou and China},
	Year = {2025},
	Url = {http://cs.umd.edu/~jbg//docs/2025_emnlp_planorama.pdf},
}
```
Accessible Abstract: One of the ways that AI can help users with a task is by developing a plan: a set of steps to solve a problem or complete a task. Through a user study with human--AI teams, we show that AIs are poor judges of what plan is going to be more helpful to more helpful to a user trying to answer math questions or questions that require multiple steps of research (e.g., what's the tallest building in the most populous city in Germany).
Nishant Balepur, Vishakh Padmakumar, Fumeng Yang, Shi Feng, Rachel Rudinger, and Jordan Boyd-Graber. Whose Boat Does it Float? Improving Personalization in Preference Tuning via Inferred User Personas. Association for Computational Linguistics, 2025. [Code/Data] [Bibtex]
```
@inproceedings{Balepur:Padmakumar:Yang:Feng:Rudinger:Boyd-Graber-2025,
	Title = {Whose Boat Does it Float? Improving Personalization in Preference Tuning via Inferred User Personas},
	Author = {Nishant Balepur and Vishakh Padmakumar and Fumeng Yang and Shi Feng and Rachel Rudinger and Jordan Lee Boyd-Graber},
	Booktitle = {Association for Computational Linguistics},
	Location = {Vienna, Austria},
	Year = {2025},
	Url = {http://cs.umd.edu/~jbg//docs/2025_acl_boat.pdf},
}
```
Accessible Abstract: Language models are optimized to learn which responses you prefer, but they don't learn why you preferred a particular response. This limits their ability to tailor to personalized requests (e.g., "What should I eat for dinner? I'm vegetarian"), so we introduce a simple fix: have models infer personas that explain why users could prefer responses. We show training on these inferred personas leads to responses that are significantly more personalized for user needs.
Nishant Balepur, Rachel Rudinger, and Jordan Boyd-Graber. Which of These Best Describes Multiple Choice Evaluation with LLMs? A) Forced B) Flawed C) Fixable D) All of the Above. Association for Computational Linguistics, 2025. [Bibtex]
```
@inproceedings{Balepur:Rudinger:Boyd-Graber-2025,
	Title = {Which of These Best Describes Multiple Choice Evaluation with LLMs? A) Forced B) Flawed C) Fixable D) All of the Above},
	Author = {Nishant Balepur and Rachel Rudinger and Jordan Boyd-Graber},
	Booktitle = {Association for Computational Linguistics},
	Location = {Vienna, Austria},
	Year = {2025},
	Url = {http://cs.umd.edu/~jbg//docs/2025_acl_mcqa_bad.pdf},
}
```
Accessible Abstract: Most people dislike taking multiple-choice tests, so why are they the default way we evaluate NLP systems? This position paper argues that, despite its simplicity and popularity, multiple-choice evaluation is flawed, both in its format and the datasets it relies on. Drawing from educational testing theory, we propose practical fixes for these issues, helping us build evaluations that better test knowledge and reflect how humans use NLP systems.
Alexander Hoyle, Lorena Calvo-Bartolomé, Jordan Boyd-Graber, and Philip Resnik. ProxAnn: Use-Oriented Evaluations of Topic Models and Document Clustering. Association for Computational Linguistics, 2025. [Code] [Bibtex]
```
@inproceedings{Hoyle:Calvo-Bartolome:Boyd-Graber:Resnik-2025,
	Title = {ProxAnn: Use-Oriented Evaluations of Topic Models and Document Clustering},
	Author = {Alexander Hoyle and Lorena Calvo-Bartolom\'{e} and Jordan Lee Boyd-Graber and Philip Resnik},
	Booktitle = {Association for Computational Linguistics},
	Location = {Vienna, Austria},
	Year = {2025},
	Url = {http://cs.umd.edu/~jbg//docs/2025_acl_proxann.pdf},
}
```
Accessible Abstract: Topic models are tools to help people navigate large document collections. However, testing whether a topic model is good or not is notoriously hard, as it's subjective and requires asking real people about whether the outputs make sense. We show that you can ask a language model to recreate the answers of humans, correlating better with ground truth than previous evaluations.
Neha Pundlik Srikanth, Rachel Rudinger, and Jordan Boyd-Graber. No Questions are Stupid and but some are Poorly Posed: Understanding Poorly-Posed Information-Seeking Questions. Association for Computational Linguistics, 2025. [Code/Data] [Bibtex]
```
@inproceedings{Srikanth:Rudinger:Boyd-Graber-2025,
	Author = {Neha Srikanth and Rachel Rudinger and Jordan Lee Boyd-Graber},
	Title = {No Questions are Stupid and but some are Poorly Posed: Understanding Poorly-Posed Information-Seeking Questions},
	Booktitle = {Association for Computational Linguistics},
	Location = {Vienna, Austria},
	Year = {2025},
	Url = {http://cs.umd.edu/~jbg//docs/2025_acl_badq.pdf},
}
```
Accessible Abstract: Often, the questions users ask search engines or chatbots aren't perfect: they have errors, are vague, or lack context. Humans are able to deftly navigate these issues, but computers still struggle. We analyze the differences in how humans and computers repair imperfect questions to suggest how to improve AI's question answering abilities.
Zongxia Li, Lorena Calvo-Bartolomé, Alexander Miserlis Hoyle, Paiheng Xu, Daniel Kofi Stephens, Juan Francisco Fung, Alden Dima, and Jordan Boyd-Graber. LLMs Struggle to Describe the Haystack without Human Help: A Social Science-Inspired Evaluation of Topic Models. Association for Computational Linguistics, 2025. [Bibtex]
```
@inproceedings{Li:Calvo-Bartolom\'e:Hoyle:Xu:Stephens:Fung:Dima:Boyd-Graber-2025,
	Author = {Zongxia Li and Lorena Calvo-Bartolom\'e and Alexander Miserlis Hoyle and Paiheng Xu and Daniel Kofi Stephens and Juan Francisco Fung and Alden Dima and Jordan Boyd-Graber},
	Title = {LLMs Struggle to Describe the Haystack without Human Help: A Social Science-Inspired Evaluation of Topic Models},
	Booktitle = {Association for Computational Linguistics},
	Location = {Vienna, Austria},
	Year = {2025},
	Url = {http://cs.umd.edu/~jbg//docs/2025_acl_bass.pdf},
}
```
Accessible Abstract: Understanding large document collections has been the domain of old fashioned models called topic models for decades. There are new models based on LLMs that claim to be better... are they? We propose a new evaluation based on how much people learn from interacting with models to categorize a dataset to compare traditional and LLM models: traditional models are not bad, new models hallucinate, and a human in the loop model that we call BASS has the best outcomes.
Yoo Yeon Sung, Eve Fleisig, Yu Hope, Ishan Upadhyay, and Jordan Boyd-Graber. GRACE: A Granular Benchmark for Evaluating Model Calibration against Human Calibration. Association for Computational Linguistics, 2025. [Code/Data] [Bibtex]
```
@inproceedings{Sung:Fleisig:Hope:Upadhyay:Boyd-Graber-2025,
	Title = {GRACE: A Granular Benchmark for Evaluating Model Calibration against Human Calibration},
	Author = {Yoo Yeon Sung and Eve Fleisig and Yu Hope and Ishan Upadhyay and Jordan Boyd-Graber},
	Booktitle = {Association for Computational Linguistics},
	Year = {2025},
	Location = {Vienna, Austria},
	Url = {http://cs.umd.edu/~jbg//docs/2025_acl_grace.pdf},
}
```
Accessible Abstract: As AI use becomes more common, it's important to measure not just whether the systems are correct but whether they know when they're incorrect. We propose a new metric to measure this mismatch between correctness and confidence, compare computer ability with human ability, and show that computers have a long way to go before they're well-calibrated.
Nishant Balepur, Feng Gu, Abhilasha Ravichander, Shi Feng, Jordan Boyd-Graber, and Rachel Rudinger. Reverse Question Answering: Can an LLM Write a Question so Hard (or Bad) that it Can't Answer?. Annual Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics, 2025. [Bibtex]
```
@inproceedings{Balepur:Gu:Ravichander:Feng:Boyd-Graber:Rudinger-2025,
	Title = {Reverse Question Answering: Can an LLM Write a Question so Hard (or Bad) that it Can't Answer?},
	Author = {Nishant Balepur and Feng Gu and Abhilasha Ravichander and Shi Feng and Jordan Boyd-Graber and Rachel Rudinger},
	Booktitle = {Annual Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics},
	Year = {2025},
	Location = {Albuquerque},
	Url = {http://cs.umd.edu/~jbg//docs/2025_naacl_reverseqa.pdf},
}
```
Accessible Abstract: Language models like ChatGPT are pretty good at answering questions (e.g. "What is 12 * 12?"), but we show they can surprisingly struggle when asked to do the reverse task: generating questions for answers (e.g. "Give me a question with the answer 144"). We study when these errors happen, what might be causing them, and how they can be addressed.
Feng Gu, Wichayaporn Wongkamjan, Jonathan K. Kummerfeld, Denis Peskoff, Jonathan May, and Jordan Boyd-Graber. Personalized Help for Optimizing Low-Skilled Users' Strategy. Annual Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics, 2025. [Bibtex]
```
@inproceedings{Gu:Wongkamjan:Kummerfeld:Peskoff:May:Boyd-Graber-2025,
	Title = {Personalized Help for Optimizing Low-Skilled Users' Strategy},
	Author = {Feng Gu and Wichayaporn Wongkamjan and Jonathan K. Kummerfeld and Denis Peskoff and Jonathan May and Jordan Boyd-Graber},
	Booktitle = {Annual Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics},
	Year = {2025},
	Location = {Albuquerque},
	Url = {http://cs.umd.edu/~jbg//docs/2024_arr_chiron-advisor.pdf},
}
```
Accessible Abstract: AIs can beat humans in game environments; however, how helpful those agents are to human remains understudied. We augment CICERO, a natural language agent that demonstrates superhuman performance in Diplomacy, to generate both move and message advice based on player intentions. A dozen Diplomacy games with novice and experienced players, with varying advice settings, show that some of the generated advice is beneficial. It helps novices compete with experienced players and in some instances even surpass them. The mere presence of advice can be advantageous, even if players do not follow it.
Nishant Balepur, Alexa Siu, Nedim Lipka, Franck Dernoncourt, Tong Sun, Jordan Boyd-Graber, and Puneet Mathur. MoDS: Moderating a Mixture of Document Speakers to Summarize Debatable Queries in Document Collections. Annual Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics, 2025. [Bibtex]
```
@inproceedings{Balepur:Siu:Lipka:Dernoncourt:Sun:Boyd-Graber:Mathur-2025,
	Title = {MoDS: Moderating a Mixture of Document Speakers to Summarize Debatable Queries in Document Collections},
	Author = {Nishant Balepur and Alexa Siu and Nedim Lipka and Franck Dernoncourt and Tong Sun and Jordan Lee Boyd-Graber and Puneet Mathur},
	Booktitle = {Annual Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics},
	Year = {2025},
	Location = {Albuquerque},
	Url = {http://cs.umd.edu/~jbg//docs/2025_naacl_mods.pdf},
}
```
Accessible Abstract: When you ask ChatGPT for advice on questions with multiple perspectives (e.g. "Is pineapple good on pizza?"), you likely want a response that fairly represents all viewpoints. We formulate this task, collect a dataset to test it, and develop MoDS—a system where multiple ChatGPT's debate like a panel discussion—to generate balanced answers for questions based on multiple sources.

Benjamin Börschinger, Jordan Boyd-Graber, Christian Buck, Jannis Bulian, Massimiliano Ciaramita, Michelle Chen Huebscher, Wojciech Gajewski, Yannic Kilcher, Rodrigo Nogueira, and Lierni Sestorain Saralegu. Meta Answering for Machine Reading. ArXiv, 2020. [Preprint] [Bibtex]

@article{B\"orschinger:Boyd-Graber:Buck:Bulian:Ciaramita:Huebscher:Gajewski:Kilcher:Nogueira:Saralegu-2020,
	Title = {Meta Answering for Machine Reading},
	Author = {Benjamin B\"orschinger and Jordan Boyd-Graber and Christian Buck and Jannis Bulian and Massimiliano Ciaramita and Michelle Chen Huebscher and Wojciech Gajewski and Yannic Kilcher and Rodrigo Nogueira and Lierni Sestorain Saralegu},
	Journal = {ArXiv},
	Year = {2020},
	Url = {https://arxiv.org/abs/1911.04156},
}

Pedro Rodriguez, Shi Feng, Mohit Iyyer, He He, and Jordan Boyd-Graber. Quizbowl: The Case for Incremental Question Answering. ArXiv, 2020. [Webpage] [Bibtex]

@article{Rodriguez:Feng:Iyyer:He:Boyd-Graber-2020,
	Title = {Quizbowl: The Case for Incremental Question Answering},
	Author = {Pedro Rodriguez and Shi Feng and Mohit Iyyer and He He and Jordan Boyd-Graber},
	Journal = {ArXiv},
	Year = {2020},
	Url = {https://arxiv.org/abs/1904.04792},
}

News

Recent Publications