I am a full professor in the University of Maryland Computer Science Department (tenure home), Institute of Advanced Computer Studies, INFO, and Language Science Center.

My research focuses on making machine learning more useful, more interpretable, and able to learn and interact from humans. This helps users sift through decades of documents; discover when individuals lie, reframe, or change the topic in a conversation; or to compete against humans in games that are based in natural language.

Book a meeting with me (collaborators and UMD students).

Recent Publications

  • Yoo Yeon Sung, Maharshi Gor, Eve Fleisig, Ishani Mondal, and Jordan Lee Boyd-Graber. ADVSCORE: A Metric for the Evaluation and Creation of Adversarial Benchmarks. North American Association for Computational Linguistics, 2025. [Bibtex] This was one of ten papers selected as an Outstanding Paper at NAACL 2025
    Accessible Abstract: Adversarial datasets should validate AI robustness by presenting samples that humans handle well but models struggle with. However, as models advance, these datasets risk becoming obsolete. Assessing whether a dataset remains adversarial is challenging due to the absence of a standardized metric for adversarialness. To address this, we introduce AdvScore, a human-grounded evaluation metric that quantifies a dataset's adversarial nature by accounting for the differing abilities of models and humans while also identifying low-quality examples.
  • Feng Gu, Wichayaporn Wongkamjan, Jonathan K. Kummerfeld, Denis Peskoff, Jonathan May, and Jordan Boyd-Graber. Should I Trust You? Detecting Deception in Negotiations using Counterfactual RL. Findings of the Association for Computational Linguistics, 2025. [Bibtex]
    Accessible Abstract: When determining when an offer sounds "too good to be true", it helps to consider what the person sending the message has to gain. When we provide this information to classifiers tasked with determining if a message is deceptive in the online game of Diplomacy, it dramatically improves ability to detect deception.
  • Nishant Balepur, Vishakh Padmakumar, Fumeng Yang, Shi Feng, Rachel Rudinger, and Jordan Lee Boyd-Graber. Whose Boat Does it Float? Improving Personalization in Preference Tuning via Inferred User Personas. Association for Computational Linguistics, 2025. [Code/Data] [Bibtex]
    Accessible Abstract: Language models are optimized to learn which responses you prefer, but they don't learn why you preferred a particular response. This limits their ability to tailor to personalized requests (e.g., "What should I eat for dinner? I'm vegetarian"), so we introduce a simple fix: have models infer personas that explain why users could prefer responses. We show training on these inferred personas leads to responses that are significantly more personalized for user needs.
  • Nishant Balepur, Rachel Rudinger, and Jordan Boyd-Graber. Which of These Best Describes Multiple Choice Evaluation with LLMs? A) Forced B) Flawed C) Fixable D) All of the Above. Association for Computational Linguistics, 2025. [Bibtex]
    Accessible Abstract: Most people dislike taking multiple-choice tests, so why are they the default way we evaluate NLP systems? This position paper argues that, despite its simplicity and popularity, multiple-choice evaluation is flawed, both in its format and the datasets it relies on. Drawing from educational testing theory, we propose practical fixes for these issues, helping us build evaluations that better test knowledge and reflect how humans use NLP systems.
  • Alexander Hoyle, Lorena Calvo-Bartolomé, Jordan Lee Boyd-Graber, and Philip Resnik. ProxAnn: Use-Oriented Evaluations of Topic Models and Document Clustering. Association for Computational Linguistics, 2025. [Bibtex]
    Accessible Abstract: Topic models are tools to help people navigate large document collections. However, testing whether a topic model is good or not is notoriously hard, as it's subjective and requires asking real people about whether the outputs make sense. We show that you can ask a language model to recreate the answers of humans, correlating better with ground truth than previous evaluations.
  • Neha Punklik Srikanth, Rachel Rudinger, and Jordan Lee Boyd-Graber. No Questions are Stupid and but some are Poorly Posed: Understanding Poorly-Posed Information-Seeking Questions. Association for Computational Linguistics, 2025. [Code/Data] [Bibtex]
    Accessible Abstract: Often, the questions users ask search engines or chatbots aren't perfect: they have errors, are vague, or lack context. Humans are able to deftly navigate these issues, but computers still struggle. We analyze the differences in how humans and computers repair imperfect questions to suggest how to improve AI's question answering abilities.
  • Zongxia Li, Lorena Calvo-Bartolomé, Alexander Miserlis Hoyle, Paiheng Xu, Daniel Kofi Stephens, Juan Francisco Fung, Alden Dima, and Jordan Boyd-Graber. LLMs Struggle to Describe the Haystack without Human Help: A Social Science-Inspired Evaluation of Topic Models. Association for Computational Linguistics, 2025. [Bibtex]
    Accessible Abstract: Understanding large document
  • Yoo Yeon Sung, Eve Fleisig, Yu Hope, Ishan Upadhyay, and Jordan Boyd-Graber. GRACE: A Granular Benchmark for Evaluating Model Calibration against Human Calibration. Association for Computational Linguistics, 2025. [Bibtex]
    Accessible Abstract: As AI use becomes more common, it's important to measure not just whether the systems are correct but whether they know when they're incorrect. We propose a new metric to measure this mismatch between correctness and confidence, compare computer ability with human ability, and show that computers have a long way to go before they're well-calibrated.
  • Nishant Balepur, Feng Gu, Abhilasha Ravichander, Shi Feng, Jordan Boyd-Graber, and Rachel Rudinger. Reverse Question Answering: Can an LLM Write a Question so Hard (or Bad) that it Can't Answer?. Annual Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics, 2025. [Bibtex]
    Accessible Abstract: Language models like ChatGPT are pretty good at answering questions (e.g. "What is 12 * 12?"), but we show they can surprisingly struggle when asked to do the reverse task: generating questions for answers (e.g. "Give me a question with the answer 144"). We study when these errors happen, what might be causing them, and how they can be addressed.
  • Feng Gu, Wichayaporn Wongkamjan, Jonathan K. Kummerfeld, Denis Peskoff, Jonathan May, and Jordan Boyd-Graber. Personalized Help for Optimizing Low-Skilled Users' Strategy. Annual Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics, 2025. [Bibtex]
    Accessible Abstract: AIs can beat humans in game environments; however, how helpful those agents are to human remains understudied. We augment CICERO, a natural language agent that demonstrates superhuman performance in Diplomacy, to generate both move and message advice based on player intentions. A dozen Diplomacy games with novice and experienced players, with varying advice settings, show that some of the generated advice is beneficial. It helps novices compete with experienced players and in some instances even surpass them. The mere presence of advice can be advantageous, even if players do not follow it.
  • Nishant Balepur, Alexa Siu, Nedim Lipka, Franck Dernoncourt, Tong Sun, Jordan Lee Boyd-Graber, and Puneet Mathur. MoDS: Moderating a Mixture of Document Speakers to Summarize Debatable Queries in Document Collections. Annual Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics, 2025. [Bibtex]
    Accessible Abstract: When you ask ChatGPT for advice on questions with multiple perspectives (e.g. "Is pineapple good on pizza?"), you likely want a response that fairly represents all viewpoints. We formulate this task, collect a dataset to test it, and develop MoDS—a system where multiple ChatGPT's debate like a panel discussion—to generate balanced answers for questions based on multiple sources.
  • Benjamin Börschinger, Jordan Boyd-Graber, Christian Buck, Jannis Bulian, Massimiliano Ciaramita, Michelle Chen Huebscher, Wojciech Gajewski, Yannic Kilcher, Rodrigo Nogueira, and Lierni Sestorain Saralegu. Meta Answering for Machine Reading. ArXiv, 2020. [Preprint] [Bibtex]
  • Pedro Rodriguez, Shi Feng, Mohit Iyyer, He He, and Jordan Boyd-Graber. Quizbowl: The Case for Incremental Question Answering. ArXiv, 2020. [Webpage] [Bibtex]
Jordan Boyd-Graber