I am a full professor in the University of Maryland Computer Science Department (tenure home), Institute of Advanced Computer Studies, INFO, and Language Science Center.

My research focuses on making machine learning more useful, more interpretable, and able to learn and interact from humans. This helps users sift through decades of documents; discover when individuals lie, reframe, or change the topic in a conversation; or to compete against humans in games that are based in natural language.

Book a meeting with me (collaborators and UMD students).

Recent Publications

  • Yoo Yeon Sung, Maharshi Gor, Eve Fleisig, Ishani Mondal, and Jordan Lee Boyd-Graber. ADVSCORE: A Metric for the Evaluation and Creation of Adversarial Benchmarks. North American Association for Computational Linguistics, 2025. [Bibtex] This was one of ten papers selected as an Outstanding Paper at NAACL 2025
    Accessible Abstract: Adversarial datasets should validate AI robustness by presenting samples that humans handle well but models struggle with. However, as models advance, these datasets risk becoming obsolete. Assessing whether a dataset remains adversarial is challenging due to the absence of a standardized metric for adversarialness. To address this, we introduce AdvScore, a human-grounded evaluation metric that quantifies a dataset's adversarial nature by accounting for the differing abilities of models and humans while also identifying low-quality examples.
  • Feng Gu, Wichayaporn Wongkamjan, Jonathan K. Kummerfeld, Denis Peskoff, Jonathan May, and Jordan Boyd-Graber. Wichayaporn Wongkamjan and Yanze Wang and Feng Gu and Denis Peskoff and Jonathan K. Kummerfeld and Jonathan May and Jordan Boyd-Graber. Findings of the Association for Computational Linguistics, 2025. [Bibtex]
    Accessible Abstract: TBD
  • Nishant Balepur, Vishakh Padmakumar, Fumeng Yang, Shi Feng, Rachel Rudinger, and Jordan Lee Boyd-Graber. Whose Boat Does it Float? Improving Personalization in Preference Tuning via Inferred User Personas. Association for Computational Linguistics, 2025. [Bibtex]
    Accessible Abstract: Language models are optimized to learn which responses you prefer, but they don't learn why you preferred a particular response. This limits their ability to tailor to personalized requests (e.g., "What should I eat for dinner? I'm vegetarian"), so we introduce a simple fix: have models infer personas that explain why users could prefer responses. We show training on these inferred personas leads to responses that are significantly more personalized for user needs.
  • Nishant Balepur, Rachel Rudinger, and Jordan Lee Boyd-Graber. Which of These Best Describes Multiple Choice Evaluation with LLMs? A) Forced B) Flawed C) Fixable D) All of the Above. Association for Computational Linguistics, 2025. [Bibtex]
    Accessible Abstract: Most people dislike taking multiple-choice tests, so why are they the default way we evaluate NLP systems? This position paper argues that, despite its simplicity and popularity, multiple-choice evaluation is flawed, both in its format and the datasets it relies on. Drawing from educational testing theory, we propose practical fixes for these issues, helping us build evaluations that better test knowledge and reflect how humans use NLP systems.
  • No Questions are Stupid and but some are Poorly Posed: Understanding Poorly-Posed Information-Seeking Questions. Neha Srikanth and Rachel Rudinger and Jordan Lee Boyd-Graber. Association for Computational Linguistics, 2025. [Bibtex]
  • Yoo Yeon Sung, Eve Fleisig, Yu Hope, Ishan Upadhyay, and Jordan Lee Boyd-Graber. GRACE: A Granular Benchmark for Evaluating Model Calibration against Human Calibration. Association for Computational Linguistics, 2025. [Bibtex]
  • Nishant Balepur, Feng Gu, Abhilasha Ravichander, Shi Feng, Jordan Boyd-Graber, and Rachel Rudinger. Reverse Question Answering: Can an LLM Write a Question so Hard (or Bad) that it Can't Answer?. Annual Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics, 2025. [Bibtex]
    Accessible Abstract: Language models like ChatGPT are pretty good at answering questions (e.g. "What is 12 * 12?"), but we show they can surprisingly struggle when asked to do the reverse task: generating questions for answers (e.g. "Give me a question with the answer 144"). We study when these errors happen, what might be causing them, and how they can be addressed.
  • Feng Gu, Wichayaporn Wongkamjan, Jonathan K. Kummerfeld, Denis Peskoff, Jonathan May, and Jordan Boyd-Graber. Personalized Help for Optimizing Low-Skilled Users' Strategy. Annual Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics, 2025. [Bibtex]
    Accessible Abstract: AIs can beat humans in game environments; however, how helpful those agents are to human remains understudied. We augment CICERO, a natural language agent that demonstrates superhuman performance in Diplomacy, to generate both move and message advice based on player intentions. A dozen Diplomacy games with novice and experienced players, with varying advice settings, show that some of the generated advice is beneficial. It helps novices compete with experienced players and in some instances even surpass them. The mere presence of advice can be advantageous, even if players do not follow it.
  • Ryan A Cook, John P Lalor, and Ahmed Abbasi. No Simple Answer to Data Complexity: An Examination of Instance-Level Complexity Metrics for Classification Tasks. Annual Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics, 2025. [Bibtex]
    Accessible Abstract: Instance-level complexity scores can be used for tasks such as filtering out noisy observations and subsampling informative examples. However, there exists a diverse taxonomy of complexity metrics that can be used for a classification task, making metric selection itself difficult. We examine the relationship between these metrics and find that simply storing training loss provides similar complexity rankings as other more computationally intensive techniques. Metric similarity allows us to subsample data with higher aggregate complexity along several metrics using a single a priori available meta-feature.
  • Nishant Balepur, Alexa Siu, Nedim Lipka, Franck Dernoncourt, Tong Sun, Jordan Lee Boyd-Graber, and Puneet Mathur. MoDS: Moderating a Mixture of Document Speakers to Summarize Debatable Queries in Document Collections. Annual Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics, 2025. [Bibtex]
    Accessible Abstract: When you ask ChatGPT for advice on questions with multiple perspectives (e.g. "Is pineapple good on pizza?"), you likely want a response that fairly represents all viewpoints. We formulate this task, collect a dataset to test it, and develop MoDS—a system where multiple ChatGPT's debate like a panel discussion—to generate balanced answers for questions based on multiple sources.
  • Benjamin Börschinger, Jordan Boyd-Graber, Christian Buck, Jannis Bulian, Massimiliano Ciaramita, Michelle Chen Huebscher, Wojciech Gajewski, Yannic Kilcher, Rodrigo Nogueira, and Lierni Sestorain Saralegu. Meta Answering for Machine Reading. ArXiv, 2020. [Preprint] [Bibtex]
  • Pedro Rodriguez, Shi Feng, Mohit Iyyer, He He, and Jordan Boyd-Graber. Quizbowl: The Case for Incremental Question Answering. ArXiv, 2020. [Webpage] [Bibtex]
Jordan Boyd-Graber