I am a full professor in the University of Maryland Computer Science Department (tenure home), Institute of Advanced Computer Studies, INFO, and Language Science Center.

My research focuses on making machine learning more useful, more interpretable, and able to learn and interact from humans. This helps users sift through decades of documents; discover when individuals lie, reframe, or change the topic in a conversation; or to compete against humans in games that are based in natural language.

Book a meeting with me (collaborators and UMD students).

Recent Publications

  • Zongxia Li, Wenhao Yu, Chengsong Huang, Rui Liu, Zhenwen Liang, Fuxiao Liu, Jingxi Che, Dian Yu, Jordan Boyd-Graber, Haitao Mi, and Dong Yu. Self-Rewarding Vision-Language Model via Reasoning Decomposition. International Conference on Learning Representations, 2026. [Bibtex]
  • Feng Gu, Zongxia Li, Carlos R. Colon, Benjamin Evans, Ishani Mondal, and Jordan Boyd-Graber. Large Language Models Are Effective Human Annotation Assistants, But Not Good Independent Annotators. Findings of the Association for Computational Linguistics, 2026. [Arxiv] [Bibtex]
    Accessible Abstract: Event annotation is important for identifying, monitoring, and understanding sociological trends. Although expert annotators set the gold standard, they are expensive and inefficient. While state-of-the-art NLP models are an attractive alternative, they are often evaluated on standalone subtasks rather than entire workflows. Thus, we evaluate a holistic workflow that summarizes news with event coreference resolution and argument extraction in three modes: AI-only, AI assistance, and human only. Although AI's recall is seven times higher than the tf-idf baseline at coreference resolution, it is far from replacing experts. However, experts adopt AI-extracted arguments 60% of the time, reducing extraction time by 25%.
  • Tasnim Kabir, Dmytro Kurdydyk, Aadi Palnitkar, Liam Dorn, Ahmed Haj Ahmed, and Jordan Boyd-Graber. AUDITA: A New Dataset to Audit Humans or AI is Better at Audio QA. Findings of the Association for Computational Linguistics, 2026. [Bibtex]
    Accessible Abstract: We do a lot of evaluation of how well AI can answer questions, but what about if they have to listen to the question. While there are other datasets out there that measure this, these often are overly simplistic. They don't measure reasoning or what humans care about. Our new dataset AUDITA, harvests questions that are difficult from the web. We then ask humans to answer both existing Audio QA questions and our new questions: this new dataset is much harder, and existing audio models struggle on them.
  • Maharshi Gor, Yoo Yeon Sung, Yu Hou, Eve Fleisig, Zhu Irene Ying, Tianyi Zhou, and Jordan Boyd-Graber. AI, Take the Wheel: What Drives Delegation and Trust in Human-Computer Cooperative Question Answering? Findings of the Association for Computational Linguistics, 2026. [Bibtex]
  • Ishani Mondal, Meera Bharadwaj, Ayush Roy, Aparna Garimella, and Jordan Boyd-Graber. SMART-Editor: A Multi-Agent Framework for Human-Like Design Editing with Structural Integrity. European Association for Computational Linguistics, 2026. [Bibtex]
  • HyoJung Han, Nishant Balepur, Jordan Boyd-Graber, and Marine Carpuat. Measuring User's Mental Models of Speech Translation in Human-MT Collaboration. Association for Computational Linguistics, 2026. [Bibtex]
  • Nishant Balepur, Malachi Hamada, Varsha Kishore, Sergey Feldman, Amanpreet Singh, Pao Siangliulue, Joseph Chee Chang, Eunsol Choi, Jordan Boyd-Graber, and Aakanksha Naik. Language Models Don't Know What You Want: Evaluating Personalization in Deep Research Needs Real Users. Association for Computational Linguistics, 2026. [Bibtex]
    Accessible Abstract: Deep Research systems help scientists discover more relevant research papers, but existing tools have no understanding of their users. We design MyScholarQA, the first personalized deep research system that learns from a researcher's interests to suggest more relevant papers. We evaluate our system with a mix of offline evaluations, using LLMs that simulate users, and online interviews, ultimately showing that LLMs cannot replace the insights gained from speaking with real humans.
  • Nishant Balepur, Bhavya Rajasekaran, Hyunjin Jane Oh, Michael Xie, Atrey Desai, Vipul Gupta, Steven James Moore, Eunsol Choi, Rachel Rudinger, and Jordan Boyd-Graber. BenchMarker: An Education-Inspired Toolkit for Highlighting Flaws in Multiple-Choice Benchmarks. Association for Computational Linguistics, 2026. [Bibtex]
    Accessible Abstract: Multiple-choice questions are a standard way to evaluate NLP systems, but they are riddled with flaws that limit their validity. Extending our previous position paper, we draw on educational testing theory to design BenchMarker, a toolkit that detects faulty MCQs that exist on the Internet, have guessable shortcuts, and writing issues that confuse students and LLMs. We show how BenchMarker can detect and help fix flaws in NLP benchmarks.
  • Benjamin Börschinger, Jordan Boyd-Graber, Christian Buck, Jannis Bulian, Massimiliano Ciaramita, Michelle Chen Huebscher, Wojciech Gajewski, Yannic Kilcher, Rodrigo Nogueira, and Lierni Sestorain Saralegu. Meta Answering for Machine Reading. ArXiv, 2020. [Preprint] [Bibtex]
  • Pedro Rodriguez, Shi Feng, Mohit Iyyer, He He, and Jordan Boyd-Graber. Quizbowl: The Case for Incremental Question Answering. ArXiv, 2020. [Webpage] [Bibtex]
Jordan Boyd-Graber