Overview
Project Team
Publications
Software
Project funded by the National Science Foundation (IIS-2403436)
PI: Jordan Boyd-Graber,
Overview
Tens of millions of Americans interact with AI tools to find information, answer questions, or help them solve problems. One key drawback of these systems is lack of personalization: since modern AI systems do not know whom they are talking to, they can only give generic answers to user questions. But the answer to the question “why is the sky blue?” should be different if the person asking the question is a college student or a young child. This project aims to enable an AI model to provide more appropriate responses to users depending on their unique backgrounds, experiences, and needs. It will first gather a diverse dataset in order to characterize what kinds of responses are preferred by different people. The project will then use these data to develop AI systems that can tailor their answers to individual users, as well as evaluate how well the AI systems personalize responses. To achieve this personalization, the AI systems will learn to explicitly represent the kind of person they are talking to, based on their background or previous interactions, and then use this representation to generate an appropriate response. This project will result in AIs that can provide personalized, specific responses based on the person asking the question as well as resources that will help other personalize AIs. These resources will include datasets of personalized questions and answers, interfaces and visualizations to understand why AI provides specific responses over others; interviews and discussions with community members to understand their needs; and code and models that will allow others to build, train, and deploy personalized AI systems.
While large language models (LLMs) trained on massive datasets have shown impressive performance on a variety of tasks, they still exhibit biases and struggle to be equally useful for everyone. While initially pre-trained on a language modeling objective, most LLMs are further fine-tuned to align their outputs with human preferences. However, existing techniques assume a “one size fits all” approach, ignoring diversity in user needs. This project will first construct probes to detect cases where models fail to adapt to the diverse needs of different users. Then, this project will develop Personalized Feedback for Diverse Populations (PFDP) to identify when models should be sensitive to the unique needs, knowledge, and background of users by examining the training trajectory of models and comparing models' answers to human preferences. PFDP will enable the development of models that can detect examples that are difficult for computers but not for humans, explain why such disparities in difficulty exist, and represent users’ needs and preferences within the model. To correct those shortcomings in the data, we focus on data curation: we propose techniques to automatically create new examples that ask questions about under-represented groups or require targeted responses to create adversarial prompt and response pairs with a human in the loop. Finally, with these new data, we develop techniques to allow modern architectures to make the most of these difficult (but few) examples. These techniques will allow for fine-tuning LLMs with a small curated subset of data that is robust to variations in prompts and will lead to the generation of acceptable answers for a diverse population of users.
<< back to top
![]() |
Jordan Boyd-Graber Assistant Professor, Computer Science (Maryland) |
![]() |
Alvin Grissom II Associate Professor, Haverford College |
![]() |
Robin Jia Assistant Professor, University of Southern California |
![]() |
John P. Lalor Assistant Professor, University of Notre Dame |
![]() |
Swabha Swayamdipta Assistant Professor, University of Southern California |
![]() |
Nishant Balepur PhD Student, University of Maryland |
![]() |
Maharshi Gor PhD Student, University of Maryland |
![]() |
John Kanu PhD Student, University of Maryland |
![]() |
Yoo Yeon Sung PhD Student, University of Maryland |
![]() |
Ryan Cook PhD Student, University of Notre Dame |
<< back to top
@inproceedings{Cook:Lalor:Abbasi-2025, Title = {No Simple Answer to Data Complexity: An Examination of Instance-Level Complexity Metrics for Classification Tasks}, Author = {Ryan A Cook and John P Lalor and Ahmed Abbasi}, Booktitle = {Annual Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics}, Year = {2025}, Location = {Albuquerque}, Url = {http://cs.umd.edu/~jbg//docs/2025_naacl_answercomplexity.pdf}, }
Accessible Abstract: Instance-level complexity scores can be used for tasks such as filtering out noisy observations and subsampling informative examples. However, there exists a diverse taxonomy of complexity metrics that can be used for a classification task, making metric selection itself difficult. We examine the relationship between these metrics and find that simply storing training loss provides similar complexity rankings as other more computationally intensive techniques. Metric similarity allows us to subsample data with higher aggregate complexity along several metrics using a single a priori available meta-feature.
@inproceedings{Balepur:Gu:Ravichander:Feng:Boyd-Graber:Rudinger-2025, Title = {Reverse Question Answering: Can an LLM Write a Question so Hard (or Bad) that it Can't Answer?}, Author = {Nishant Balepur and Feng Gu and Abhilasha Ravichander and Shi Feng and Jordan Boyd-Graber and Rachel Rudinger}, Booktitle = {Annual Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics}, Year = {2025}, Location = {Albuquerque}, Url = {http://cs.umd.edu/~jbg//docs/2025_naacl_reverseqa.pdf}, }
Accessible Abstract: Language models like ChatGPT are pretty good at answering questions (e.g. "What is 12 * 12?"), but we show they can surprisingly struggle when asked to do the reverse task: generating questions for answers (e.g. "Give me a question with the answer 144"). We study when these errors happen, what might be causing them, and how they can be addressed.
@inproceedings{Balepur:Rudinger:Boyd-Graber-2025, Title = {Which of These Best Describes Multiple Choice Evaluation with LLMs? A) Forced B) Flawed C) Fixable D) All of the Above}, Author = {Nishant Balepur and Rachel Rudinger and Jordan Lee Boyd-Graber}, Booktitle = {Association for Computational Linguistics}, Location = {Vienna, Austria}, Year = {2025}, Url = {http://cs.umd.edu/~jbg//docs/2025_acl_mcqa_bad.pdf}, }
Accessible Abstract: Most people dislike taking multiple-choice tests, so why are they the default way we evaluate NLP systems? This position paper argues that, despite its simplicity and popularity, multiple-choice evaluation is flawed, both in its format and the datasets it relies on. Drawing from educational testing theory, we propose practical fixes for these issues, helping us build evaluations that better test knowledge and reflect how humans use NLP systems.
@inproceedings{Balepur:Padmakumar:Yang:Feng:Rudinger:Boyd-Graber-2025, Title = {Whose Boat Does it Float? Improving Personalization in Preference Tuning via Inferred User Personas}, Author = {Nishant Balepur and Vishakh Padmakumar and Fumeng Yang and Shi Feng and Rachel Rudinger and Jordan Lee Boyd-Graber}, Booktitle = {Association for Computational Linguistics}, Location = {Vienna, Austria}, Year = {2025}, Url = {http://cs.umd.edu/~jbg//docs/2025_acl_boat.pdf}, }
Accessible Abstract: Language models are optimized to learn which responses you prefer, but they don't learn why you preferred a particular response. This limits their ability to tailor to personalized requests (e.g., "What should I eat for dinner? I'm vegetarian"), so we introduce a simple fix: have models infer personas that explain why users could prefer responses. We show training on these inferred personas leads to responses that are significantly more personalized for user needs.
@inproceedings{Balepur:Shu:Hoyle:Robey:Feng:Goldfarb-Tarrant:Boyd-Graber-2024, Title = {A SMART Mnemonic Sounds like "Glue Tonic": Mixing LLMs with Student Feedback to Make Mnemonic Learning Stick}, Author = {Nishant Balepur and Matthew Shu and Alexander Hoyle and Alison Robey and Shi Feng and Seraphina Goldfarb-Tarrant and Jordan Boyd-Graber}, Booktitle = {Empirical Methods in Natural Language Processing}, Year = {2024}, Location = {Miami}, Url = {http://cs.umd.edu/~jbg//docs/2024_emnlp_mnemonic.pdf}, }
Accessible Abstract: Learning vocabulary (e.g., benevolent) can be tedious, but using mnemonics (e.g., benevolent sounds like "benefits," and a kind boss gives benefits) makes it more engaging and effective. This paper introduces SMART, a large language model trained to produce mnemonics based on feedback from flashcard learners. Students struggle to predict which mnemonics will help them most. Still, by training SMART on both student preferences and learning outcomes, we can generate mnemonics as effectively as GPT-4, but at a much lower cost.
@inproceedings{Gor:Daume-III:Boyd-Graber-2024, Title = {Do great minds think alike? Investigating Human-AI Complementarity in Question Answering with CAIMIRA}, Author = {Maharshi Gor and Hal {Daum\'{e} III} Tianyi Zhou and Jordan Boyd-Graber}, Booktitle = {Empirical Methods in Natural Language Processing}, Year = {2024}, Location = {Miami}, Url = {http://cs.umd.edu/~jbg//docs/2024_emnlp_caimira.pdf}, }
Accessible Abstract: CAIMIRA discovers the skills that humans and AIs use to answer questions. By scraping websites where trivia nerds answer really difficult questions and posing those questions to AI models like GPT-4 and LLaMA-3-70B, while humans excel in knowledge-based abductive reasoning, AI outperforms on fact-based historical recall. This research suggests future challenges should focus on more complex reasoning and nuanced language tasks to better align AI development with human cognitive strengths.
@inproceedings{Kabir:Sung:Bandyopadhyay:Zou:Chandra:Boyd-Graber-2024, Title = {You Make me Feel like a Natural Question: Training QA Systems on Transformed Trivia Questions}, Author = {Tasnim Kabir and Yoo Yeon Sung and Saptarashmi Bandyopadhyay and Hao Zou and Abhranil Chandra and Jordan Lee Boyd-Graber}, Booktitle = {Empirical Methods in Natural Language Processing}, Location = {Miami}, Year = {2024}, Url = {http://cs.umd.edu/~jbg//docs/2024_emnlp_natural.pdf}, }
Accessible Abstract: Many of the questions for training AIs how to answer questions come from the queries users type into search engines (like Google's Natural Questions). Is there a cheaper---perhaps even better---way? We propose a "naturalization" technique to turn high-quality, rigorously edited trivia questions into examples that resembles Natural Questions. Training on our naturalized questions and testing on natural questions comes close to the results with using Natural Questions, and we can improve results on MMLU (a standard modern evaluation set) by using our data.
@article{Wu:Guan:Li:Huang:Liu:Wang:Xian:Shrivastava:Huang:Boyd-Graber:Zhou:Manocha-2024, Title = {AUTOHALLUSION: Automatic Generation of Hallucination Benchmarks for Vision-Language Models}, Author = {Xiyang Wu and Tianrui Guan and Dianqi Li and Shuaiyi Huang and Xiaoyu Liu and Xijun Wang and Ruiqi Xian and Abhinav Shrivastava and Furong Huang and Jordan Boyd-Graber and Tianyi Zhou and Dinesh Manocha}, Journal = {Findings of the Empirical Methods in Natural Language Processing}, Year = {2024}, Location = {Miami}, Url = {https://arxiv.org/abs/2406.10900}, }
@article{Li:Mondal:Nghiem:Liang:Boyd-Graber-2024, Title = {PEDANTS (Precise Evaluations of Diverse Answer Nominee Text for Skinflints): Use Evaluation Metrics Wisely---Efficient Evaluation Analysis and Benchmarking for Open-Domain Question Answering}, Author = {Zongxia Li and Ishani Mondal and Huy Nghiem and Yijun Liang and Jordan Boyd-Graber}, Journal = {Findings of the Empirical Methods in Natural Language Processing}, Location = {Miami}, Year = {2024}, Url = {https://arxiv.org/abs/2402.11161}, }
@article{Staff-2024, Author = {Maryland Today Staff}, Year = {2024}, Title = {At New AI Institute’s Celebration, a Question of ‘Who’s at the Table’}, Journal = {Maryland Today}, Url = {https://today.umd.edu/at-new-ai-institutes-celebration-a-question-of-whos-at-the-table}, }
This work is supported by the National Science Foundation. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the researchers and do not necessarily reflect the views of the National Science Foundation.