Hard Data to the Model: Personalized, Diverse Preferences for Language Models

Project funded by the National Science Foundation (IIS-2403436)
PI: Jordan Boyd-Graber,

Overview

Tens of millions of Americans interact with AI tools to find information, answer questions, or help them solve problems. One key drawback of these systems is lack of personalization: since modern AI systems do not know whom they are talking to, they can only give generic answers to user questions. But the answer to the question “why is the sky blue?” should be different if the person asking the question is a college student or a young child. This project aims to enable an AI model to provide more appropriate responses to users depending on their unique backgrounds, experiences, and needs. It will first gather a diverse dataset in order to characterize what kinds of responses are preferred by different people. The project will then use these data to develop AI systems that can tailor their answers to individual users, as well as evaluate how well the AI systems personalize responses. To achieve this personalization, the AI systems will learn to explicitly represent the kind of person they are talking to, based on their background or previous interactions, and then use this representation to generate an appropriate response. This project will result in AIs that can provide personalized, specific responses based on the person asking the question as well as resources that will help other personalize AIs. These resources will include datasets of personalized questions and answers, interfaces and visualizations to understand why AI provides specific responses over others; interviews and discussions with community members to understand their needs; and code and models that will allow others to build, train, and deploy personalized AI systems.

While large language models (LLMs) trained on massive datasets have shown impressive performance on a variety of tasks, they still exhibit biases and struggle to be equally useful for everyone. While initially pre-trained on a language modeling objective, most LLMs are further fine-tuned to align their outputs with human preferences. However, existing techniques assume a “one size fits all” approach, ignoring diversity in user needs. This project will first construct probes to detect cases where models fail to adapt to the diverse needs of different users. Then, this project will develop Personalized Feedback for Diverse Populations (PFDP) to identify when models should be sensitive to the unique needs, knowledge, and background of users by examining the training trajectory of models and comparing models' answers to human preferences. PFDP will enable the development of models that can detect examples that are difficult for computers but not for humans, explain why such disparities in difficulty exist, and represent users’ needs and preferences within the model. To correct those shortcomings in the data, we focus on data curation: we propose techniques to automatically create new examples that ask questions about under-represented groups or require targeted responses to create adversarial prompt and response pairs with a human in the loop. Finally, with these new data, we develop techniques to allow modern architectures to make the most of these difficult (but few) examples. These techniques will allow for fine-tuning LLMs with a small curated subset of data that is robust to variations in prompts and will lead to the generation of acceptable answers for a diverse population of users.

<< back to top

Project Team

	Jordan Boyd-Graber Assistant Professor, Computer Science (Maryland)
	Alvin Grissom II Associate Professor, Haverford College

<< back to top

Publications (Selected)

Maharshi Gor, Hal Daumé III Tianyi Zhou, and Jordan Boyd-Graber. Do great minds think alike? Investigating Human-AI Complementarity in Question Answering with CAIMIRA. Empirical Methods in Natural Language Processing, 2024. [Talk] [Code] [Data] [Bibtex]
```
@inproceedings{Gor:Daume-III:Boyd-Graber-2024,
	Title = {Do great minds think alike? Investigating Human-AI Complementarity in Question Answering with CAIMIRA},
	Author = {Maharshi Gor and Hal {Daum\'{e} III} Tianyi Zhou and Jordan Boyd-Graber},
	Booktitle = {Empirical Methods in Natural Language Processing},
	Year = {2024},
	Location = {Miami},
	Url = {http://cs.umd.edu/~jbg//docs/2024_emnlp_caimira.pdf},
}
```
Accessible Abstract: CAIMIRA discovers the skills that humans and AIs use to answer questions. By scraping websites where trivia nerds answer really difficult questions and posing those questions to AI models like GPT-4 and LLaMA-3-70B, while humans excel in knowledge-based abductive reasoning, AI outperforms on fact-based historical recall. This research suggests future challenges should focus on more complex reasoning and nuanced language tasks to better align AI development with human cognitive strengths.
Tasnim Kabir, Yoo Yeon Sung, Saptarashmi Bandyopadhyay, Hao Zou, Abhranil Chandra, and Jordan Lee Boyd-Graber. You Make me Feel like a Natural Question: Training QA Systems on Transformed Trivia Questions. Empirical Methods in Natural Language Processing, 2024. [ArXiv] [Research Talk] [Code] [Bibtex]
```
@inproceedings{Kabir:Sung:Bandyopadhyay:Zou:Chandra:Boyd-Graber-2024,
	Title = {You Make me Feel like a Natural Question: Training QA Systems on Transformed Trivia Questions},
	Author = {Tasnim Kabir and Yoo Yeon Sung and Saptarashmi Bandyopadhyay and Hao Zou and Abhranil Chandra and Jordan Lee Boyd-Graber},
	Booktitle = {Empirical Methods in Natural Language Processing},
	Location = {Miami},
	Year = {2024},
	Url = {http://cs.umd.edu/~jbg//docs/2024_emnlp_natural.pdf},
}
```
Accessible Abstract: Many of the questions for training AIs how to answer questions come from the queries users type into search engines (like Google's Natural Questions). Is there a cheaper---perhaps even better---way? We propose a "naturalization" technique to turn high-quality, rigorously edited trivia questions into examples that resembles Natural Questions. Training on our naturalized questions and testing on natural questions comes close to the results with using Natural Questions, and we can improve results on MMLU (a standard modern evaluation set) by using our data.

Datasets

You Make me Feel Like a Natural Question

Acknowledgments

This work is supported by the National Science Foundation. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the researchers and do not necessarily reflect the views of the National Science Foundation.