Better Extraction from Text Towards Enhanced Retrieval (BETTER)

Project funded by IARPA
PI: Jordan Boyd-Graber

In collaboration with Ben Van Durme and Michael Paul.

Overview

The Better Extraction from Text Towards Enhanced Retrieval (BETTER) Program will develop methods for extracting increasingly fine-grained semantic information, with a focus of events in the form of who-did-what-to-whom-whenwhere, across multiple languages and problem domains. This extracted information will be applied to an information retrieval task. An additional area of focus is human-in-the-loop computation. Performer systems will need the ability to incorporate human judgments for metrics such as relevancy and accuracy of extracted or retrieved information.

The UMD focused on improving representations for low resource langauges through using related languages and human interaction.

Project Team

	Jordan Boyd-Graber Assistant Professor, Computer Science (UMD)
	Mozhi Zhang PhD Student, Computer Science (UMD)

<< back to top

Publications (Selected)

Yoshinari Fujinuma, Jordan Boyd-Graber, and Katharina Kann. How Does Multilingual Pretraining Affect Cross-Lingual Transferability?. Association for Computational Linguistics, 2022. [Code] [Bibtex]

@inproceedings{Fujinuma:Boyd-Graber:Kann-2022,
	Author = {Yoshinari Fujinuma and Jordan Boyd-Graber and Katharina Kann},
	Title = {How Does Multilingual Pretraining Affect Cross-Lingual Transferability?},
	Booktitle = {Association for Computational Linguistics},
	Year = {2022},
	Location = {Dublin},
	Url = {http://umiacs.umd.edu/~jbg//docs/2022_acl_multilingbert.pdf},
}

Wanrong He, Andrew Mao, and Jordan Boyd-Graber. Cheater's Bowl: Human vs. Computer Search Strategies for Open-Domain QA. Findings of Empirical Methods in Natural Language Processing, 2022. [Code] [Data] [Research Talk] [Bibtex]
```
@article{He:Mao:Boyd-Graber-2022,
	Title = {Cheater's Bowl: Human vs. Computer Search Strategies for Open-Domain QA},
	Author = {Wanrong He and Andrew Mao and Jordan Boyd-Graber},
	Journal = {Findings of Empirical Methods in Natural Language Processing},
	Year = {2022},
	Location = {Abu Dhabi},
	Url = {http://umiacs.umd.edu/~jbg//docs/2022_emnlp_cheaters.pdf},
}
```
Accessible Abstract: When the Covid pandemic it, trivia games moved online. With it came cheating: people tried to quickly Google answers. This is bad for sportsmanship, but a good source of training data for helping teach computers how to find answers. We built an interface to harvest this training data from trivia players, fed these into retrieval-based QA systems, showing that these queries were better than the automatically generated queries used by the current state of the art.
Mozhi Zhang, Yoshinari Fujinuma, Michael J. Paul, and Jordan Boyd-Graber. Why Overfitting Isn't Always Bad: Retrofitting Cross-Lingual Word Embeddings to Dictionaries. Association for Computational Linguistics, 2020. [Preprint] [Video] [Code] [Bibtex]
```
@inproceedings{Zhang:Fujinuma:Paul:Boyd-Graber-2020,
	Author = {Mozhi Zhang and Yoshinari Fujinuma and Michael J. Paul and Jordan Boyd-Graber},
	Title = {Why Overfitting Isn't Always Bad: Retrofitting Cross-Lingual Word Embeddings to Dictionaries},
	Booktitle = {Association for Computational Linguistics},
	Year = {2020},
	Location = {The Cyberverse Simulacrum of Seattle},
	Url = {http://umiacs.umd.edu/~jbg//docs/2020_acl_refine.pdf},
}
```
Accessible Abstract: Computers need to represent words in a computer-readable way. This work talks about how slightly moving these representations for words in different languages to be closer to a small list of translations (like from a dictionary) after doing fancy machine learning works better on downstream tasks (e.g., guessing grammatical category of a word) but hurts on asking the algorithm for translations of unseen words.
Michelle Yuan, Hsuan-Tien Lin, and Jordan Boyd-Graber. Cold-start Active Learning through Self-Supervised Language Modeling. Empirical Methods in Natural Language Processing, 2020. [Video] [Code] [Bibtex]
```
@inproceedings{Yuan:Lin:Boyd-Graber-2020,
	Title = {Cold-start Active Learning through Self-Supervised Language Modeling},
	Author = {Michelle Yuan and Hsuan-Tien Lin and Jordan Boyd-Graber},
	Booktitle = {Empirical Methods in Natural Language Processing},
	Year = {2020},
	Location = {The Cyberverse Simulacrum of Punta Cana, Dominican Republic},
	Url = {http://umiacs.umd.edu/~jbg//docs/2020_emnlp_alps.pdf},
}
```
Accessible Abstract: Labeling data is a fundamental bottleneck in machine learning, especially for NLP, due to annotation cost and time. For medical text, obtaining labeled data is challenging because of privacy issues or shortage in expertise. Thus, active learning can be employed to recognize the most relevant examples and then query labels from an oracle. However, developing a strategy for selecting examples to label is non-trivial. Active learning is difficult to use in cold-start; all examples confuse the model because it has not trained on enough data. Fortunately, modern NLP provides an additional source of information: pre-trained language models. In our paper, we propose an active learning strategy called ALPS to find sentences that perplex the language model. We evaluate our approach on sentence classification datasets spanning across different domains. Results show that ALPS is an efficient active learning strategy that is competitive with state-of-the-art approaches.
Michelle Yuan, Mozhi Zhang, Benjamin Van Durme, Leah Findlater, and Jordan Boyd-Graber. Interactive Refinement of Cross-Lingual Word Embeddings. Empirical Methods in Natural Language Processing, 2020. [Audio of Paper Readthrough] [Video of Paper Readthrough] [Code] [Conference Talk] [Bibtex]
```
@inproceedings{Yuan:Zhang:Van-Durme:Findlater:Boyd-Graber-2020,
	Title = {Interactive Refinement of Cross-Lingual Word Embeddings},
	Author = {Michelle Yuan and Mozhi Zhang and Benjamin {Van Durme} and Leah Findlater and Jordan Boyd-Graber},
	Booktitle = {Empirical Methods in Natural Language Processing},
	Year = {2020},
	Location = {The Cyberverse Simulacrum of Punta Cana, Dominican Republic},
	Url = {http://umiacs.umd.edu/~jbg//docs/2020_emnlp_clime.pdf},
}
```
Accessible Abstract: Language technologies sometimes need to be quickly deployed in low-resource languages. For example, in the 2010 Haiti earthquake, researchers used machine learning models to analyze social media and text messages to gain situational awareness. We introduce CLIME, an interactive system that can help in these scenarios: users see which words related to the task the system thinks are similar, corrects the model to push similar words together and dissimilar words apart.

Acknowledgments

Any opinions, findings, and conclusions or recommendations expressed in this material are those of the researchers and do not necessarily reflect the views of the sponsor.