Project funded by IARPA
PI: Jordan Boyd-Graber
The Better Extraction from Text Towards Enhanced Retrieval (BETTER) Program will develop methods for extracting increasingly fine-grained semantic information, with a focus of events in the form of who-did-what-to-whom-whenwhere, across multiple languages and problem domains. This extracted information will be applied to an information retrieval task. An additional area of focus is human-in-the-loop computation. Performer systems will need the ability to incorporate human judgments for metrics such as relevancy and accuracy of extracted or retrieved information.
The UMD focused on improving representations for low resource langauges through using related languages and human interaction.
Jordan Boyd-Graber Assistant Professor, Computer Science (UMD) | |
Mozhi Zhang PhD Student, Computer Science (UMD) |
<< back to top
@inproceedings{Fujinuma:Boyd-Graber:Kann-2022, Author = {Yoshinari Fujinuma and Jordan Boyd-Graber and Katharina Kann}, Title = {How Does Multilingual Pretraining Affect Cross-Lingual Transferability?}, Booktitle = {Association for Computational Linguistics}, Year = {2022}, Location = {Dublin}, Url = {http://umiacs.umd.edu/~jbg//docs/2022_acl_multilingbert.pdf}, }
@article{He:Mao:Boyd-Graber-2022, Title = {Cheater's Bowl: Human vs. Computer Search Strategies for Open-Domain QA}, Author = {Wanrong He and Andrew Mao and Jordan Boyd-Graber}, Journal = {Findings of Empirical Methods in Natural Language Processing}, Year = {2022}, Location = {Abu Dhabi}, Url = {http://umiacs.umd.edu/~jbg//docs/2022_emnlp_cheaters.pdf}, }
Accessible Abstract: When the Covid pandemic it, trivia games moved online. With it came cheating: people tried to quickly Google answers. This is bad for sportsmanship, but a good source of training data for helping teach computers how to find answers. We built an interface to harvest this training data from trivia players, fed these into retrieval-based QA systems, showing that these queries were better than the automatically generated queries used by the current state of the art.
@inproceedings{Zhang:Fujinuma:Paul:Boyd-Graber-2020, Author = {Mozhi Zhang and Yoshinari Fujinuma and Michael J. Paul and Jordan Boyd-Graber}, Title = {Why Overfitting Isn't Always Bad: Retrofitting Cross-Lingual Word Embeddings to Dictionaries}, Booktitle = {Association for Computational Linguistics}, Year = {2020}, Location = {The Cyberverse Simulacrum of Seattle}, Url = {http://umiacs.umd.edu/~jbg//docs/2020_acl_refine.pdf}, }
Accessible Abstract: Computers need to represent words in a computer-readable way. This work talks about how slightly moving these representations for words in different languages to be closer to a small list of translations (like from a dictionary) after doing fancy machine learning works better on downstream tasks (e.g., guessing grammatical category of a word) but hurts on asking the algorithm for translations of unseen words.
@inproceedings{Yuan:Lin:Boyd-Graber-2020, Title = {Cold-start Active Learning through Self-Supervised Language Modeling}, Author = {Michelle Yuan and Hsuan-Tien Lin and Jordan Boyd-Graber}, Booktitle = {Empirical Methods in Natural Language Processing}, Year = {2020}, Location = {The Cyberverse Simulacrum of Punta Cana, Dominican Republic}, Url = {http://umiacs.umd.edu/~jbg//docs/2020_emnlp_alps.pdf}, }
Accessible Abstract: Labeling data is a fundamental bottleneck in machine learning, especially for NLP, due to annotation cost and time. For medical text, obtaining labeled data is challenging because of privacy issues or shortage in expertise. Thus, active learning can be employed to recognize the most relevant examples and then query labels from an oracle. However, developing a strategy for selecting examples to label is non-trivial. Active learning is difficult to use in cold-start; all examples confuse the model because it has not trained on enough data. Fortunately, modern NLP provides an additional source of information: pre-trained language models. In our paper, we propose an active learning strategy called ALPS to find sentences that perplex the language model. We evaluate our approach on sentence classification datasets spanning across different domains. Results show that ALPS is an efficient active learning strategy that is competitive with state-of-the-art approaches.
@inproceedings{Yuan:Zhang:Van-Durme:Findlater:Boyd-Graber-2020, Title = {Interactive Refinement of Cross-Lingual Word Embeddings}, Author = {Michelle Yuan and Mozhi Zhang and Benjamin {Van Durme} and Leah Findlater and Jordan Boyd-Graber}, Booktitle = {Empirical Methods in Natural Language Processing}, Year = {2020}, Location = {The Cyberverse Simulacrum of Punta Cana, Dominican Republic}, Url = {http://umiacs.umd.edu/~jbg//docs/2020_emnlp_clime.pdf}, }
Accessible Abstract: Language technologies sometimes need to be quickly deployed in low-resource languages. For example, in the 2010 Haiti earthquake, researchers used machine learning models to analyze social media and text messages to gain situational awareness. We introduce CLIME, an interactive system that can help in these scenarios: users see which words related to the task the system thinks are similar, corrects the model to push similar words together and dissimilar words apart.
Any opinions, findings, and conclusions or recommendations expressed in this material are those of the researchers and do not necessarily reflect the views of the sponsor.