Project # 1618695
Participants:Douglas W. Oard, PI, University of Maryland
Katie Shilton, Co-PI, University of Maryland
Jimmy Lin, Co-PI, University of Maryland and University of Waterloo
Mossaab Bagdouri, Ph.D. Student, University of Maryland
Caitlin Christian-Lamb, Ph.D. Student, University of Maryland
Hua He, Ph.D. Student, University of Maryland
Mahmoud Sayed, Ph.D. Student, University of Maryland
Jyothi Vinjumur, Ph.D. Student, University of Maryland
Yulu Wang, Ph.D. Student, University of Maryland
Amy Wickner, Ph.D. Student, University of Maryland
Rashmi Sankepally, M.S. Student, University of Maryland
Will Cox, B.S. Student, University of Maryland
Jonah Rivera, B.S. Student, University of Maryland
Paulina Zheng, B.S. Student, University of Maryland
Today's search engines are designed principally to help people find what they want to see. Paradoxically, the fact that search engines do this well means that there are many collections that can't be searched. Citizens can not yet search some government records because of intermixed information that may need to be protected. Scholars are not yet allowed to see much of the growing backlog of unprocessed archival collections for similar reasons. These limitations, and many more, are direct consequences of the fact that current search engines can only protect sensitive content if that sensitive content has been marked in advance. As the volume of digital content continues to increase, current approaches based on manually finding and marking all of the sensitive content in a collection simply cannot affordably accommodate the scale of the challenge. This project will address that challenge by creating a new class of search algorithms that are designed to balance the searcher's interest in finding relevant content with the content provider's interest in protecting sensitive content. This technology will benefit society by dramatically altering the way we approach challenges such as government transparency, personal and enterprise information management, civil litigation and regulatory investigations, and scholarly access to archival materials.
The project will leverage evaluation-driven information retrieval techniques to optimize a unified objective function that balances the value of finding relevant content with the imperative to protect sensitive information. This will require developing a new class of evaluation measures that are sensitive to both value (relevance) and cost (sensitivity). Factorial vignette survey techniques will be used to elicit the context-appropriate balance of access and restriction for representative applications. The survey results will then be used to inform the design of the feature sets on which evaluation-driven information retrieval techniques depend. Initial experiments will be conducted in protected environments, both locally and as shared-task evaluations on collections that can be licensed for research use under terms that preclude inappropriate disclosure. Ultimately, the project seeks to develop a process for evaluating algorithms for search among sensitive content using an algorithm deposit model in which the executable search algorithm is sent to the protected data, and only manually vetted evaluation results will be returned to participants.
Page created: December 11, 2017
This material is based upon work supported by the National Science Foundation under Grant No. 1618695. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the authors and do not necessarily reflect the views of the National Science Foundation.