Safely Searching Among Sensitive Content

Participants:

Douglas W. Oard, PI, University of Maryland

Katie Shilton, Co-PI, University of Maryland

Jimmy Lin, Co-PI, University of Maryland and University of Waterloo

Mossaab Bagdouri, Ph.D. Student, University of Maryland

Caitlin Christian-Lamb, Ph.D. Student, University of Maryland

Hua He, Ph.D. Student, University of Maryland

Mahmoud Sayed, Ph.D. Student, University of Maryland

Jyothi Vinjumur, Ph.D. Student, University of Maryland

Yulu Wang, Ph.D. Student, University of Maryland

Amy Wickner, Ph.D. Student, University of Maryland

Rashmi Sankepally, M.S. Student, University of Maryland

Will Cox, B.S. Student, University of Maryland

Nishanth Mallekav, B.S. Student, University of Maryland

Jonah Rivera, B.S. Student, University of Maryland

Paulina Zheng, B.S. Student, University of Maryland

Today's search engines are designed principally to help people find what they want to see. Paradoxically, the fact that search engines do this well means that there are many collections that can't be searched. Citizens can not yet search some government records because of intermixed information that may need to be protected. Scholars are not yet allowed to see much of the growing backlog of unprocessed archival collections for similar reasons. These limitations, and many more, are direct consequences of the fact that current search engines can only protect sensitive content if that sensitive content has been marked in advance. As the volume of digital content continues to increase, current approaches based on manually finding and marking all of the sensitive content in a collection simply cannot affordably accommodate the scale of the challenge. This project will address that challenge by creating a new class of search algorithms that are designed to balance the searcher's interest in finding relevant content with the content provider's interest in protecting sensitive content. This technology will benefit society by dramatically altering the way we approach challenges such as government transparency, personal and enterprise information management, civil litigation and regulatory investigations, and scholarly access to archival materials.

The project will leverage evaluation-driven information retrieval techniques to optimize a unified objective function that balances the value of finding relevant content with the imperative to protect sensitive information. This will require developing a new class of evaluation measures that are sensitive to both value (relevance) and cost (sensitivity). Factorial vignette survey techniques will be used to elicit the context-appropriate balance of access and restriction for representative applications. The survey results will then be used to inform the design of the feature sets on which evaluation-driven information retrieval techniques depend. Initial experiments will be conducted in protected environments, both locally and as shared-task evaluations on collections that can be licensed for research use under terms that preclude inappropriate disclosure. Ultimately, the project seeks to develop a process for evaluating algorithms for search among sensitive content using an algorithm deposit model in which the executable search algorithm is sent to the protected data, and only manually vetted evaluation results will be returned to participants.

Project Pages:

Related Links:

Page created: December 11, 2017

This material is based upon work supported by the National Science Foundation under Grant No. 1618695. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the authors and do not necessarily reflect the views of the National Science Foundation.