UMD Reddit Suicidality Dataset

The University of Maryland Reddit Suicidality Dataset, Version 2

If you are interested in obtaining this dataset, please note that your application must provide documentation that your project plan has been reviewed and approved by an Institutional Review Board or equivalent ethical review panel at your institution or organization, as specified in Part 2 of the application. We cannot review applications that lack such documentation. If you are a graduate student, please note that documentation of review by an advisor or supervisor does not satisfy this requirement.

Overview

The University of Maryland Reddit Suicidality Dataset was constructed using data from Reddit, an online site for anonymous discussion on a wide variety of topics, in order to facilitate research on suicidality and suicide prevention. The dataset was derived from the 2015 Full Reddit Submission Corpus, using postings in the r/SuicideWatch subreddit to identify (anonymous) users who might represent positive instances of suicidality.

We introduced Version 1 of the dataset in Shing et al. (2018). As reported there, annotation of users in this dataset by experts for level of suicide risk (on a four-point scale of no risk, low, moderate, and severe risk) yielded what is, to our knowledge, the first demonstration of reliability in risk assessment by clinicians based on social media postings. The paper also introduces and demonstrates the value of a new, detailed rubric for assessing suicide risk, compares crowdsourced with expert performance, and presented baseline predictive modeling experiments using the new dataset.

Subsequently, we updated the dataset for the shared task on predicting degree of suicide risk from Reddit Posts, run as part of the 2019 Computational Linguistics and Clinical Psychology Workshop (CLPsych 2019) held at the 2019 Conference of the North American Chapter of the Association for Computational Linguistics (NAACL) (Zirikly et al. 2019). Updates included adding automatic de-identification of post titles and bodies, as well as the definition of a standard training/test split to be used during the shared task in order to facilitate head-to-head comparisons of system performance. We have also filtered out some posts from the Version 1 dataset based on encoding issues.

The currently available Version 2 of the dataset includes the training and test data from the 2019 CLPsych shared task (with consensus annotations based on crowdsourcing) plus the expert-annotated data (which was not used in the shared task). We recommend using the crowdsourcing train/test split for direct comparison with 2019 shared task papers, and using the full expert-annotated dataset for final testing since the expert annotations have strong inter-rater reliability.

The dataset is accompanied by documentation about its format. Briefly, it contains one subdirectory with data pertaining to 11,129 users who posted on SuicideWatch, and another for 11,129 users who did not. For each user, we have full longitudinal data from the 2015 Full Reddit Submission Corpus, including, for each post, the post ID, anonymized user ID, timestamp, subreddit, de-identified post title, and de-identified post body. In addition, we have two sets of human risk-level annotations for subsets of the users, obtained via crowdsourced annotation (621 users who posted on SuicideWatch and 621 who did not) and expert annotations (245 users who posted on SuicideWatch, paired with 245 control users who did not). In both cases we generated a user-level consensus label using the Dawid-Skene (1979) model for discovering true item states/effects from multiple noisy measurements (Passoneau and Carpenter, 2014; see discussion in Shing, 2018).

In addition to reading and citing the papers below, people using this dataset may wish to read Gaffney and Matias (2018). Published subsequent to Shing et al. (2018), this article provides caveats regarding the use of the 2015 Reddit Corpus related to missing data, which we discuss in Zirikly et al. (2019).

Papers to Cite when Using the Dataset

Han-Chin Shing, Suraj Nair, Ayah Zirikly, Meir Friedenberg, Hal Daumé III, and Philip Resnik, "Expert, Crowdsourced, and Machine Assessment of Suicide Risk via Online Postings", Proceedings of the Fifth Workshop on Computational Linguistics and Clinical Psychology: From Keyboard to Clinic, pages 25–36, New Orleans, Louisiana, June 5, 2018.
@inproceedings{shing2018expert,
  title={Expert, crowdsourced, and machine assessment of suicide risk via online postings},
  author={Shing, Han-Chin and Nair, Suraj and Zirikly, Ayah and Friedenberg, Meir and {Daum{\'e} III}, Hal and Resnik, Philip},
  booktitle={Proceedings of the Fifth Workshop on Computational Linguistics and Clinical Psychology: From Keyboard to Clinic},
  pages={25--36},
  year={2018}
}
Ayah Zirikly, Philip Resnik, Özlem Uzuner, and Kristy Hollingshead. 2019. CLPsych 2019 shared task: Predicting the degree of suicide risk in Reddit posts. In Proceedings of the Sixth Workshop on Computational Linguistics and Clinical Psychology (CLPsych'19), Minneapolis, June 6, 2019.
@inproceedings{zirikly2019clpsych,
  title={{CLPsych} 2019 Shared Task: Predicting the Degree of Suicide Risk in {Reddit} Posts},
  author={Zirikly, Ayah and Resnik, Philip and Uzuner, {\"O}zlem and Hollingshead, Kristy}, 
  booktitle={Proceedings of the Sixth Workshop on Computational Linguistics and Clinical Psychology},
  location="Minneapolis",
  month="June",
  day="6",
  year={2019}
}

Dataset Availability and Governance Plan

Reddit is designed to be a site where people "detach from their real-world identities" and post anonymously (Gutman, 2018), but the construction of this dataset adds an additional layer of anonymization by replacing user names with unique identifiers (since, for example, a hypothetical user could still have chosen the username maryjanesmith1973.collegepark, identifying name, birth year, and location), plus, as of Version 2, automatic de-identification of text as described in Zirikly et al. (2019). In terms of formal human subjects research protections, the University of Maryland College Park’s Institutional Review Board has reviewed the use and sharing of this dataset and designated it as Exempt Category 4, i.e. research involving the collection or study of existing data if they are available or if information is recorded such that subjects cannot be identified.

Even with IRB approval for sharing, however -- and even for an anonymous and/or de-identified dataset -- particular care needs to be taken with sensitive data of this kind (Benton et al., 2017, Chancellor et al., 2019). Therefore we have established a collaboration with the American Association of Suicidology (AAS) to put in place a governance process for researcher access to the dataset, described below. The governance process involves review of applications for access to the dataset by a governance committee of five volunteers established by AAS, which includes Philip Resnik (lead investigator at University of Maryland) and four people affiliated with and/or designated by AAS. The AAS contact person regarding this dataset is Tony Wood, chair of the AAS Board of Directors.

Three of the five members of the governance committee, selected per availability, will review requests for access submitted in the format specified below. Outcomes of the review include the following responses:

Approval. All three members approve, in which case the application is approved and Resnik will share the dataset with the researcher.
Questions. The governance committee has questions or requests for clarification.
Revise-and-resubmit. The governance committee has specific suggestions for a revision and resubmission of the application.
Reject. The governance committee declines to approve unanimously, in which case the dataset will not be shared.

Note that the governance process has been established as part of a collaboration between Prof. Resnik and AAS. It may be changed at any time by mutual agreement, and Prof. Resnik or AAS can end this collaboration at any time.

The governance committee will attend to and encourage diversity and inclusion with respect to the set of reviewers and the community of researchers using the dataset.

How to Request Access

Although we have to be careful to make sure all appropriate steps are followed, we are very eager to share this resource with other researchers! Please send requests for access to the dataset to Philip Resnik (resnik@umd.edu). Requests should be based on this sample application, which has two parts:

Part 1. A brief cover letter of no more than two pages (one page is even better!) that:
- (a) briefly describes the project,
- (b) specifies its intended duration,
- (c) affirms having read Benton et al. (2017) and commits to its broad ethical principles (with a strong recommendation to also read Chancellor et al., 2019);
- (d) commits to referring to the "University of Maryland Reddit Suicidality Dataset, Version 2", including appropriate references to Shing et al. (2018) and Zirikly et al. (2019), in any publications using or discussing this dataset. Please also acknowledge the assistance of the American Association of Suicidology in making the dataset available.
- (e) includes a brief data management statement indicating the policy (who and how) for applicant’s project- internal access and sharing, including provisions for appropriate protection of the data. For example, an acceptable data management statement might say:
  - (i) the data and any derivatives will be stored only on password-protected servers where access is restricted to the applicant or their supervisees using Unix group permissions;
  - (ii) any copies of the data or derivatives of it will be accompanied by a clear README.txt file stating the researcher must be contacted prior to any further re-distribution;
  - (iii) on request the researcher will provide the AAS governance committee a list of names and email addresses for anyone who has had access to the dataset;
  - (iv) the researcher will refer any requests for the data (outside their own supervisees) to the governance committee.
Part 2. A copy of a research protocol from the applicant that has been submitted to the applicant’s Institutional Review Board for review and either approved or designated as exempt from IRB review according to U.S. federal regulations, including a copy of the relevant IRB approval documentation. For research to be conducted outside the U.S., an equivalent letter from a research ethics board, IRB equivalent, ministry of health, etc., will be evaluated by the governance committee on a case by case basis, with the expectation that proposed research will meet U.S. IRB standards as well as any relevant local laws and cultural sensitivities. (Applicants may want to be aware that the U.S. Department of Health and Human Services provides an annually updated resource to identify federally assured international research sites.) Please note that your application will not be reviewed without (a) a copy of your protocol or ethical review request that was submitted for approval, and (b) a copy of the approval documentation. If you are a graduate student, please note that documentation of review by an advisor or supervisor does not satisfy this requirement.

Return to Philip Resnik's home page