If you are interested in obtaining this dataset, please note that your application must provide documentation that your project plan has been reviewed and approved by an Institutional Review Board or equivalent ethical review panel at your institution or organization, as specified in Part 2 of the application. We cannot review applications that lack such documentation. If you are a graduate student, please note that documentation of review by an advisor or supervisor does not satisfy this requirement.
We introduced Version 1 of the dataset in Shing et al. (2018). As reported there, annotation of users in this dataset by experts for level of suicide risk (on a four-point scale of no risk, low, moderate, and severe risk) yielded what is, to our knowledge, the first demonstration of reliability in risk assessment by clinicians based on social media postings. The paper also introduces and demonstrates the value of a new, detailed rubric for assessing suicide risk, compares crowdsourced with expert performance, and presented baseline predictive modeling experiments using the new dataset.
Subsequently, we updated the dataset for the shared task on predicting degree of suicide risk from Reddit Posts, run as part of the 2019 Computational Linguistics and Clinical Psychology Workshop (CLPsych 2019) held at the 2019 Conference of the North American Chapter of the Association for Computational Linguistics (NAACL) (Zirikly et al. 2019). Updates included adding automatic de-identification of post titles and bodies, as well as the definition of a standard training/test split to be used during the shared task in order to facilitate head-to-head comparisons of system performance. We have also filtered out some posts from the Version 1 dataset based on encoding issues.
The currently available Version 2 of the dataset includes the training and test data from the 2019 CLPsych shared task (with consensus annotations based on crowdsourcing) plus the expert-annotated data (which was not used in the shared task). We recommend using the crowdsourcing train/test split for direct comparison with 2019 shared task papers, and using the full expert-annotated dataset for final testing since the expert annotations have strong inter-rater reliability.
The dataset is accompanied by documentation about its format. Briefly, it contains one subdirectory with data pertaining to 11,129 users who posted on SuicideWatch, and another for 11,129 users who did not. For each user, we have full longitudinal data from the 2015 Full Reddit Submission Corpus, including, for each post, the post ID, anonymized user ID, timestamp, subreddit, de-identified post title, and de-identified post body. In addition, we have two sets of human risk-level annotations for subsets of the users, obtained via crowdsourced annotation (621 users who posted on SuicideWatch and 621 who did not) and expert annotations (245 users who posted on SuicideWatch, paired with 245 control users who did not). In both cases we generated a user-level consensus label using the Dawid-Skene (1979) model for discovering true item states/effects from multiple noisy measurements (Passoneau and Carpenter, 2014; see discussion in Shing, 2018).
In addition to reading and citing the papers below, people using this dataset may wish to read Gaffney and Matias (2018). Published subsequent to Shing et al. (2018), this article provides caveats regarding the use of the 2015 Reddit Corpus related to missing data, which we discuss in Zirikly et al. (2019).
Han-Chin Shing, Suraj Nair, Ayah Zirikly, Meir Friedenberg, Hal Daumé III, and Philip Resnik, "Expert, Crowdsourced, and Machine Assessment of Suicide Risk via Online Postings", Proceedings of the Fifth Workshop on Computational Linguistics and Clinical Psychology: From Keyboard to Clinic, pages 25–36, New Orleans, Louisiana, June 5, 2018.@inproceedings{shing2018expert, title={Expert, crowdsourced, and machine assessment of suicide risk via online postings}, author={Shing, Han-Chin and Nair, Suraj and Zirikly, Ayah and Friedenberg, Meir and {Daum{\'e} III}, Hal and Resnik, Philip}, booktitle={Proceedings of the Fifth Workshop on Computational Linguistics and Clinical Psychology: From Keyboard to Clinic}, pages={25--36}, year={2018} }Ayah Zirikly, Philip Resnik, Özlem Uzuner, and Kristy Hollingshead. 2019. CLPsych 2019 shared task: Predicting the degree of suicide risk in Reddit posts. In Proceedings of the Sixth Workshop on Computational Linguistics and Clinical Psychology (CLPsych'19), Minneapolis, June 6, 2019.
@inproceedings{zirikly2019clpsych, title={{CLPsych} 2019 Shared Task: Predicting the Degree of Suicide Risk in {Reddit} Posts}, author={Zirikly, Ayah and Resnik, Philip and Uzuner, {\"O}zlem and Hollingshead, Kristy}, booktitle={Proceedings of the Sixth Workshop on Computational Linguistics and Clinical Psychology}, location="Minneapolis", month="June", day="6", year={2019} }
Even with IRB approval for sharing, however -- and even for an anonymous and/or de-identified dataset -- particular care needs to be taken with sensitive data of this kind (Benton et al., 2017, Chancellor et al., 2019). Therefore we have established a collaboration with the American Association of Suicidology (AAS) to put in place a governance process for researcher access to the dataset, described below. The governance process involves review of applications for access to the dataset by a governance committee of five volunteers established by AAS, which includes Philip Resnik (lead investigator at University of Maryland) and four people affiliated with and/or designated by AAS. The AAS contact person regarding this dataset is Tony Wood, chair of the AAS Board of Directors.
Three of the five members of the governance committee, selected per availability, will review requests for access submitted in the format specified below. Outcomes of the review include the following responses:
The governance committee will attend to and encourage diversity and inclusion with respect to the set of reviewers and the community of researchers using the dataset.