BurnoutText - Frequent Words in Texts about Burnout, Depression and a Control Group

This dataset was generated in the context of a research project funded by the Swiss National Science Foundation (grant nr. 196483, see https://data.snf.ch/grants/grant/196483). In this project, new methods from natural language processing are applied to develop new methods for burnout detection in clinical psychology/psychiatry. For details refer to: https://www.bfh.ch/en/research/research-projects/2021-288-996-826/ The source data for this derived dataset was collected from Reddit and consists of a "Burnout" dataset with 352 samples, a "No burnout" dataset with 13,216 samples and a "Depression" dataset with 979 samples. More details about the original dataset can be found in the following publication: https://doi.org/10.3389/fdata.2022.863100 All contractions were expanded (ex. "I'm" to "I am") using the contractions python library. We used the spacy en-core-web-sm pre-trained English language pipeline to tokenize each text sample, remove stopwords and punctuation, and lemmatize the remaining tokens. For example, the text "I feel like I have been working too much. Everything is exhausting." would be converted to "feel like work exhausting". The dataset presented here was then compiled by counting the top 20 lemmatized tokens in each of the classes (Burnout, No burnout and Depression). The words are ordered from more frequent to less frequent.

Organizational unit: BFH - Institute for Data Applications and Security (IDAS)

Keywords: burnout, natural language processing, machine learning, augmented intelligence, ensemble classifier, psychology, mental health

Publication date

06/21/2022

Retention date∞

Access levelPublic

SensitivityBlue

Contract on the use of data

License

Contributors

Files

186