Social media platforms, such as Instagram, Twitter and Reddit provide invaluable platforms for building communities and seeking peer-support with mental health issues. However, adequately moderating content on these platforms and their sub-communities can be a time-intensive task as well as emotionally-draining.This project demonstrates how to (i) scrape data from reddit in Python; (ii) clean and format this data; and (iii) build a SpaCy textcat
model that can predict the label (trigger warning) of potentially sensitive content.
More details of the project can be found on the slides included in this repo.
The scraped dataset comprises around 142,000 documents after cleaning. There is a slight class imbalance, with ED and OCD having the most observations while bipolar and ADHD have the least.
##SpaCy Model
The following commands are defined in this project. They
can be executed using spacy project run [command]
.
Command | Description |
---|---|
convert |
Convert the data to spaCy's binary format |
train |
Train the textcat model |
evaluate |
Evaluate the model and export metrics |
package |
Package the trained model as a pip package |
visualize-model |
Visualize the model's output interactively using Streamlit |
The following workflows are defined by the project. It
can be executed using spacy project run all
and will run the specified commands in order.
Workflow | Steps |
---|---|
all |
convert → train → evaluate → package |
The following assets are necessary to run the project. They can
be generated by running Jupyter notebook scraping_reddit.ipynb
found in the project directory.
File | Source | Description |
---|---|---|
[assets/reddit-train.jsonl ] |
Local | Training data scraped from Reddit |
[assets/reddit-dev.jsonl ] |
Local | Development data scraped from Reddit |
They can generated by running Jupyter notebook scraping_reddit.ipynb
found in the project directory.
File | Source | Description |
---|---|---|
assets/raw_reddit_dataset.csv |
Local | Whole raw dataset scraped from Reddit exported as csv |
assets/cleaned_reddit_dataset.csv |
Local | Whole cleaned dataset from Reddit exported as csv |
assets/reddit-train.csv |
Local | training data exported as csv |
assets/reddit-dev.csv |
Local | dev data exported as csv |
assets/reddit-test.csv |
Local | test data exported as csv |
The final model performed well on unseen data, with a macro F1 score of 80.79 and an average ROC-AUC score of 0.96. This model could be implemented by platform moderators to help streamline the process of sifting through all the content and ameliorate their workload. Alternatively, it could be applied automatically to posts so that users can see what a post is about before they read it and act accordingly.
A potential limitation of this project is that we do not collect any information on user demographics thus it is difficult to say how generalisable the model developed here would be on other (non-reddit) data.
Future work could also focus on adding an NER or some other sort of keyword extraction component to the SpaCy pipeline in order to further assist moderators in processing content.
[1] "Classification of 'Triggering' Content on Social Media" by Keelin Sekerka-Bajbus. Available: https://github.com/ksek87/trigger-warning-classification .
[2] "spaCy Project: Demo Multilabel Textcat (Text Classification)" by ExplosionAI Available: https://github.com/explosion/projects/tree/v3/pipelines/textcat_multilabel_demo
[3] “Reddit,” reddit. [Online]. Available: https://www.reddit.com/.
[4] Pushshift.io. (2019). Pushshift.io. Available: https://pushshift.io/.