⚠️ Trigger Warning Prediction: Text Classification of Posts scraped from Reddit ⚠️

Overview

Social media platforms, such as Instagram, Twitter and Reddit provide invaluable platforms for building communities and seeking peer-support with mental health issues. However, adequately moderating content on these platforms and their sub-communities can be a time-intensive task as well as emotionally-draining.This project demonstrates how to (i) scrape data from reddit in Python; (ii) clean and format this data; and (iii) build a SpaCy textcat model that can predict the label (trigger warning) of potentially sensitive content.

More details of the project can be found on the slides included in this repo.

Dataset

The scraped dataset comprises around 142,000 documents after cleaning. There is a slight class imbalance, with ED and OCD having the most observations while bipolar and ADHD have the least.

##SpaCy Model

⏯ Commands

The following commands are defined in this project. They can be executed using spacy project run [command].

Command	Description
`convert`	Convert the data to spaCy's binary format
`train`	Train the textcat model
`evaluate`	Evaluate the model and export metrics
`package`	Package the trained model as a pip package
`visualize-model`	Visualize the model's output interactively using Streamlit

⏭ Workflows

The following workflows are defined by the project. It can be executed using spacy project run all and will run the specified commands in order.

Workflow	Steps
`all`	`convert` → `train` → `evaluate` → `package`

🗂 Assets

The following assets are necessary to run the project. They can be generated by running Jupyter notebook scraping_reddit.ipynb found in the project directory.

File	Source	Description
[`assets/reddit-train.jsonl`]	Local	Training data scraped from Reddit
[`assets/reddit-dev.jsonl`]	Local	Development data scraped from Reddit

🗂 Other Data

They can generated by running Jupyter notebook scraping_reddit.ipynb found in the project directory.

File	Source	Description
`assets/raw_reddit_dataset.csv`	Local	Whole raw dataset scraped from Reddit exported as csv
`assets/cleaned_reddit_dataset.csv`	Local	Whole cleaned dataset from Reddit exported as csv
`assets/reddit-train.csv`	Local	training data exported as csv
`assets/reddit-dev.csv`	Local	dev data exported as csv
`assets/reddit-test.csv`	Local	test data exported as csv

Result

The final model performed well on unseen data, with a macro F1 score of 80.79 and an average ROC-AUC score of 0.96. This model could be implemented by platform moderators to help streamline the process of sifting through all the content and ameliorate their workload. Alternatively, it could be applied automatically to posts so that users can see what a post is about before they read it and act accordingly.

A potential limitation of this project is that we do not collect any information on user demographics thus it is difficult to say how generalisable the model developed here would be on other (non-reddit) data.

Future work could also focus on adding an NER or some other sort of keyword extraction component to the SpaCy pipeline in order to further assist moderators in processing content.

📚 References

[1] "Classification of 'Triggering' Content on Social Media" by Keelin Sekerka-Bajbus. Available: https://github.com/ksek87/trigger-warning-classification .

[2] "spaCy Project: Demo Multilabel Textcat (Text Classification)" by ExplosionAI Available: https://github.com/explosion/projects/tree/v3/pipelines/textcat_multilabel_demo

[3] “Reddit,” reddit. [Online]. Available: https://www.reddit.com/.

[4] Pushshift.io. (2019). Pushshift.io. Available: https://pushshift.io/.

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
configs		configs
scripts		scripts
training		training
README.md		README.md
project.yml		project.yml
reddit-textcat-ppt.pdf		reddit-textcat-ppt.pdf
requirements.txt		requirements.txt
scraping_reddit.ipynb		scraping_reddit.ipynb
tester.ipynb		tester.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

configs

configs

scripts

scripts

training

training

README.md

README.md

project.yml

project.yml

reddit-textcat-ppt.pdf

reddit-textcat-ppt.pdf

requirements.txt

requirements.txt

scraping_reddit.ipynb

scraping_reddit.ipynb

tester.ipynb

tester.ipynb

Repository files navigation

⚠️ Trigger Warning Prediction: Text Classification of Posts scraped from Reddit ⚠️

Overview

Dataset

⏯ Commands

⏭ Workflows

🗂 Assets

🗂 Other Data

Result

📚 References

About

Releases

Packages

Languages

Statisfied/reddit-textcat

Folders and files

Latest commit

History

Repository files navigation

⚠️ Trigger Warning Prediction: Text Classification of Posts scraped from Reddit ⚠️

Overview

Dataset

⏯ Commands

⏭ Workflows

🗂 Assets

🗂 Other Data

Result

📚 References

About

Resources

Stars

Watchers

Forks

Languages