-
Notifications
You must be signed in to change notification settings - Fork 5
New dataset
Jonathon Belotti edited this page Aug 27, 2017
·
2 revisions
- Comment - the actual text of the comment
- Date - timestamp of comment eg.
20120618192155Z
(because original dataset has this) - Insult - classification ie. is it an insult or not (
1
->true
,0
->false
) - Usage - ??? it's in the original dataset
- Previous comment - humans would use context to classify comments. (this may not be available on all comments)
- Labels - metadata about comment, to help with dataset and model performance analysis (eg. sexist, racist, sarcasm)
- Difficulty - rough judgement about how obvious the classification would be to a human
Reddit is probably easiest to scrape, the only challenge will be maintain a diverse set of comments. Facebook should be included because, well, it's the biggest social media platform and should have some representation. Twitter should be included because language in comments/tweets is often corrupted to fit in 180 chars and that presents an interesting challenge.