New dataset

Data Fields

Comment - the actual text of the comment
Date - timestamp of comment eg. 20120618192155Z (because original dataset has this)
Insult - classification ie. is it an insult or not (1 -> true, 0 -> false)
Usage - ??? it's in the original dataset
Previous comment - humans would use context to classify comments. (this may not be available on all comments)
Labels - metadata about comment, to help with dataset and model performance analysis (eg. sexist, racist, sarcasm)
Difficulty - rough judgement about how obvious the classification would be to a human

Sources

Reddit is probably easiest to scrape, the only challenge will be maintain a diverse set of comments. Facebook should be included because, well, it's the biggest social media platform and should have some representation. Twitter should be included because language in comments/tweets is often corrupted to fit in 180 chars and that presents an interesting challenge.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

New dataset

Data Fields

Sources

Clone this wiki locally