Skip to content

New dataset

Jonathon Belotti edited this page Aug 27, 2017 · 2 revisions

Data Fields

  1. Comment - the actual text of the comment
  2. Date - timestamp of comment eg. 20120618192155Z (because original dataset has this)
  3. Insult - classification ie. is it an insult or not (1 -> true, 0 -> false)
  4. Usage - ??? it's in the original dataset
  5. Previous comment - humans would use context to classify comments. (this may not be available on all comments)
  6. Labels - metadata about comment, to help with dataset and model performance analysis (eg. sexist, racist, sarcasm)
  7. Difficulty - rough judgement about how obvious the classification would be to a human

Sources

Reddit is probably easiest to scrape, the only challenge will be maintain a diverse set of comments. Facebook should be included because, well, it's the biggest social media platform and should have some representation. Twitter should be included because language in comments/tweets is often corrupted to fit in 180 chars and that presents an interesting challenge.

Clone this wiki locally