Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Invalid annotation: singleton character tsd_trial.csv#L1126 and tsd_train.csv#L7895 #2

Open
GillesJ opened this issue Oct 27, 2020 · 1 comment
Labels
invalid This doesn't seem right

Comments

@GillesJ
Copy link

GillesJ commented Oct 27, 2020

I found two unconnected singleton character annotation which are invalid:

  • tsd_trial.csv line 1126, instance 658:
"[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 21]","Ridiculous logic!

G&M sure seem hooked to Real Estate industry cash (propaganda pieces in exchange of ad cash), Trudeau and paying interest on massive Federal debt."

-> corrected label by removing singleton 21: [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]

  • tsd_train.csv line 7895, instance 4616:
"[94, 95, 96, 97, 98, 241]","He went on a 'traveling the country vacation' there. I hope they have a swift court and swift death penalty. He is immigrated here, non citizen living with parents, in Colorado. 

DO NOT give him back to us. No matter how much Hickenlooper pleads."

-> corrected label by removing singleton 241: [94, 95, 96, 97, 98]

I found these when unitizing annotations from character-level to token-level.
My script found no other singleton characters.

@GillesJ GillesJ changed the title Invalid annotation: singleton character l Invalid annotation: singleton character tsd_trial.csv#L658 and tsd_train.csv#L4616 Oct 27, 2020
@avidale
Copy link

avidale commented Oct 27, 2020

I found similar problems in several more lines.
E.g. line 1143 (668 if newlines in the texts are quoted) in tsd_trial.csv is "[5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20]",Because Trudeau knows that his federal Liberal Party survives on voters with brain damage from recreational pot use., where characters 5 and 6 are a part of the word "Because".
How to deal with such cases?

@GillesJ GillesJ changed the title Invalid annotation: singleton character tsd_trial.csv#L658 and tsd_train.csv#L4616 Invalid annotation: singleton character tsd_trial.csv#L1126 and tsd_train.csv#L7895 Oct 27, 2020
@ipavlopoulos ipavlopoulos added the invalid This doesn't seem right label Oct 28, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
invalid This doesn't seem right
Projects
None yet
Development

No branches or pull requests

3 participants