Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix instances where the debate text ends up in the speaker field #81

Open
stephbuon opened this issue Jun 18, 2021 · 4 comments
Open

Fix instances where the debate text ends up in the speaker field #81

stephbuon opened this issue Jun 18, 2021 · 4 comments
Assignees

Comments

@stephbuon
Copy link
Owner

Sometimes debate text ends up in the speaker field:

mr swift macneill said he wished to call attention to the escape of two indentured chinese labourers from johannesburg to pretoria and their capture in the latter place when he referred to the matter upon a previous occasion the colonial secretary said that he had no information in regard to the imprisonment of these men naturally upon all questions affecting personal liberty irishmen sympathised with the victims of oppression and he wished to draw attention to the way the right hon gentleman had acted in the matter on friday last he asked the colonial secretary whether he had received the information which was conveyed to the public in a reuters telegram stating that two of these chinese labourers had escaped from the compounds at the mines and had managed to get from johannesburg to pretoria which was a distance of forty miles upon the point he did not think the colonial secretary had treated the house respectfully because hon members were anxious to have the fullest information in regard to the administration of the ordinance in south africa the right hon gentleman replied to him stating that he would not telegraph but he promised to communicate with lord milner by despatch he had not telegraphed and therefore it was the right hon gentlemans duty to have sent a despatch to lord milner by the mail which left for south africa at two oclock on saturday but there was something far worse connected with the mater and they had only to look at the ordinance itself in order to see the very stringent conditions under which these chinese laboured

We need to find instances like this and fix them by:
a) isolating the speaker name in its own column
b) separating the text so that each sentence has its own row. (like our existing csv file)

I bet we can find these instances by checking to see if any speakers are more than, say, 30 letters long?

@stephbuon
Copy link
Owner Author

Here are some debate IDs for instances where the debate text ended up as the debate title:

  • 122992
  • 139356 (same as the above example)
  • 99118
  • 5293
  • 77639
  • 69847
  • 88022
  • 70978

@stephbuon
Copy link
Owner Author

@stephbuon check alexander's work

@stephbuon
Copy link
Owner Author

@stephbuon check out the kind of sentence ID assigned to each "returned" sentence.

@stephbuon
Copy link
Owner Author

I need to go back in and see how the sentences are being handled:

library(data.table)
library(tidyverse)

a <- fread("~/data/hansard_c19_improved_speaker_names_2.csv") %>%
  select(sentence_id, speaker, new_speaker, text)

a <- a %>%
  mutate(len = str_count(speaker))

data3 <- filter(a, speaker > 160)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants