Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ES actor dictionaries contain 25% to 30% duplicate entries #58

Open
philip-schrodt opened this issue Dec 10, 2018 · 0 comments
Open

ES actor dictionaries contain 25% to 30% duplicate entries #58

philip-schrodt opened this issue Dec 10, 2018 · 0 comments

Comments

@philip-schrodt
Copy link
Contributor

The various Spanish actor dictionaries all contain about 25% to 30% duplicate entries, specifically

  • ELBOW_SPANISH_Phoenix.Countries.actors_UPDATED_noaccent.txt. Total actors: 58126 Total duplicates 16144 ( 27.77%)
  • File: ELBOW_SPANISH_Phoenix_MilNonState_actors_UPDATED_mod.txt Total actors: 4310 Total duplicates 1058 ( 24.55%)
  • File: ELBOW_SPANISH_Phoenix_International_actors_UPDATED_noaccent.txt. Total actors: 12821 Total duplicates 3740 ( 29.17%)
  • File: Agents_ESP_Bablenet_20171114_mod.txt. Total actors: 16062 Total duplicates 4570 ( 28.45%)

There's a particularly extreme case in ELBOW_SPANISH_Phoenix.Countries.actors_UPDATED_noaccent.txt, where there an 718 repetitions of

AL-YUMHURIYYA_AL-YAZAIIRIYYA_AD-DIMUQRATIYYA_ASH-SHA`BIYYA_TIGDUDA_TAMEGDAYT_TAGERFANT_TAZZAYRIT_REPUBLIQUE_ALGERIENNE_DEMOCRATIQUE_ET_POPULAIRE_ [DZA]

Some of these repetitions differ by one or more accents/diacritics, but there are several combinations which are repeated in identical form (though perhaps in some earlier iteration of the file they also differed in some diacritics?) 51 times. This is by far the extreme case: most repetitions occur fewer than ten times, and the most common situation is only a single repetition.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant