-
Notifications
You must be signed in to change notification settings - Fork 155
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Is it possible to add few pre-1918 Russian characters to RUS language files? #3
Comments
Yes, this is possible. I think the resulting model should not replace I assume that the missing section sign will be needed for |
Your first step should be finding/making ground truth text from images of pre-1918 Russian books and/or newspapers. |
The Byelorussian-Ukrainian I (upper and lower case) is included in |
@stweil and @amitdo, thank you for the comments. As I figured out, a new rus_old model is a better solution. I shall try to prepare a set of words in pre-1918 Russian for training and come back to the issue after that. I am not sure I will be able to decipher the training instructions on my own but they are anyway of no use without a good deal of text to be used on. |
In 1917--1918, the Russian language was reformed in many ways including but not limited to the banning of four letters: I-decimal (now known as "Byelorussian-Ukrainian I"), Yat, Fita, and Izhitsa. The necessity to OCR the texts published in Russia from 1708 through 1918 (and somewhat later) is widely recognised among scholars but they are largely unfamiliar with the ways tesseract can be trained to recognise these missing characters (and, I have to confess, the vast majority of ordinary people will be absolutely unable to train tesseract even if they read the instructions [ https://github.com/tesseract-ocr/tesseract/wiki/TrainingTesseract-4.00#fine-tuning-for--a-few-characters ]). See also: https://en.wikipedia.org/wiki/Russian_alphabet#Letters_eliminated_in_1918
Is there a possibility to include in the desired characters list for Russian ( langdata_lstm/rus/desired_characters ) the following glyphs:
§ : Section sign ; Unicode number: U+00A7
І : Cyrillic Capital Letter Byelorussian-Ukrainian I ; Unicode number: U+0406
і : Cyrillic Small Letter Byelorussian-Ukrainian I ; Unicode number: U+0456
Ѣ : Cyrillic Capital Letter Yat ; Unicode number: U+0462
ѣ : Cyrillic Small Letter Yat ; Unicode number: U+0463
Ѳ : Cyrillic Capital Letter Fita ; Unicode number: U+0472
ѳ : Cyrillic Small Letter Fita ; Unicode number: U+0473
Ѵ : Cyrillic Capital Letter Izhitsa ; Unicode number: U+0474
ѵ : Cyrillic Small Letter Izhitsa ; Unicode number: U+0475
What else should be provided to add these few characters? A list of words containing these letters? How long should that list be? I am working currently on a project which processes lots of geographic names in pre-1918 Russian (and some other texts), so I can provide at least a list of words of considerable length. For now, I have to resort to OCR the pre-1918 text as a post-1918 and insert the missing four characters manually (mostly, two of them, as Fita and, especially, Izhitsa were rather less frequent).
Or this would rather require a much larger effort like creating a special rus-old model?
The text was updated successfully, but these errors were encountered: