Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

NO fas.unicharset and fas.xheights file for Persian Language #60

Open
AinazRafiei opened this issue May 19, 2024 · 6 comments
Open

NO fas.unicharset and fas.xheights file for Persian Language #60

AinazRafiei opened this issue May 19, 2024 · 6 comments

Comments

@AinazRafiei
Copy link

There are no fas.xheights and fas.unicharset file for Persian language.Without these data how can we train tesseract with LSTM on persian language.Coulde you please add them or guide how can we make them ?

@amitdo
Copy link

amitdo commented May 19, 2024

@stweil,

If you want to fix this issue, the fas.unicharset file can be extracted from the fas.traineddata.

IIRC, the xheights file is not needed for LSTM training.

@stweil stweil changed the title NO fas.unicharset and fas.xheights file for Persian Lnaguage. NO fas.unicharset and fas.xheights file for Persian Language May 19, 2024
@stweil
Copy link
Member

stweil commented May 19, 2024

Is this an issue? There is also no eng.xheights and eng.unicharset in the repository, and the same is true for all other languages, too.

@AinazRafiei, why do you think that you need those files?

@amitdo
Copy link

amitdo commented May 20, 2024

@stweil
Copy link
Member

stweil commented May 20, 2024

Ah, yes, sorry. So what remains to be fixed? Or can this issue be closed?

@AinazRafiei
Copy link
Author

Is this an issue? There is also no eng.xheights and eng.unicharset in the repository, and the same is true for all other languages, too.

@AinazRafiei, why do you think that you need those files?
why IIRC, the xheights file is not needed for LSTM training?
I want to trian tesseract4 with LSTM on a custom Persian dataset. Actually finetune Tesseract on my dataset to resolve errors when recognizing the text of the images in my database.I followed training tesseract instructions. In (https://tesseract-ocr.github.io/tessdoc/tess4/TrainingTesseract-4.00.html#building-the-training-tools) in Tesstutorial part said we need unicharset and xheight file.I trained tesseract4 without these(unicharset and xheight) with an imperfect unicharset file which generate during training and as I expected got error Encoding of string failed and the trining error was too high because the characters in training data didn't exist in unicharsetfile and during trianing and file ignored by model, therefore, training can not be done accurately . I believe the problem of the high training error rate is because of the absence of unicharset and xheight files for persian language.I dont want to trian from scratch because I don't have enogh datasets to do that. There is method call cutoff layer mentioned in Tesseract documentation but I didnt understand that. If there is any solution to finetune in my case could you please tell me ?

@AinazRafiei
Copy link
Author

fas.unicharset

eng.unicharset

The unicharset files in language folder like eng and fas are very different from files outside the folders. unicharset files that are outside are much more completed and have lots of unichars in language like unichars in different fonts. Unicharset in language folders are files that generated during training with Tesseract dataset on a language.Its not useable when you want to train tesseract on your dataset because it is different from dataset Tesseract used .

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants