NO fas.unicharset and fas.xheights file for Persian Language #60

AinazRafiei · 2024-05-19T08:28:43Z

There are no fas.xheights and fas.unicharset file for Persian language.Without these data how can we train tesseract with LSTM on persian language.Coulde you please add them or guide how can we make them ?

amitdo · 2024-05-19T11:31:14Z

@stweil,

If you want to fix this issue, the fas.unicharset file can be extracted from the fas.traineddata.

IIRC, the xheights file is not needed for LSTM training.

stweil · 2024-05-19T21:24:23Z

Is this an issue? There is also no eng.xheights ~~and eng.unicharset~~ in the repository, and the same is true for all other languages, too.

@AinazRafiei, why do you think that you need those files?

amitdo · 2024-05-20T01:58:02Z

fas.unicharset

eng.unicharset

stweil · 2024-05-20T07:19:13Z

Ah, yes, sorry. So what remains to be fixed? Or can this issue be closed?

AinazRafiei · 2024-05-20T07:53:21Z

Is this an issue? There is also no eng.xheights ~~and eng.unicharset~~ in the repository, and the same is true for all other languages, too.

@AinazRafiei, why do you think that you need those files?
why IIRC, the xheights file is not needed for LSTM training?
I want to trian tesseract4 with LSTM on a custom Persian dataset. Actually finetune Tesseract on my dataset to resolve errors when recognizing the text of the images in my database.I followed training tesseract instructions. In (https://tesseract-ocr.github.io/tessdoc/tess4/TrainingTesseract-4.00.html#building-the-training-tools) in Tesstutorial part said we need unicharset and xheight file.I trained tesseract4 without these(unicharset and xheight) with an imperfect unicharset file which generate during training and as I expected got error Encoding of string failed and the trining error was too high because the characters in training data didn't exist in unicharsetfile and during trianing and file ignored by model, therefore, training can not be done accurately . I believe the problem of the high training error rate is because of the absence of unicharset and xheight files for persian language.I dont want to trian from scratch because I don't have enogh datasets to do that. There is method call cutoff layer mentioned in Tesseract documentation but I didnt understand that. If there is any solution to finetune in my case could you please tell me ?

AinazRafiei · 2024-05-20T08:00:32Z

fas.unicharset

eng.unicharset

The unicharset files in language folder like eng and fas are very different from files outside the folders. unicharset files that are outside are much more completed and have lots of unichars in language like unichars in different fonts. Unicharset in language folders are files that generated during training with Tesseract dataset on a language.Its not useable when you want to train tesseract on your dataset because it is different from dataset Tesseract used .

stweil changed the title ~~NO fas.unicharset and fas.xheights file for Persian Lnaguage.~~ NO fas.unicharset and fas.xheights file for Persian Language May 19, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

NO fas.unicharset and fas.xheights file for Persian Language #60

NO fas.unicharset and fas.xheights file for Persian Language #60

AinazRafiei commented May 19, 2024

amitdo commented May 19, 2024

stweil commented May 19, 2024 •

edited

Loading

amitdo commented May 20, 2024

stweil commented May 20, 2024 •

edited

Loading

AinazRafiei commented May 20, 2024

AinazRafiei commented May 20, 2024

NO fas.unicharset and fas.xheights file for Persian Language #60

NO fas.unicharset and fas.xheights file for Persian Language #60

Comments

AinazRafiei commented May 19, 2024

amitdo commented May 19, 2024

stweil commented May 19, 2024 • edited Loading

amitdo commented May 20, 2024

stweil commented May 20, 2024 • edited Loading

AinazRafiei commented May 20, 2024

AinazRafiei commented May 20, 2024

stweil commented May 19, 2024 •

edited

Loading

stweil commented May 20, 2024 •

edited

Loading