-
Notifications
You must be signed in to change notification settings - Fork 194
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Training an existing Tesseract model ara #406
Comments
also here is github repo having all used files to generate the dataset: |
I see no critical errors in your training, but it did not run long enough. Increase And fix or remove the GT line with 'cptleslakUgh3yyAyygueSaladnypayple'. |
Thank you for your feedback and suggestions. I appreciate your guidance! I’ve noticed that while there is some progress in BCER and BWER, the BWER in particular is improving very slowly. Considering the results after 20,000 iterations, it seems like achieving good accuracy will be quite challenging at this rate. Do you think adjusting the dataset (e.g., increasing its size or refining the ground truth) or tweaking training parameters like learning_rate or --max_iterations could help accelerate improvement? Looking forward to your thoughts! Thanks again for your time and help. |
Can you check this:
this is after 46000 itirations the the values are almost the same for the 5000 iterations |
Would this might cause a problem
|
How do you do read in the additional information regarding word list (and alike) which cause the failures prompted in your log? Please note, that for training of official Arabic model they used for sure several dozens of fonts and at least 100.000 lines of text. You only use three(?) fonts and only about 16K lines of text to generate your input. Why do you limit your input to 8 words for each line? Further, despite the learning progress tesstrain reports and depending on your final target scenario, it might be worthwhile to evaluate the resulting model afterwards with data completely unseen before. At least this is how we do it, in the context of mass digitization of Arabic/Hebrew/Farsi prints. To provide some more background, I'm from an german institution called FID MENA, always looking for collaborations on this topic. In the past we've tried to fine tune the official Tesseract std-model for Arabic, but not with synthetic dataset alike you do, but with snippets generated from retro-digitized materials, originating from real prints of MENAlib |
Please check the order of RTL text in your groundtruth files. See issue |
I am working on training the Tesseract OCR model for the Arabic language (ara) using a custom dataset focused on political content. The workflow involves generating .gt.txt and .tif files for the dataset and running the make training command. However, I am encountering issues during the training process that result in unexpected errors or incomplete logs.
log file:
this result is after more than 20000 itirations
I would be more than happy if someone help me with this
The text was updated successfully, but these errors were encountered: