Skip to content

Commit

Permalink
Merge pull request #135 from johnlockejrr/main
Browse files Browse the repository at this point in the history
Update TrainingTesseract-5.md
  • Loading branch information
zdenop authored Sep 7, 2024
2 parents af52e2a + d93cfc8 commit 9f87063
Showing 1 changed file with 18 additions and 0 deletions.
18 changes: 18 additions & 0 deletions tess5/TrainingTesseract-5.md
Original file line number Diff line number Diff line change
Expand Up @@ -485,6 +485,24 @@ cannot be encoded using the given unicharset. Possible causes are:
1. A stray unprintable character (like tab or a control character) in the text.
1. There is an un-represented Indic grapheme/aksara in the text.

You can simply remove the offending characters (CHARACTER TABULATION, CARRIAGE RETURN,
NO-BREAK SPACE, LEFT-TO-RIGHT MARK, RIGHT-TO-LEFT MARK, ZERO WIDTH NO-BREAK SPACE,
POP DIRECTIONAL FORMATTING, ZERO WIDTH NON-JOINER) in pct. 2 with a sed script:

`remove_control_chars.sed`
```
s/\x09//g
s/\x0d//g
s/\x00\xa0/ /g
s/\x20\x0e//g
s/\x20\x0f//g
s/\xfe\xff//g
s/\x20\x2c//g
s/\x20\x0c//g
```

`sed -i -f remove_control_chars.sed data/lang-ground-truth*.gt.txt`

In any case it will result in that training image being ignored by the trainer.
If the error is infrequent, it is harmless, but it may indicate that your
unicharset is inadequate for representing the language that you are training.
Expand Down

0 comments on commit 9f87063

Please sign in to comment.