Skip to content

Commit

Permalink
trained a new OCRopus model
Browse files Browse the repository at this point in the history
  • Loading branch information
Jesper Zedlitz committed Oct 17, 2017
1 parent 8d30998 commit 1656704
Show file tree
Hide file tree
Showing 54 changed files with 88 additions and 15 deletions.
28 changes: 18 additions & 10 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -11,19 +11,21 @@ In addition to the pyrnn.gz model to be used with `ocropus-rpred` I have also tr

The ground truth images for this model were selected from these historic books:

- Das astronomische Weltbild im Wandel der Zeit (1912) urn:nbn:at:AT-OOeLB-4113427
- Topographische Chronik von Breslau (1805)
- Egger: Die christliche Mutter (1914)
- Frapan: Bittersüß (1891)
- Gartenlaube Heft 1, S. 8 (1897)
- Menzel: Der praktische Maurer (1847) urn:nbn:de:kobv:b4-25126-9
- Kiel city directory (1888)
- Was sollen wir kochen? (1915) urn:nbn:at:AT-OOeLB-1184253
- Köln city directory (1891)
- Ludendorff: Kriegserinnerungen (1921)
- Menzel: Der praktische Maurer (1847)
- Kreis-Kalender für den Kreis Plön (1909) urn:nbn:de:gbv:8:2-2533517
- Preuschen: Yoshiwara (1920)
- Gartenlaube Heft 1, S. 8 (1897)
- Munzinger: Die Japaner (1898) urn:nbn:de:kobv:b4-200905194219
- Frapan: Bittersüß (1891) urn:nbn:de:kobv:b4-200905191440
- Schiller und Oberösterreich (1905) urn:nbn:at:AT-OOeLB-1099695
- Kreis-Kalender für den Kreis Plön (1909) urn:nbn:de:gbv:8:2-2533517
- Das astronomische Weltbild im Wandel der Zeit (1912) urn:nbn:at:AT-OOeLB-4113427
- Egger: Die christliche Mutter (1914) urn:nbn:de:kobv:b4-30011-9
- Was sollen wir kochen? (1915) urn:nbn:at:AT-OOeLB-1184253
- Der Große Krieg in Einzeldarstellungen - Die Winterschlacht in Masuren (1918) urn:nbn:at:AT-OOeLB-1691328
- Preuschen: Yoshiwara (1920) urn:nbn:de:kobv:b4-30090-0
- Ludendorff: Kriegserinnerungen (1921)
- Ehrenbuch der Gefallenen Stuttgarts (1925)

For some rare characters (Q, Y, Ä, Ö and Ü) I have generated some synthetic training data with OCRopus-linegen using the Walbaum Fraktur font and words from a German dictionary.
Expand All @@ -50,7 +52,7 @@ Using a language model you can easily replace the wrong characters. In German an

### s, tz, ch, ck and others

Fraktur has a few special characters that usually do not occur in 'normal' text, such as the long-S or the tz ligature. This model (and the ground truth) translates them into 'normal' characters. Therefore a long-S will be interpreted as s and the ligatures will be interpreted as two characters.
Fraktur has a few special characters that usually do not occur in 'normal' text, such as the long-S or the tz ligature. This model (and the ground truth) recognizsed the correct Unicode character LATIN SMALL LETTER LONG S. The tz-ligature is interpredted as two characters.

![Wasser](images/long-s.png)

Expand All @@ -76,3 +78,9 @@ To train the model from scratch you can use this command:
(Make sure that the OCRopus directory is included in PATH.) If you want to improve the existing model you can start with it:

ocropus-rtrain -c codec.txt -F 2000 --load fraktur.pyrnn.gz -o fraktur "training/*.bin.png"

You can see the training progress in this image:

![diagram showing the training progress](images/accuracy.png)

If you are interested in the amount of ground-truth data and the character accuracy for each book used for the model traning, take a look at the `overview-*` diagrams in the `images` folder.
4 changes: 2 additions & 2 deletions accuracy.gnuplot
Original file line number Diff line number Diff line change
Expand Up @@ -9,5 +9,5 @@ set xrange [0:]
set yrange [0.95:1]

plot 'errors.csv' u 1:(1-$2) with lines title 'testing',\
'' u 1:(1-$3) title 'training',\
'' u 1:(1-$3) title 'exclusive testing'
'' u 1:(1-$3) with lines title 'training',\
'' u 1:(1-$4) with lines title 'exclusive testing'
24 changes: 24 additions & 0 deletions errors.csv
Original file line number Diff line number Diff line change
@@ -0,0 +1,24 @@
# iteration testing training exclusive testing
2000 0.142825631411 0.14939694083 0.124767801858
4000 0.0636373662733 0.0596750420105 0.0597523219814
6000 0.0379949266571 0.0323524182587 0.0266253869969
8000 0.0291717216279 0.0224648860697 0.0238390092879
10000 0.0292268666593 0.0223788576574 0.0213622291022
12000 0.0259181647734 0.0186050779704 0.0167182662539
14000 0.0226094628874 0.0150033551081 0.0164086687307
16000 0.0268004852763 0.0188516927524 0.015479876161
18000 0.0207896768501 0.0138677800655 0.0123839009288
20000 0.0219477225102 0.0146936528237 0.0130030959752
22000 0.0205690967244 0.0127952925253 0.012693498452
24000 0.0207345318187 0.013615430056 0.0133126934985
26000 0.0231609132017 0.0144183619043 0.0145510835913
28000 0.0201830815044 0.012462649331 0.0139318885449
30000 0.0195213411272 0.0111148708714 0.0123839009288
32000 0.0184735855299 0.0108969322268 0.0120743034056
34000 0.0180324252785 0.010346350388 0.0136222910217
36000 0.0177015550899 0.0107764924496 0.0108359133127
38000 0.0175361199956 0.00975562195674 0.00959752321981
40000 0.0227748979817 0.0153417335299 0.0139318885449
42000 0.0200727914415 0.0137989573356 0.0148606811146
44000 0.0199073563472 0.0137588107432 0.012693498452
46000 0.0149443035183 0.0100481185586 0.0130030959752
Binary file added fraktur.pyrnn.gz
Binary file not shown.
Binary file added images/accuracy.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added images/overview-characters.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added images/overview-lines.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added images/overview-testing.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added images/overview-training.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
20 changes: 20 additions & 0 deletions overview.csv
Original file line number Diff line number Diff line change
@@ -0,0 +1,20 @@
# book training lines training characters training error test lines test characters test error
AC09345259 116 5119 0.00655391120507 6 190 0.00564971751412
astro1912 75 4993 0.0117195823567 11 707 0.0345345345345
breslau1805 21 920 0.0689252336449 2 62 0.107142857143
egger 419 16510 0.00754962577286 48 1910 0.00964265456608
frapan 504 25074 0.00335570469799 57 2852 0.00265957446809
gartenlaube 69 4409 0.00896535013327 6 388 0.0301369863014
grosserkrieg20 43 2316 0.0221811460259 18 1116 0.0191938579655
kiel1888 373 13065 0.0188726305288 33 960 0.0368239355581
kochen1915 119 6502 0.00713337757133 4 237 0.00448430493274
koeln1891 44 3306 0.0175382653061 2 148 0.013986013986
linegen 190 1702 0.0500747384155 3 20 0.214285714286
ludendorf 351 23000 0.00475158001568 33 1891 0.00789622109419
menzel 213 13479 0.0167192429022 20 1233 0.0128755364807
munzinger 202 10576 0.010013744355 62 3286 0.0148593107809
nn 19 861 0.0718711276332 3 196 0.0641711229947
ploen1909 87 3478 0.0436667698978 10 438 0.0565110565111
preuschen 19 953 0.010067114094 4 194 0.00549450549451
schiller 364 20494 0.00508339952343 37 2076 0.00831600831601
stuttgart 537 29414 0.00777550010603 41 2338 0.0120751341682
11 changes: 9 additions & 2 deletions overview.sh
Original file line number Diff line number Diff line change
Expand Up @@ -2,6 +2,11 @@

# Generate a table with an overview (number of lines, number of characters, recognition error)
# over the books used in the training process.
#
# After running this script you can use
# gnuplot overview.gnuplot
# to genernate a few nice diagrams showing the number of ground-truth data and the character accuracy
# for each of the books used in the model training process.

ocropus-rpred -h > /dev/null 2>/dev/null
if [ $? != 0 ]; then
Expand All @@ -10,9 +15,11 @@ if [ $? != 0 ]; then
fi

echo "Recognizing training data..."
#ocropus-rpred -q -n -m `pwd`/fraktur.pyrnn.gz -Q6 "training/*.bin.png"
ocropus-rpred -q -n -m `pwd`/fraktur.pyrnn.gz -Q6 "training/*.bin.png"
ocropus-rpred -q -n -m `pwd`/fraktur.pyrnn.gz -Q6 "training/*.nrm.png"
echo "Recognizing testing data..."
#ocropus-rpred -q -n -m `pwd`/fraktur.pyrnn.gz -Q6 "testing/*.bin.png"
ocropus-rpred -q -n -m `pwd`/fraktur.pyrnn.gz -Q6 "testing/*.bin.png"
ocropus-rpred -q -n -m `pwd`/fraktur.pyrnn.gz -Q6 "testing/*.nrm.png"

ls training/*.gt.txt | cut -c 10- | sed 's/_.*//' | uniq | while read book; do echo -ne "$book\t"; LANG= wc -m -l training/${book}_*.gt.txt |grep total | sed 's/[^ 0-9].*//' ; done > r1.txt
ls training/*.gt.txt | cut -c 10- | sed 's/_.*//' | uniq | while read book; do echo -ne "$book\t"; LANG= wc -m -l testing/${book}_*.gt.txt |grep total | sed 's/[^ 0-9].*//' ; done > r2.txt
Expand Down
Original file line number Diff line number Diff line change
@@ -1 +1 @@
Durch die getroffenen. Vorſichtsmaßregeln war es gelungen,
Durch die getroffenen Vorſichtsmaßregeln war es gelungen,
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
heranzuwerfen. Tatſächlich gingen die Ruſſen bei Sztabin und
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
hinter dem Njemen lebhafter, doch hatte auch hier die deutſche
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
ſal überlaſſen hatte, als er keinen Ausweg mehr ſah, mag von
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
Grodno aus die ruſſiſche Heeresleitung beſchworen haben, alles
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
General Siewers, der vor Tagen bereits ſeine Armee ihrem Schick-
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
Führung bereits genügend Kräfte zur Abwehr eines etwaigen
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
(ſiehe Skizze 2). Auch in Gegend der Feſtung Olita wurde es
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
Geſpenſt der Schlacht von Tannenberg haben heraufſteigen ſehen.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
und in blutigem Ringen unter ſtarken Verluſten zurükgeworfen
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
das XL. Reſervekorps und die 4. Kavalleriediviſion aufgehalten
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
feindlichen Entlaſtungsvorſtoßes gegen den Rücken der Ein-
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
öſtlich zum Angriff über den Bobr vor, wurden aber hier durch
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
irgend Erreichbare zum Einſaz ſeiner eingeſchloſſenen Diviſionen
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
ſchließungsarmee bereitgeſtellt. In weiſer Vorausſicht war die

0 comments on commit 1656704

Please sign in to comment.