Skip to content

Workflow Guide recommendations

Konstantin Baierer edited this page Nov 17, 2020 · 18 revisions

In order to facilitate the usage of OCR-D and the configuration of workflows, we provide two workflows which can be used as a start for your OCR-D-tests. They were determined by testing the processors listed above on selected pages of some prints from the 17th and 18th century.

The results vary quite a lot from page to page. In most cases, segmentation is a problem.

Note that for our test pages, not all steps described above werde needed to obtain the best results. Depending on your particular images, you might want to include those processors again for better results.

We are currently working on regression tests with the help of which we will be able to provide more profound workflows soon, which will replace those interm solutions.

Best results for selected pages

The following workflow has produced best results for 'simple' pages (e.g. this page) (CER ~1%).

Step Processor Parameter
1 ocrd-cis-ocropy-binarize
2 ocrd-anybaseocr-crop
3 ocrd-skimage-binarize -P method li
4 ocrd-skimage-denoise P level-of-operation page
5 ocrd-tesserocr-deskew -P level-of-operation page
7 ocrd-cis-ocropy-segment -P level-of-operation page
9 ocrd-tesserocr-deskew
13 ocrd-cis-ocropy-dewarp
14 ocrd-calamari-recognize -P checkpoint /path/to/models/\*.ckpt.json

Example with ocrd-process

ocrd process \
  "cis-ocropy-binarize -I OCR-D-IMG -O OCR-D-BIN" \
  "anybaseocr-crop -I OCR-D-BIN -O OCR-D-CROP" \
  "skimage-binarize -I OCR-D-CROP -O OCR-D-BIN2 -P method li" \
  "skimage-denoise -I OCR-D-BIN2 -O OCR-D-BIN-DENOISE -P level-of-operation page" \
  "tesserocr-deskew -I OCR-D-BIN-DENOISE -O OCR-D-BIN-DENOISE-DESKEW -P operation_level page" \
  "cis-ocropy-segment -I OCR-D-BIN-DENOISE-DESKEW -O OCR-D-SEG-REG -P level-of-operation page" \
  "tesserocr-deskew -I OCR-D-SEG-REG -O OCR-D-SEG-REG-DESKEW" \
  "cis-ocropy-dewarp -I OCR-D-SEG-REG-DESKEW -O OCR-D-SEG-LINE-RESEG-DEWARP" \
  "calamari-recognize -I OCR-D-SEG-LINE-RESEG-DEWARP -O OCR-D-OCR -P checkpoint /path/to/models/\*.ckpt.json"

Note: (1) This workflow expects your images to be stored in a folder called OCR-D-IMG. If your images are saved in a different folder, you need to adjust -I OCR-D-IMG in the second line of the call above with the name of your folder, e.g. -I MAX (2) For the last processor in this workflow, ocrd-calamari-recognize, you need to specify your local path to the model on your hard drive as parameter value! The last line of the ocrd-process call above could e.g. look like this:

  "calamari-recognize -I OCR-D-SEG-LINE-RESEG-DEWARP -O OCR-D-OCR -P checkpoint /test/data/calamari_models/\*.ckpt.json"

All the other lines can just be copied and pasted.

Good results for slower processors

If your computer is not that powerful you may try this workflow. It works fine for simple pages and produces also good results in shorter time.

Step Processor Parameter
1 ocrd-cis-ocropy-binarize
2 ocrd-anybaseocr-crop
3 ocrd-skimage-binarize -P method li
4 ocrd-skimage-denoise -P level-of-operation page
5 ocrd-tesserocr-deskew -P level-of-operation page
7 ocrd-tesserocr-segment-region
7a ocrd-segment-repair -P plausibilize true
9 ocrd-tesserocr-deskew
10 ocrd-cis-ocropy-clip
11 ocrd-tesserocr-segment-line
12 ocrd-cis-ocropy-clip -P level-of-operation line
13 ocrd-cis-ocropy-dewarp
14 ocrd-tesserocr-recognize -P textequiv_level glyph -P overwrite_words true -P model GT4HistOCR_50000000.997_191951

Example with ocrd-process

ocrd process \
  "cis-ocropy-binarize -I OCR-D-IMG -O OCR-D-BIN" \
  "anybaseocr-crop -I OCR-D-BIN -O OCR-D-CROP" \
  "skimage-binarize -I OCR-D-CROP -O OCR-D-BIN2 -P method li" \
  "skimage-denoise -I OCR-D-BIN2 -O OCR-D-BIN-DENOISE -P level-of-operation page" \
  "tesserocr-deskew -I OCR-D-BIN-DENOISE -O OCR-D-BIN-DENOISE-DESKEW -P operation_level page" \
  "tesserocr-segment-region -I OCR-D-BIN-DENOISE-DESKEW -O OCR-D-SEG-REG" \
  "segment-repair -I OCR-D-SEG-REG -O OCR-D-SEG-REPAIR -P plausibilize true" \
  "tesserocr-deskew -I OCR-D-SEG-REPAIR -O OCR-D-SEG-REG-DESKEW" \
  "cis-ocropy-clip -I OCR-D-SEG-REG-DESKEW -O OCR-D-SEG-REG-DESKEW-CLIP" \
  "tesserocr-segment-line -I OCR-D-SEG-REG-DESKEW-CLIP -O OCR-D-SEG-LINE" \
  "cis-ocropy-clip -I OCR-D-SEG-LINE -O OCR-D-SEG-LINE-CLIP -P level-of-operation line" \
  "cis-ocropy-dewarp -I OCR-D-SEG-LINE-CLIP -O OCR-D-SEG-LINE-RESEG-DEWARP" \
  "tesserocr-recognize -I OCR-D-SEG-LINE-RESEG-DEWARP -O OCR-D-OCR -P textequiv_level glyph -P overwrite_words true -P model GT4HistOCR_50000000.997_191951}"

Note: (1) This workflow expects your images to be stored in a folder called OCR-D-IMG. If your images are saved in a different folder, you need to adjust -I OCR-D-IMG in the second line of the call above with the name of your folder, e.g. -I my_images (2) For the last processor in this workflow, ocrd-tesserocr-recognize, the environment variable TESSDATA_PREFIX has to be set to point to the directory where the used models are stored if they are not in the default location.

Welcome to the OCR-D wiki, a companion to the OCR-D website.

Articles and tutorials
Discussions
Expert section on OCR-D- workflows
Particular workflow steps
Recommended workflows
Workflow Guide
Videos
Section on Ground Truth
Clone this wiki locally