Skip to content
Stefan Weil edited this page Oct 4, 2020 · 4 revisions

Introduction to OCR-D workflows

Introduction

From a perspective of software, OCR-D consists of a multitude of processors that focus on doing a single step in a complete OCR workflow, such as binarization, region segmentation or post-correction.

Since all OCR-D processors implement the same command line interface it is relatively straightforward to define a sequence of processor calls in a shell script file, with the data flow defined by the input/output file groups, e.g.

#!/bin/sh
ocrd-olena-binarize -I MAX -O BNI
ocrd-tesserocr-segment-region -I BIN -O SEG

This workflow will fail because of a typo: -O BNI should be -O BIN. Moreover, since the workflow steps are not validated in their entirety before execution, this will fail at runtime, leaving you with an inconsistent, half-processed workspace.

To avoid such pitfalls, a workflow should be defined and executed in such a way that ensures, before execution, that it is

  • well-formed, i.e. there are no syntax errors
  • resolveable, i.e. all OCR-D processors used are available on the system and the parameters passed are valid
  • consistent, i.e. all input file groups are either in the METS or produced by other steps

Therefore, we recommend you do not build your own workflows in shell script but rather use ocrd process that comes bundled with OCR-D/core or @bertsky's Makefile-based workflow-configuration to run OCR-D workflows. Finally, we'll give you a sneak preview of an upcoming mechanism to define OCR-D workflows in a straightforward shell-script-like syntax, independent of workflow-engine.

ocrd process

ocrd process expects a list of sequential tasks in the form of OCR-D processor command line calls. Every task must be quoted to define the boundary between tasks, and the ocrd- prefix that all OCR-D processors share can be omitted for conciseness, e.g.

ocrd process \
  -m /path/to/mets.xml \
  'olena-binarize -I MAX -O BNI' \
  'tesserocr-segment-region -I BIN -O SEG'

Before executing this workflow, ocrd process will raise an issue Input file group not contained in METS or produced by previous steps: BIN' and allow the user to debug the input/output file group wiring.

Every task supports a subset of the OCR-D processor CLI:

  • -I: input file group(s)
  • -O: output file group(s)
  • -p PARAM_JSON: Parameter(s) as a JSON object
  • -P NAME VALUE: Set parameter NAME to VALUE

ocrd process itself supports a few options:

$ ocrd process --help
Usage: ocrd process [OPTIONS] TASKS...

  Process a series of tasks

Options:
  -l, --log-level [OFF|ERROR|WARN|INFO|DEBUG|TRACE]
                                  Log level
  -m, --mets TEXT                 METS to process
  -g, --page-id TEXT              ID(s) of the pages to process
  --overwrite                     Remove output pages/images if they already
                                  exist

The -m/--mets option allows you to define the path to the mets.xml if it is not in the current working directory.

The -l/--loglevel option overrides log level for all processors. Run with -l DEBUG to get the most detailed log output.

The main benefit of the --overwrite option is in re-processing a workflow. Normally, a processor will abort with an error message if any of the output files it intends to create already exist. With --overwrite these warnings go away. While useful for debugging and developing a workflow, you should never use --overwrite in a production environment! If not re-processing, these errors are genuine bugs, please report them instead of hiding them with --overwrite.

To see two complete workflows, one fast, one slow, expressed in ocrd process, see the bottom of the OCR-D workflow guide.

workflow-configuration

workflow-configuration is an OCR-D workflow engine based on GNU make, a widely used software build tool that is very versatile. There are many advantages of a Makefile over pure shell script for managing and building software (in fact, we use Makefiles in most OCR-D projects) that also come in handy when processing OCR-D workflows, such as:

  • builtin dependency logic, that detects whether steps have already run and can skip them
  • conditionals to adapt processing dynamically to the data at hand
  • multi-target rules that allow very concise formulation of similar workflow steps (think: text recognition with tesseract with 10 different models)

To reduce the complexity of creating such workflows, workflow-configuration defines a sort of Makefile-based domain specific language that allows expressing steps in a syntactically simple manner and finally includes a master makefile to "do the plumbing"

Here's an example extract from one of the many bundled pre-defined workflows:

BIN = $(INPUT)-BINPAGE-sauvola

$(BIN): $(INPUT)
$(BIN): TOOL = ocrd-olena-binarize
$(BIN): PARAMS = "impl": "sauvola-ms-split"

DEN = $(BIN)-DENOISE-ocropy

$(DEN): $(BIN)
$(DEN): TOOL = ocrd-cis-ocropy-denoise
$(DEN): PARAMS = "level-of-operation": "page", "noise_maxsize": 3.0

Assuming the $(INPUT) filegroup is MAX, this workflow snippet is equivalent to this ocrd process call:

ocrd process \
  'olena-binarize -I MAX -O MAX-BINPAGE-sauvola -P impl sauvola-ms-split' \
  'cis-ocropy-denoise -I MAX-BINPAGE-sauvola -O MAX-BINPAGE-sauvola-DENOISE-ocropy -P level-of-operation page -P noise_maxsize 3.0'

As you can see, the workflow-configuration variant is much more structured and less repetitive (and therefore: less error-prone).

Note: Irrespective of dependency checking (for incremental/fail-over processing), the declarative content of a makefile-based configuration can also be extracted into plain shell syntax via make -Bsn -f workflow.mk > workflow.sh.

While making use of the advanced features of GNU make for workflow-configuration workflows requires a learning curve, it is easy to get started by adapting the various predefined workflows.

workflow-configuration comes with the ocrd-make command line tool that you can use by specifying the workflow you want to run with the -f option:

ocrd-make -f /path/to/my-workflow-config.mk

OCRD-WF

OCR-D is currently developing a specification for OCRD-WF, a workflow-engine-independent definition of OCR-D workflows, with the implementation and tooling in OCR-D/core developed in parallel.

Syntactically, it will be similar to ocrd process, e.g:

#!/usr/bin/env ocrd-wf
olena-binarize -I MAX -O BIN
tesserocr-segment-region -I BIN -O SEG

Once we have agreed on the exact syntactical/semantical details of this domain-specific language, we will create adapters to convert an OCRD-WF workflow to

We will keep you posted in the OCR-D Gitter chat.

Welcome to the OCR-D wiki, a companion to the OCR-D website.

Articles and tutorials
Discussions
Expert section on OCR-D- workflows
Particular workflow steps
Recommended workflows
Workflow Guide
Videos
Section on Ground Truth
Clone this wiki locally