-
Notifications
You must be signed in to change notification settings - Fork 7
Intro workflows
From a perspective of software, OCR-D consists of a multitude of processors that focus on doing a single step in a complete OCR workflow, such as binarization, region segmentation or post-correction.
Since all OCR-D processors implement the same command line interface it is relatively straightforward to define a sequence of processor calls in a shell script file, with the data flow defined by the input/output file groups, e.g.
#!/bin/sh
ocrd-olena-binarize -I MAX -O BNI
ocrd-tesserocr-segment-region -I BIN -O SEG
This workflow will fail because of a typo: -O BNI
should be -O BIN
. Moreover, since the workflow steps are not validated in their entirety before execution, this will fail at runtime, leaving you with an inconsistent, half-processed workspace.
To avoid such pitfalls, a workflow should be defined and executed in such a way that ensures, before execution, that it is
- well-formed, i.e. there are no syntax errors
- resolveable, i.e. all OCR-D processors used are available on the system and the parameters passed are valid
- consistent, i.e. all input file groups are either in the METS or produced by other steps
Therefore, we recommend you do not build your own workflows in shell script but rather use ocrd process
that comes bundled with OCR-D/core or @bertsky's Makefile-based workflow-configuration to run OCR-D workflows. Finally, we'll give you a sneak preview of an upcoming mechanism to define OCR-D workflows in a straightforward shell-script-like syntax, independent of workflow-engine.
ocrd process
expects a list of sequential tasks in the form of OCR-D processor command line calls. Every task must be quoted to define the boundary between tasks, and the ocrd-
prefix that all OCR-D processors share can be omitted for conciseness, e.g.
ocrd process \
-m /path/to/mets.xml \
'olena-binarize -I MAX -O BNI' \
'tesserocr-segment-region -I BIN -O SEG'
Before executing this workflow, ocrd process
will raise an issue
Input file group not contained in METS or produced by previous steps: BIN'
and allow the user to debug the input/output file group wiring.
Every task supports a subset of the OCR-D processor CLI:
-
-I
: input file group(s) -
-O
: output file group(s) -
-p PARAM_JSON
: Parameter(s) as a JSON object -
-P NAME VALUE
: Set parameterNAME
toVALUE
ocrd process
itself supports a few options:
$ ocrd process --help
Usage: ocrd process [OPTIONS] TASKS...
Process a series of tasks
Options:
-l, --log-level [OFF|ERROR|WARN|INFO|DEBUG|TRACE]
Log level
-m, --mets TEXT METS to process
-g, --page-id TEXT ID(s) of the pages to process
--overwrite Remove output pages/images if they already
exist
The -m/--mets
option allows you to define the path to the mets.xml
if it is not in the current working directory.
The -l/--loglevel
option overrides log level for all processors. Run with -l DEBUG
to get the most detailed log output.
The main benefit of the --overwrite
option is in re-processing a workflow. Normally, a processor will abort with an error message if any of the output files it intends to create already exist. With --overwrite
these warnings go away. While useful for debugging and developing a workflow, you should never use --overwrite
in a production environment! If not re-processing, these errors are genuine bugs, please report them instead of hiding them with --overwrite
.
To see two complete workflows, one fast, one slow, expressed in ocrd process
, see the bottom of the OCR-D workflow guide.
workflow-configuration
is an OCR-D workflow engine based on GNU make, a widely used software build tool that is very versatile. There are many advantages of a Makefile over pure shell script for managing and building software (in fact, we use Makefiles in most OCR-D projects) that also come in handy when processing OCR-D workflows, such as:
- builtin dependency logic, that detects whether steps have already run and can skip them
- conditionals to adapt processing dynamically to the data at hand
- multi-target rules that allow very concise formulation of similar workflow steps (think: text recognition with tesseract with 10 different models)
To reduce the complexity of creating such workflows, workflow-configuration defines a sort of Makefile-based domain specific language that allows expressing steps in a syntactically simple manner and finally includes a master makefile to "do the plumbing"
Here's an example extract from one of the many bundled pre-defined workflows:
BIN = $(INPUT)-BINPAGE-sauvola
$(BIN): $(INPUT)
$(BIN): TOOL = ocrd-olena-binarize
$(BIN): PARAMS = "impl": "sauvola-ms-split"
DEN = $(BIN)-DENOISE-ocropy
$(DEN): $(BIN)
$(DEN): TOOL = ocrd-cis-ocropy-denoise
$(DEN): PARAMS = "level-of-operation": "page", "noise_maxsize": 3.0
Assuming the $(INPUT)
filegroup is MAX
, this workflow snippet is equivalent to this ocrd process
call:
ocrd process \
'olena-binarize -I MAX -O MAX-BINPAGE-sauvola -P impl sauvola-ms-split' \
'cis-ocropy-denoise -I MAX-BINPAGE-sauvola -O MAX-BINPAGE-sauvola-DENOISE-ocropy -P level-of-operation page -P noise_maxsize 3.0'
As you can see, the workflow-configuration
variant is much more structured and less repetitive (and therefore: less error-prone).
Note: Irrespective of dependency checking (for incremental/fail-over processing), the declarative content of a makefile-based configuration can also be extracted into plain shell syntax via make -Bsn -f workflow.mk > workflow.sh
.
While making use of the advanced features of GNU make for workflow-configuration
workflows requires a learning curve, it is easy to get started by adapting the various predefined workflows.
workflow-configuration
comes with the ocrd-make
command line tool that you can use by specifying the workflow you want to run with the -f
option:
ocrd-make -f /path/to/my-workflow-config.mk
OCR-D is currently developing a specification for OCRD-WF, a workflow-engine-independent definition of OCR-D workflows, with the implementation and tooling in OCR-D/core developed in parallel.
Syntactically, it will be similar to ocrd process
, e.g:
#!/usr/bin/env ocrd-wf
olena-binarize -I MAX -O BIN
tesserocr-segment-region -I BIN -O SEG
Once we have agreed on the exact syntactical/semantical details of this domain-specific language, we will create adapters to convert an OCRD-WF workflow to
-
ocrd process
call -
workflow-configuration
makefile - possibly Taverna workflow
- possibly BPML
We will keep you posted in the OCR-D Gitter chat.
Welcome to the OCR-D wiki, a companion to the OCR-D website.
Articles and tutorials
- Running OCR-D on macOS
- Running OCR-D in Windows 10 with Windows Subsystem for Linux
- Running OCR-D on POWER8 (IBM pSeries)
- Running browse-ocrd in a Docker container
- OCR-D Installation on NVIDIA Jetson Nano and Xavier
- Mapping PAGE to ALTO
- Comparison of OCR formats (outdated)
- A Practicioner's View on Binarization
- How to use the bulk-add command to generate workspaces from existing files
- Evaluation of (intermediary) steps of an OCR workflow
- A quickstart guide to ocrd workspace
- Introduction to parameters in OCR-D
- Introduction to OCR-D processors
- Introduction to OCR-D workflows
- Visualizing (intermediate) OCR-D-results
- Guide to updating ocrd workspace calls for 2.15.0+
- Introduction to Docker in OCR-D
- How to import Abbyy-generated ALTO
- How to create ALTO for DFG Viewer
- How to create searchable fulltext data for DFG Viewer
- Setup native CUDA Toolkit for Qurator tools on Ubuntu 18.04
- OCR-D Code Review Guidelines
- OCR-D Recommendations for Using CI in Your Repository
Expert section on OCR-D- workflows
Particular workflow steps
Workflow Guide
- Workflow Guide: preprocessing
- Workflow Guide: binarization
- Workflow Guide: cropping
- Workflow Guide: denoising
- Workflow Guide: deskewing
- Workflow Guide: dewarping
- Workflow Guide: region-segmentation
- Workflow Guide: clipping
- Workflow Guide: line-segmentation
- Workflow Guide: resegmentation
- Workflow Guide: olr-evaluation
- Workflow Guide: text-recognition
- Workflow Guide: text-alignment
- Workflow Guide: post-correction
- Workflow Guide: ocr-evaluation
- Workflow Guide: adaptation-of-coordinates
- Workflow Guide: format-conversion
- Workflow Guide: generic transformations
- Workflow Guide: dummy processing
- Workflow Guide: archiving
- Workflow Guide: recommended workflows