Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add 1st draft line GT/training specs #105

Open
wants to merge 17 commits into
base: master
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
10 changes: 10 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,3 +3,13 @@
# Specification of the technical architecture, interface definitions and data exchange format(s)

See [https://ocr-d.github.io/](https://ocr-d.github.io/).

## Line Ground Truth

* [Spec](./gt-spec.md)
* [BagIt profile](./gt-profile.yml)

## Engine training

* [Spec](./training-spec.md)
* [JSON schema](./training-schema.yml)
1 change: 1 addition & 0 deletions gt-profile.json
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
{"BagIt-Profile-Info":{"BagIt-Profile-Identifier":"https://ocr-d.github.io/gt-profile.json","BagIt-Profile-Version":"1.2.0","Source-Organization":"OCR-D","External-Description":"BagIt profile for OCR line Ground Truth","Contact-Name":"Konstantin Baierer","Contact-Email":"[email protected]","Version":0.1},"Bag-Info":{"Bagging-Date":{"required":false},"Source-Organization":{"required":false},"Gt-Transcription-Extension":{"required":false,"default":".gt.txt"},"Gt-Transcription-Media-Type":{"required":false,"default":"text/plain"},"Gt-Prediction-Directory":{"required":false,"default":"pred"},"Gt-Prediction-Extension":{"required":false,"default":".pred.txt"},"Gt-Prediction-Media-Type":{"required":false,"default":"text/plain"},"Gt-Transcription-Directory":{"required":false,"default":"text"},"Gt-Transcription-Normalization":{"required":true,"values":["NFD","NFKD","NFC","NFKC","non-normalized"]},"Gt-Color-Image-Extension":{"required":false,"default":".color.png"},"Gt-Color-Image-Media-Type":{"required":false,"default":"image/png","values":["image/png","image/tiff","image/jpeg"]},"Gt-Color-Image-Directory":{"required":false,"default":"img"},"Gt-Grayscale-Image-Extension":{"required":false,"default":".nrm.png"},"Gt-Grayscale-Image-Media-Type":{"required":false,"default":"image/png","values":["image/png","image/tiff","image/jpeg"]},"Gt-Grayscale-Image-Directory":{"required":false,"default":"grayscale"},"Gt-Bitonal-Image-Extension":{"required":false,"default":".bin.png"},"Gt-Bitonal-Image-Media-Type":{"required":false,"default":"image/png","values":["image/png","image/tiff"]},"Gt-Bitonal-Image-Directory":{"required":false,"default":"bin"},"Gt-Line-Metadata-Extension":{"required":false,"default":".json"},"Gt-Line-Metadata-Media-Type":{"required":false,"default":"application/json","values":["application/json","text/vnd.yaml"]},"Gt-Line-Metadata-Directory":{"required":false,"default":"meta"},"Gt-Directory":{"required":false,"default":"ground-truth"},"Gt-Directory-Structure":{"required":false,"default":"flat","values":["flat","flat-nested","subfolders","subfolders-nested"]}},"Manifests-Required":["sha512"],"Tag-Manifests-Required":[],"Tag-Files-Required":[],"Tag-Files-Allowed":["README.md","build.sh"],"Allow-Fetch.txt":false,"Serialization":"allowed","Accept-Serialization":"application/zip","Accept-BagIt-Version":["1.0"]}
115 changes: 115 additions & 0 deletions gt-profile.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,115 @@
BagIt-Profile-Info:
BagIt-Profile-Identifier: https://ocr-d.github.io/gt-profile.json
BagIt-Profile-Version: '1.2.0'
Source-Organization: OCR-D
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about information about the origin of the digitized lines?

  • minimal bibliographic record based on DC?
  • and artificially generated lines (+ degeneration)
  • what about the degeneration algorithm?

I think that comment may be in the wrong place here. It should probably be placed in this place ## Line metadata##.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See https://github.com/OCR-D/spec/pull/105/files/6827085d051e945062203b82ef921e54025cfbda#diff-ee256e83a17cfe309565c88ab376091a That is the definition of what's currently supposed to be in there. Bibliographic metadata would be in the METS referred to by metsUrl. How to encode provenance on a line-level I am not sure though. @VolkerHartmann?

External-Description: BagIt profile for OCR line Ground Truth
Contact-Name: Konstantin Baierer
Contact-Email: [email protected]
Version: 0.1
Bag-Info:
Bagging-Date:
required: false
Source-Organization:
required: false
Gt-Transcription-Extension:
required: false
default: '.gt.txt'
Gt-Transcription-Media-Type:
required: false
default: 'text/plain'
Gt-Prediction-Directory:
required: false
default: 'pred'
Gt-Prediction-Extension:
required: false
default: '.pred.txt'
Gt-Prediction-Media-Type:
required: false
default: 'text/plain'
Gt-Transcription-Directory:
required: false
default: 'text'
Gt-Transcription-Normalization:
required: true
values:
- NFD
- NFKD
- NFC
- NFKC
- non-normalized
Gt-Color-Image-Extension:
required: false
default: '.color.png'
Gt-Color-Image-Media-Type:
required: false
default: 'image/png'
values:
- 'image/png'
- 'image/tiff'
- 'image/jpeg'
Gt-Color-Image-Directory:
required: false
default: 'img'
Gt-Grayscale-Image-Extension:
required: false
default: '.nrm.png'
Gt-Grayscale-Image-Media-Type:
required: false
default: 'image/png'
values:
- 'image/png'
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would a differentiation between Tiff compressed or JPEG2000 make more sense?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You mean additionally allow image/jp2? Do engines allow JPEG2000 input for training?

- 'image/tiff'
- 'image/jpeg'
Gt-Grayscale-Image-Directory:
required: false
default: 'grayscale'
Gt-Bitonal-Image-Extension:
required: false
default: '.bin.png'
Gt-Bitonal-Image-Media-Type:
required: false
default: 'image/png'
values:
- 'image/png'
- 'image/tiff'
Gt-Bitonal-Image-Directory:
required: false
default: 'bin'
Gt-Line-Metadata-Extension:
required: false
default: '.json'
Gt-Line-Metadata-Media-Type:
required: false
default: 'application/json'
values:
- 'application/json'
- 'text/vnd.yaml'
Gt-Line-Metadata-Directory:
required: false
default: 'meta'
Gt-Directory:
required: false
default: 'ground-truth'
Gt-Directory-Structure:
required: false
default: 'flat'
values:
# img and transcription in the Gt-Directory
- 'flat'
# img and transcription in the same dir below Gt-Directory
- 'flat-nested'
# img and transcription in subfolders Gt-Bitonal-Image-Directory and Gt-Transcription-Directory of Gt-Directory
- 'subfolders'
# img and transcription in subfolders Gt-Bitonal-Image-Directory and Gt-Transcription-Directory in the same dir below Gt-Directory
- 'subfolders-nested'
Manifests-Required: ['sha512']
Tag-Manifests-Required: []
Tag-Files-Required: []
Tag-Files-Allowed:
- README.md
- build.sh
Allow-Fetch.txt: false
kba marked this conversation as resolved.
Show resolved Hide resolved
Serialization: allowed
Accept-Serialization: application/zip
Accept-BagIt-Version:
- '1.0'
218 changes: 218 additions & 0 deletions gt-spec.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,218 @@
# linegt

> An exchange format for line-based ground truth for OCR

<!-- BEGIN-MARKDOWN-TOC -->
* [Rationale](#rationale)
* [BagIt](#bagit)
* [BagIt profile](#bagit-profile)
* [Gt-Transcription-Extension](#gt-transcription-extension)
* [Gt-Transcription-Media-Type](#gt-transcription-media-type)
* [Gt-Transcription-Directory](#gt-transcription-directory)
* [Gt-Transcription-Normalization](#gt-transcription-normalization)
* [Gt-Prediction-Extension](#gt-prediction-extension)
* [Gt-Prediction-Media-Type](#gt-prediction-media-type)
* [Gt-Prediction-Directory](#gt-prediction-directory)
* [Gt-Grayscale-Image-Extension](#gt-grayscale-image-extension)
* [Gt-Grayscale-Image-Media-Type](#gt-grayscale-image-media-type)
* [Gt-Grayscale-Image-Directory](#gt-grayscale-image-directory)
* [Gt-Color-Image-Extension](#gt-color-image-extension)
* [Gt-Color-Image-Media-Type](#gt-color-image-media-type)
* [Gt-Color-Image-Directory](#gt-color-image-directory)
* [Gt-Bitonal-Image-Extension](#gt-bitonal-image-extension)
* [Gt-Bitonal-Image-Media-Type](#gt-bitonal-image-media-type)
* [Gt-Bitonal-Image-Directory](#gt-bitonal-image-directory)
* [Gt-Line-Metadata-Extension](#gt-line-metadata-extension)
* [Gt-Line-Metadata-Media-Type](#gt-line-metadata-media-type)
* [Gt-Line-Metadata-Directory](#gt-line-metadata-directory)
* [Gt-Directory](#gt-directory)
* [Gt-Directory-Structure](#gt-directory-structure)
* [Line metadata](#line-metadata)

<!-- END-MARKDOWN-TOC -->

## Rationale

Recent OCR (optical character recognition) engines are not actually
character-based anymore but use neural networks that operate on lines. These
engines can be trained with images of text lines and their transcription
("ground truth"), plus engine-specific configurations.

This format defines a standardized format to bundle such ground truth, based on
the BagIt conventions.

## BagIt

An `linegt` bag must be a valid BagIt bag:

* Root folder must contain a file `bagit.txt`
* Root folder must contain a file `bag-info.txt` with metadata about the bag
* All payload files must be under a folder `/data`
* Every file in `/data` along with its `<algo>` checksum must be listed in a
file `manifest-<algo>.txt`

## BagIt profile

In addition to the requirements of BagIt, an `ocr_linegt` bag must adhere to
the `ocr_linegt` BagIt profile.

### Gt-Transcription-Extension

Extension of the transcription files. Default: `.gt.txt`.

### Gt-Transcription-Media-Type

Media type of the transcription files. Default: `text/plain`.

### Gt-Transcription-Directory

Name of the subfolder containing transcriptions if [`Gt-Directory-Structure`] is `subfolders` or `subfolders-nested`. Default: `text`.

### Gt-Transcription-Normalization

**Required**

All transcriptions MUST be UTF-8 encoded Unicode. This property defines the
unicode normalization level.

One of `NFC`, `NFKC`, `NFD` or `NFKC` or `non-normalized`.

![Illustration unicode normalization](http://unicode.org/reports/tr15/images/UAX15-NormFig6.jpg)

### Gt-Prediction-Extension

Extension of the prediction files. Used for evaluation. Default: `.pred.txt`.

### Gt-Prediction-Media-Type

Media type of the prediction files. Default: `text/plain`.

### Gt-Prediction-Directory

Name of the subfolder containing predictions if [`Gt-Directory-Structure`] is `subfolders` or `subfolders-nested`. Default: `pred`.

### Gt-Grayscale-Image-Extension

Extension of the grayscale image files. Default: `.nrm.png`.

### Gt-Grayscale-Image-Media-Type

Media type of the grayscale image files. Default: `image/png`.

### Gt-Grayscale-Image-Directory

Name of the subfolder containing grayscale images if [`Gt-Directory-Structure`] is `subfolders` or `subfolders-nested`. Default: `grayscale`.

### Gt-Color-Image-Extension

Extension of the color image files. Default: `.color.png`.

### Gt-Color-Image-Media-Type

Media type of the color image files. Default: `image/png`.

### Gt-Color-Image-Directory

Name of the subfolder containing color images if [`Gt-Directory-Structure`] is `subfolders` or `subfolders-nested`. Default: `img`.

### Gt-Bitonal-Image-Extension

Extension of the bitonal image files. Default: `.bin..png`.

### Gt-Bitonal-Image-Media-Type

Media type of the bitonal image files. Default: `image/png`.

### Gt-Bitonal-Image-Directory

Name of the subfolder containing bitonal images if [`Gt-Directory-Structure`] is `subfolders` or `subfolders-nested`. Default: `bin`.

### Gt-Line-Metadata-Extension

Extension of the [line metadata] files. Default: `.json`.

### Gt-Line-Metadata-Media-Type

Media type of the [line metadata] files. Default: `application/json`.

### Gt-Line-Metadata-Directory

Name of the subfolder containing [line metadata] if [`Gt-Directory-Structure`] is `subfolders` or `subfolders-nested`. Default: `meta`.

### Gt-Directory

Directory below `/data` containing the ground truth. Default: `ground-truth`.

### Gt-Directory-Structure

Directory structure. One of

- `flat`: img and transcription in the [`Gt-Directory`]
- `flat-nested`: img and transcription in the same dir below [`Gt-Directory`]
- `subfolders`: img and transcription in subfolders [`Gt-Bitonal-Image-Directory`] and [`Gt-Transcription-Directory`] of [`Gt-Directory`]
- `subfolders-nested`: img and transcription in subfolders [`Gt-Bitonal-Image-Directory`] and [`Gt-Transcription-Directory`] in the same dir below Gt-Directory

## Line metadata

In addition to the bag-wide metadata defined by the [BagIt profile], metadata
can be saved per line to preserve the provenance of every single line.

Line metadata can be encoded in JSON or YAML (depending on
[`Gt-Line-Metadata-Extension`] and [`Gt-Line-Metadata-Media-Type`]).

Line metadata MUST adhere to this JSON schema:

<!-- BEGIN-EVAL -w '```yaml' '```' -- cat single-line.yml -->
```yaml
description: Schema for provenance of single lines
type: object
required:
- imageUrl
properties:
coords:
description: Coordinates as array of x-y-pairs
type: array
items:
type: array
length: 2
items:
type: number
pageUrl:
description: URL of the page (resp. URL the PAGE-XML file)
type: string
imageUrl:
description: URL of the image (resp. the `pg:imageFilename` in the PAGE-XML file)
type: string
bagUrl:
description: URL of the bag that contains the page
type: string
metsUrl:
description: URL of the METS document that contains the page
type: string
lineId:
description: ID of the line within the PAGE-XML doc
type: string
teiUrl:
description: URL of the TEI document that contains the page
type: string
xpath:
description: XPath to the line if no `fileId` was provided
type: string
```

<!-- END-EVAL -->

<!--
==================================================================
Reference links
==================================================================
--->
[`Gt-Directory`]: #gt-directory
[`Gt-Bitonal-Image-Directory`]: #gt-bitonal-image-directory
[`Gt-Transcription-Directory`]: #gt-transcription-directory
[`Gt-Directory-Structure`]: #gt-directory-structure
[`Gt-Line-Metadata-Directory`]: #gt-bitonal-image-directory
[`Gt-Line-Metadata-Extension`]: #gt-line-metadata-extension
[`Gt-Line-Metadata-Media-Type`]: #gt-line-metadata-media-type
[BagIt Profile]: #bagit-profile
[line metadata]: #line-metadata
1 change: 1 addition & 0 deletions model-evaluation-schema.json
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
{"$id":"https://ocr-d.github.io/schemas/v1/model-evaluation-schema.json","type":"object","required":["engineName","engineVersion","groundTruthBag","model"],"properties":{"engineName":{"type":"string","enum":["ocropus","kraken","tesseract","calamari"]},"engineVersion":{"type":"string"},"recognizerArguments":{"description":"Command line arguments passed to the CLI recognition tool","type":"array","default":[]},"groundTruthBag":{"description":"A bag of line ground truth adhering to https://ocr-d.github.io/gt-profile.json","type":"string"},"model":{"description":"URL/path to model to use","type":"string"},"measures":{"description":"which evaluation measures to produce","type":"array","items":{"type":"string","enum":["cer-per-line","cer-total","ler","wer-per-line","wer-total","confusion-matrix"]}}}}
39 changes: 39 additions & 0 deletions model-evaluation-schema.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,39 @@
$id: https://ocr-d.github.io/schemas/v1/model-evaluation-schema.json
type: object
required:
- engineName
- engineVersion
- groundTruthBag
- model
properties:
engineName:
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

delete .Derived from model

type: string
enum:
- ocropus
- kraken
- tesseract
- calamari
engineVersion:
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

delete .Derived from model

type: string
recognizerArguments:
description: Command line arguments passed to the CLI recognition tool
type: array
default: []
groundTruthBag:
description: A bag of line ground truth adhering to https://ocr-d.github.io/gt-profile.json
type: string
model:
description: URL/path to model to use
type: string
measures:
description: which evaluation measures to produce
type: array
items:
type: string
enum:
- cer-per-line
- cer-total
- ler
- wer-per-line
- wer-total
- confusion-matrix
1 change: 1 addition & 0 deletions single-line.json
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
{"description":"Schema for provenance of single lines","type":"object","required":["imageUrl"],"properties":{"coords":{"description":"Coordinates as array of x-y-pairs","type":"array","items":{"type":"array","length":2,"items":{"type":"number"}}},"pageUrl":{"description":"URL of the page (resp. URL the PAGE-XML file)","type":"string"},"imageUrl":{"description":"URL of the image (resp. the `pg:imageFilename` in the PAGE-XML file)","type":"string"},"bagUrl":{"description":"URL of the bag that contains the page","type":"string"},"metsUrl":{"description":"URL of the METS document that contains the page","type":"string"},"lineId":{"description":"ID of the line within the PAGE-XML doc","type":"string"},"teiUrl":{"description":"URL of the TEI document that contains the page","type":"string"},"xpath":{"description":"XPath to the line if no `fileId` was provided","type":"string"}}}
Loading