Training Fraktur

Jump to bottom Edit New page

Stefan Weil edited this page Dec 17, 2019 · 5 revisions

Training Fraktur

This is a collection of sources for training OCR models which can be used to recognize Fraktur.

Austrian Newspapers

Austrian Newspapers is a ground truth data set created from Austrian newspapers by the Austrian National Library (Österreichische Nationalbibliothek).See https://github.com/tesseract-ocr/tesstrain/wiki/AustrianNewspapers for more information.

GT4HistOCR

GT4HistOCR is ground truth for training OCR engines on historical documents in German Fraktur and Early Modern Latin. See https://github.com/tesseract-ocr/tesstrain/wiki/GT4HistOCR for more information.

Ocropus Fraktur

https://github.com/jze/ocropus-model_fraktur provides ground truth data, 3852 lines for training and 414 lines for testing, both of good quality.

Open issues

Some umlauts might be replaced by aͤ, oͤ, uͤ.
It uses the minus sign instead of ⸗.