-
Notifications
You must be signed in to change notification settings - Fork 194
Akkadian Cuneiform
Akkadian is the language spoken and written in Ancient Mesopotamia, from around 2600 to 600 BCE, and had academic/liturgical use up to 100 AD. It was written using the Cuneiform script. Two most important variants of Akkadian are Babylonian and Assyrian. Cuneiform signs have different forms in these two variants, which becomes important below in the training section.
The data set was scraped from the website of the ORACC (Open Richly Annotated Cuneiform Corpus). Only those ORACC projects having texts written in Cuneiform were scraped: RIBo: Royal Inscriptions of Babylonia online, RINAP: Royal Inscriptions of the Neo-Assyrian Period, and SAAo: State Archives of Assyria Online.
The scraper was written in Python 3.5 and is uploaded here. For each scraped document, it opens the popup with cuneiform text, extracts its contents, cleans them up and saves them into text files. Then all the saved text files were simply concatenated to a single file. This file is the Tesseract training text corpus.
Text corpus was then processed with the create_dictdata
tool from the pytesstrain
package, which created additional files needed for training a Tesseract model. These training files were added both to the LSTM-based language data repository (here) and to the legacy language data repository (here).
The data set was used for training Tesseract models for Akkadian language. Below you can read about training the LSTM-based Tesseract model and then, for the sake of completeness, about training the legacy Tesseract model.
This setup was first created by Shreeshrii here and updated by the author here.
The training runs on the Ubuntu 22.04, or any other Linux distribution with Tesseract 4 or above.
First, update the package database and install the required packages (replace vim
with your favourite editor):
apt update
apt install -y tesseract-ocr git vim bc python3-pip python3-venv
Then, check out the language data and the training repository:
git clone https://github.com/wincentbalin/tesstrain-akk
cd tesstrain-akk
# Get a cup of tea after executing the next line
git clone https://github.com/tesseract-ocr/langdata_lstm
Create Python virtual environment, activate it and install required packages:
python3 -m venv venv
. ./venv/bin/activate
pip3 install kraken pytesstrain
Add ISRI Analytic Tools for OCR Evaluation with UTF-8 support:
apt install -y build-essential libutf8proc-dev
git clone https://github.com/eddieantonio/ocreval
cd ocreval
make
make install
cd ..
Then install the ocrevalUAtion tool:
apt install -y wget default-jre
mkdir ~/ocrevaluation
wget -O ~/ocrevaluation/ocrevaluation.jar https://github.com/impactcentre/ocrevalUAtion/releases/download/v1.3.4/ocrevalUAtion-1.3.4-jar-with-dependencies.jar
Now, install the fonts — just copy the *.[ot]tf
from the following list to /usr/share/fonts
:
- Akkadian (Old Babylonian, by George Douros)
- Assurbanipal (Neo-Assyrian, by Sylvie Vanséveren)
- Assyrian (Neo-Assyrian, by George Douros)
- CuneiformComposite (Neo-Assyrian, from ORACC)
- CuneiformNA (Neo-Assyrian, from ORACC)
- CuneiformOB (Old Babylonian, from ORACC)
- Santakku (Old Babylonian, by Sylvie Vanséveren)
- SantakkuM (Old Babylonian, by Sylvie Vanséveren)
- Segoe UI Historic (Neo-Assyrian, from Microsoft)
List the real font names and, if needed, update the font list (you can use any installed editor instead of vim
):
text2image --fonts_dir /usr/share/fonts --list_available_fonts
vim akk.fontslist.txt
Copy the language data for Akkadian:
cp -a langdata_lstm/akk langdata
If needed, adjust Tesseract configuration by adding the file akk.config
to the language data:
cp akk.config langdata
To adjust the iteration count, go to the last line in the script
trainlayer.sh
, which starts withMAX_ITERATIONS=
, and change the value.
Then execute the scripts of Shreeshrii:
# Prepare training data
bash txt2img.sh | tee txt2img.log
bash img2lstmf.sh | tee img2lstmf.log
# Perform training
bash trainlayer.sh | tee trainlayer.log
# Evaluate results
bash checkpointeval.sh | tee reports/checkpointeval.txt
The models are created from checkpoint files while training and are copied to the directories data/akk/tessdata_best
and data/akk/tessdata_fast
. Study the evaluation results in the file reports/checkpointeval-summary.txt
(more information in the file reports/checkpointeval.txt
) and choose the most suitable model for your needs.
Training for Tesseract 3 was done using this Makefile. It performs a conventional Tesseract 3 training workflow. The fonts used are CuneiformNA, CuneiformOB and CuneiformComposite (all three downloaded from ORACC), as well as Segoe UI Historic (supplied with current Windows OS). The exposures used ranged from -3 to 3.
The training runs on the Ubuntu 16.04, or any other Linux distribution with Tesseract 3.04 or 3.05.
First, update the package database and install the required packages (replace vim
with your favourite editor):
apt update
apt install -y tesseract-ocr git vim wget python3-pip python3-venv
Create Python virtual environment, activate it and install required packages (package versions are adjusted for Ubuntu 16.04, because of Python 3.5):
python3 -m venv venv
. ./venv/bin/activate
pip3 install wheel
pip3 install Pillow==7.2 pytesseract==0.3.6 pytesstrain
To be able to calculate additional font metrics, install Nick White's tools:
apt install -y libpango1.0-dev
git clone http://ancientgreekocr.org/grctraining.git
cd grctraining
make tools/addmetrics tools/xheight
cp tools/addmetrics tools/xheight /usr/local/bin
cd ..
Add fonts like described in the previous section and, additionally, list their names:
text2image --fonts_dir /usr/share/fonts --list_available_fonts
Get the Makefile
:
wget https://gist.githubusercontent.com/wincentbalin/9329a6e994852ed477ba30ef4c29e71c/raw/0d2a8a76eea42698ede299406accca8b361c00fe/Makefile
Clone language data:
git clone https://github.com/tesseract-ocr/langdata.git ../langdata
To change the fonts to train with, edit the variables
FONTS
andFONTSJOINED
(the latter one is needed only becausemake
does not process values with spaces correctly). Add complementing font rules at the bottom of theMakefile
.
To change the training corpus, edit the variable
CORPUS
or set it in the command line (i.e.make CORPUS=another_textfile.txt
).
Run make
or, to utilise 4 processors, make -j 4
.
The training lasts ~1.5 days with supplied training text and 4 fonts. The resulting file is akk.traineddata
in the current directory.
The .zip
archive with models for legacy Tesseract and LSTM-based Tesseract both in best and in fast variant is available here. All models within this archive were trained with 9 fonts.
The models for LSTM-based Tesseract with 4 fonts are available here.