OCR quality. Extract, examine, clean, evaluate.

A set of scripts for working with PDF files after OCR. They help to extract the plaintext data from the PDFs, to examine a frequency list of words/tokens, to clean the output with regex, to perform kwic, to evaluate CER.

For R scripts: use ocr-quality.Rproj to get already a working directory.

pdf2text.r

Extract the plaintext from a set of already searchable PDFs.

pdf2text_clean.r

Clean the plaintext with regex.

regex_example.R

Simple code to use regex within R.

wordlist.R

A table of word frequencies from the extracted text.

kwic.r

Another exploration of tokens/words frequencies and kwic (keyword in context).

txt2tei.R

Plaintext file into a simple XML-TEI file with metadata from the filename.

cer.R and cer_jiwer.ipynb

Evaluate CER (Character Error Rate) using scripts in R and Python (Jupyter Notebook).

Usage: load the ground truth text (proofread, checked,...) as a "reference" variable and the text passed by OCR into the "hypothesis" variable.

Name		Name	Last commit message	Last commit date
Latest commit History 33 Commits
data		data
data_input		data_input
data_output		data_output
data_output_clean		data_output_clean
data_output_xml		data_output_xml
.gitignore		.gitignore
README.md		README.md
cer.R		cer.R
cer_jiwer.ipynb		cer_jiwer.ipynb
kwic.R		kwic.R
ocr-quality.Rproj		ocr-quality.Rproj
pdf2text.R		pdf2text.R
pdf2text_clean.R		pdf2text_clean.R
regex_example.R		regex_example.R
remove-infopag-in-directory.R		remove-infopag-in-directory.R
remove-infopag.R		remove-infopag.R
txt2tei.R		txt2tei.R
wordlist.R		wordlist.R

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

OCR quality. Extract, examine, clean, evaluate.

pdf2text.r

pdf2text_clean.r

regex_example.R

wordlist.R

kwic.r

txt2tei.R

cer.R and cer_jiwer.ipynb

About

Releases

Packages

Languages

RISE-UNIBAS/ocr-quality

Folders and files

Latest commit

History

Repository files navigation

OCR quality. Extract, examine, clean, evaluate.

pdf2text.r

pdf2text_clean.r

regex_example.R

wordlist.R

kwic.r

txt2tei.R

cer.R and cer_jiwer.ipynb

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages