diff --git a/README.md b/README.md index 23a8727..1792a1d 100644 --- a/README.md +++ b/README.md @@ -1,6 +1,6 @@ # pero-enhance -Tool for text-guided textual document scan quality enhancement. The method works on lines of text that can be input through a PAGE XML or detected automatically by a buil-in OCR. By using text input along with the image, the results can be correctly readable even with parts of the original text missing or severly degraded in the source image. The tool includes functionality for cropping the text lines, processing them with our provided models for either text enhancement and inpainting, and for blending the enhanced text lines back into the source document image. We currently provide models for OCR and enhancement of czech newspapers optimized for low-quality scans from micro-films. +Tool for text-guided textual document scan quality enhancement. The method works on lines of text that can be input through a PAGE XML or detected automatically by a built-in OCR. By using text input along with the image, the results can be correctly readable even with parts of the original text missing or severely degraded in the source image. The tool includes functionality for cropping the text lines, processing them with our provided models for either text enhancement and inpainting, and for blending the enhanced text lines back into the source document image. We currently provide models for OCR and enhancement of czech newspapers optimized for low-quality scans from micro-films. This package can be used as a standalone commandline tool to process document pages in bulk. Alternatively, the package provides a python class that can be integrated in third-party software. @@ -11,9 +11,9 @@ The method is based on Generative Adversarial Neural Networks (GAN) that are tra ## Installation The module requires python 3 and CUDA capable GPU. -Clone the repository (which clones pero-ocr as submodule) and add the pero_enhance and pero_ocr package to your `PYTHONPATH`: +Clone the repository (which clones pero-ocr as submodule) and add the pero-enhance and pero-ocr package to your `PYTHONPATH`: ``` -clone --recursive https://github.com/DCGM/pero-enhance.git +git clone --recursive https://github.com/DCGM/pero-enhance.git cd pero-enhance export PYTHONPATH=/abs/path/to/repo/pero-enhance:/abs/path/to/repo/pero-enhance/pero-ocr:$PYTHONPATH ``` @@ -33,7 +33,7 @@ Images in a folder can be enhanced by running following: ``` python repair_page.py -i ../example/ -x ../example/ -o /path/to/outputs ``` -The above command runs OCR, stores the OCR output in ./example/, and stores the enhance images in /path/to/outputs. The generated OCR Page XML files can be manualy revised if the OCR quality is not satisfactory, and the command can be repeated to use these changes for better image enhancement. +The above command runs OCR, stores the OCR output in ./example/, and stores the enhance images in /path/to/outputs. The generated OCR Page XML files can be manually revised if the OCR quality is not satisfactory, and the command can be repeated to use these changes for better image enhancement. Alternatively, you can run interactive demo by running the following, where the xml file is optional: ``` @@ -41,7 +41,7 @@ python demo.py -i ../example/82f4ac84-6f1e-43ba-b1d5-e2b28d69508d.jpg -x ../exam ``` When Page XML file is not provided, automatic text detection and OCR is done using `PageParser` from the pero-ocr package. -The commands use by default models and settings optimized for czech newspapers downloaded during instalation. The models can be changed Different models for enhancement can be specified by `-r /path/to/enhancement-model/repair_engine.json` and OCR models by `-p /path/to/ocr-model/config.ini`. +The commands use by default models and settings optimized for czech newspapers downloaded during installation. The models can be changed Different models for enhancement can be specified by `-r /path/to/enhancement-model/repair_engine.json` and OCR models by `-p /path/to/ocr-model/config.ini`. ### EngineRepairCNN class In your code, you can directly use the EngineRepairCNN class to enhance individual text line images normalized to height of 32 pixels or of whole page images when the content is defined by pero.layout class. The processed images should have three channels represented as numpy arrays. diff --git a/training/ocr_engine/line_ocr_engine.py b/training/ocr_engine/line_ocr_engine.py index db2153c..f65e6c2 100644 --- a/training/ocr_engine/line_ocr_engine.py +++ b/training/ocr_engine/line_ocr_engine.py @@ -72,7 +72,7 @@ def process_lines(self, lines): if line.shape[0] == self.line_px_height: ValueError("Line height needs to be {} for this ocr network and is {} instead.".format(self.line_px_height, line.shape[0])) if line.shape[2] == 3: - ValueError("Line crops need three color channes, but this one has {}.".format(line.shape[2])) + ValueError("Line crops need three color channels, but this one has {}.".format(line.shape[2])) all_transcriptions = [None]*len(lines) all_logits = [None]*len(lines) diff --git a/training/transformer/compute_bleu.py b/training/transformer/compute_bleu.py index c0dc9da..4d93b53 100644 --- a/training/transformer/compute_bleu.py +++ b/training/transformer/compute_bleu.py @@ -65,7 +65,7 @@ def bleu_tokenize(string): except when a punctuation is preceded and followed by a digit (e.g. a comma/dot as a thousand/decimal separator). - Note that a numer (e.g. a year) followed by a dot at the end of sentence + Note that a number (e.g. a year) followed by a dot at the end of sentence is NOT tokenized, i.e. the dot stays with the number because `s/(\p{P})(\P{N})/ $1 $2/g` does not match this case (unless we add a space after each sentence). diff --git a/training/transformer/data_download.py b/training/transformer/data_download.py index c53d573..7a2ee58 100644 --- a/training/transformer/data_download.py +++ b/training/transformer/data_download.py @@ -83,7 +83,7 @@ _TARGET_THRESHOLD = 327 # Accept vocabulary if size is within this threshold VOCAB_FILE = "vocab.ende.%d" % _TARGET_VOCAB_SIZE -# Strings to inclue in the generated files. +# Strings to include in the generated files. _PREFIX = "wmt32k" _TRAIN_TAG = "train" _EVAL_TAG = "dev" # Following WMT and Tensor2Tensor conventions, in which the diff --git a/training/transformer/model/beam_search.py b/training/transformer/model/beam_search.py index d720adf..df6e717 100644 --- a/training/transformer/model/beam_search.py +++ b/training/transformer/model/beam_search.py @@ -521,7 +521,7 @@ def _gather_beams(nested, beam_indices, batch_size, new_beam_size): Nested structure containing tensors with shape [batch_size, new_beam_size, ...] """ - # Computes the i'th coodinate that contains the batch index for gather_nd. + # Computes the i'th coordinate that contains the batch index for gather_nd. # Batch pos is a tensor like [[0,0,0,0,],[1,1,1,1],..]. batch_pos = tf.range(batch_size * new_beam_size) // new_beam_size batch_pos = tf.reshape(batch_pos, [batch_size, new_beam_size]) diff --git a/training/transformer/model/embedding_layer.py b/training/transformer/model/embedding_layer.py index d966b09..6dd3f2e 100644 --- a/training/transformer/model/embedding_layer.py +++ b/training/transformer/model/embedding_layer.py @@ -32,7 +32,7 @@ def __init__(self, vocab_size, hidden_size, method="gather"): hidden_size: Dimensionality of the embedding. (Typically 512 or 1024) method: Strategy for performing embedding lookup. "gather" uses tf.gather which performs well on CPUs and GPUs, but very poorly on TPUs. "matmul" - one-hot encodes the indicies and formulates the embedding as a sparse + one-hot encodes the indices and formulates the embedding as a sparse matrix multiplication. The matmul formulation is wasteful as it does extra work, however matrix multiplication is very fast on TPUs which makes "matmul" considerably faster than "gather" on TPUs. diff --git a/training/transformer/model/transformer.py b/training/transformer/model/transformer.py index 3fc8190..fdbf5ba 100644 --- a/training/transformer/model/transformer.py +++ b/training/transformer/model/transformer.py @@ -40,7 +40,7 @@ class Transformer(object): Implemented as described in: https://arxiv.org/pdf/1706.03762.pdf The Transformer model consists of an encoder and decoder. The input is an int - sequence (or a batch of sequences). The encoder produces a continous + sequence (or a batch of sequences). The encoder produces a continuous representation, and the decoder uses the encoder output to generate probabilities for the output sequence. """ diff --git a/training/transformer/utils/tokenizer.py b/training/transformer/utils/tokenizer.py index 0613171..18af1bc 100644 --- a/training/transformer/utils/tokenizer.py +++ b/training/transformer/utils/tokenizer.py @@ -498,7 +498,7 @@ def _gen_new_subtoken_list( subtoken_counts, min_count, alphabet, reserved_tokens=None): """Generate candidate subtokens ordered by count, and new max subtoken length. - Add subtokens to the candiate list in order of length (longest subtokens + Add subtokens to the candidate list in order of length (longest subtokens first). When a subtoken is added, the counts of each of its prefixes are decreased. Prefixes that don't appear much outside the subtoken are not added to the candidate list. @@ -516,7 +516,7 @@ def _gen_new_subtoken_list( Args: subtoken_counts: defaultdict mapping str subtokens to int counts - min_count: int minumum count requirement for subtokens + min_count: int minimum count requirement for subtokens alphabet: set of characters. Each character is added to the subtoken list to guarantee that all tokens can be encoded. reserved_tokens: list of tokens that will be added to the beginning of the