Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix some typos (most of them found by codespell) #30

Open
wants to merge 4 commits into
base: master
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
10 changes: 5 additions & 5 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
# pero-enhance

Tool for text-guided textual document scan quality enhancement. The method works on lines of text that can be input through a PAGE XML or detected automatically by a buil-in OCR. By using text input along with the image, the results can be correctly readable even with parts of the original text missing or severly degraded in the source image. The tool includes functionality for cropping the text lines, processing them with our provided models for either text enhancement and inpainting, and for blending the enhanced text lines back into the source document image. We currently provide models for OCR and enhancement of czech newspapers optimized for low-quality scans from micro-films.
Tool for text-guided textual document scan quality enhancement. The method works on lines of text that can be input through a PAGE XML or detected automatically by a built-in OCR. By using text input along with the image, the results can be correctly readable even with parts of the original text missing or severely degraded in the source image. The tool includes functionality for cropping the text lines, processing them with our provided models for either text enhancement and inpainting, and for blending the enhanced text lines back into the source document image. We currently provide models for OCR and enhancement of czech newspapers optimized for low-quality scans from micro-films.

This package can be used as a standalone commandline tool to process document pages in bulk. Alternatively, the package provides a python class that can be integrated in third-party software.

Expand All @@ -11,9 +11,9 @@ The method is based on Generative Adversarial Neural Networks (GAN) that are tra
## Installation
The module requires python 3 and CUDA capable GPU.

Clone the repository (which clones pero-ocr as submodule) and add the pero_enhance and pero_ocr package to your `PYTHONPATH`:
Clone the repository (which clones pero-ocr as submodule) and add the pero-enhance and pero-ocr package to your `PYTHONPATH`:
```
clone --recursive https://github.com/DCGM/pero-enhance.git
git clone --recursive https://github.com/DCGM/pero-enhance.git
cd pero-enhance
export PYTHONPATH=/abs/path/to/repo/pero-enhance:/abs/path/to/repo/pero-enhance/pero-ocr:$PYTHONPATH
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
export PYTHONPATH=/abs/path/to/repo/pero-enhance:/abs/path/to/repo/pero-enhance/pero-ocr:$PYTHONPATH
export PYTHONPATH=$PWD:$PWD/pero-ocr:$PYTHONPATH

```
Expand All @@ -33,15 +33,15 @@ Images in a folder can be enhanced by running following:
```
python repair_page.py -i ../example/ -x ../example/ -o /path/to/outputs
```
The above command runs OCR, stores the OCR output in ./example/, and stores the enhance images in /path/to/outputs. The generated OCR Page XML files can be manualy revised if the OCR quality is not satisfactory, and the command can be repeated to use these changes for better image enhancement.
The above command runs OCR, stores the OCR output in ./example/, and stores the enhance images in /path/to/outputs. The generated OCR Page XML files can be manually revised if the OCR quality is not satisfactory, and the command can be repeated to use these changes for better image enhancement.

Alternatively, you can run interactive demo by running the following, where the xml file is optional:
```
python demo.py -i ../example/82f4ac84-6f1e-43ba-b1d5-e2b28d69508d.jpg -x ../example/82f4ac84-6f1e-43ba-b1d5-e2b28d69508d.xml
```
When Page XML file is not provided, automatic text detection and OCR is done using `PageParser` from the pero-ocr package.

The commands use by default models and settings optimized for czech newspapers downloaded during instalation. The models can be changed Different models for enhancement can be specified by `-r /path/to/enhancement-model/repair_engine.json` and OCR models by `-p /path/to/ocr-model/config.ini`.
The commands use by default models and settings optimized for czech newspapers downloaded during installation. The models can be changed Different models for enhancement can be specified by `-r /path/to/enhancement-model/repair_engine.json` and OCR models by `-p /path/to/ocr-model/config.ini`.

### EngineRepairCNN class
In your code, you can directly use the EngineRepairCNN class to enhance individual text line images normalized to height of 32 pixels or of whole page images when the content is defined by pero.layout class. The processed images should have three channels represented as numpy arrays.
Expand Down
2 changes: 1 addition & 1 deletion training/ocr_engine/line_ocr_engine.py
Original file line number Diff line number Diff line change
Expand Up @@ -72,7 +72,7 @@ def process_lines(self, lines):
if line.shape[0] == self.line_px_height:
ValueError("Line height needs to be {} for this ocr network and is {} instead.".format(self.line_px_height, line.shape[0]))
if line.shape[2] == 3:
ValueError("Line crops need three color channes, but this one has {}.".format(line.shape[2]))
ValueError("Line crops need three color channels, but this one has {}.".format(line.shape[2]))

all_transcriptions = [None]*len(lines)
all_logits = [None]*len(lines)
Expand Down
2 changes: 1 addition & 1 deletion training/transformer/compute_bleu.py
Original file line number Diff line number Diff line change
Expand Up @@ -65,7 +65,7 @@ def bleu_tokenize(string):
except when a punctuation is preceded and followed by a digit
(e.g. a comma/dot as a thousand/decimal separator).

Note that a numer (e.g. a year) followed by a dot at the end of sentence
Note that a number (e.g. a year) followed by a dot at the end of sentence
is NOT tokenized,
i.e. the dot stays with the number because `s/(\p{P})(\P{N})/ $1 $2/g`
does not match this case (unless we add a space after each sentence).
Expand Down
2 changes: 1 addition & 1 deletion training/transformer/data_download.py
Original file line number Diff line number Diff line change
Expand Up @@ -83,7 +83,7 @@
_TARGET_THRESHOLD = 327 # Accept vocabulary if size is within this threshold
VOCAB_FILE = "vocab.ende.%d" % _TARGET_VOCAB_SIZE

# Strings to inclue in the generated files.
# Strings to include in the generated files.
_PREFIX = "wmt32k"
_TRAIN_TAG = "train"
_EVAL_TAG = "dev" # Following WMT and Tensor2Tensor conventions, in which the
Expand Down
2 changes: 1 addition & 1 deletion training/transformer/model/beam_search.py
Original file line number Diff line number Diff line change
Expand Up @@ -521,7 +521,7 @@ def _gather_beams(nested, beam_indices, batch_size, new_beam_size):
Nested structure containing tensors with shape
[batch_size, new_beam_size, ...]
"""
# Computes the i'th coodinate that contains the batch index for gather_nd.
# Computes the i'th coordinate that contains the batch index for gather_nd.
# Batch pos is a tensor like [[0,0,0,0,],[1,1,1,1],..].
batch_pos = tf.range(batch_size * new_beam_size) // new_beam_size
batch_pos = tf.reshape(batch_pos, [batch_size, new_beam_size])
Expand Down
2 changes: 1 addition & 1 deletion training/transformer/model/embedding_layer.py
Original file line number Diff line number Diff line change
Expand Up @@ -32,7 +32,7 @@ def __init__(self, vocab_size, hidden_size, method="gather"):
hidden_size: Dimensionality of the embedding. (Typically 512 or 1024)
method: Strategy for performing embedding lookup. "gather" uses tf.gather
which performs well on CPUs and GPUs, but very poorly on TPUs. "matmul"
one-hot encodes the indicies and formulates the embedding as a sparse
one-hot encodes the indices and formulates the embedding as a sparse
matrix multiplication. The matmul formulation is wasteful as it does
extra work, however matrix multiplication is very fast on TPUs which
makes "matmul" considerably faster than "gather" on TPUs.
Expand Down
2 changes: 1 addition & 1 deletion training/transformer/model/transformer.py
Original file line number Diff line number Diff line change
Expand Up @@ -40,7 +40,7 @@ class Transformer(object):
Implemented as described in: https://arxiv.org/pdf/1706.03762.pdf

The Transformer model consists of an encoder and decoder. The input is an int
sequence (or a batch of sequences). The encoder produces a continous
sequence (or a batch of sequences). The encoder produces a continuous
representation, and the decoder uses the encoder output to generate
probabilities for the output sequence.
"""
Expand Down
4 changes: 2 additions & 2 deletions training/transformer/utils/tokenizer.py
Original file line number Diff line number Diff line change
Expand Up @@ -498,7 +498,7 @@ def _gen_new_subtoken_list(
subtoken_counts, min_count, alphabet, reserved_tokens=None):
"""Generate candidate subtokens ordered by count, and new max subtoken length.

Add subtokens to the candiate list in order of length (longest subtokens
Add subtokens to the candidate list in order of length (longest subtokens
first). When a subtoken is added, the counts of each of its prefixes are
decreased. Prefixes that don't appear much outside the subtoken are not added
to the candidate list.
Expand All @@ -516,7 +516,7 @@ def _gen_new_subtoken_list(

Args:
subtoken_counts: defaultdict mapping str subtokens to int counts
min_count: int minumum count requirement for subtokens
min_count: int minimum count requirement for subtokens
alphabet: set of characters. Each character is added to the subtoken list to
guarantee that all tokens can be encoded.
reserved_tokens: list of tokens that will be added to the beginning of the
Expand Down