Skip to content

Latest commit

 

History

History
37 lines (32 loc) · 2.11 KB

README.md

File metadata and controls

37 lines (32 loc) · 2.11 KB

Document NER

Master's thesis defense project.

General info

The classic method of processing and extracting text from an image (OCR) in cooperation with the Entity Recognition (NER) technology in a trained algorithm based on a set of business cards.

PL: Klasyczna metoda przetwarzania oraz wyodrębniania tekstu z obrazu (OCR) we współpracy z technologią rozpoznawania jednostek (NER) w przeszkolonym algorytmie na podstawie zbioru wizytówek.

Technologies

 Python
 Jupyter Notebook
 OCR
  OpenCV
  Tesseract OCR
 NER
  Pandas
  SpaCy
  RegEx

Solution architecture

Computer vision scans the document, identifies the position of the text and eventually extracts the text from the image. Natural language processing extracts units from text. The document in image form is read using OCR technology to extract text in editable form. The extracted text is cleaned and passed to a learning model that is trained to recognize names. Finally, the named units from this model will be generated. image

Scheme of the process

The scheme of the process and the operation of the application can be described in ten steps:

  1. The process of sending documents via desktop or mobile devices.
  2. Paper documents, submissions and emails containing scans or photos documents.
  3. A collection of a certain number of files containing documents as a base.
  4. The process of analyzing photos and scans of documents by OCR technology.
  5. Extraction of text from the document base.
  6. Generation of text data and preprocessing and data cleaning.
  7. Labeling test data with the BIO system for training the NER model.
  8. NER model training process.
  9. Extracting text data with named units from documents.

Setup

Use a single pipeline code file named predictions.py, with all the necessary functions. Comments are included in the code. image