Welcome to the ScaleDP library

An Open-Source Library for Processing Documents using AI/ML in Apache Spark.

Source Code: https://github.com/StabRise/ScaleDP

Tutorials: https://github.com/StabRise/ScaleDP-Tutorials

Welcome to the ScaleDP library

ScaleDP is library allows you to process documents using AI/ML capabilities and scale it using Apache Spark.

LLM (Large Language Models) and VLM (Vision Language Models) models are used to extract data from text and images in combination with OCR engines.

Discover pre-trained models for your projects or play with the thousands of models hosted on the Hugging Face Hub.

Key features

Document processing:

Load PDF documents/Images to the Spark DataFrame
Extract text from PDF documents/Images
Extract images from PDF documents
Extract structured data from text/images using LLM and ML models

OCR:

Support various open-source OCR engines:

CV:

Object detection on images using YOLO models
Text detection on images

LLM:

Support OpenAI compatible API for call LLM/VLM models (GPT, Gemini, GROQ, etc.)

OCR Images/PDF documents using Vision LLM models
Extract data from the image using Vision LLM models
Extract data from the text/images using LLM models
Extract data using DSPy framework
Extract data from the text/images using NLP models from the Hugging Face Hub
Visualize results

Installation

Prerequisites

Python 3.10 or higher
Apache Spark 3.5 or higher
Java 8
Tesseract 4.0 or higher

Installation using pip

Install the ScaleDP package with pip:

pip install scaledp

Installation using Docker

Build image:

  docker build -t scaledp .

Run container:

  docker run -p 8888:8888 scaledp:latest

Open Jupyter Notebook in your browser:

  http://localhost:8888

Qiuckstart

Start a Spark session with ScaleDP:

from scaledp import *
spark = ScaleDPSession()
spark

Read example image file:

image_example = files('resources/images/Invoice.png')
df = spark.read.format("binaryFile") \
    .load(image_example)

df.show_image("content")

Output:

Define pipeline for extract text from the image and run NER:

pipeline = PipelineModel(stages=[
    DataToImage(inputCol="content", outputCol="image"),
    TesseractOcr(inputCol="image", outputCol="text", psm=PSM.AUTO, keepInputData=True),
    Ner(model="obi/deid_bert_i2b2", inputCol="text", outputCol="ner", keepInputData=True),
    ImageDrawBoxes(inputCols=["image", "ner"], outputCol="image_with_boxes", lineWidth=3, 
                   padding=5, displayDataList=['entity_group'])
])

result = pipeline.transform(df).cache()

result.show_text("text")

Output:

Show NER results:

result.show_ner(limit=20)

Output:

+------------+-------------------+----------+-----+---+--------------------+
|entity_group|              score|      word|start|end|               boxes|
+------------+-------------------+----------+-----+---+--------------------+
|        HOSP|  0.991257905960083|  Hospital|    0|  8|[{Hospital:, 0.94...|
|         LOC|  0.999171257019043|    Dutton|   10| 16|[{Dutton,, 0.9609...|
|         LOC| 0.9992585778236389|        MI|   18| 20|[{MI, 0.93335297,...|
|          ID| 0.6838774085044861|        26|   29| 31|[{26-123123, 0.90...|
|       PHONE| 0.4669836759567261|         -|   31| 32|[{26-123123, 0.90...|
|       PHONE| 0.7790696024894714|    123123|   32| 38|[{26-123123, 0.90...|
|        HOSP|0.37445762753486633|      HOPE|   39| 43|[{HOPE, 0.9525460...|
|        HOSP| 0.9503226280212402|     HAVEN|   44| 49|[{HAVEN, 0.952546...|
|         LOC| 0.9975488185882568|855 Howard|   59| 69|[{855, 0.94682700...|
|         LOC| 0.9984399676322937|    Street|   70| 76|[{Street, 0.95823...|
|        HOSP| 0.3670221269130707|  HOSPITAL|   77| 85|[{HOSPITAL, 0.959...|
|         LOC| 0.9990363121032715|    Dutton|   86| 92|[{Dutton,, 0.9647...|
|         LOC|  0.999313473701477|  MI 49316|   94|102|[{MI, 0.94589012,...|
|       PHONE| 0.9830010533332825|   ( 123 )|  110|115|[{(123), 0.595334...|
|       PHONE| 0.9080978035926819|       456|  116|119|[{456-1238, 0.955...|
|       PHONE| 0.9378324151039124|         -|  119|120|[{456-1238, 0.955...|
|       PHONE| 0.8746233582496643|      1238|  120|124|[{456-1238, 0.955...|
|     PATIENT|0.45354968309402466|hopedutton|  132|142|[{hopedutton@hope...|
|       EMAIL|0.17805588245391846| hopehaven|  143|152|[{hopedutton@hope...|
|        HOSP|  0.505658745765686|   INVOICE|  157|164|[{INVOICE, 0.9661...|
+------------+-------------------+----------+-----+---+--------------------+

Visualize NER results:

result.visualize_ner(labels_list=["DATE", "LOC"])

Original image with NER results:

result.show_image("image_with_boxes")

Ocr engines

	Bbox level	Support GPU	Separate model for text detection	Processing time 1 page (CPU/GPU) secs	Support Handwritten Text
Tesseract OCR	character	no	no	0.2/no	not good
Tesseract OCR CLI	character	no	no	0.2/no	not good
Easy OCR	word	yes	yes
Surya OCR	line	yes	yes
DocTR	word	yes	yes

Disclaimer

This project is not affiliated with, endorsed by, or connected to the Apache Software Foundation or Apache Spark.

Name		Name	Last commit message	Last commit date
Latest commit History 156 Commits
.github/workflows		.github/workflows
docs		docs
examples/jupyter		examples/jupyter
images		images
scaledp		scaledp
tests		tests
tutorials @ 35fadbc		tutorials @ 35fadbc
.gitignore		.gitignore
.gitmodules		.gitmodules
.pre-commit-config.yaml		.pre-commit-config.yaml
.readthedocs.yaml		.readthedocs.yaml
Dockerfile		Dockerfile
LICENSE		LICENSE
README.md		README.md
poetry.lock		poetry.lock
pyproject.toml		pyproject.toml
pytest.ini		pytest.ini

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Welcome to the ScaleDP library

Key features

Document processing:

OCR:

CV:

LLM:

Installation

Prerequisites

Installation using pip

Installation using Docker

Qiuckstart

Ocr engines

Disclaimer

About

Contributors 2

Languages

License

StabRise/ScaleDP

Folders and files

Latest commit

History

Repository files navigation

Welcome to the ScaleDP library

Key features

Document processing:

OCR:

CV:

LLM:

Installation

Prerequisites

Installation using pip

Installation using Docker

Qiuckstart

Ocr engines

Disclaimer

About

Topics

Resources

License

Stars

Watchers

Forks

Contributors 2

Languages