Skip to content

ScaleDP is an Open-Source extension of Apache Spark for Document Processing

License

Notifications You must be signed in to change notification settings

StabRise/ScaleDP

Repository files navigation


ScaleDP

An Open-Source Library for Processing Documents using AI/ML in Apache Spark.

GitHub StabRise Codacy Badge


Source Code: https://github.com/StabRise/ScaleDP

Quickstart: 1.QuickStart.ipynb

Tutorials: https://github.com/StabRise/ScaleDP-Tutorials


Welcome to the ScaleDP library

ScaleDP is library allows you to process documents using AI/ML capabilities and scale it using Apache Spark.

LLM (Large Language Models) and VLM (Vision Language Models) models are used to extract data from text and images in combination with OCR engines.

Discover pre-trained models for your projects or play with the thousands of models hosted on the Hugging Face Hub.

Key features

Document processing:

  • Load PDF documents/Images to the Spark DataFrame
  • Extract text from PDF documents/Images
  • Extract images from PDF documents
  • Extract structured data from text/images using LLM and ML models

OCR:

Support various open-source OCR engines:

CV:

  • Object detection on images using YOLO models
  • Text detection on images

LLM:

Support OpenAI compatible API for call LLM/VLM models (GPT, Gemini, GROQ, etc.)

  • OCR Images/PDF documents using Vision LLM models
  • Extract data from the image using Vision LLM models
  • Extract data from the text/images using LLM models
  • Extract data using DSPy framework
  • Extract data from the text/images using NLP models from the Hugging Face Hub
  • Visualize results

Installation

Prerequisites

  • Python 3.10 or higher
  • Apache Spark 3.5 or higher
  • Java 8
  • Tesseract 4.0 or higher

Installation using pip

Install the ScaleDP package with pip:

pip install scaledp

Installation using Docker

Build image:

  docker build -t scaledp .

Run container:

  docker run -p 8888:8888 scaledp:latest

Open Jupyter Notebook in your browser:

  http://localhost:8888

Qiuckstart

Start a Spark session with ScaleDP:

from scaledp import *
spark = ScaleDPSession()
spark

Read example image file:

image_example = files('resources/images/Invoice.png')
df = spark.read.format("binaryFile") \
    .load(image_example)

df.show_image("content")

Output:

Define pipeline for extract text from the image and run NER:

pipeline = PipelineModel(stages=[
    DataToImage(inputCol="content", outputCol="image"),
    TesseractOcr(inputCol="image", outputCol="text", psm=PSM.AUTO, keepInputData=True),
    Ner(model="obi/deid_bert_i2b2", inputCol="text", outputCol="ner", keepInputData=True),
    ImageDrawBoxes(inputCols=["image", "ner"], outputCol="image_with_boxes", lineWidth=3, 
                   padding=5, displayDataList=['entity_group'])
])

result = pipeline.transform(df).cache()

result.show_text("text")

Output:

Show NER results:

result.show_ner(limit=20)

Output:

+------------+-------------------+----------+-----+---+--------------------+
|entity_group|              score|      word|start|end|               boxes|
+------------+-------------------+----------+-----+---+--------------------+
|        HOSP|  0.991257905960083|  Hospital|    0|  8|[{Hospital:, 0.94...|
|         LOC|  0.999171257019043|    Dutton|   10| 16|[{Dutton,, 0.9609...|
|         LOC| 0.9992585778236389|        MI|   18| 20|[{MI, 0.93335297,...|
|          ID| 0.6838774085044861|        26|   29| 31|[{26-123123, 0.90...|
|       PHONE| 0.4669836759567261|         -|   31| 32|[{26-123123, 0.90...|
|       PHONE| 0.7790696024894714|    123123|   32| 38|[{26-123123, 0.90...|
|        HOSP|0.37445762753486633|      HOPE|   39| 43|[{HOPE, 0.9525460...|
|        HOSP| 0.9503226280212402|     HAVEN|   44| 49|[{HAVEN, 0.952546...|
|         LOC| 0.9975488185882568|855 Howard|   59| 69|[{855, 0.94682700...|
|         LOC| 0.9984399676322937|    Street|   70| 76|[{Street, 0.95823...|
|        HOSP| 0.3670221269130707|  HOSPITAL|   77| 85|[{HOSPITAL, 0.959...|
|         LOC| 0.9990363121032715|    Dutton|   86| 92|[{Dutton,, 0.9647...|
|         LOC|  0.999313473701477|  MI 49316|   94|102|[{MI, 0.94589012,...|
|       PHONE| 0.9830010533332825|   ( 123 )|  110|115|[{(123), 0.595334...|
|       PHONE| 0.9080978035926819|       456|  116|119|[{456-1238, 0.955...|
|       PHONE| 0.9378324151039124|         -|  119|120|[{456-1238, 0.955...|
|       PHONE| 0.8746233582496643|      1238|  120|124|[{456-1238, 0.955...|
|     PATIENT|0.45354968309402466|hopedutton|  132|142|[{hopedutton@hope...|
|       EMAIL|0.17805588245391846| hopehaven|  143|152|[{hopedutton@hope...|
|        HOSP|  0.505658745765686|   INVOICE|  157|164|[{INVOICE, 0.9661...|
+------------+-------------------+----------+-----+---+--------------------+

Visualize NER results:

result.visualize_ner(labels_list=["DATE", "LOC"])

Original image with NER results:

result.show_image("image_with_boxes")

Ocr engines

Bbox level Support GPU Separate model for text detection Processing time 1 page (CPU/GPU) secs Support Handwritten Text
Tesseract OCR character no no 0.2/no not good
Tesseract OCR CLI character no no 0.2/no not good
Easy OCR word yes yes
Surya OCR line yes yes
DocTR word yes yes

Disclaimer

This project is not affiliated with, endorsed by, or connected to the Apache Software Foundation or Apache Spark.