The example builds a pipeline that extracts text, tables and figures from a PDF Document. It embeds the text, table and images from the document and writes them into ChromaDB. This example also provides an alternate approach, that is OSS friendly, using Docling for document parsing and ElasticSearch as the vector store.
The pipeline is hosted on a server endpoint in one of the containers. The endpoint can be called from any Python application.
docker compose up
The Graph has all the code which performs PDF Parsing, embedding and writing the VectorDB. We will deploy the Graph on the server to create an Endpoint for our workflow. Make sure to deploy the right graph before running the example.
pip install indexify
python workflow.py
This stage deploys the workflow on the server. At this point, you can also open the UI to see the deployed Graph.
After this, you can call the endpoint with PDFs to make Indexify start parsing documents.
from indexify import RemoteGraph
graph = RemoteGraph.by_name("Extract_pages_tables_images_pdf")
invocation_id = graph.run(block_until_done=True, url="")
You can read the output of every function of the Graph. For example,
chunks = graph.output(invocation_id, "chunk_text")
The ChromaDB tables are populated automatically by the ChromaDBWriter class.
The name of the databases used in the example are text_embeddings
and image_embeddings
. The database running inside the container at port 8000
is forwarded to the host for convenience.
For ElasticSearch, the service in this example is set-up using docker-compose.yaml
. elastic_writer.py
relies on docker networking to connect to it
and index the generated vectors.
Once the documents are processed, you can query ChromaDB for vector search. Here is some same code for that
For ElasticSearch es_retrieve.py
has some sample python code to query the indexes.
Copy the folder, modify the code as you like and simply upload the new Graph.
python workflow.py
You have to make a couple of changes to use GPUs for PDF parsing.
- Uncomment the lines in the
pdf-parser-executor
block which mention uncommenting them would enable GPUs. - Use the
gpu_image
in thePDFParser
,extract_chunks
andextract_images
class/functions so that the workflow routes the PDFParser into the GPU enabled image.