Retrieval Augmented Generation for EIC

This is a project that is currently being developed to build a RAG based system for the upcoming EIC.

There are three main parts to the RAG pipeline.

Ingestion

Ingestion in Retrieval-Augmented Generation (RAG) is a crucial process that involves the preparation and organization of data to be used by the model. This process can be broken down further into three main steps: chunking of information, embedding models, and storing it in a vector database.

Chunking
Encoding chunked information into a vector using a embedding model (e.g. BERT, seq2seq, text2vec)
Storing the encoded information in a vector database.

Chunking

This is the first step in the ingestion process. The raw data can come in various forms. which could be a large corpus of text, is divided into manageable chunks or segments. The size of these chunks can vary depending on the specific requirements of the task at hand. Chunking helps in reducing the complexity of the data and makes it easier for the model to process the information.

Retrieval

Content Fusion and Generation

Types of RAG system

A very recent survey paper. summarizes the types of RAG system¹. There are three types of RAG architecture broadly based on where LLM being used in the pipeline

Project Milestones

Building a Naive RAG for EIC using the 200 papers from arxiv on EIC. ✅
- Backend is a relatively straight forward RAG architecture. Where ingestion of data is done using PyPDF.
- Frontend is a simple web interface that allows for the user to upload a PDF and get back a list of papers that are relevant to the input.
- Report evaulated RAGAS metrics for the built architecture.
- Publish this in the proceeding for AI4EIC-2023. 🧑‍🏭
Going beyong Naive RAG. Towards building a RAG architecture with Testable Evaulation Metrics. 🧑‍🏭
- This requires going beyond
Multi modal output as a Proof of concept.
- Storing meta data information about table etc.
- Using Agents in Langchain to build a latex report.

References

How tos

Running the webapp

In order to run the streamlit app do the following

git clone https://github.com/wmdataphys/EIC-RAG-Project.git & cd EIC-RAG-Project
It is better to have a seperate python environment incase of any version mismanagement with other projects. I use conda env conda create --name env_RAG-EIC python=3.10 This creates a python version 3.10 as this was stable when I started building the app. Once created activate the env as conda activate env_RAG-EIC
Now install pip before installing all other packages. conda install pip
Now install all the requirements pip install -r requirements.txt
Ask karthik18495@gmail.com about the secrets.toml and config.toml
Create a folder named .streamlit in the parent directory and move the files secrets.toml and config.toml in there.
Now run streamlit run streamlit_app/AI4EIC-RAGAS4EIC.py. This should run on a http://localhost:8050

Updating `requirements.txt`

If any new library has been used in the app that requires installation through pip. Make sure to use the --format freeze when updating the requirements.txt
The command is pip list --format freeze > requirements.txt

Types of RAG ↩

Name	Name	Last commit message	Last commit date
Latest commit karthik18495 updated readme for running Mar 18, 2024 5db4071 · Mar 18, 2024 History 29 Commits
Templates	Templates	latest updates	Mar 18, 2024
ingestion	ingestion	update	Jan 30, 2024
streamlit_app	streamlit_app	latest updates	Mar 18, 2024
.gitignore	.gitignore	latest commits	Jan 30, 2024
README.md	README.md	updated readme for running	Mar 18, 2024
SECURITY.md	SECURITY.md	Create SECURITY.md	Feb 13, 2024
requirements.txt	requirements.txt	latest updates	Mar 18, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Retrieval Augmented Generation for EIC

Ingestion

Chunking

Retrieval

Content Fusion and Generation

Types of RAG system

Project Milestones

References

How tos

Running the webapp

Updating `requirements.txt`

About

Releases

Packages

Contributors 2

Languages

wmdataphys/EIC-RAG-Project

Folders and files

Latest commit

History

Repository files navigation

Retrieval Augmented Generation for EIC

Ingestion

Chunking

Retrieval

Content Fusion and Generation

Types of RAG system

Project Milestones

References

How tos

Running the webapp

Updating requirements.txt

Footnotes

About

Resources

Security policy

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Updating `requirements.txt`

Packages