This is a project that is currently being developed to build a RAG based system for the upcoming EIC.
There are three main parts to the RAG pipeline.
Ingestion in Retrieval-Augmented Generation (RAG) is a crucial process that involves the preparation and organization of data to be used by the model. This process can be broken down further into three main steps: chunking of information, embedding models, and storing it in a vector database.
- Chunking
- Encoding chunked information into a vector using a embedding model (e.g. BERT, seq2seq, text2vec)
- Storing the encoded information in a vector database.
This is the first step in the ingestion process. The raw data can come in various forms. which could be a large corpus of text, is divided into manageable chunks or segments. The size of these chunks can vary depending on the specific requirements of the task at hand. Chunking helps in reducing the complexity of the data and makes it easier for the model to process the information.
A very recent survey paper. summarizes the types of RAG system1. There are three types of RAG architecture broadly based on where LLM being used in the pipeline
- Building a Naive RAG for EIC using the 200 papers from arxiv on EIC. β
- Backend is a relatively straight forward RAG architecture. Where ingestion of data is done using PyPDF.
- Frontend is a simple web interface that allows for the user to upload a PDF and get back a list of papers that are relevant to the input.
- Report evaulated RAGAS metrics for the built architecture.
- Publish this in the proceeding for AI4EIC-2023. π§βπ
- Going beyong Naive RAG. Towards building a RAG architecture with Testable Evaulation Metrics. π§βπ
- This requires going beyond
- Multi modal output as a Proof of concept.
- Storing meta data information about table etc.
- Using Agents in Langchain to build a latex report.
In order to run the streamlit
app do the following
git clone https://github.com/wmdataphys/EIC-RAG-Project.git
&cd EIC-RAG-Project
- It is better to have a seperate python environment incase of any version mismanagement with other projects. I use conda env
conda create --name env_RAG-EIC python=3.10
This creates a python version3.10
as this was stable when I started building the app. Once created activate the env asconda activate env_RAG-EIC
- Now install
pip
before installing all other packages.conda install pip
- Now install all the requirements
pip install -r requirements.txt
- Ask
karthik18495@gmail.com
about thesecrets.toml
andconfig.toml
- Create a folder named
.streamlit
in the parent directory and move the filessecrets.toml
andconfig.toml
in there. - Now run
streamlit run streamlit_app/AI4EIC-RAGAS4EIC.py
. This should run on ahttp://localhost:8050
- If any new library has been used in the app that requires installation through pip. Make sure to use the
--format freeze
when updating therequirements.txt
- The command is
pip list --format freeze > requirements.txt