This project aims to build a Retrieval-Augmented Generation (RAG)-based chatbot system. The chatbot utilizes context-aware chunking for efficient document processing and leverages open-source models for embeddings and language generation.
-
Context-Aware Chunking: Implements an optimal, manual chunking strategy that uses special chunk markers to separate chunks within documents. This allows for easy extraction of chunks using Python's split() method. Make sure the documents users upload have been chunked using a special chunk marker separator for effective processing.
-
Data Processing Tool: - Data Processing Tool: Converts tables in documents into HTML tables to handle challenges such as long tables and merged cells. The tool requires users to bold headers in tables for clarity and automatically identifies table types, including those with 1 header or more than 2 headers.
-
Open-Source Models: Uses open-source models instead of proprietary ones like those from OpenAI, providing a cost-effective and flexible solution.
- Embedding Model: Utilizes
intfloat/multilingual-e5-small
, which is highly efficient and particularly effective for Vietnamese text. - Language Model: Uses
Viet-Mistral/Vistral-7B-Chat
, a language model based on Mistral, with continued pretraining on Vietnamese for better generation performance.
- Embedding Model: Utilizes
- Clone the repository:
git clone https://github.com/quoctata2911/RAG-based-ChatBot-System.git
- Navigate to the project directory:
cd RAG-Based-Chatbot-System
- Install the required dependencies:
pip install -r requirements.txt
Upload your Word .docx documents into the data folder. Ensure that each document has been chunked using a special chunk marker separator as specified in the config.yaml file.
- Configure the chunk marker:
- Open the
config.yaml
file located in the project directory. - Locate the parameter defining the chunk marker and adjust it as needed for your document segmentation requirements.
- Prepare the data:
python prepare_data.py
- Run the chatbot:
python chat.py
- prepare_data.py: Script to preprocess and chunk documents, converting tables into HTML and segmenting them with chunk markers.
- chat.py: Main script to run the chatbot system.
-
Embedding Model: We use the
intfloat/multilingual-e5-small
model for generating embeddings. This model is particularly effective for Vietnamese text, outperforming other models in our benchmarks. -
Language Model: The language model used is Vistral, a variant of the Mistral model that has been further pre-trained on Vietnamese text for improved performance in language generation tasks.
Through extensive benchmarking, the intfloat/multilingual-e5-small
model has proven to be the best choice for Vietnamese embeddings, offering a balance of efficiency and performance. The Vistral model enhances language generation capabilities, ensuring the chatbot responds accurately and naturally in Vietnamese.
We welcome contributions to improve the RAG-ChatBot. Please fork the repository and create a pull request with your changes. For major changes, please open an issue first to discuss what you would like to change.
This project is licensed under the MIT License. See the LICENSE file for more details.
For any questions or suggestions, please contact me at [email protected]