A Retrieval Augmented Generation (RAG) implementation for medical domain queries using BioMistral-7B and other open-source components. This application provides accurate medical information retrieval and response generation while keeping all processing local and private.
- Fully Local Processing: All computations run on-premise without external API calls
- Domain-Specific Models: Uses medical-specialized language and embedding models
- Self-hosted Vector Database: Scalable vector storage using Qdrant
- Interactive Web Interface: Clean UI for easy interaction with the system
- Document Source Tracking: Provides source context for all generated responses
- LLM: BioMistral-7B (medical domain-specific model)
- Embeddings: PubMedBERT-based embeddings (medical domain-specific)
- Vector Database: Qdrant (self-hosted)
- Framework: LangChain + LlamaCpp
- API: FastAPI
- Frontend: HTML/JavaScript with Bootstrap
- Python 3.8+
- Docker
- 16GB+ RAM
- CPU with AVX2 support (for LlamaCpp)
- Clone the repository:
git clone https://github.com/rudra-singh1/ragChatbot
cd biomedical-rag
- Create and activate a virtual environment:
python -m venv venv
source venv/bin/activate # On Windows: .\venv\Scripts\activate
- Install dependencies:
pip install -r requirements.txt
- Download the BioMistral model:
- Get the quantized GGUF model from TheBloke's HuggingFace repository
- Place it in the project root directory
- Start Qdrant:
docker pull qdrant/qdrant
docker run -p 6333:6333 qdrant/qdrant
- Place your medical documents in the
data/
directory (supports PDF, TXT) - Update the model path in
app.py
if needed:
LOCAL_LLM_PATH = "biomistral-7b.Q4_K_M.gguf"
- First, ingest your documents to create vectors:
python ingest.py
- Start the application:
uvicorn app:app --reload
- Access the web interface at
http://localhost:8000
GET /
: Main web interfacePOST /get_response
: Query endpoint- Input:
{"query": "your medical question here"}
- Returns:
{"answer": "response", "context": "source context", "source": "document name"}
- Input:
- Collection Name:
vector_db
- Vector Dimension: 768 (PubMedBERT embedding size)
- Distance Metric: Cosine Similarity
- Temperature: 0.1
- Max Tokens: 2048
- Model: BioMistral-7B (4-bit quantized)
- Chunk Size: 700 tokens
- Chunk Overlap: 70 tokens
- Top-k retrieval: 2 documents
- Average response time: 30-40 seconds on CPU
- RAM usage: ~8GB during operation
- Storage: Depends on document volume (vectors typically 20% of raw text size)
- CPU-only implementation (can be extended to GPU)
- No chat memory/history (stateless queries)
- Response time dependent on CPU capabilities
- Limited to medical domain queries
- Add chat memory for contextual conversations
- Implement streaming responses
- Add GPU support
- Improve document preview functionality
- Add more medical document formats support
- Implement authentication