The Deployment Toolkit is a sophisticated Language Model (LLM) application designed to leverage the power of Intel CPUs and GPUs. It features Retrieval Augmented Generation (RAG), a cutting-edge technology that enhances the model's ability to generate accurate and contextually relevant responses by integrating external information retrieval mechanisms.
- CPU: 13th generations of Intel Core processors and above
- GPU: Intel® Arc™ graphics
- RAM: 32GB
- DISK: 128GB
- OpenVINO: 2024.6.0
- NodeJS: v22.13.0 LTS
Please ensure that you have these ports available before running the applications.
Apps | Port |
---|---|
UI | 8010 |
Backend | 8011 |
LLM Service | 8012 |
Text to Speech Service | 8013 |
Speech to Text Service | 8014 |
Ubuntu 22.04 LTS
# Install OpenVINO library
python3 -m venv .venv
source .venv/bin/activate
python3 -m pip install openvino==2024.6 optimum-intel[openvino,nncf]==1.21.0 --extra-index-url https://download.pytorch.org/whl/cpu
# Download & Convert LLM Model
optimum-cli export openvino --model Qwen/Qwen2.5-7B-Instruct --weight-format int4 --sym --ratio 1.0 --group-size -1 ./data/models/llm
# Download & Convert Embedding Model
optimum-cli export openvino --model BAAI/bge-large-en-v1.5 --task feature-extraction --weight-format fp16 ./data/models/embeddings/bge-large-en-v1.5
# Download & Convert Reranker Model
optimum-cli export openvino --model BAAI/bge-reranker-large --task text-classification --weight-format fp16 ./data/models/reranker/bge-reranker-large
This step will download all the necessary files online, please ensure you have a valid network connection.
docker compose build
export RENDER_GROUP_ID=$(getent group render | cut -d: -f3)
docker compose up -d
Windows 11
Double click on the install-backend.bat
Double click on the install-ui.bat
5.1 Start Ollama by following the doc
5.2 Start Text to speech by following the doc
5.3 Start Speech to text by following the doc
Double click on the run.bat
- Changing the inference device for embedding model. Supported device: ["CPU", "GPU"]
# Example: Loading embedding model on GPU device
export EMBEDDING_DEVICE=GPU
- Changing the inference device for reranker model. Supported device: ["CPU", "GPU"]
# Example: Loading reranker model on GPU device
export RERANKER_DEVICE=GPU
- Current speech-to-text feature only work with localhost.
- RAG documents will use all the documents that are uploaded.