Skip to content

Latest commit

 

History

History
120 lines (93 loc) · 4.56 KB

File metadata and controls

120 lines (93 loc) · 4.56 KB

LLM Deployment Toolkit

The Deployment Toolkit is a sophisticated Language Model (LLM) application designed to leverage the power of Intel CPUs and GPUs. It features Retrieval Augmented Generation (RAG), a cutting-edge technology that enhances the model's ability to generate accurate and contextually relevant responses by integrating external information retrieval mechanisms.

LLM Deployment Toolkit

Requirements

Validated hardware

  • CPU: 13th generations of Intel Core processors and above
  • GPU: Intel® Arc™ graphics
  • RAM: 32GB
  • DISK: 128GB

Validated software version

  • OpenVINO: 2024.6.0
  • NodeJS: v22.13.0 LTS

Application ports

Please ensure that you have these ports available before running the applications.

Apps Port
UI 8010
Backend 8011
LLM Service 8012
Text to Speech Service 8013
Speech to Text Service 8014

Quick Start

Ubuntu 22.04 LTS

1. Install prerequisite

2. Install GPU driver

3. Download LLM model (Skip if you already have a LLM model in data folder)

# Install OpenVINO library
python3 -m venv .venv
source .venv/bin/activate
python3 -m pip install openvino==2024.6 optimum-intel[openvino,nncf]==1.21.0 --extra-index-url https://download.pytorch.org/whl/cpu

# Download & Convert LLM Model
optimum-cli export openvino --model Qwen/Qwen2.5-7B-Instruct --weight-format int4 --sym --ratio 1.0 --group-size -1 ./data/models/llm

4. Download embedding model and reranker model

# Download & Convert Embedding Model
optimum-cli export openvino --model BAAI/bge-large-en-v1.5 --task feature-extraction --weight-format fp16 ./data/models/embeddings/bge-large-en-v1.5

# Download & Convert Reranker Model
optimum-cli export openvino --model BAAI/bge-reranker-large --task text-classification --weight-format fp16 ./data/models/reranker/bge-reranker-large

5. Build docker images

This step will download all the necessary files online, please ensure you have a valid network connection.

docker compose build

6. Start docker container

export RENDER_GROUP_ID=$(getent group render | cut -d: -f3)
docker compose up -d
Windows 11

1. Install prerequisite

2. Install GPU driver

3. Follow the document and install the following services in the microservices folder.

  • Ollama: doc
  • Text to speech: doc
  • Speech to text: doc

4. Install RAG Toolkit

4.1 Install backend

Double click on the install-backend.bat

4.2 Install UI

Double click on the install-ui.bat

5. Run application

5.1 Start Ollama by following the doc

5.2 Start Text to speech by following the doc

5.3 Start Speech to text by following the doc

5.4 Start RAG Toolkit

Double click on the run.bat

FAQ

  1. Changing the inference device for embedding model. Supported device: ["CPU", "GPU"]
# Example: Loading embedding model on GPU device
export EMBEDDING_DEVICE=GPU
  1. Changing the inference device for reranker model. Supported device: ["CPU", "GPU"]
# Example: Loading reranker model on GPU device
export RERANKER_DEVICE=GPU

Limitations

  1. Current speech-to-text feature only work with localhost.
  2. RAG documents will use all the documents that are uploaded.