LLM Deployment Toolkit

The Deployment Toolkit is a sophisticated Language Model (LLM) application designed to leverage the power of Intel CPUs and GPUs. It features Retrieval Augmented Generation (RAG), a cutting-edge technology that enhances the model's ability to generate accurate and contextually relevant responses by integrating external information retrieval mechanisms.

Requirements

Validated hardware

CPU: 13th generations of Intel Core processors and above
GPU: Intel® Arc™ graphics
RAM: 32GB
DISK: 128GB

Validated software version

OpenVINO: 2024.6.0
NodeJS: v22.13.0 LTS

Application ports

Please ensure that you have these ports available before running the applications.

Apps	Port
UI	8010
Backend	8011
LLM Service	8012
Text to Speech Service	8013
Speech to Text Service	8014

Quick Start

Ubuntu 22.04 LTS

1. Install prerequisite

Docker

2. Install GPU driver

Intel® Arc™ A-Series Graphics
Intel® Data Center GPU Flex Series

3. Download LLM model (Skip if you already have a LLM model in data folder)

# Install OpenVINO library
python3 -m venv .venv
source .venv/bin/activate
python3 -m pip install openvino==2024.6 optimum-intel[openvino,nncf]==1.21.0 --extra-index-url https://download.pytorch.org/whl/cpu

# Download & Convert LLM Model
optimum-cli export openvino --model Qwen/Qwen2.5-7B-Instruct --weight-format int4 --sym --ratio 1.0 --group-size -1 ./data/models/llm

4. Download embedding model and reranker model

# Download & Convert Embedding Model
optimum-cli export openvino --model BAAI/bge-large-en-v1.5 --task feature-extraction --weight-format fp16 ./data/models/embeddings/bge-large-en-v1.5

# Download & Convert Reranker Model
optimum-cli export openvino --model BAAI/bge-reranker-large --task text-classification --weight-format fp16 ./data/models/reranker/bge-reranker-large

5. Build docker images

This step will download all the necessary files online, please ensure you have a valid network connection.

docker compose build

6. Start docker container

export RENDER_GROUP_ID=$(getent group render | cut -d: -f3)
docker compose up -d

Windows 11

1. Install prerequisite

Python 3.11.9 (64-bit)
Intel® oneAPI Base Toolkit version 2024.2.1
Node.js v22.12.0

2. Install GPU driver

Intel® Arc™ & Iris® Xe Graphics - Windows
Intel® Data Center GPU Flex Series - Windows

3. Follow the document and install the following services in the microservices folder.

Ollama: doc
Text to speech: doc
Speech to text: doc

4. Install RAG Toolkit

4.1 Install backend

Double click on the install-backend.bat

4.2 Install UI

Double click on the install-ui.bat

5. Run application

5.1 Start Ollama by following the doc

5.2 Start Text to speech by following the doc

5.3 Start Speech to text by following the doc

5.4 Start RAG Toolkit

Double click on the run.bat

FAQ

Changing the inference device for embedding model. Supported device: ["CPU", "GPU"]

# Example: Loading embedding model on GPU device
export EMBEDDING_DEVICE=GPU

Changing the inference device for reranker model. Supported device: ["CPU", "GPU"]

# Example: Loading reranker model on GPU device
export RERANKER_DEVICE=GPU

Limitations

Current speech-to-text feature only work with localhost.
RAG documents will use all the documents that are uploaded.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

LLM Deployment Toolkit

Requirements

Validated hardware

Validated software version

Application ports

Quick Start

1. Install prerequisite

2. Install GPU driver

3. Download LLM model (Skip if you already have a LLM model in data folder)

4. Download embedding model and reranker model

5. Build docker images

6. Start docker container

1. Install prerequisite

2. Install GPU driver

3. Follow the document and install the following services in the microservices folder.

4. Install RAG Toolkit

4.1 Install backend

4.2 Install UI

5. Run application

5.1 Start Ollama by following the doc

5.2 Start Text to speech by following the doc

5.3 Start Speech to text by following the doc

5.4 Start RAG Toolkit

FAQ

Limitations

Files

README.md

Latest commit

History

README.md

File metadata and controls

LLM Deployment Toolkit

Requirements

Validated hardware

Validated software version

Application ports

Quick Start

1. Install prerequisite

2. Install GPU driver

3. Download LLM model (Skip if you already have a LLM model in data folder)

4. Download embedding model and reranker model

5. Build docker images

6. Start docker container

1. Install prerequisite

2. Install GPU driver

3. Follow the document and install the following services in the microservices folder.

4. Install RAG Toolkit

4.1 Install backend

4.2 Install UI

5. Run application

5.1 Start Ollama by following the doc

5.2 Start Text to speech by following the doc

5.3 Start Speech to text by following the doc

5.4 Start RAG Toolkit

FAQ

Limitations