Add dockerfile for tensorrt build #221

peldszus · 2024-05-30T09:00:51Z

Notes

The image assumes you have a trt engine lying around in some folder that you can mount into the docker image.
Alternatively, you can compile a whisper model to an trt engine using this model. See howto below.
I'm not super happy with the series of pip install, especially the workaround with the double torch install. A good next step would be to provide a requirements.txt for tensorrt-servers. That would define which packages to take from which index url and we could also exclude the installation of the deps of fasterwhisper.
It is to be tested, which architectures can be served with this. For me it worked on a RTX 3090, and I'm about to test it on a A5000 and RTX 4000.
TensorRT is allocating a lot of VRAM for the kv cache, more than I like (goes up to 17gb). I'm working on a way to configure that.

Size

The resulting docker image is considerably smaller than the existing one.

REPOSITORY                        TAG       IMAGE ID       CREATED         SIZE
wl-new                            0.4.1-trt 54891a4d1b98   11 hours ago    14.5GB
ghcr.io/collabora/whisperbot-base latest    ef3dd12abc9a   4 months ago    25.1GB

How to use

Step 1: build the docker image

docker build -t wl-new:0.4.1-trt -f docker/Dockerfile.tensorrt .

Step 2: run the image once to compile the tensor rt engine

mkdir models
docker run -v ./models:/models --rm --runtime=nvidia --gpus all -p 9090:9090 --entrypoint /bin/bash -it wl-new:0.4.1-trt

.. within the image ...

# diagnostics
python --version
python -c "import torch; print('Torch version:',torch.__version__); print('Cuda available:',torch.cuda.is_available())"
python -c "import tensorrt as trt; print('TensorRT version:',trt.__version__)"
python -c "import tensorrt_llm; print('TensorRT LLM',tensorrt_llm.__version__)"

# download the whisper model and compile it to a tensor rt engine
cd /TensorRT-LLM-0.7.1/examples/whisper/
wget --directory-prefix=assets https://openaipublic.azureedge.net/main/whisper/models/e5b1a55b89c1367dacf97e3e19bfd829a01529dbfdeefa8caeb59b3f1b81dadb/large-v3.pt
python3 build.py --output_dir /models/whisper_large_v3 --use_gpt_attention_plugin --use_gemm_plugin --use_layernorm_plugin  --use_bert_attention_plugin

Step 3: run the service with the model

docker run -v ./models:/models --rm --runtime=nvidia --gpus all -p 9090:9090 wl-new:0.4.1-trt python3 run_server.py --backend tensorrt --trt_model_path /models/whisper_large_v3 --trt_multilingual

peldszus · 2024-06-05T08:43:33Z

Closed in favour of #227.

Add dockerfile for tensorrt build

6e1b470

This was referenced May 30, 2024

Docker compose tensorrt #177

Closed

Tensor backend core dumped #208

Closed

peldszus closed this Jun 5, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add dockerfile for tensorrt build #221

Add dockerfile for tensorrt build #221

peldszus commented May 30, 2024 •

edited

Loading

peldszus commented Jun 5, 2024

Add dockerfile for tensorrt build #221

Add dockerfile for tensorrt build #221

Conversation

peldszus commented May 30, 2024 • edited Loading

Notes

Size

How to use

Step 1: build the docker image

Step 2: run the image once to compile the tensor rt engine

Step 3: run the service with the model

peldszus commented Jun 5, 2024

peldszus commented May 30, 2024 •

edited

Loading