Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add dockerfile for tensorrt build #221

Closed
wants to merge 1 commit into from

Conversation

peldszus
Copy link
Contributor

@peldszus peldszus commented May 30, 2024

Notes

  • The image assumes you have a trt engine lying around in some folder that you can mount into the docker image.
  • Alternatively, you can compile a whisper model to an trt engine using this model. See howto below.
  • I'm not super happy with the series of pip install, especially the workaround with the double torch install. A good next step would be to provide a requirements.txt for tensorrt-servers. That would define which packages to take from which index url and we could also exclude the installation of the deps of fasterwhisper.
  • It is to be tested, which architectures can be served with this. For me it worked on a RTX 3090, and I'm about to test it on a A5000 and RTX 4000.
  • TensorRT is allocating a lot of VRAM for the kv cache, more than I like (goes up to 17gb). I'm working on a way to configure that.

Size

The resulting docker image is considerably smaller than the existing one.

REPOSITORY                        TAG       IMAGE ID       CREATED         SIZE
wl-new                            0.4.1-trt 54891a4d1b98   11 hours ago    14.5GB
ghcr.io/collabora/whisperbot-base latest    ef3dd12abc9a   4 months ago    25.1GB

How to use

Step 1: build the docker image

docker build -t wl-new:0.4.1-trt -f docker/Dockerfile.tensorrt .

Step 2: run the image once to compile the tensor rt engine

mkdir models
docker run -v ./models:/models --rm --runtime=nvidia --gpus all -p 9090:9090 --entrypoint /bin/bash -it wl-new:0.4.1-trt

.. within the image ...

# diagnostics
python --version
python -c "import torch; print('Torch version:',torch.__version__); print('Cuda available:',torch.cuda.is_available())"
python -c "import tensorrt as trt; print('TensorRT version:',trt.__version__)"
python -c "import tensorrt_llm; print('TensorRT LLM',tensorrt_llm.__version__)"

# download the whisper model and compile it to a tensor rt engine
cd /TensorRT-LLM-0.7.1/examples/whisper/
wget --directory-prefix=assets https://openaipublic.azureedge.net/main/whisper/models/e5b1a55b89c1367dacf97e3e19bfd829a01529dbfdeefa8caeb59b3f1b81dadb/large-v3.pt
python3 build.py --output_dir /models/whisper_large_v3 --use_gpt_attention_plugin --use_gemm_plugin --use_layernorm_plugin  --use_bert_attention_plugin

Step 3: run the service with the model

docker run -v ./models:/models --rm --runtime=nvidia --gpus all -p 9090:9090 wl-new:0.4.1-trt python3 run_server.py --backend tensorrt --trt_model_path /models/whisper_large_v3 --trt_multilingual

This was referenced May 30, 2024
@peldszus
Copy link
Contributor Author

peldszus commented Jun 5, 2024

Closed in favour of #227.

@peldszus peldszus closed this Jun 5, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

1 participant