Adding TRT-LLM + Triton truss #55

aspctu · 2023-10-31T06:02:33Z

Overview

This PR adds support for Triton + TRT-LLM engines. We allow users to define a Huggingface repository for the pre-built engines and tokenizers. We leverage the C++ TRT runtime and the Triton Inference Server to provide high-performance model serving with streaming enabled.

trtllm-truss/config.yaml

pankajroark · 2023-11-04T05:15:11Z

trtllm-truss/model/model.py

+            }
+        )
+
+    def predict(self, model_input):


We should try to use async predict. Sync predict runs on a thread pool which has limited number of threads and can limit concurrency. Plus creating a new thread per request is not ideal. cc @squidarth who may know of examples of where we use async predict.

Ah, I missed the yield before. So this predict function is a generator, right?

it's not obvious to me that there would be a big perf increase by switching to having this by async (it's true that doing things this way will produce another thread). It's a medium lift at least to switch it, since we'd have to change the TritonClient implementation to also be async

joostinyi · 2023-11-07T20:05:57Z

trtllm-truss/README.md

+  }
+}
+```
+TODO(Abu): __fill__


for your reference: looks like using both max_tokens_in_paged_kv_cache and kv_cache_free_gpu_mem_fraction are redundant. I didn't find docs on kv_cache_free_gpu_mem_fraction but sounds like they preallocate 85% of the free gpu mem for the kv cache by default, if max_tokens_in_paged_kv_cache is not specified

aspctu added 17 commits October 27, 2023 13:49

init

60faf86

adding remove me file

52cdf9f

adding creation of version dir

b0f419f

remove extraneous file

5e81f43

fixing data dir

2d6732d

fixing load method

c593037

update path

a2caf7f

updating path to engine

c5734f7

updating model.py with correct waiting logic

970e12e

Adding readme

775668a

update grpc client logic

ef24e7c

set up streaming on truss server

535abfc

update postprocessing logic for correct spacing

d8eea27

increase predict concurrency

8726b03

refactor

471cbb9

adding support for hf key, tokenizer and mpi

fcb5c6c

add thread safety

ec3b286

bolasim reviewed Nov 2, 2023

View reviewed changes

trtllm-truss/config.yaml Outdated Show resolved Hide resolved

Update config.yaml

44a412c

pankajroark reviewed Nov 4, 2023

View reviewed changes

adding readme

90061cc

joostinyi reviewed Nov 7, 2023

View reviewed changes

update base image

69a28b6

joostinyi approved these changes Nov 8, 2023

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Adding TRT-LLM + Triton truss #55

Adding TRT-LLM + Triton truss #55

aspctu commented Oct 31, 2023 •

edited

Loading

pankajroark Nov 4, 2023

pankajroark Nov 4, 2023

squidarth Nov 4, 2023

joostinyi Nov 7, 2023

Adding TRT-LLM + Triton truss #55

Are you sure you want to change the base?

Adding TRT-LLM + Triton truss #55

Conversation

aspctu commented Oct 31, 2023 • edited Loading

Overview

pankajroark Nov 4, 2023

Choose a reason for hiding this comment

pankajroark Nov 4, 2023

Choose a reason for hiding this comment

squidarth Nov 4, 2023

Choose a reason for hiding this comment

joostinyi Nov 7, 2023

Choose a reason for hiding this comment

aspctu commented Oct 31, 2023 •

edited

Loading