-
Notifications
You must be signed in to change notification settings - Fork 40
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Adding TRT-LLM + Triton truss #55
base: main
Are you sure you want to change the base?
Conversation
} | ||
) | ||
|
||
def predict(self, model_input): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should try to use async predict. Sync predict runs on a thread pool which has limited number of threads and can limit concurrency. Plus creating a new thread per request is not ideal. cc @squidarth who may know of examples of where we use async predict.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, I missed the yield before. So this predict function is a generator, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it's not obvious to me that there would be a big perf increase by switching to having this by async (it's true that doing things this way will produce another thread). It's a medium lift at least to switch it, since we'd have to change the TritonClient implementation to also be async
} | ||
} | ||
``` | ||
TODO(Abu): __fill__ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
for your reference: looks like using both max_tokens_in_paged_kv_cache
and kv_cache_free_gpu_mem_fraction
are redundant. I didn't find docs on kv_cache_free_gpu_mem_fraction
but sounds like they preallocate 85% of the free gpu mem for the kv cache by default, if max_tokens_in_paged_kv_cache
is not specified
Overview
This PR adds support for Triton + TRT-LLM engines. We allow users to define a Huggingface repository for the pre-built engines and tokenizers. We leverage the C++ TRT runtime and the Triton Inference Server to provide high-performance model serving with streaming enabled.