Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adding TRT-LLM + Triton truss #55

Open
wants to merge 20 commits into
base: main
Choose a base branch
from

Conversation

aspctu
Copy link
Contributor

@aspctu aspctu commented Oct 31, 2023

Overview

This PR adds support for Triton + TRT-LLM engines. We allow users to define a Huggingface repository for the pre-built engines and tokenizers. We leverage the C++ TRT runtime and the Triton Inference Server to provide high-performance model serving with streaming enabled.

trtllm-truss/config.yaml Outdated Show resolved Hide resolved
}
)

def predict(self, model_input):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should try to use async predict. Sync predict runs on a thread pool which has limited number of threads and can limit concurrency. Plus creating a new thread per request is not ideal. cc @squidarth who may know of examples of where we use async predict.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, I missed the yield before. So this predict function is a generator, right?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it's not obvious to me that there would be a big perf increase by switching to having this by async (it's true that doing things this way will produce another thread). It's a medium lift at least to switch it, since we'd have to change the TritonClient implementation to also be async

}
}
```
TODO(Abu): __fill__
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for your reference: looks like using both max_tokens_in_paged_kv_cache and kv_cache_free_gpu_mem_fraction are redundant. I didn't find docs on kv_cache_free_gpu_mem_fraction but sounds like they preallocate 85% of the free gpu mem for the kv cache by default, if max_tokens_in_paged_kv_cache is not specified

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants