-
Notifications
You must be signed in to change notification settings - Fork 40
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
add BEI doc #405
add BEI doc #405
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall this is really good, just left some notes mostly around formatting.
One thing on organization -- this doesn't need to be a numbered folder. We'll clean up truss-examples as part of the docs re-org work.
This is a collection of BEI deployments with Baseten. BEI is Baseten's solution for production-grade deployments via TensorRT-LLM. | ||
|
||
With BEI you get the following benefits: | ||
- *lowest-latency inference* across any embedding solution (vLLM, SGlang, Infinity, TEI, Ollama)*1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- *lowest-latency inference* across any embedding solution (vLLM, SGlang, Infinity, TEI, Ollama)*1 | |
- *Lowest-latency inference* across any embedding solution (vLLM, SGlang, Infinity, TEI, Ollama)<sup>1</sup> |
|
||
With BEI you get the following benefits: | ||
- *lowest-latency inference* across any embedding solution (vLLM, SGlang, Infinity, TEI, Ollama)*1 | ||
- *highest-throughput inference* across any embedding solution (vLLM, SGlang, Infinity, TEI, Ollama) - thanks to XQA kernels, FP8 and dynamic batching.*2 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- *highest-throughput inference* across any embedding solution (vLLM, SGlang, Infinity, TEI, Ollama) - thanks to XQA kernels, FP8 and dynamic batching.*2 | |
- *Highest-throughput inference* across any embedding solution (vLLM, SGlang, Infinity, TEI, Ollama) - thanks to XQA kernels, FP8 and dynamic batching.<sup>2</sup> |
With BEI you get the following benefits: | ||
- *lowest-latency inference* across any embedding solution (vLLM, SGlang, Infinity, TEI, Ollama)*1 | ||
- *highest-throughput inference* across any embedding solution (vLLM, SGlang, Infinity, TEI, Ollama) - thanks to XQA kernels, FP8 and dynamic batching.*2 | ||
- high parallelism: up to 1400 client embeddings per second |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- high parallelism: up to 1400 client embeddings per second | |
- High parallelism: up to 1400 client embeddings per second |
- *lowest-latency inference* across any embedding solution (vLLM, SGlang, Infinity, TEI, Ollama)*1 | ||
- *highest-throughput inference* across any embedding solution (vLLM, SGlang, Infinity, TEI, Ollama) - thanks to XQA kernels, FP8 and dynamic batching.*2 | ||
- high parallelism: up to 1400 client embeddings per second | ||
- cached model weights for fast vertical scaling and high availability - no Hugging Face hub dependency at runtime |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- cached model weights for fast vertical scaling and high availability - no Hugging Face hub dependency at runtime | |
- Cached model weights for fast vertical scaling and high availability - no Hugging Face hub dependency at runtime |
11-embeddings-reranker-classification-tensorrt/BEI-baai-bge-en-icl-embedding/README.md
Outdated
Show resolved
Hide resolved
11-embeddings-reranker-classification-tensorrt/BEI-baai-bge-en-icl-embedding/README.md
Outdated
Show resolved
Hide resolved
} | ||
``` | ||
Advanced: | ||
You may also use Baseten's async jobs API, which returns a request_id, which you can use to query the status of the job and get the results. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You may also use Baseten's async jobs API, which returns a request_id, which you can use to query the status of the job and get the results. | |
You may also use Baseten's async jobs API, which returns a request_id, which you can use to query the status of the job and get the results.\n |
11-embeddings-reranker-classification-tensorrt/BEI-baai-bge-en-icl-embedding/README.md
Outdated
Show resolved
Hide resolved
11-embeddings-reranker-classification-tensorrt/BEI-baai-bge-en-icl-embedding/README.md
Outdated
Show resolved
Hide resolved
All suggestions applied! |
they are AUTOGENERATED! please look at generate.py.