Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add BEI doc #405

Merged
merged 10 commits into from
Feb 11, 2025
Merged

add BEI doc #405

merged 10 commits into from
Feb 11, 2025

Conversation

michaelfeil
Copy link
Contributor

@michaelfeil michaelfeil commented Feb 7, 2025

they are AUTOGENERATED! please look at generate.py.

Copy link
Member

@philipkiely-baseten philipkiely-baseten left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall this is really good, just left some notes mostly around formatting.

One thing on organization -- this doesn't need to be a numbered folder. We'll clean up truss-examples as part of the docs re-org work.

This is a collection of BEI deployments with Baseten. BEI is Baseten's solution for production-grade deployments via TensorRT-LLM.

With BEI you get the following benefits:
- *lowest-latency inference* across any embedding solution (vLLM, SGlang, Infinity, TEI, Ollama)*1

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
- *lowest-latency inference* across any embedding solution (vLLM, SGlang, Infinity, TEI, Ollama)*1
- *Lowest-latency inference* across any embedding solution (vLLM, SGlang, Infinity, TEI, Ollama)<sup>1</sup>


With BEI you get the following benefits:
- *lowest-latency inference* across any embedding solution (vLLM, SGlang, Infinity, TEI, Ollama)*1
- *highest-throughput inference* across any embedding solution (vLLM, SGlang, Infinity, TEI, Ollama) - thanks to XQA kernels, FP8 and dynamic batching.*2

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
- *highest-throughput inference* across any embedding solution (vLLM, SGlang, Infinity, TEI, Ollama) - thanks to XQA kernels, FP8 and dynamic batching.*2
- *Highest-throughput inference* across any embedding solution (vLLM, SGlang, Infinity, TEI, Ollama) - thanks to XQA kernels, FP8 and dynamic batching.<sup>2</sup>

With BEI you get the following benefits:
- *lowest-latency inference* across any embedding solution (vLLM, SGlang, Infinity, TEI, Ollama)*1
- *highest-throughput inference* across any embedding solution (vLLM, SGlang, Infinity, TEI, Ollama) - thanks to XQA kernels, FP8 and dynamic batching.*2
- high parallelism: up to 1400 client embeddings per second

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
- high parallelism: up to 1400 client embeddings per second
- High parallelism: up to 1400 client embeddings per second

- *lowest-latency inference* across any embedding solution (vLLM, SGlang, Infinity, TEI, Ollama)*1
- *highest-throughput inference* across any embedding solution (vLLM, SGlang, Infinity, TEI, Ollama) - thanks to XQA kernels, FP8 and dynamic batching.*2
- high parallelism: up to 1400 client embeddings per second
- cached model weights for fast vertical scaling and high availability - no Hugging Face hub dependency at runtime

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
- cached model weights for fast vertical scaling and high availability - no Hugging Face hub dependency at runtime
- Cached model weights for fast vertical scaling and high availability - no Hugging Face hub dependency at runtime

}
```
Advanced:
You may also use Baseten's async jobs API, which returns a request_id, which you can use to query the status of the job and get the results.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
You may also use Baseten's async jobs API, which returns a request_id, which you can use to query the status of the job and get the results.
You may also use Baseten's async jobs API, which returns a request_id, which you can use to query the status of the job and get the results.\n

@michaelfeil
Copy link
Contributor Author

All suggestions applied!

@michaelfeil michaelfeil merged commit a272977 into main Feb 11, 2025
1 check passed
@michaelfeil michaelfeil deleted the embeddings-models-purge branch February 11, 2025 03:22
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants