New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

add BEI doc #405

Merged

michaelfeil merged 10 commits into main from embeddings-models-purge

Feb 11, 2025

Contributor

michaelfeil commented Feb 7, 2025 •

edited

Loading

they are AUTOGENERATED! please look at generate.py.


          add baseten embeddings

a730cb9

michaelfeil assigned philipkiely-baseten

michaelfeil added 8 commits

February 7, 2025 02:23


          fix typos

d53e1ca


          update template

4f0cb24


          update many typos

678b743


          add generated examples

9be1dc1


          Merge branch 'main' into embeddings-models-purge

788b8a6


          update readme

a366493


          ruff fix

397e9f3


          fix format issues

f3e1d7b

philipkiely-baseten approved these changes

View reviewed changes

Member

philipkiely-baseten left a comment

Overall this is really good, just left some notes mostly around formatting.

One thing on organization -- this doesn't need to be a numbered folder. We'll clean up truss-examples as part of the docs re-org work.

11-embeddings-reranker-classification-tensorrt/README.md Outdated

+              This is a collection of BEI deployments with Baseten. BEI is Baseten's solution for production-grade deployments via TensorRT-LLM.
+              With BEI you get the following benefits:
+              - *lowest-latency inference* across any embedding solution (vLLM, SGlang, Infinity, TEI, Ollama)*1

Member

philipkiely-baseten Feb 11, 2025

Suggested change

      
            - *lowest-latency inference* across any embedding solution (vLLM, SGlang, Infinity, TEI, Ollama)*1
          
            - *Lowest-latency inference* across any embedding solution (vLLM, SGlang, Infinity, TEI, Ollama)<sup>1</sup>

11-embeddings-reranker-classification-tensorrt/README.md Outdated

+              With BEI you get the following benefits:
+              - *lowest-latency inference* across any embedding solution (vLLM, SGlang, Infinity, TEI, Ollama)*1
+              - *highest-throughput inference* across any embedding solution (vLLM, SGlang, Infinity, TEI, Ollama) - thanks to XQA kernels, FP8 and dynamic batching.*2

Member

philipkiely-baseten Feb 11, 2025

Suggested change

      
            - *highest-throughput inference* across any embedding solution (vLLM, SGlang, Infinity, TEI, Ollama) - thanks to XQA kernels, FP8 and dynamic batching.*2
          
            - *Highest-throughput inference* across any embedding solution (vLLM, SGlang, Infinity, TEI, Ollama) - thanks to XQA kernels, FP8 and dynamic batching.<sup>2</sup>

11-embeddings-reranker-classification-tensorrt/README.md Outdated

+              With BEI you get the following benefits:
+              - *lowest-latency inference* across any embedding solution (vLLM, SGlang, Infinity, TEI, Ollama)*1
+              - *highest-throughput inference* across any embedding solution (vLLM, SGlang, Infinity, TEI, Ollama) - thanks to XQA kernels, FP8 and dynamic batching.*2
+              - high parallelism: up to 1400 client embeddings per second

Member

philipkiely-baseten Feb 11, 2025

Suggested change

      
            - high parallelism: up to 1400 client embeddings per second
          
            - High parallelism: up to 1400 client embeddings per second

11-embeddings-reranker-classification-tensorrt/README.md Outdated

+              - *lowest-latency inference* across any embedding solution (vLLM, SGlang, Infinity, TEI, Ollama)*1
+              - *highest-throughput inference* across any embedding solution (vLLM, SGlang, Infinity, TEI, Ollama) - thanks to XQA kernels, FP8 and dynamic batching.*2
+              - high parallelism: up to 1400 client embeddings per second
+              - cached model weights for fast vertical scaling and high availability - no Hugging Face hub dependency at runtime

Member

philipkiely-baseten Feb 11, 2025

Suggested change

      
            - cached model weights for fast vertical scaling and high availability - no Hugging Face hub dependency at runtime
          
            - Cached model weights for fast vertical scaling and high availability - no Hugging Face hub dependency at runtime

11-embeddings-reranker-classification-tensorrt/README.md Outdated Show resolved Hide resolved

11-embeddings-reranker-classification-tensorrt/BEI-baai-bge-en-icl-embedding/README.md Outdated Show resolved Hide resolved

11-embeddings-reranker-classification-tensorrt/BEI-baai-bge-en-icl-embedding/README.md Outdated Show resolved Hide resolved

11-embeddings-reranker-classification-tensorrt/BEI-baai-bge-en-icl-embedding/README.md

+              }
+              ```
+              Advanced:
+              You may also use Baseten's async jobs API, which returns a request_id, which you can use to query the status of the job and get the results.

Member

philipkiely-baseten Feb 11, 2025

Suggested change

      
            You may also use Baseten's async jobs API, which returns a request_id, which you can use to query the status of the job and get the results.
          
            You may also use Baseten's async jobs API, which returns a request_id, which you can use to query the status of the job and get the results.\n

11-embeddings-reranker-classification-tensorrt/BEI-baai-bge-en-icl-embedding/README.md Outdated Show resolved Hide resolved

11-embeddings-reranker-classification-tensorrt/BEI-baai-bge-en-icl-embedding/README.md Outdated Show resolved Hide resolved


          add suggestions

5d33e77

Contributor Author

michaelfeil commented Feb 11, 2025

All suggestions applied!

michaelfeil merged commit a272977 into main

1 check passed

michaelfeil deleted the embeddings-models-purge branch

February 11, 2025 03:22

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet