Update batching explanation in docs (#36)

* Update embeddings documentation to be more clear Update embeddings documentation to be more clear * Update README.md * Fix markdown formatting * Update README.md
triton-inference-server · Aug 13, 2024 · a812e25 · a812e25
1 parent e1455e0
commit a812e25
Show file tree

Hide file tree

Showing 2 changed files with 19 additions and 0 deletions.
diff --git a/genai-perf/README.md b/genai-perf/README.md
@@ -335,6 +335,13 @@ You can optionally set additional model inputs with the following option:
   model with a singular value, such as `stream:true` or `max_tokens:5`. This
   flag can be repeated to supply multiple extra inputs.
 
+For [Large Language Models](docs/tutorial.md), there is no batch size (i.e.
+batch size is always `1`). Each request includes the inputs for one individual
+inference. Other modes such as the [embeddings](docs/embeddings.md) and
+[rankings](docs/rankings.md) endpoints support client-side batching, where
+`--batch-size N` means that each request sent will include the inputs for `N`
+separate inferences, allowing them to be processed together.
+
 </br>
 
 <!--

diff --git a/genai-perf/docs/embeddings.md b/genai-perf/docs/embeddings.md
@@ -68,6 +68,18 @@ genai-perf profile \
     --input-file embeddings.jsonl
 ```
 
+* `-m intfloat/e5-mistral-7b-instruct` is to specify what model you want to run
+  (`intfloat/e5-mistral-7b-instruct`)
+* `--service-kind openai` is to specify that the server type is OpenAI-API
+  compatible
+* `--endpoint-type embeddings` is to specify that the sent requests should be
+  formatted to follow the [embeddings
+  API](https://platform.openai.com/docs/api-reference/embeddings/create)
+* `--batch-size 2` is to specify that each request will contain the inputs for 2
+  individual inferences, making a batch size of 2
+* `--input-file embeddings.jsonl` is to specify the input data to be used for
+  inferencing
+
 This will use default values for optional arguments. You can also pass in
 additional arguments with the `--extra-inputs` [flag](../README.md#input-options).
 For example, you could use this command: