Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adding proper README and high quality input/output examples for Mistr… #140

Open
wants to merge 3 commits into
base: main
Choose a base branch
from
Open
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
204 changes: 197 additions & 7 deletions mistral/mistral-7b-chat/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,15 +2,205 @@

This is a [Truss](https://truss.baseten.co/) for Mistral 7B Chat. This Truss is compatible with our [bridge endpoint for OpenAI ChatCompletion users](https://docs.baseten.co/api-reference/openai).

For the standard deployment of Mistral 7B, see [Mistral 7B Instruct](https://app.baseten.co/explore/mistral_7b_instruct).
Mistral 7B is a seven billion parameter language model released by [Mistral](https://mistral.ai/). Its instruct-tuned chat variant performs well versus other open source LLMs of the same size, and the model is capable of writing code.

## About Mistral 7B Chat
## Deployment

Mistral 7B is a seven billion parameter language model released by [Mistral](https://mistral.ai/). Its instruct-tuned chat variant performs well versus other open source LLMs of the same size, and the model is capable of writing code.
First, clone this repository:

```sh
git clone https://github.com/basetenlabs/truss-examples/
cd mistral/mistral-7b-chat
```

Before deployment:

1. Make sure you have a [Baseten account](https://app.baseten.co/signup) and [API key](https://app.baseten.co/settings/account/api_keys).
2. Install the latest version of Truss: `pip install --upgrade truss`

With `mistral-7b-chat` as your working directory, you can deploy the model with:

```sh
truss push
```

Paste your Baseten API key if prompted.

For more information, see [Truss documentation](https://truss.baseten.co).


## Mistral 7B Chat API documentation

This section provides an overview of the Mistral 7B Chat model, its parameters, and how to use it. The API consists of a single route named `predict`, which you can invoke to generate text.

### API route: `predict`

The `predict` route is the primary method for generating text based on a list of messages. It takes several parameters:

- __messages__ (required): A list of JSON objects representing a conversation.
- __stream__ (optional, default=False): A boolean determining whether the model should stream a response back. When `True`, the API returns generated text as it becomes available.
- __max_tokens__ (optional, default=512): The maximum number of tokens to return, counting input tokens. Maximum of 4096.
- __temperature__ (optional, default=1.0): Controls the randomness of the generated text. Higher values produce more diverse results, while lower values produce more deterministic results.
- __top_p__ (optional, default=0.95): The cumulative probability threshold for token sampling. The model will only consider tokens whose cumulative probability is below this threshold.
- __top_k__ (optional, default=50): The number of top tokens to consider when sampling. The model will only consider the top_k highest-probability tokens.
- __repetition_penalty__ (optional, default=1.0): Controls the model’s penalty on producing the same token sequence, with higher values discouraging repetition.
- __no_repeat_ngram_size__ (optional, default=0): The size of the n-gram that should not appear more than once in the output text.
- __use_cache__ (optional, default=True): A boolean determining whether the model should use the cache to avoid recomputing already computed hidden states during text generation.
- __do_sample__ (optional, default=True): Controls the sampling strategy during the decoding process. Setting to False results in the generation process sampling the highest probability next token (greedy decoding). Otherwise, we sample non-greedily via Top-K or Top-P sampling.

Here is an example of the `messages` the model takes as input:
```json
[
{"role": "user", "content": "What is a mistral?"},
{"role": "assistant", "content": "A mistral is a type of cold, dry wind that blows across the southern slopes of the Alps from the Valais region of Switzerland into the Ligurian Sea near Genoa. It is known for its strong and steady gusts, sometimes reaching up to 60 miles per hour."},
{"role": "user", "content": "How does the mistral wind form?"}
]
```

## Example usage of the API in Python:

This model is designed to work with the OpenAI Chat Completions format. Here is an example of how to stream tokens using chat completions:

```python
from openai import OpenAI
import os

# Replace the empty string with your model id below
model_id = ""

client = OpenAI(
api_key=os.environ["BASETEN_API_KEY"],
base_url=f"https://bridge.baseten.co/{model_id}/v1"
)

# Call model endpoint
res = client.chat.completions.create(
model="mistral-7b",
messages=[
{"role": "user", "content": "What is a mistral?"},
{"role": "assistant", "content": "A mistral is a type of cold, dry wind that blows across the southern slopes of the Alps from the Valais region of Switzerland into the Ligurian Sea near Genoa. It is known for its strong and steady gusts, sometimes reaching up to 60 miles per hour."},
{"role": "user", "content": "How does the mistral wind form?"}
],
temperature=0.5,
max_tokens=50,
top_p=0.95,
stream=True
)

# Print the generated tokens as they get streamed
for chunk in res:
print(chunk.choices[0].delta.content)
```

Similarly, here is a non-streaming example:

```python
from openai import OpenAI
import os

# Replace the empty string with your model id below
model_id = ""

client = OpenAI(
api_key=os.environ["BASETEN_API_KEY"],
base_url=f"https://bridge.baseten.co/{model_id}/v1"
)

# Call model endpoint
res = client.chat.completions.create(
model="mistral-7b",
messages=[
{"role": "user", "content": "What is a mistral?"},
{"role": "assistant", "content": "A mistral is a type of cold, dry wind that blows across the southern slopes of the Alps from the Valais region of Switzerland into the Ligurian Sea near Genoa. It is known for its strong and steady gusts, sometimes reaching up to 60 miles per hour."},
{"role": "user", "content": "How does the mistral wind form?"}
],
temperature=0.5,
max_tokens=50,
top_p=0.95
)

# Print the output of the model
print(res.choices[0].message.content)
```

It's not necessary to use OpenAI Chat Completions. You can also invoke the model using http requests.

Streaming Example:

```python
import requests

# Replace the empty string with your model id below
model_id = ""
baseten_api_key = os.environ["BASETEN_API_KEY"]

messages = [
{"role": "user", "content": "What is a mistral?"},
{"role": "assistant", "content": "A mistral is a type of cold, dry wind that blows across the southern slopes of the Alps from the Valais region of Switzerland into the Ligurian Sea near Genoa. It is known for its strong and steady gusts, sometimes reaching up to 60 miles per hour."},
{"role": "user", "content": "How does the mistral wind form?"},
]
data = {
"messages": messages,
"stream": True,
"max_new_tokens": 100,
"temperature": 0.9,
"top_p": 0.85,
"top_k": 40,
"repetition_penalty": 1.1,
"no_repeat_ngram_size": 3
}

# Call model endpoint
res = requests.post(
f"https://model-{model_id}.api.baseten.co/production/predict",
headers={"Authorization": f"Api-Key {baseten_api_key}"},
json=data,
stream=True
)

# Print the generated tokens as they get streamed
for content in res.iter_content():
print(content.decode("utf-8"), end="", flush=True)
```

Non-streaming Example:

```python
import requests

# Replace the empty string with your model id below
model_id = ""
baseten_api_key = os.environ["BASETEN_API_KEY"]

messages = [
{"role": "user", "content": "What is a mistral?"},
{"role": "assistant", "content": "A mistral is a type of cold, dry wind that blows across the southern slopes of the Alps from the Valais region of Switzerland into the Ligurian Sea near Genoa. It is known for its strong and steady gusts, sometimes reaching up to 60 miles per hour."},
{"role": "user", "content": "How does the mistral wind form?"},
]
data = {
"messages": messages,
"max_new_tokens": 100,
"temperature": 0.9,
"top_p": 0.85,
"top_k": 40,
"repetition_penalty": 1.1,
"no_repeat_ngram_size": 3
}

# Call model endpoint
res = requests.post(
f"https://model-{model_id}.api.baseten.co/production/predict",
headers={"Authorization": f"Api-Key {baseten_api_key}"},
json=data
)

## Model API reference
# Print the output of the model
print(res.json())
```

This model is designed for our ChatCompletions endpoint:
When the model is invoked without streaming the output is a string which contains the generated text.
Here is an example of the LLM output:

- [ChatCompletions endpoint tutorial](https://www.baseten.co/blog/gpt-vs-mistral-migrate-to-open-source-llms-with-minor-code-changes/)
- [ChatCompletions endpoint reference docs](https://docs.baseten.co/api-reference/openai)
```
"[INST] What is a mistral? [/INST]A mistral is a type of cold, dry wind that blows across the southern slopes of the Alps from the Valais region of Switzerland into the Ligurian Sea near Genoa. It is known for its strong and steady gusts, sometimes reaching up to 60 miles per hour. [INST] How does the mistral wind form? [/INST]The mistral wind forms as a result of the movement of cold air from the high mountains of the Swiss Alps towards the sea. The cold air collides with the warmer air over the Mediterranean Sea, causing the cold air to rise rapidly and creating a cyclonic circulation. As the warm air rises, the cold air flows into the valley, creating a strong, steady wind known as the mistral.\n\nThe mistral is typically strongest during the winter months when the air is cold."
```
Loading