Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Inference] Add validated models for Gaudi #225

Closed
wants to merge 34 commits into from
Closed
Show file tree
Hide file tree
Changes from 30 commits
Commits
Show all changes
34 commits
Select commit Hold shift + click to select a range
ff52ed5
add validated models for Gaudi
Deegue May 16, 2024
cd543a5
nit
Deegue May 16, 2024
55595f7
fix
Deegue May 17, 2024
0a72f39
remove
Deegue May 17, 2024
0b17988
add config
Deegue May 17, 2024
ef33763
nit
Deegue May 17, 2024
c1b2a2d
remove prompt and add gpt2
Deegue May 17, 2024
28acd73
check and add all template, remove bloom-560m, add mixtral, change Qw…
Deegue May 20, 2024
ecf40e6
nit
Deegue May 21, 2024
f4e02ff
fix
Deegue May 21, 2024
8218531
fix
Deegue May 21, 2024
0baa303
fix
Deegue May 22, 2024
02ab927
Merge branch 'intel:main' into add_validated_models
Deegue May 22, 2024
96b36bd
remove default template
Deegue May 22, 2024
1de2ebb
fix when list length is 1
Deegue May 23, 2024
9b8e57d
fix
Deegue May 23, 2024
14e0199
Merge branch 'main' into add_validated_models
Deegue May 27, 2024
8940d0d
fix target
Deegue May 27, 2024
50c4988
change cache dir
Deegue May 27, 2024
762e84c
remove Mixtral
Deegue May 29, 2024
54e1550
Merge branch 'intel:main' into add_validated_models
Deegue May 29, 2024
012bac2
change to 8 cards
Deegue May 29, 2024
4496e73
remove Qwen and fix
Deegue May 30, 2024
a732a1c
Merge branch 'intel:main' into add_validated_models
Deegue Jun 3, 2024
33a1478
revert and add Qwen&Mixtral back
Deegue Jun 4, 2024
0830f2d
Merge branch 'intel:main' into add_validated_models
Deegue Jun 5, 2024
43a75bc
nit
Deegue Jun 5, 2024
9634202
Merge branch 'intel:main' into add_validated_models
Deegue Jun 11, 2024
2b868ca
add Qwen1.5-7B-Chat
Deegue Jun 12, 2024
7555935
add Qwen2-7B-Instruct
Deegue Jun 12, 2024
53187b5
remove several models
Deegue Jun 13, 2024
6d16dd4
add falcon qwen linear all reduce to hpu_predictor
Deegue Jun 18, 2024
d368e2e
Merge branch 'main' into add_validated_models
Deegue Jun 18, 2024
2d4cea1
Merge branch 'intel:main' into add_validated_models
Deegue Jul 17, 2024
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
40 changes: 23 additions & 17 deletions .github/workflows/workflow_inference_gaudi2.yml
Original file line number Diff line number Diff line change
Expand Up @@ -17,7 +17,7 @@ on:
default: '/home/ci/actions-runner/_work/llm-on-ray/llm-on-ray'
model_cache_path:
type: string
default: '/mnt/DP_disk1/huggingface/cache'
default: '/scratch-2/huggingface/cache'

concurrency:
group: ${{ github.workflow }}-${{ github.event.pull_request.number || github.ref }}-inf-gaudi2
Expand All @@ -28,16 +28,29 @@ jobs:
name: inference
strategy:
matrix:
model: [ llama-2-7b-chat-hf, llama-2-70b-chat-hf, llama-2-7b-chat-hf-vllm ]
model: [ bloom-7b1, CodeLlama-7b-hf, falcon-7b, falcon-40b, gemma-2b, gpt-j-6b, gpt2, llama-2-7b-chat-hf, llama-2-70b-chat-hf, meta-llama-3-8b-instruct, meta-llama-3-70b-instruct, Qwen1.5-110B, mistral-7b-v0.1, Mixtral-7B, mpt-7b, Qwen1.5-7B-Chat, Qwen2-7B-Instruct, llama-2-7b-chat-hf-vllm ]
isPR:
- ${{inputs.ci_type == 'pr'}}

exclude:
- { isPR: true }

include:
- { model: "bloom-7b1"}
- { model: "CodeLlama-7b-hf"}
- { model: "falcon-7b"}
- { model: "falcon-40b"}
- { model: "gemma-2b"}
- { model: "gpt-j-6b"}
- { model: "gpt2"}
- { model: "llama-2-7b-chat-hf"}
- { model: "llama-2-70b-chat-hf"}
- { model: "meta-llama-3-8b-instruct"}
- { model: "meta-llama-3-70b-instruct"}
- { model: "mistral-7b-v0.1"}
- { model: "mpt-7b"}
- { model: "Qwen1.5-7B-Chat"}
- { model: "Qwen2-7B-Instruct"}
- { model: "llama-2-7b-chat-hf-vllm"}

runs-on: gaudi2
Expand All @@ -60,12 +73,10 @@ jobs:
id: "target"
run: |
target="inference"
if [[ ${{ matrix.model }} == "llama-2-7b-chat-hf" ]]; then
target="${target}_gaudi2"
elif [[ ${{ matrix.model }} == "llama-2-70b-chat-hf" ]]; then
target="${target}_gaudi2"
elif [[ ${{ matrix.model }} == "llama-2-7b-chat-hf-vllm" ]]; then
if [[ ${{ matrix.model }} == "llama-2-7b-chat-hf-vllm" ]]; then
target="${target}_vllm_gaudi2"
else
target="${target}_gaudi2"
fi
echo "target is ${target}"
echo "target=$target" >> $GITHUB_OUTPUT
Expand Down Expand Up @@ -109,11 +120,8 @@ jobs:
TARGET=${{steps.target.outputs.target}}
CMD=$(cat << EOF
import yaml
if ("${{ matrix.model }}" == "llama-2-7b-chat-hf"):
conf_path = "llm_on_ray/inference/models/hpu/llama-2-7b-chat-hf-hpu.yaml"
elif ("${{ matrix.model }}" == "llama-2-70b-chat-hf"):
conf_path = "llm_on_ray/inference/models/hpu/llama-2-70b-chat-hf-hpu.yaml"
elif ("${{ matrix.model }}" == "llama-2-7b-chat-hf-vllm"):
conf_path = "llm_on_ray/inference/models/hpu/" + "${{ matrix.model }}" + "-hpu.yaml"
if ("${{ matrix.model }}" == "llama-2-7b-chat-hf-vllm"):
conf_path = "llm_on_ray/inference/models/hpu/llama-2-7b-chat-hf-vllm-hpu.yaml"
with open(conf_path, encoding="utf-8") as reader:
result = yaml.load(reader, Loader=yaml.FullLoader)
Expand All @@ -123,13 +131,11 @@ jobs:
EOF
)
docker exec "${TARGET}" python -c "$CMD"
if [[ ${{ matrix.model }} == "llama-2-7b-chat-hf" ]]; then
docker exec "${TARGET}" bash -c "llm_on_ray-serve --config_file llm_on_ray/inference/models/hpu/llama-2-7b-chat-hf-hpu.yaml --keep_serve_terminal"
elif [[ ${{ matrix.model }} == "llama-2-70b-chat-hf" ]]; then
docker exec "${TARGET}" bash -c "llm_on_ray-serve --config_file llm_on_ray/inference/models/hpu/llama-2-70b-chat-hf-hpu.yaml --keep_serve_terminal"
elif [[ ${{ matrix.model }} == "llama-2-7b-chat-hf-vllm" ]]; then
if [[ ${{ matrix.model }} == "llama-2-7b-chat-hf-vllm" ]]; then
docker exec "${TARGET}" bash -c "huggingface-cli login --token ${{ env.HF_ACCESS_TOKEN }}"
docker exec "${TARGET}" bash -c "llm_on_ray-serve --config_file llm_on_ray/inference/models/hpu/llama-2-7b-chat-hf-vllm-hpu.yaml --keep_serve_terminal"
else
docker exec "${TARGET}" bash -c "llm_on_ray-serve --config_file llm_on_ray/inference/models/hpu/${{ matrix.model }}-hpu.yaml --keep_serve_terminal"
fi
echo Streaming query:
docker exec "${TARGET}" bash -c "python examples/inference/api_server_openai/query_http_requests.py --model_name ${{ matrix.model }} --streaming_response"
Expand Down
13 changes: 13 additions & 0 deletions llm_on_ray/inference/models/hpu/CodeLlama-7b-hf-hpu.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,13 @@
port: 8000
name: CodeLlama-7b-hf
route_prefix: /CodeLlama-7b-hf
num_replicas: 1
cpus_per_worker: 8
hpus_per_worker: 1
device: hpu
model_description:
model_id_or_path: codellama/CodeLlama-7b-hf
tokenizer_name_or_path: codellama/CodeLlama-7b-hf
chat_template: "llm_on_ray/inference/models/templates/template_codellama.jinja"
config:
use_auth_token: ''
14 changes: 14 additions & 0 deletions llm_on_ray/inference/models/hpu/Mixtral-7B-hpu.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,14 @@
port: 8000
Deegue marked this conversation as resolved.
Show resolved Hide resolved
name: Mixtral-7B
route_prefix: /Mixtral-7B
num_replicas: 1
cpus_per_worker: 8
hpus_per_worker: 1
deepspeed: true
workers_per_group: 8
device: hpu
model_description:
model_id_or_path: mistralai/Mixtral-8x7B-Instruct-v0.1
tokenizer_name_or_path: mistralai/Mixtral-8x7B-Instruct-v0.1
config:
use_auth_token: ''
14 changes: 14 additions & 0 deletions llm_on_ray/inference/models/hpu/Qwen1.5-110B-hpu.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,14 @@
port: 8000
Deegue marked this conversation as resolved.
Show resolved Hide resolved
name: Qwen1.5-110B
route_prefix: /Qwen1.5-110B
num_replicas: 1
cpus_per_worker: 8
hpus_per_worker: 1
deepspeed: true
workers_per_group: 8
device: hpu
model_description:
model_id_or_path: Qwen/Qwen1.5-110B
tokenizer_name_or_path: Qwen/Qwen1.5-110B
config:
use_auth_token: ''
12 changes: 12 additions & 0 deletions llm_on_ray/inference/models/hpu/Qwen1.5-7B-Chat-hpu.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,12 @@
port: 8000
Deegue marked this conversation as resolved.
Show resolved Hide resolved
name: Qwen1.5-7B-Chat
route_prefix: /Qwen1.5-7B-Chat
num_replicas: 1
cpus_per_worker: 8
hpus_per_worker: 1
device: hpu
model_description:
model_id_or_path: Qwen/Qwen1.5-7B-Chat
tokenizer_name_or_path: Qwen/Qwen1.5-7B-Chat
config:
use_auth_token: ''
12 changes: 12 additions & 0 deletions llm_on_ray/inference/models/hpu/Qwen2-7B-Instruct-hpu.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,12 @@
port: 8000
name: Qwen2-7B-Instruct
route_prefix: /Qwen2-7B-Instruct
num_replicas: 1
cpus_per_worker: 8
hpus_per_worker: 1
device: hpu
model_description:
model_id_or_path: Qwen/Qwen2-7B-Instruct
tokenizer_name_or_path: Qwen/Qwen2-7B-Instruct
config:
use_auth_token: ''
12 changes: 12 additions & 0 deletions llm_on_ray/inference/models/hpu/bloom-7b1-hpu.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,12 @@
port: 8000
name: bloom-7b1
route_prefix: /bloom-7b1
num_replicas: 1
cpus_per_worker: 8
hpus_per_worker: 1
device: hpu
model_description:
model_id_or_path: bigscience/bloom-7b1
tokenizer_name_or_path: bigscience/bloom-7b1
config:
use_auth_token: ''
14 changes: 14 additions & 0 deletions llm_on_ray/inference/models/hpu/falcon-40b-hpu.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,14 @@
port: 8000
name: falcon-40b
route_prefix: /falcon-40b
num_replicas: 1
cpus_per_worker: 8
hpus_per_worker: 1
deepspeed: true
workers_per_group: 8
device: hpu
model_description:
model_id_or_path: tiiuae/falcon-40b
tokenizer_name_or_path: tiiuae/falcon-40b
config:
use_auth_token: ''
12 changes: 12 additions & 0 deletions llm_on_ray/inference/models/hpu/falcon-7b-hpu.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,12 @@
port: 8000
name: falcon-7b
route_prefix: /falcon-7b
num_replicas: 1
cpus_per_worker: 8
hpus_per_worker: 1
device: hpu
model_description:
model_id_or_path: tiiuae/falcon-7b
tokenizer_name_or_path: tiiuae/falcon-7b
config:
use_auth_token: ''
13 changes: 13 additions & 0 deletions llm_on_ray/inference/models/hpu/gemma-2b-hpu.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,13 @@
port: 8000
name: gemma-2b
route_prefix: /gemma-2b
num_replicas: 1
cpus_per_worker: 8
hpus_per_worker: 1
device: hpu
model_description:
model_id_or_path: google/gemma-2b
tokenizer_name_or_path: google/gemma-2b
chat_template: "llm_on_ray/inference/models/templates/template_gemma.jinja"
config:
use_auth_token: ' '
13 changes: 13 additions & 0 deletions llm_on_ray/inference/models/hpu/gpt-j-6b-hpu.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,13 @@
port: 8000
name: gpt-j-6b
route_prefix: /gpt-j-6b
num_replicas: 1
cpus_per_worker: 8
hpus_per_worker: 1
device: hpu
model_description:
model_id_or_path: EleutherAI/gpt-j-6b
tokenizer_name_or_path: EleutherAI/gpt-j-6b
gpt_base_model: true
config:
use_auth_token: ''
14 changes: 14 additions & 0 deletions llm_on_ray/inference/models/hpu/gpt2-hpu.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,14 @@
port: 8000
name: gpt2
route_prefix: /gpt2
num_replicas: 1
cpus_per_worker: 8
hpus_per_worker: 1
device: hpu
model_description:
model_id_or_path: gpt2
tokenizer_name_or_path: gpt2
chat_template: "llm_on_ray/inference/models/templates/template_gpt2.jinja"
gpt_base_model: true
config:
use_auth_token: ''
Original file line number Diff line number Diff line change
Expand Up @@ -10,5 +10,6 @@ device: hpu
model_description:
model_id_or_path: meta-llama/Llama-2-70b-chat-hf
tokenizer_name_or_path: meta-llama/Llama-2-70b-chat-hf
chat_template: "llm_on_ray/inference/models/templates/template_llama2.jinja"
config:
use_auth_token: ''
Original file line number Diff line number Diff line change
Expand Up @@ -8,5 +8,6 @@ device: hpu
model_description:
model_id_or_path: meta-llama/Llama-2-7b-chat-hf
tokenizer_name_or_path: meta-llama/Llama-2-7b-chat-hf
chat_template: "llm_on_ray/inference/models/templates/template_llama2.jinja"
config:
use_auth_token: ''
13 changes: 13 additions & 0 deletions llm_on_ray/inference/models/hpu/mistral-7b-v0.1-hpu.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,13 @@
port: 8000
name: mistral-7b-v0.1
route_prefix: /mistral-7b-v0.1
num_replicas: 1
cpus_per_worker: 8
hpus_per_worker: 1
device: hpu
model_description:
model_id_or_path: mistralai/Mistral-7B-v0.1
tokenizer_name_or_path: mistralai/Mistral-7B-v0.1
chat_template: "llm_on_ray/inference/models/templates/template_mistral.jinja"
config:
use_auth_token: ''
13 changes: 13 additions & 0 deletions llm_on_ray/inference/models/hpu/mpt-7b-hpu.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,13 @@
port: 8000
name: mpt-7b
route_prefix: /mpt-7b
num_replicas: 1
cpus_per_worker: 8
hpus_per_worker: 1
device: hpu
model_description:
model_id_or_path: EleutherAI/gpt-neox-20b
tokenizer_name_or_path: EleutherAI/gpt-neox-20b
config:
use_auth_token: ''
trust_remote_code: true
19 changes: 19 additions & 0 deletions llm_on_ray/inference/models/hpu/neural-chat-7b-v3-3-hpu.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,19 @@
port: 8000
name: neural-chat-7b-v3-3
route_prefix: /neural-chat-7b-v3-3
num_replicas: 1
cpus_per_worker: 0
gpus_per_worker: 0
hpus_per_worker: 1
deepspeed: false
workers_per_group: 2
device: hpu
ipex:
enabled: false
precision: bf16
model_description:
model_id_or_path: Intel/neural-chat-7b-v3-3
tokenizer_name_or_path: Intel/neural-chat-7b-v3-3
chat_template: "llm_on_ray/inference/models/templates/template_neuralchat.jinja"
config:
use_auth_token: ''
2 changes: 2 additions & 0 deletions llm_on_ray/inference/models/hpu/neural-chat-7b-v3-3.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -15,3 +15,5 @@ model_description:
model_id_or_path: Intel/neural-chat-7b-v3-3
tokenizer_name_or_path: Intel/neural-chat-7b-v3-3
chat_template: "llm_on_ray/inference/models/templates/template_neuralchat.jinja"
config:
use_auth_token: ''
Loading