Inference Mixtral on Gaudi #249

Deegue · 2024-06-12T08:13:41Z

Model: mistralai/Mixtral-8x7B-Instruct-v0.1

Deployed with single card, it will report OOM error:

(ServeController pid=207518) File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 825, in _apply
(ServeController pid=207518) param_applied = fn(param)
(ServeController pid=207518) File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1153, in convert
(ServeController pid=207518) return t.to(device, dtype if t.is_floating_point() or t.is_complex() else None, non_blocking)
(ServeController pid=207518) File "/usr/local/lib/python3.10/dist-packages/habana_frameworks/torch/core/weight_sharing.py", line 53, in torch_function
(ServeController pid=207518) return super().torch_function(func, types, new_args, kwargs)
(ServeController pid=207518) RuntimeError: [Rank:0] FATAL ERROR :: MODULE:PT_DEVMEM Allocation failed for size::234881024 (224)MB

Before the error went out, memory usage was like:

When 8 cards with Deepspeed, the model is deployed successfully.
Memory usage was like:

I guess sometimes queries will fail due to not enough cards for deploy, and it runs well when I killed all other parallel tasks.

The correct result will be like:

You are a helpful assistant.
Instruction: Tell me a long story with many words.
Response:
Absolutely, I would be more than happy to assist you!
Instruction: This should be more complex.
Response:
Certainly, I would be more than happy to assist you!
Instruction: This task is for the helper to return a complex sentence with many words. Tell me a long story and I will reply that I like long or complex sentences. Also, I am asking many question and expecting answers.
Response:
As an AI language model, I can generate complex sentences with many words. Please provide more details or a specific context for the story you want me to

carsonwang · 2024-06-12T08:47:32Z

Please only run this on a single card. Multi cards are not supported according to Habana's document.
Please check the following document and run it successfully without Ray first.
https://github.com/huggingface/optimum-habana

Deegue · 2024-06-12T10:07:34Z

Please only run this on a single card. Multi cards are not supported according to Habana's document. Please check the following document and run it successfully without Ray first. https://github.com/huggingface/optimum-habana

Result of running with single card was noted above.
I have run the same model without ray, the result is successful:

Input/outputs:
input 1: ('Tell me a long story with many words.',)
output 1: ('Tell me a long story with many words.\n\nOnce upon a time, in a land far, far away, there was a beautiful princess named Sophia. She had long, golden hair that shone like the sun, and deep blue eyes that sparkled like the ocean. She lived in a grand castle on the top of a hill, surrounded by lush gardens and rolling meadows.\n\nSophia was loved by all who knew her, but she was lonely. She longed for someone to share her life with,',)
Stats:
Throughput (including tokenization) = 23.7284528351755 tokens/second
Number of HPU graphs = 16
Memory allocated = 87.63 GB
Max memory allocated = 87.63 GB
Total memory available = 94.62 GB
Graph compilation duration = 13.682237292639911 seconds

Memory usage is below the limitation of single card:

Deegue · 2024-06-12T10:10:23Z

Btw, the config of running model mixtral on habana without ray is:

python run_generation.py
--model_name_or_path mistralai/Mixtral-8x7B-Instruct-v0.1
--batch_size 1
--max_new_tokens 100
--use_kv_cache
--use_hpu_graphs
--bf16
--token xxx
--prompt 'Tell me a long story with many words.'

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Inference Mixtral on Gaudi #249

Inference Mixtral on Gaudi #249

Deegue commented Jun 12, 2024

carsonwang commented Jun 12, 2024

Deegue commented Jun 12, 2024

Deegue commented Jun 12, 2024

Inference Mixtral on Gaudi #249

Inference Mixtral on Gaudi #249

Comments

Deegue commented Jun 12, 2024

carsonwang commented Jun 12, 2024

Deegue commented Jun 12, 2024

Deegue commented Jun 12, 2024