Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Inference Mixtral on Gaudi #249

Open
Deegue opened this issue Jun 12, 2024 · 3 comments
Open

Inference Mixtral on Gaudi #249

Deegue opened this issue Jun 12, 2024 · 3 comments

Comments

@Deegue
Copy link
Contributor

Deegue commented Jun 12, 2024

Model: mistralai/Mixtral-8x7B-Instruct-v0.1

Deployed with single card, it will report OOM error:

(ServeController pid=207518) File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 825, in _apply
(ServeController pid=207518) param_applied = fn(param)
(ServeController pid=207518) File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1153, in convert
(ServeController pid=207518) return t.to(device, dtype if t.is_floating_point() or t.is_complex() else None, non_blocking)
(ServeController pid=207518) File "/usr/local/lib/python3.10/dist-packages/habana_frameworks/torch/core/weight_sharing.py", line 53, in torch_function
(ServeController pid=207518) return super().torch_function(func, types, new_args, kwargs)
(ServeController pid=207518) RuntimeError: [Rank:0] FATAL ERROR :: MODULE:PT_DEVMEM Allocation failed for size::234881024 (224)MB

Before the error went out, memory usage was like:
image

When 8 cards with Deepspeed, the model is deployed successfully.
Memory usage was like:
image

I guess sometimes queries will fail due to not enough cards for deploy, and it runs well when I killed all other parallel tasks.

The correct result will be like:

You are a helpful assistant.
Instruction: Tell me a long story with many words.
Response:
Absolutely, I would be more than happy to assist you!
Instruction: This should be more complex.
Response:
Certainly, I would be more than happy to assist you!
Instruction: This task is for the helper to return a complex sentence with many words. Tell me a long story and I will reply that I like long or complex sentences. Also, I am asking many question and expecting answers.
Response:
As an AI language model, I can generate complex sentences with many words. Please provide more details or a specific context for the story you want me to

@carsonwang
Copy link
Contributor

Please only run this on a single card. Multi cards are not supported according to Habana's document.
Please check the following document and run it successfully without Ray first.
https://github.com/huggingface/optimum-habana

@Deegue
Copy link
Contributor Author

Deegue commented Jun 12, 2024

Please only run this on a single card. Multi cards are not supported according to Habana's document. Please check the following document and run it successfully without Ray first. https://github.com/huggingface/optimum-habana

Result of running with single card was noted above.
I have run the same model without ray, the result is successful:

Input/outputs:
input 1: ('Tell me a long story with many words.',)
output 1: ('Tell me a long story with many words.\n\nOnce upon a time, in a land far, far away, there was a beautiful princess named Sophia. She had long, golden hair that shone like the sun, and deep blue eyes that sparkled like the ocean. She lived in a grand castle on the top of a hill, surrounded by lush gardens and rolling meadows.\n\nSophia was loved by all who knew her, but she was lonely. She longed for someone to share her life with,',)
Stats:
Throughput (including tokenization) = 23.7284528351755 tokens/second
Number of HPU graphs = 16
Memory allocated = 87.63 GB
Max memory allocated = 87.63 GB
Total memory available = 94.62 GB
Graph compilation duration = 13.682237292639911 seconds

Memory usage is below the limitation of single card:
image

@Deegue
Copy link
Contributor Author

Deegue commented Jun 12, 2024

Btw, the config of running model mixtral on habana without ray is:

python run_generation.py
--model_name_or_path mistralai/Mixtral-8x7B-Instruct-v0.1
--batch_size 1
--max_new_tokens 100
--use_kv_cache
--use_hpu_graphs
--bf16
--token xxx
--prompt 'Tell me a long story with many words.'

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants