-
-
Notifications
You must be signed in to change notification settings - Fork 5.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support FP8 Quantization and Inference Run on Intel Gaudi (HPU) using INC (Intel Neural Compressor) #12010
base: main
Are you sure you want to change the base?
Conversation
👋 Hi! Thank you for contributing to the vLLM project. Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can do one of these:
🚀 |
5c04292
to
ab1c832
Compare
d1662df
to
8a2ce5f
Compare
58e7d72
to
0c1f134
Compare
This pull request has merge conflicts that must be resolved before it can be |
Signed-off-by: Nir David <[email protected]>
Signed-off-by: Nir David <[email protected]>
Signed-off-by: Nir David <[email protected]>
Signed-off-by: Nir David <[email protected]>
Signed-off-by: Nir David <[email protected]>
Signed-off-by: Nir David <[email protected]>
Signed-off-by: Nir David <[email protected]>
Signed-off-by: Nir David <[email protected]>
Signed-off-by: Nir David <[email protected]>
Signed-off-by: Nir David <[email protected]>
Signed-off-by: Nir David <[email protected]>
Signed-off-by: Nir David <[email protected]>
Signed-off-by: Nir David <[email protected]>
Signed-off-by: Nir David <[email protected]>
Signed-off-by: Nir David <[email protected]>
Signed-off-by: Nir David <[email protected]>
Signed-off-by: Nir David <[email protected]>
Signed-off-by: Nir David <[email protected]>
This PR adds support to FP8 Quantization and Inference run on Intel Gaudi (HPU) using INC (Intel Neural Compressor).
Currently, quantization is validated only in Llama models.
Measurements are device dependent - Don't use measurements collected on Gaudi3 on Gaudi2 accelerators as it might cause accuracy issues.
Running Inference in FP8 with INC:
Specify quantization method "inc" and kv cache dtype "fp8_inc" as parameters to the the LLM object.
It will require to set an environment variable "QUANT_CONFIG" which will point to a 'JSON config file' (https://docs.habana.ai/en/latest/PyTorch/Inference_on_PyTorch/Quantization/Inference_Using_FP8.html#supported-json-config-file-options) in QUANTIZE mode. Make sure there are measurement files/scale files in the folder specified as the "dump_stats_path" in the json config file. (If none exists, scale files are generated during the inference run using the measurement files)
At the end of the run, the model executor's shutdown method must be called.
More information on vLLM quantization using INC will be shown here (added in this PR): https://github.com/vllm-project/vllm/blob/main/docs/source/features/quantization/inc.md
This PR adds a new flag "weights_load_device" which allows uploading the model's (unquantized) weights onto a different device than the device on which the model will run. If not provided the behavior is kept by using the device specified in the device config.