Support FP8 Quantization and Inference Run on Intel Gaudi (HPU) using INC (Intel Neural Compressor) #12010

nirda7 · 2025-01-13T14:39:53Z

This PR adds support to FP8 Quantization and Inference run on Intel Gaudi (HPU) using INC (Intel Neural Compressor).
Currently, quantization is validated only in Llama models.

Measurement files are required to run quantized models with vLLM on Gaudi accelerators. The FP8 model calibration procedure is described in the vllm-hpu-extention package: https://github.com/HabanaAI/vllm-hpu-extension/tree/main/calibration/README.md
Measurements are device dependent - Don't use measurements collected on Gaudi3 on Gaudi2 accelerators as it might cause accuracy issues.

Running Inference in FP8 with INC:
Specify quantization method "inc" and kv cache dtype "fp8_inc" as parameters to the the LLM object.
It will require to set an environment variable "QUANT_CONFIG" which will point to a 'JSON config file' (https://docs.habana.ai/en/latest/PyTorch/Inference_on_PyTorch/Quantization/Inference_Using_FP8.html#supported-json-config-file-options) in QUANTIZE mode. Make sure there are measurement files/scale files in the folder specified as the "dump_stats_path" in the json config file. (If none exists, scale files are generated during the inference run using the measurement files)
At the end of the run, the model executor's shutdown method must be called.

More information on vLLM quantization using INC will be shown here (added in this PR): https://github.com/vllm-project/vllm/blob/main/docs/source/features/quantization/inc.md

This PR adds a new flag "weights_load_device" which allows uploading the model's (unquantized) weights onto a different device than the device on which the model will run. If not provided the behavior is kept by using the device specified in the device config.

github-actions · 2025-01-13T14:40:10Z

👋 Hi! Thank you for contributing to the vLLM project.
Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can do one of these:

Add ready label to the PR
Enable auto-merge.

🚀

vllm/worker/hpu_model_runner.py

mergify · 2025-02-19T09:02:44Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @nirda7.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

Signed-off-by: Nir David <[email protected]>

mergify bot added the documentation Improvements or additions to documentation label Jan 13, 2025

mgoin self-requested a review January 13, 2025 19:12

nirda7 force-pushed the dev/hpu_fp8 branch from f664d3c to 2ca5636 Compare January 14, 2025 01:24

zhouyu5 reviewed Jan 14, 2025

View reviewed changes

vllm/worker/hpu_model_runner.py Show resolved Hide resolved

nirda7 force-pushed the dev/hpu_fp8 branch from 144d28d to c6813cb Compare January 21, 2025 18:41

nirda7 requested review from robertgshaw2-redhat, tlrmchlsmth, zhuohan123, youkaichao, alexm-redhat, comaniac and njhill as code owners January 21, 2025 18:41

nirda7 force-pushed the dev/hpu_fp8 branch from 1400336 to f3a310f Compare January 21, 2025 19:03

mergify bot added the ci/build label Jan 22, 2025

nirda7 force-pushed the dev/hpu_fp8 branch 3 times, most recently from 5c04292 to ab1c832 Compare January 27, 2025 00:12

nirda7 force-pushed the dev/hpu_fp8 branch 2 times, most recently from d1662df to 8a2ce5f Compare February 6, 2025 15:13

nirda7 force-pushed the dev/hpu_fp8 branch 2 times, most recently from 58e7d72 to 0c1f134 Compare February 16, 2025 10:09

mergify bot added the needs-rebase label Feb 19, 2025

nirda7 force-pushed the dev/hpu_fp8 branch from 0c1f134 to 1ae7a5d Compare February 19, 2025 09:18

mergify bot removed the needs-rebase label Feb 19, 2025

nirda7 added 4 commits February 20, 2025 13:09

Support HPU fp8 quantization

f5a9cab

Signed-off-by: Nir David <[email protected]>

Refactor fp8 inc config and flow

82e14af

Signed-off-by: Nir David <[email protected]>

adjust destructors and calling finish measurements through shutdown

bc562bf

Signed-off-by: Nir David <[email protected]>

Add documentation changes

6ade112

Signed-off-by: Nir David <[email protected]>

nirda7 added 14 commits February 20, 2025 13:11

fix CR comments

1d8385d

Signed-off-by: Nir David <[email protected]>

add more documentation changes

aeb46b6

Signed-off-by: Nir David <[email protected]>

some more CR fixes

c6a0bf1

Signed-off-by: Nir David <[email protected]>

remove gaudi-installation duplication

5a97102

Signed-off-by: Nir David <[email protected]>

change inc.rst to inc.md

8520bcf

Signed-off-by: Nir David <[email protected]>

fix more CR comments

0ffcf40

Signed-off-by: Nir David <[email protected]>

Add INC and Intel Gaudi to supported hardware table

07e4e9d

Signed-off-by: Nir David <[email protected]>

fix formatting

4ee94fd

Signed-off-by: Nir David <[email protected]>

Fix weights load device use

f79af66

Signed-off-by: Nir David <[email protected]>

fix shutdown flow after executors refactor

899eda2

Signed-off-by: Nir David <[email protected]>

Adjust vllm-hpu-extension sha

f3b0d78

Signed-off-by: Nir David <[email protected]>

fix shutdown flow

d8db55c

Signed-off-by: Nir David <[email protected]>

add spdx header to inc.py

949fcda

Signed-off-by: Nir David <[email protected]>

Fix supported hardware table style

3516ac5

Signed-off-by: Nir David <[email protected]>

nirda7 force-pushed the dev/hpu_fp8 branch from 24fd72f to 3516ac5 Compare February 20, 2025 11:13

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support FP8 Quantization and Inference Run on Intel Gaudi (HPU) using INC (Intel Neural Compressor) #12010

Support FP8 Quantization and Inference Run on Intel Gaudi (HPU) using INC (Intel Neural Compressor) #12010

nirda7 commented Jan 13, 2025 •

edited by github-actions bot

Loading

github-actions bot commented Jan 13, 2025

mergify bot commented Feb 19, 2025

Support FP8 Quantization and Inference Run on Intel Gaudi (HPU) using INC (Intel Neural Compressor) #12010

Are you sure you want to change the base?

Support FP8 Quantization and Inference Run on Intel Gaudi (HPU) using INC (Intel Neural Compressor) #12010

Conversation

nirda7 commented Jan 13, 2025 • edited by github-actions bot Loading

github-actions bot commented Jan 13, 2025

mergify bot commented Feb 19, 2025

nirda7 commented Jan 13, 2025 •

edited by github-actions bot

Loading