Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support FP8 Quantization and Inference Run on Intel Gaudi (HPU) using INC (Intel Neural Compressor) #12010

Open
wants to merge 18 commits into
base: main
Choose a base branch
from

Conversation

nirda7
Copy link
Contributor

@nirda7 nirda7 commented Jan 13, 2025

This PR adds support to FP8 Quantization and Inference run on Intel Gaudi (HPU) using INC (Intel Neural Compressor).
Currently, quantization is validated only in Llama models.

Running Inference in FP8 with INC:
Specify quantization method "inc" and kv cache dtype "fp8_inc" as parameters to the the LLM object.
It will require to set an environment variable "QUANT_CONFIG" which will point to a 'JSON config file' (https://docs.habana.ai/en/latest/PyTorch/Inference_on_PyTorch/Quantization/Inference_Using_FP8.html#supported-json-config-file-options) in QUANTIZE mode. Make sure there are measurement files/scale files in the folder specified as the "dump_stats_path" in the json config file. (If none exists, scale files are generated during the inference run using the measurement files)
At the end of the run, the model executor's shutdown method must be called.

More information on vLLM quantization using INC will be shown here (added in this PR): https://github.com/vllm-project/vllm/blob/main/docs/source/features/quantization/inc.md

This PR adds a new flag "weights_load_device" which allows uploading the model's (unquantized) weights onto a different device than the device on which the model will run. If not provided the behavior is kept by using the device specified in the device config.

Copy link

👋 Hi! Thank you for contributing to the vLLM project.
Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can do one of these:

  • Add ready label to the PR
  • Enable auto-merge.

🚀

@mergify mergify bot added the documentation Improvements or additions to documentation label Jan 13, 2025
@mgoin mgoin self-requested a review January 13, 2025 19:12
@mergify mergify bot added the ci/build label Jan 22, 2025
@nirda7 nirda7 force-pushed the dev/hpu_fp8 branch 3 times, most recently from 5c04292 to ab1c832 Compare January 27, 2025 00:12
@nirda7 nirda7 force-pushed the dev/hpu_fp8 branch 2 times, most recently from d1662df to 8a2ce5f Compare February 6, 2025 15:13
@nirda7 nirda7 force-pushed the dev/hpu_fp8 branch 2 times, most recently from 58e7d72 to 0c1f134 Compare February 16, 2025 10:09
Copy link

mergify bot commented Feb 19, 2025

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @nirda7.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ci/build documentation Improvements or additions to documentation
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants