Rebase 2025-02-10 #810

kzawora-intel · 2025-02-10T13:59:06Z

No description provided.

) Signed-off-by: Woosuk Kwon <[email protected]>

Signed-off-by: Divakar Verma <[email protected]>

Signed-off-by: Mark McLoughlin <[email protected]>

…-project#12560) Signed-off-by: Harry Mellor <[email protected]>

…m-project#12564) Signed-off-by: Beim <[email protected]>

Signed-off-by: [email protected] <[email protected]> Signed-off-by: mgoin <[email protected]> Co-authored-by: mgoin <[email protected]> Co-authored-by: simon-mo <[email protected]>

…project#12555) Signed-off-by: npanpaliya <[email protected]>

Signed-off-by: mgoin <[email protected]>

…caling (vllm-project#11868)

Signed-off-by: Aleksandr Malyshev <[email protected]> Co-authored-by: Aleksandr Malyshev <[email protected]>

Signed-off-by: Lucas Wilkinson <[email protected]> Signed-off-by: simon-mo <[email protected]> Co-authored-by: Woosuk Kwon <[email protected]> Co-authored-by: simon-mo <[email protected]> Co-authored-by: Michael Goin <[email protected]> Co-authored-by: Zhuohan Li <[email protected]> Co-authored-by: Tyler Michael Smith <[email protected]> Co-authored-by: Alexander Matveev <[email protected]> Co-authored-by: simon-mo <[email protected]>

…2571)

Signed-off-by: Harry Mellor <[email protected]>

Co-authored-by: simon-mo <[email protected]>

@hmellor

It's very annoying when I forgot to add `-s` in `git commit` to sign-off, because I then need to `git rebase HEAD~1 --signoff` and `git push -f` to fix the DCO. This PR adds a hook to sign off commits automatically when `-s` is missing to solve this problem. The only change from the user side is now users have to install 2 hooks, so instead of just ``` pre-commit install ``` Now we need to ``` pre-commit install --hook-type pre-commit --hook-type commit-msg ``` Note that even if users still only install the pre-commit hook, they won't get any error in `git commit`. Just the sign-off hook won't run. cc @hmellor @youkaichao --------- Signed-off-by: Cody Yu <[email protected]>

@WoosukKwon

- Create v1 design document section in docs. - Add prefix caching design doc. @WoosukKwon @ywang96 --------- Signed-off-by: Cody Yu <[email protected]>

…oject#12603) This pr adds extra key to block hash, to generate different hash value for two blocks with the same token string but different extra_keys in their parent blocks. For example, it can generate different hash value for the second block of the following two requests: ```python request1 = make_request( request_id=0, prompt_token_ids=[_ for _ in range(6)], mm_positions=[{ "offset": 0, "length": 3 }, { "offset": 3, "length": 3 }], mm_hashes=["hash1", "hash2"], ) request2 = make_request( request_id=1, prompt_token_ids=[_ for _ in range(6)], mm_positions=[{ "offset": 0, "length": 3 }, { "offset": 3, "length": 3 }], mm_hashes=["hash3", "hash2"], ) ``` --------- Signed-off-by: Chen Zhang <[email protected]>

Instead of having to create a new build with release version put in as env var.

SUMMARY: * previous PR for pulling in block configs also changed defaults (https://github.com/vllm-project/vllm/pull/11589/files) for FP8 * this broke L4 MoE since there was not enough SHM for the default configuration * this reverts the non-block example to the default Signed-off-by: [email protected] <[email protected]>

…DeepSeekV3 (vllm-project#12587) Integrates the block-quantized kernels introduced in vllm-project#11868 for use in linear layers. Signed-off-by: Tyler Michael Smith <[email protected]>

…2563) **[Guided decoding performance optimization]** Sending the guided decoding bitmask in xgrammar to the GPU (`self.token_bitmask.to(scores.device)`) is a blocking operation that prevents the CPU from pre-launching the sampler kernels. The CPU waits until decode is complete, then copies the bitmask over. This PR changes the operation to async via setting `non-blocking=True`. (Current) The CPU is blocked on a `cudaStreamSynchronize` and only pre-empts the sampling kernels after bitmask application. Below is the Nsys profile for one decode phase from Llama 3.1 8B. ![image](https://github.com/user-attachments/assets/8997eae1-b822-4f52-beb8-ef19a7c6b824) With the optimization, this is no longer the case: ![image](https://github.com/user-attachments/assets/6d5ea83f-f169-4f98-a8c1-41c719b3e1e7) --------- Signed-off-by: Ryan N <[email protected]>

- Make device tab names more explicit - Add comprehensive list of devices to https://docs.vllm.ai/en/latest/getting_started/installation/index.html - Add `attention` blocks to the intro of all devices that don't have pre-built wheels/images --------- Signed-off-by: Harry Mellor <[email protected]>

@mgoin

Based on a request by @mgoin , with @kylesayrs we have added an example doc for int4 w4a16 quantization, following the pre-existing int8 w8a8 quantization example and the example available in [`llm-compressor`](https://github.com/vllm-project/llm-compressor/blob/main/examples/quantization_w4a16/llama3_example.py) FIX #n/a (no issue created) @kylesayrs and I have discussed a couple additional improvements for the quantization docs. We will revisit at a later date, possibly including: - A section for "choosing the correct quantization scheme/ compression technique" - Additional vision or audio calibration datasets --------- Signed-off-by: Brian Dellabetta <[email protected]> Co-authored-by: Michael Goin <[email protected]>

SUMMARY: * avoid crashing the engine when we get an input longer than max_model_len FIX vllm-project#12567(*link existing issues this PR will resolve*)

…llm-project#11161) FIX issue vllm-project#9688 vllm-project#11086 vllm-project#12487 --------- Signed-off-by: Jee Jee Li <[email protected]> Co-authored-by: weilong.yu <[email protected]> Co-authored-by: Jee Jee Li <[email protected]>

…oject#12617) Without this PR --------------- Quantizing models with llm-compressor and a recipe that explicitly lists names of layers produces a model that is not loadable by vLLM (i.e. `vllm serve <model>` fails with `raise ValueError(f"Unable to find matching target for {module} in the ...`). Example recipe: ``` recipe = """ quantization_stage: run_type: oneshot quantization_modifiers: GPTQModifier: ignore: ["lm_head"] config_groups: group_0: weights: num_bits: 4 type: "int" symmetric: true strategy: "group" group_size: 128 targets: [ "model.layers.0.mlp.down_proj", "model.layers.2.mlp.down_proj", "model.layers.3.mlp.down_proj", "model.layers.4.mlp.down_proj", "model.layers.5.mlp.down_proj", "model.layers.6.mlp.down_proj", "model.layers.7.mlp.down_proj", "model.layers.8.mlp.down_proj", "model.layers.9.mlp.down_proj", "model.layers.10.mlp.down_proj", "model.layers.11.mlp.down_proj", "model.layers.12.mlp.down_proj", "model.layers.13.mlp.down_proj", "model.layers.14.mlp.down_proj", "model.layers.15.mlp.down_proj", "model.layers.16.mlp.down_proj", "model.layers.17.mlp.down_proj", "model.layers.19.mlp.down_proj", "model.layers.21.mlp.down_proj", "model.layers.22.mlp.down_proj", . . . ] """ ``` To reproduce the vLLM error: ```bash vllm serve nm-testing/eldar-test ``` With this PR ------------ Models are loaded correctly without any errors.

…12599)

Fixes `is_marlin` not being passed into `get_default_config` Also allow `--tensor-parallel-size` in addition to `-tp` and `--tp-size` Signed-off-by: Tyler Michael Smith <[email protected]>

…oject#12517) This PR addresses a bug in the Cutlass integration where the `sparsity_config.ignore` list was not being respected. When only a subset of modules were configured as Sparse24, the system incorrectly selected Cutlass for non-sparse modules as well. This update ensures the correct scheme is selected for non-sparse modules, fixing this behavior. --- ### Changes - Updated logic to correctly respect `sparsity_config.ignore`. - Ensured non-sparse modules use the appropriate scheme instead of defaulting to Cutlass. --- <details> <summary>Testing Setup</summary> The fix has been tested on top of [this diff](vllm-project#12097). #### Steps to Test: ```bash git checkout -b my-test-branch origin/rahul-bitmask-additions # compressed Cutlass support git revert --no-edit aa2cd2c # revert Tyler's commit to turn off Cutlass for W16A16 git cherry-pick ca624cd # this branch ``` #### Additional Patch Required: ```diff diff --git a/vllm/model_executor/layers/quantization/compressed_tensors/compressed_tensors.py b/vllm/model_executor/layers/quantization/compressed_tensors/compressed_tensors.py index a54177c1c..f916dd0c9 100644 --- a/vllm/model_executor/layers/quantization/compressed_tensors/compressed_tensors.py +++ b/vllm/model_executor/layers/quantization/compressed_tensors/compressed_tensors.py @@ -9,7 +9,7 @@ from compressed_tensors.quantization import (QuantizationArgs, QuantizationStrategy, QuantizationType) from pydantic import BaseModel - +from vllm.logger import init_logger from vllm.model_executor.layers.fused_moe import FusedMoE from vllm.model_executor.layers.linear import (LinearBase, LinearMethodBase, UnquantizedLinearMethod) @@ -27,7 +27,7 @@ from vllm.model_executor.layers.quantization.compressed_tensors.utils import ( should_ignore_layer) from vllm.model_executor.layers.quantization.kv_cache import BaseKVCacheMethod from vllm.platforms import current_platform - +logger = init_logger(__name__) __all__ = ["CompressedTensorsLinearMethod"] SPARSITY_CONFIG_NAME: Literal["sparsity_config"] = "sparsity_config" ``` Apply using: ```bash git apply logging-patch.patch ``` </details> --- <details> <summary>Models Tested</summary> - `nm-testing/TinyLlama-1.1B-Chat-v1.0-gsm8k-partial-24` - `nm-testing/TinyLlama-1.1B-Chat-v1.0-gsm8k-full-sparse24` - `nm-testing/TinyLlama-1.1B-Chat-v1.0-gsm8k-partial-24-entire-fp8-compressed` - `nm-testing/TinyLlama-1.1B-Chat-v1.0-gsm8k-partial-24-remaining-fp8-compressed` </details> --- <details> <summary>Example Output</summary> #### Layers 0-5 (Sparse24) ``` Using scheme: CompressedTensors24 for model.layers.0.self_attn.qkv_proj Using scheme: CompressedTensors24 for model.layers.0.self_attn.o_proj Using scheme: CompressedTensors24 for model.layers.0.mlp.gate_up_proj Using scheme: CompressedTensors24 for model.layers.0.mlp.down_proj ... ``` #### Layers 6+ (Non-Sparse, FP8) ``` Using scheme: CompressedTensorsW8A8Fp8 for model.layers.6.self_attn.qkv_proj Using scheme: CompressedTensorsW8A8Fp8 for model.layers.6.self_attn.o_proj Using scheme: CompressedTensorsW8A8Fp8 for model.layers.6.mlp.gate_up_proj Using scheme: CompressedTensorsW8A8Fp8 for model.layers.6.mlp.down_proj ... ``` </details> **Note:** Assumed all modules in fused layers such as `QKV_proj` and `Gate_up_proj` follow the same quantization/pruning scheme. --- For related tasks using the Asana app for GitHub, refer to [[this link](https://app.asana.com/0/0/1209227810815160)](https://app.asana.com/0/0/1209227810815160). Signed-off-by: Rahul Tuli <[email protected]>

…ecture (vllm-project#13157)

…-project#13193)

…roject#12909)

Signed-off-by: Harry Mellor <[email protected]>

Signed-off-by: mgoin <[email protected]>

…oject#13198) Signed-off-by: Tyler Michael Smith <[email protected]>

Signed-off-by: Gregory Shtrasberg <[email protected]>

…3227)

…oject#13250)

…e-commit-config (vllm-project#13237)

Signed-off-by: Varun Sundar Rabindranath <[email protected]> Co-authored-by: Varun Sundar Rabindranath <[email protected]>

…ere is a finished request in batch (vllm-project#13126)

… in run_batch (vllm-project#12927) Signed-off-by: Pooya Davoodi <[email protected]>

tlrmchlsmth and others added 30 commits January 29, 2025 21:12

Revert "[Build/CI] Fix libcuda.so linkage" (vllm-project#12552)

73aa6cf

[V1][BugFix] Free encoder cache for aborted requests (vllm-project#12545

e0cc5f2

) Signed-off-by: Woosuk Kwon <[email protected]>

[Misc][MoE] add Deepseek-V3 moe tuning support (vllm-project#12558)

1c1bb0b

Signed-off-by: Divakar Verma <[email protected]>

[V1][Metrics] Add GPU cache usage % gauge (vllm-project#12561)

f17f1d4

Signed-off-by: Mark McLoughlin <[email protected]>

Set ?device={device} when changing tab in installation guides (vllm…

a276903

…-project#12560) Signed-off-by: Harry Mellor <[email protected]>

[Misc] fix typo: add missing space in lora adapter error message (vll…

41bf561

…m-project#12564) Signed-off-by: Beim <[email protected]>

[Kernel] Triton Configs for Fp8 Block Quantization (vllm-project#11589)

9b0c4ba

Signed-off-by: [email protected] <[email protected]> Signed-off-by: mgoin <[email protected]> Co-authored-by: mgoin <[email protected]> Co-authored-by: simon-mo <[email protected]>

[CPU][PPC] Updated torch, torchvision, torchaudio dependencies (vllm-…

bd2107e

…project#12555) Signed-off-by: npanpaliya <[email protected]>

[V1][Log] Add max request concurrency log to V1 (vllm-project#12569)

4078052

Signed-off-by: mgoin <[email protected]>

[Kernel] Update cutlass_scaled_mm to support 2d group (blockwise) s…

9798b2f

…caling (vllm-project#11868)

[ROCm][AMD][Model] llama 3.2 support upstreaming (vllm-project#12421)

a1fc18c

Signed-off-by: Aleksandr Malyshev <[email protected]> Co-authored-by: Aleksandr Malyshev <[email protected]>

[Bugfix] Gracefully handle huggingface hub http error (vllm-project#1…

7a8987d

…2571)

Add favicon to docs (vllm-project#12611)

e3f7ff6

Signed-off-by: Harry Mellor <[email protected]>

[BugFix] Fix Torch.Compile For DeepSeek (vllm-project#12594)

325f679

Co-authored-by: simon-mo <[email protected]>

[Docs][V1] Prefix caching design (vllm-project#12598)

60bcef0

- Create v1 design document section in docs. - Add prefix caching design doc. @WoosukKwon @ywang96 --------- Signed-off-by: Cody Yu <[email protected]>

[release] Add input step to ask for Release version (vllm-project#12631)

415f194

Instead of having to create a new build with release version put in as env var.

[Kernel][Quantization] Integrate block-quantized CUTLASS kernels for …

eb5741a

…DeepSeekV3 (vllm-project#12587) Integrates the block-quantized kernels introduced in vllm-project#11868 for use in linear layers. Signed-off-by: Tyler Michael Smith <[email protected]>

[V1] Bugfix: Validate Model Input Length (vllm-project#12600)

b1340f9

SUMMARY: * avoid crashing the engine when we get an input longer than max_model_len FIX vllm-project#12567(*link existing issues this PR will resolve*)

[ci] Upgrade transformers to 4.48.2 in CI dependencies (vllm-project#…

35b7a05

…12599)

[Bugfix/CI] Fixup benchmark_moe.py (vllm-project#12562)

cfa134d

Fixes `is_marlin` not being passed into `get_default_config` Also allow `--tensor-parallel-size` in addition to `-tp` and `--tp-size` Signed-off-by: Tyler Michael Smith <[email protected]>

kzawora-intel and others added 30 commits February 13, 2025 16:12

restore VLLM_TARGET_DEVICE = "empty" default

f78f737

[Misc] Qwen2.5-VL Optimization (vllm-project#13155)

02ed8a1

[VLM] Separate text-only and vision variants of the same model archit…

1bc3b5e

…ecture (vllm-project#13157)

bump tokenizers requirement

9775053

make pre-commit happy

6b456d8

[Bugfix] Missing Content Type returns 500 Internal Server Error (vllm…

37dfa60

…-project#13193)

[Frontend] Add /v1/audio/transcriptions OpenAI API endpoint (vllm-p…

d84cef7

…roject#12909)

remove old tokenizers dependency

3229141

dependency hell

a78c6dc

add mllama prefill workaround

0d3a5f5

Add label if pre-commit passes (vllm-project#12527)

bffddd9

Signed-off-by: Harry Mellor <[email protected]>

Optimize moe_align_block_size for deepseek_v3 (vllm-project#12850)

2344192

Signed-off-by: mgoin <[email protected]>

[Kernel][Bugfix] Refactor and Fix CUTLASS 2:4 Sparse Kernels (vllm-pr…

c1e37bf

…oject#13198) Signed-off-by: Tyler Michael Smith <[email protected]>

Revert "Add label if pre-commit passes" (vllm-project#13242)

e38be64

[ROCm] Avoid using the default stream on ROCm (vllm-project#13238)

4108869

Signed-off-by: Gregory Shtrasberg <[email protected]>

[Kernel] Fix awq error when n is not divisable by 128 (vllm-project#1…

8c32b08

…3227)

[V1] Consolidate MM cache size to vllm.envs (vllm-project#13239)

dd5ede4

[Bugfix/CI] Turn test_compressed_tensors_2of4_sparse back on (vllm-pr…

09545c0

…oject#13250)

[Bugfix][CI] Inherit codespell settings from pyproject.toml in the pr…

0676782

…e-commit-config (vllm-project#13237)

[Bugfix] Offline example of disaggregated prefill (vllm-project#13214)

84683fa

[Misc] Remove redundant statements in scheduler.py (vllm-project#13229)

40932d7

Consolidate Llama model usage in tests (vllm-project#13094)

f2b20fe

Expand MLA to support most types of quantization (vllm-project#13181)

f0b2da7

[V1] LoRA - Enable Serving Usecase (vllm-project#12883)

cbc4012

Signed-off-by: Varun Sundar Rabindranath <[email protected]> Co-authored-by: Varun Sundar Rabindranath <[email protected]>

[ROCm][V1] Add intial ROCm support to V1 (vllm-project#12790)

ba59b78

[Bugfix][V1] GPUModelRunner._update_states should return True when th…

b0ccfc5

…ere is a finished request in batch (vllm-project#13126)

[WIP] TPU V1 Support Refactored (vllm-project#13049)

45f90bc

[Frontend] Optionally remove memory buffer used for uploading to URLs…

185cc19

… in run_batch (vllm-project#12927) Signed-off-by: Pooya Davoodi <[email protected]>

[Bugfix] Fix missing parentheses (vllm-project#13263)

83481ce

Merge remote-tracking branch 'upstream/main' into HEAD

5b86104

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Rebase 2025-02-10 #810

Rebase 2025-02-10 #810

kzawora-intel commented Feb 10, 2025

Rebase 2025-02-10 #810

Are you sure you want to change the base?

Rebase 2025-02-10 #810

Conversation

kzawora-intel commented Feb 10, 2025