-
-
Notifications
You must be signed in to change notification settings - Fork 5.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Bug]: Cutlass 2:4 Sparsity + FP8/Int8 Quant RuntimeError: Error Internal #11763
Comments
Hi @leoyuppieqnew is there any information you could share on how you prepared the compressed-tensors checkpoint? Sharing details like the model itself or even just its config.json would be useful. Unfortunately the error message is unclear and we are unsure how to reproduce at the moment |
Sure, I used the llm-compressor one-shot method for sparse and fp8 quantization. The model base is qwen2-72B-Instruct, and its config is as follows: {
"_name_or_path": "/home/qwen2_72b_tuwen_mix_24_10000/stage_sparsity",
"architectures": [
"Qwen2ForCausalLM"
],
"attention_dropout": 0.0,
"bos_token_id": 151643,
"eos_token_id": 151645,
"hidden_act": "silu",
"hidden_size": 8192,
"initializer_range": 0.02,
"intermediate_size": 29568,
"max_position_embeddings": 32768,
"max_window_layers": 70,
"model_type": "qwen2",
"num_attention_heads": 64,
"num_hidden_layers": 80,
"num_key_value_heads": 8,
"output_router_logits": false,
"quantization_config": {
"config_groups": {
"group_0": {
"input_activations": {
"actorder": null,
"block_structure": null,
"dynamic": true,
"group_size": null,
"num_bits": 8,
"observer": null,
"observer_kwargs": {},
"strategy": "token",
"symmetric": true,
"type": "float"
},
"output_activations": null,
"targets": [
"Linear"
],
"weights": {
"actorder": null,
"block_structure": null,
"dynamic": false,
"group_size": null,
"num_bits": 8,
"observer": "minmax",
"observer_kwargs": {},
"strategy": "channel",
"symmetric": true,
"type": "float"
}
}
},
"format": "float-quantized",
"global_compression_ratio": 1.4643128654975015,
"ignore": [
"lm_head"
],
"kv_cache_scheme": null,
"quant_method": "compressed-tensors",
"quantization_status": "compressed",
"sparsity_config": {
"format": "dense",
"global_sparsity": 0.48285508867414173,
"ignore": [
"lm_head"
],
"registry_requires_subclass": false,
"sparsity_structure": "2:4",
"targets": [
"Linear"
]
}
},
"rms_norm_eps": 1e-06,
"rope_theta": 1000000.0,
"sliding_window": null,
"tie_word_embeddings": false,
"torch_dtype": "bfloat16",
"transformers_version": "4.44.1",
"use_cache": false,
"use_sliding_window": false,
"vocab_size": 152064
} |
Is this using the vllm whl? |
Nope, it is compiled from source code, commitid is a491d6f |
Thanks, we will take a look |
@leoyuppieqnew thanks for reporting the bug! Since you are building from source, can apply the following patch and rebuild and then provide the output?
|
Your current environment
Model Input Dumps
No response
🐛 Describe the bug
Before submitting a new issue...
The text was updated successfully, but these errors were encountered: