-
Notifications
You must be signed in to change notification settings - Fork 29
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Bug]: Unable to load deepseek r1 on 8 x AMD MI300X AssertionError: FP8 weight padding is not supported in block quantization #375
Comments
With DeepSeek V3 please set the VLLM_FP8_PADDING=0 environment variable as its quantization method is currently incompatible with FP8 padding |
Getting much further now. Thanks for the quick suggestion:
I will post an update if this ends up fully working or not. |
hmm looks like there is a failure and it tries to write GPU dump, however GPU dump writing fails due to running out of disk. The core issue however is the Any idea on how to debug further? Current flags used in my KubeAI custom object:
Current logs:
|
I got the same issue during trying to load DeepSeek V3 on 8x AMD MI300X GPU. Set the VLLM_FP8_PADDING=0 environment variable and stuck at the following in running. Thanks for any helps. (VllmWorkerProcess pid=949) INFO 01-21 23:11:24 model_runner.py:1100] Loading model weights took 79.3596 GB |
It's now working after I switched to these flags:
note that I decreased context length and reduced amount of flags I set before. Maybe this works for you too @billcsm |
Mem access fault is a new issue that was discovered today, we hope to have a fix ready within the coming days. |
I can reproduce the Memory access fault again if I change my working config to use 120k context length. It seems to be something related to larger context length. Trying to find something in between 8k and 120k that doesn't result in mem access fault. |
Leaving the other parameters as their default value, max-model-len==32768 should work. |
Testing this now:
maybe it has to be a multiple of 8? Note that's just a wild guess since I know nothing about how the internals of vLLM or GPUs work. |
getting same Memory access fault when using 64768 context length. |
@samos123 and @gshtras , max-model-len=32768 failed at HIP out of memory. max-model-len=64768 failed at Memory access fault. Memory access fault by GPU node-13 (Agent handle: 0x479962f0) on address 0x7ee718e79000. Reason: Unknown. |
32k context length is working for me with this config:
performance when sending 1000 requests all at once:
unclear why 4 requests failed since no errors were printed in vLLM logs |
@samos123 and @gshtras ,
Thank you for your helps! |
@samos123, I did the performance benchmark with 32k context length on model DeepSeek-V3. Here is its 1000 requests (request-rate 10) test result:
My test lost 52 requests. No error was printed. |
Were you by chance running the benchmark_serving with a random dataset as a param? |
No I was using sharegpt dataset with a fixed seed. |
For the previous test, I used sharegpt dataset ShareGPT_V3_unfiltered_cleaned_split.json. Here is twice test results for running the benchmark_serving with a random dataset in 1000 requests (request-rate 10): ====== Serving Benchmark Result ======= ======== Serving Benchmark Result ======== No error was printed. |
Change to 8k context length and re-run the benchmark_serving with a random dataset in 1000 requests (request-rate 10). Still got the request lost. No error was printed. ============ Serving Benchmark Result ============ |
This seems to be an AMD specific issue. I've run the exact same benchmark with same flags on NVIDIA GPUs and always get 1000 successful requests. |
Could you provide the complete logs from both the server and the client benchmark? |
Yes let me file a separate bug for this. Since this is unrelated to deepseek r1. I've seen this on Llama 3.1 70B as well: https://substratus.ai/blog/benchmarking-llama-3.1-70b-amd-mi300x For reference here is the same benchmark on L4: https://substratus.ai/blog/benchmarking-llama-3.1-70b-on-l4 funny enough 405B model on AMD seems to work fine: https://substratus.ai/blog/benchmarking-llama-3.1-405b-amd-mi300x |
Your current environment
Image I'm using:
rocm/vllm-dev:nightly_main_20250120
vLLM version and flags used:
Model Input Dumps
No response
🐛 Describe the bug
errror
Before submitting a new issue...
The text was updated successfully, but these errors were encountered: