-
Notifications
You must be signed in to change notification settings - Fork 258
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
A750: Stable diffusion 2x slow down comparing IPEX-XPU 2.5 with IPEX-XPU 2.0 on Windows #749
Comments
Update: confirmed vae decode is the bottleneck # sd15.py
import torch
import intel_extension_for_pytorch as ipex
from diffusers import DEISMultistepScheduler, StableDiffusionPipeline
import time
model_path = "Lykon/dreamshaper-8"
pipe = StableDiffusionPipeline.from_pretrained(model_path, torch_dtype=torch.float16, variant="fp16")
pipe.scheduler = DEISMultistepScheduler.from_config(pipe.scheduler.config)
pipe = pipe.to("xpu")
func = pipe.vae.decode
def profile(*args, **kwargs):
start = time.time()
result = func(*args, **kwargs)
torch.xpu.synchronize()
end = time.time()
print("Time taken for vae.decode: ", end - start)
return result
pipe.vae.decode = profile
prompt = "portrait photo of muscular bearded guy in a worn mech suit, light bokeh, intricate, steel metal, elegant, sharp focus, soft lighting, vibrant colors"
generator = torch.manual_seed(33)
# warmup
image = pipe(prompt, generator=generator, num_inference_steps=20).images[0]
start = time.time()
for i in range(10):
image = pipe(prompt, generator=generator, num_inference_steps=20).images[0]
end = time.time()
print("Time taken per image: ", (end - start) / 10)
image.save("./image.png") Running with IPEX 2.0:
Running with IPEX 2.5:
5x VAE decoding slow down. |
I've also noticed a considerable speed regression after installation of 2.5.10 on my A770 ....almost 3x slower than 2.3.1 when using Fooocus or Forge. Interesting that it seems to be the VAE decoding... |
I am not sure if this is an IPEX issue or a PyTorch 2.5 issue because i am seeing this behavior with multiple vendors. Also add this line for GPUs that doesn't support Flash Attention or Memory Efficient Attention (like Intel and older AMD) when using PyTorch 2.5 and above:
|
After installing ipex 2.5, I tried ComfyUI on Linux with my A770 and encountered similar issue. |
The performance slow down of Stable Diffusion on IPEX 2.3 and 2.5 is expected. During our testing on IPEX 2.3, we discovered that the images generated by Stable Diffusion occasionally exhibited accuracy issues, often resulting in meaningless output images. Through debugging, we identified the root cause: on ARC, IPEX SDPA kernel utilizes the "math" implementation path, which is a naive approach https://pytorch.org/docs/stable/generated/torch.nn.functional.scaled_dot_product_attention.html. This path, when operating with FP16 data, can result in GEMM computation values exceeding the representable range of FP16, ultimately leading to NaN issues, and generate meaningless images. To address this problem and ensure functional correctness, we convert the data type to FP32 before performing SDPA calculations, and convert back to FP16 after SDPA. While this effectively resolves the NaN issue, it comes at the cost of performance. This trade-off is necessary to maintain Stable Diffusion’s functionality. We have addressed this limitation in the next-generation BMG hardware by adopting Flash Attention, which delivers significantly better performance. However, due to architectural differences between ARC and BMG, this improvement cannot be backported to ARC. Consequently, ARC hardware has to accept this performance compromise. |
Casting to FP32 doesn't fix the root cause, it is just a workaround for the deeper accuracy issues on IPEX.
FP32 upcasting SDPA should still listen to this flag and be disabled when it is set to true:
|
Yes, converting data from FP16 to FP32 does not fundamentally resolve the accuracy issues but serves as a trade-off to prevent NaN problems that result in meaningless images. For the ARC architecture, we observed that the "math" path in FP16 mode can easily lead to computation values exceeding the FP16 representable range. This limitation in IPEX is currently mitigated by converting to FP32.
Regarding comparisons with other vendors, I’m confident that for Stable Diffusion v2.1 (i.e., the case where we observed NaN issues), the results of SDPA on the IPEX FP16 math path before NaN issues occur are almost identical to those on CUDA. I conducted a detailed investigation of Stable Diffusion v2.1 on this matter, and if you're interested, I can share my findings here.
For the NaN issues that occur even with BF16 and FP32, I believe this is a different case from the FP16 NaN issue. The FP16 problem specifically arises due to the GEMM operation in the IPEX FP16 SDPA math path. Since future generations of Intel GPUs will support Flash Attention, we evaluated and concluded that Flash Attention balances performance and accuracy better. Thus, for next generations of Intel GPUs, we recommend using Flash Attention in performance-sensitive scenarios.
This is an excellent suggestion, and we will discuss internally whether to incorporate this flag in future updates. |
Describe the bug
Description
We are experiencing a significant performance regression when running the txt2img stable diffusion pipeline with diffusers. The time taken for each 512x512 image with 20 inference steps has increased from 2.7s/image to 6.5s/image when comparing IPEX-XPU 2.5 with IPEX-XPU 2.0.
Script for SD15 with diffusers
Using the model Lykon/dreamshaper-8 but I think this is a general issue for all stable diffusion workloads.
Steps to Reproduce
2.7s/image --> 6.5s/image, which is not acceptable -- I would rather stay at IPEX 2.0.
The unet diffusion seems to become faster (12it/s --> 14it/s), so I assume the bottleneck for IPEX 2.5 is VAE decoding.
Additional Notes
Versions
The text was updated successfully, but these errors were encountered: