Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

A750: Stable diffusion 2x slow down comparing IPEX-XPU 2.5 with IPEX-XPU 2.0 on Windows #749

Open
Nuullll opened this issue Dec 18, 2024 · 7 comments
Assignees

Comments

@Nuullll
Copy link

Nuullll commented Dec 18, 2024

Describe the bug

Description

We are experiencing a significant performance regression when running the txt2img stable diffusion pipeline with diffusers. The time taken for each 512x512 image with 20 inference steps has increased from 2.7s/image to 6.5s/image when comparing IPEX-XPU 2.5 with IPEX-XPU 2.0.

Script for SD15 with diffusers

Using the model Lykon/dreamshaper-8 but I think this is a general issue for all stable diffusion workloads.

# sd15.py
import torch
import intel_extension_for_pytorch as ipex
from diffusers import AutoPipelineForText2Image, DEISMultistepScheduler, StableDiffusionPipeline
import time

model_path = "Lykon/dreamshaper-8"
pipe = StableDiffusionPipeline.from_pretrained(model_path, torch_dtype=torch.float16, variant="fp16")
pipe.scheduler = DEISMultistepScheduler.from_config(pipe.scheduler.config)
pipe = pipe.to("xpu")

prompt = "portrait photo of muscular bearded guy in a worn mech suit, light bokeh, intricate, steel metal, elegant, sharp focus, soft lighting, vibrant colors"

generator = torch.manual_seed(33)
# warmup
image = pipe(prompt, generator=generator, num_inference_steps=20).images[0]

start = time.time()
for i in range(10):
    image = pipe(prompt, generator=generator, num_inference_steps=20).images[0]
end = time.time()

print("Time taken per image: ", (end - start) / 10)

image.save("./image.png")

Steps to Reproduce

  1. Create and activate a conda environment for IPEX 2.0:
    conda create -p .\ipex20-acm python=3.10 setuptools libuv -y
    conda activate .\ipex20-acm
    python -m pip install torch==2.0.0a0 intel-extension-for-pytorch==2.0.110+gitba7f6c1 --extra-index-url https://pytorch-extension.intel.com/release-whl/stable/xpu/cn/
    pip install dpcpp-cpp-rt==2023.2 mkl-dpcpp==2023.2
    pip install diffusers==0.31.0 transformers==4.47 accelerate==1.2.1 numpy==1.26
    python sd15.py
Loading pipeline components...:  14%|██████████████████████▋                                                                                                                                        | 1/7 [00:00<00:00,  8.90it/s]D:\reproducer\ipex-perf-20-25\ipex20-acm\lib\site-packages\transformers\models\clip\feature_extraction_clip.py:28: FutureWarning: The class CLIPFeatureExtractor is deprecated and will be removed in version 5 of Transformers. Please use CLIPImageProcessor instead.
  warnings.warn(
Loading pipeline components...: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 7/7 [00:00<00:00, 17.81it/s]
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 20/20 [00:04<00:00,  4.59it/s]
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 20/20 [00:01<00:00, 11.82it/s]
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 20/20 [00:01<00:00, 12.26it/s]
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 20/20 [00:01<00:00, 12.26it/s]
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 20/20 [00:01<00:00, 11.52it/s]
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 20/20 [00:01<00:00, 11.99it/s]
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 20/20 [00:01<00:00, 11.98it/s]
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 20/20 [00:01<00:00, 11.79it/s]
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 20/20 [00:01<00:00, 11.69it/s]
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 20/20 [00:01<00:00, 12.10it/s]
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 20/20 [00:01<00:00, 12.04it/s]
Time taken per image:  2.704702615737915
  1. Create and activate a conda environment for IPEX 2.5:
    conda create -p .\ipex25-acm python=3.10 setuptools libuv -y
    conda activate .\ipex25-acm
    python -m pip install torch==2.5.1+cxx11.abi torchvision==0.20.1+cxx11.abi torchaudio==2.5.1+cxx11.abi intel-extension-for-pytorch==2.5.10+xpu --extra-index-url https://pytorch-extension.intel.com/release-whl/stable/xpu/cn/
    pip install diffusers==0.31.0 transformers==4.47 accelerate==1.2.1 numpy==1.26
    python sd15.py
python sd15.py
D:\reproducer\ipex-perf-20-25\ipex25-acm\lib\site-packages\torchvision\io\image.py:14: UserWarning: Failed to load image Python extension: 'Could not find module 'D:\reproducer\ipex-perf-20-25\ipex25-acm\Lib\site-packages\torchvision\image.pyd' (or one of its dependencies). Try using the full path with constructor syntax.'If you don't plan on using image functionality from `torchvision.io`, you can ignore this warning. Otherwise, there might be something wrong with your environment. Did you have `libjpeg` or `libpng` installed before building `torchvision` from source?
  warn(
[W1217 22:50:27.000000000 OperatorEntry.cpp:162] Warning: Warning only once for all operators,  other operators may also be overridden.
  Overriding a previously registered kernel for the same operator and the same dispatch key
  operator: aten::_cummax_helper(Tensor self, Tensor(a!) values, Tensor(b!) indices, int dim) -> ()
    registered at C:\Jenkins\workspace\IPEX-WW-BUILDS\private-gpu\build\aten\src\ATen\RegisterSchema.cpp:6
  dispatch key: XPU
  previous kernel: registered at C:\Jenkins\workspace\IPEX-WW-BUILDS\private-gpu\build\aten\src\ATen\RegisterCPU.cpp:30476
       new kernel: registered at C:\Jenkins\workspace\IPEX-WW-BUILDS\ipex-gpu\build\Release\csrc\gpu\csrc\aten\generated\ATen\RegisterXPU.cpp:2971 (function operator ())
Loading pipeline components...:   0%|                                                                                                                                                       | 0/7 [00:00<?, ?it/s]D:\reproducer\ipex-perf-20-25\ipex25-acm\lib\site-packages\transformers\models\clip\feature_extraction_clip.py:28: FutureWarning: The class CLIPFeatureExtractor is deprecated and will be removed in version 5 of Transformers. Please use CLIPImageProcessor instead.
  warnings.warn(
Loading pipeline components...: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 7/7 [00:00<00:00, 13.98it/s]
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 20/20 [00:04<00:00,  4.00it/s]
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 20/20 [00:01<00:00, 13.99it/s]
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 20/20 [00:01<00:00, 14.33it/s]
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 20/20 [00:01<00:00, 14.41it/s]
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 20/20 [00:01<00:00, 14.66it/s]
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 20/20 [00:01<00:00, 14.45it/s]
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 20/20 [00:01<00:00, 14.22it/s]
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 20/20 [00:01<00:00, 14.34it/s]
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 20/20 [00:01<00:00, 14.19it/s]
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 20/20 [00:01<00:00, 14.56it/s]
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 20/20 [00:01<00:00, 13.80it/s]
Time taken per image:  6.521866035461426

2.7s/image --> 6.5s/image, which is not acceptable -- I would rather stay at IPEX 2.0.
The unet diffusion seems to become faster (12it/s --> 14it/s), so I assume the bottleneck for IPEX 2.5 is VAE decoding.

Additional Notes

  1. The first inference for IPEX-XPU 2.0 would take up to 5-10 minutes because that official release doesn't have AOT support for Arc A-Series: [IPEX][XPU][Windows 11] It takes forever to run the first pass #399
  2. Device: Intel Arc A750 Graphcis, with driver 32.0.101.6325

Versions

  1. IPEX 2.0 env
Package                     Version
--------------------------- ------------------
accelerate                  1.2.0
asttokens                   2.2.1
backcall                    0.2.0
certifi                     2024.12.14
charset-normalizer          3.4.0
colorama                    0.4.6
comm                        0.1.3
debugpy                     1.6.7
decorator                   5.1.1
diffusers                   0.31.0
dpcpp-cpp-rt                2023.2.0
executing                   1.2.0
filelock                    3.16.1
fsspec                      2024.10.0
huggingface-hub             0.27.0
idna                        3.10
importlib_metadata          8.5.0
inquirerpy                  0.3.4
intel-cmplr-lib-rt          2023.2.0
intel-cmplr-lic-rt          2023.2.0
intel-extension-for-pytorch 2.0.110+gitba7f6c1
intel-opencl-rt             2023.2.0
intel-openmp                2023.2.0
ipykernel                   6.23.3
ipython                     8.14.0
jedi                        0.18.2
Jinja2                      3.1.4
jupyter_client              8.3.0
jupyter_core                5.3.1
MarkupSafe                  3.0.2
matplotlib-inline           0.1.6
mkl                         2023.2.0
mkl-dpcpp                   2023.2.0
mpmath                      1.3.0
networkx                    3.4.2
numpy                       1.26.0
packaging                   24.2
parso                       0.8.3
pfzy                        0.3.4
pickleshare                 0.7.5
pillow                      11.0.0
pip                         24.3.1
prompt-toolkit              3.0.38
psutil                      6.1.0
pure-eval                   0.2.2
Pygments                    2.15.1
pywin32                     306
PyYAML                      6.0.2
pyzmq                       25.1.0
regex                       2024.11.6
requests                    2.32.3
safetensors                 0.4.5
setuptools                  75.6.0
stack-data                  0.6.2
sympy                       1.13.3
tbb                         2021.13.1
tokenizers                  0.21.0
torch                       2.0.0a0+gitc6a572f
tornado                     6.3.2
tqdm                        4.67.1
traitlets                   5.9.0
transformers                4.47.0
typing_extensions           4.12.2
urllib3                     2.2.3
wcwidth                     0.2.13
zipp                        3.21.0
  1. IPEX 2.5 env
Package                     Version
--------------------------- ----------------
accelerate                  1.2.1
annotated-types             0.7.0
asttokens                   2.2.1
backcall                    0.2.0
certifi                     2024.12.14
charset-normalizer          3.4.0
colorama                    0.4.6
comm                        0.1.3
debugpy                     1.6.7
decorator                   5.1.1
diffusers                   0.31.0
dpcpp-cpp-rt                2025.0.4
executing                   1.2.0
filelock                    3.16.1
fsspec                      2024.10.0
huggingface-hub             0.27.0
idna                        3.10
importlib_metadata          8.5.0
intel-cmplr-lib-rt          2025.0.4
intel-cmplr-lib-ur          2025.0.4
intel-cmplr-lic-rt          2025.0.4
intel_extension_for_pytorch 2.5.10+xpu
intel-opencl-rt             2025.0.4
intel-openmp                2025.0.4
intel-sycl-rt               2025.0.4
ipykernel                   6.23.3
ipython                     8.14.0
jedi                        0.18.2
Jinja2                      3.1.4
jupyter_client              8.3.0
jupyter_core                5.3.1
MarkupSafe                  3.0.2
matplotlib-inline           0.1.6
mkl                         2025.0.1
mkl-dpcpp                   2025.0.1
mpmath                      1.3.0
networkx                    3.4.2
numpy                       1.26.0
onemkl-sycl-blas            2025.0.1
onemkl-sycl-datafitting     2025.0.1
onemkl-sycl-dft             2025.0.1
onemkl-sycl-lapack          2025.0.1
onemkl-sycl-rng             2025.0.1
onemkl-sycl-sparse          2025.0.1
onemkl-sycl-stats           2025.0.1
onemkl-sycl-vm              2025.0.1
packaging                   24.2
parso                       0.8.3
pickleshare                 0.7.5
pillow                      11.0.0
pip                         24.3.1
prompt-toolkit              3.0.38
psutil                      6.1.0
pure-eval                   0.2.2
pydantic                    2.10.3
pydantic_core               2.27.1
Pygments                    2.15.1
pywin32                     306
PyYAML                      6.0.2
pyzmq                       25.1.0
regex                       2024.11.6
requests                    2.32.3
ruamel.yaml                 0.18.6
ruamel.yaml.clib            0.2.12
safetensors                 0.4.5
setuptools                  75.6.0
stack-data                  0.6.2
sympy                       1.13.1
tbb                         2022.0.0
tcmlib                      1.2.0
tokenizers                  0.21.0
torch                       2.5.1+cxx11.abi
torchaudio                  2.5.1+cxx11.abi
torchvision                 0.20.1+cxx11.abi
tornado                     6.3.2
tqdm                        4.67.1
traitlets                   5.9.0
transformers                4.47.0
typing_extensions           4.12.2
umf                         0.9.1
urllib3                     2.2.3
zipp                        3.21.0
@Nuullll
Copy link
Author

Nuullll commented Dec 18, 2024

Update: confirmed vae decode is the bottleneck

# sd15.py
import torch
import intel_extension_for_pytorch as ipex
from diffusers import DEISMultistepScheduler, StableDiffusionPipeline
import time

model_path = "Lykon/dreamshaper-8"
pipe = StableDiffusionPipeline.from_pretrained(model_path, torch_dtype=torch.float16, variant="fp16")
pipe.scheduler = DEISMultistepScheduler.from_config(pipe.scheduler.config)
pipe = pipe.to("xpu")

func = pipe.vae.decode
def profile(*args, **kwargs):
    start = time.time()
    result = func(*args, **kwargs)
    torch.xpu.synchronize()
    end = time.time()
    print("Time taken for vae.decode: ", end - start)
    return result
pipe.vae.decode = profile

prompt = "portrait photo of muscular bearded guy in a worn mech suit, light bokeh, intricate, steel metal, elegant, sharp focus, soft lighting, vibrant colors"

generator = torch.manual_seed(33)
# warmup
image = pipe(prompt, generator=generator, num_inference_steps=20).images[0]

start = time.time()
for i in range(10):
    image = pipe(prompt, generator=generator, num_inference_steps=20).images[0]
end = time.time()

print("Time taken per image: ", (end - start) / 10)

image.save("./image.png")

Running with IPEX 2.0:

Loading pipeline components...:  86%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▌                    | 6/7 [00:00<00:00, 16.79it/s]D:\reproducer\ipex-perf-20-25\ipex20-acm\lib\site-packages\transformers\models\clip\feature_extraction_clip.py:28: FutureWarning: The class CLIPFeatureExtractor is deprecated and will be removed in version 5 of Transformers. Please use CLIPImageProcessor instead.
  warnings.warn(
Loading pipeline components...: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 7/7 [00:00<00:00, 17.35it/s] 
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 20/20 [00:04<00:00,  4.53it/s]
Time taken for vae.decode:  0.8171584606170654
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 20/20 [00:01<00:00, 11.39it/s]
Time taken for vae.decode:  0.9818477630615234
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 20/20 [00:01<00:00, 11.56it/s]
Time taken for vae.decode:  1.0179715156555176
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 20/20 [00:01<00:00, 11.60it/s]
Time taken for vae.decode:  1.0243005752563477
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 20/20 [00:01<00:00, 11.69it/s]
Time taken for vae.decode:  1.017223596572876
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 20/20 [00:01<00:00, 11.78it/s]
Time taken for vae.decode:  1.0186982154846191
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 20/20 [00:01<00:00, 11.65it/s]
Time taken for vae.decode:  1.0014166831970215
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 20/20 [00:01<00:00, 11.66it/s]
Time taken for vae.decode:  1.034074306488037
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 20/20 [00:01<00:00, 11.66it/s]
Time taken for vae.decode:  1.0356333255767822
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 20/20 [00:01<00:00, 11.78it/s]
Time taken for vae.decode:  1.0222344398498535
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 20/20 [00:01<00:00, 11.66it/s]
Time taken for vae.decode:  0.9820561408996582
Time taken per image:  2.8001481533050536

Running with IPEX 2.5:

D:\reproducer\ipex-perf-20-25\ipex25-acm\lib\site-packages\torchvision\io\image.py:14: UserWarning: Failed to load image Python extension: 'Could not find module 'D:\reproducer\ipex-perf-20-25\ipex25-acm\Lib\site-packages\torchvision\image.pyd' (or one of its dependencies). Try using the full path with constructor syntax.'If you don't plan on using image functionality from `torchvision.io`, you can ignore this warning. Otherwise, there might be something wrong with your environment. Did you have `libjpeg` or `libpng` installed before building `torchvision` from source?
  warn(
[W1218 21:02:18.000000000 OperatorEntry.cpp:162] Warning: Warning only once for all operators,  other operators may also be overridden.
  Overriding a previously registered kernel for the same operator and the same dispatch key
  operator: aten::_cummax_helper(Tensor self, Tensor(a!) values, Tensor(b!) indices, int dim) -> ()
    registered at C:\Jenkins\workspace\IPEX-WW-BUILDS\private-gpu\build\aten\src\ATen\RegisterSchema.cpp:6
  dispatch key: XPU
  previous kernel: registered at C:\Jenkins\workspace\IPEX-WW-BUILDS\private-gpu\build\aten\src\ATen\RegisterCPU.cpp:30476
       new kernel: registered at C:\Jenkins\workspace\IPEX-WW-BUILDS\ipex-gpu\build\Release\csrc\gpu\csrc\aten\generated\ATen\RegisterXPU.cpp:2971 (function operator ())
Loading pipeline components...:  71%|██████████████████████████████████████████████████████████████████████████████████████████████████████▏                                        | 5/7 [00:00<00:00, 16.39it/s]D:\reproducer\ipex-perf-20-25\ipex25-acm\lib\site-packages\transformers\models\clip\feature_extraction_clip.py:28: FutureWarning: The class CLIPFeatureExtractor is deprecated and will be removed in version 5 of Transformers. Please use CLIPImageProcessor instead.
  warnings.warn(
Loading pipeline components...: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 7/7 [00:00<00:00, 16.94it/s] 
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 20/20 [00:04<00:00,  4.57it/s]
Time taken for vae.decode:  5.066839694976807
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 20/20 [00:01<00:00, 13.61it/s]
Time taken for vae.decode:  5.169049263000488
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 20/20 [00:01<00:00, 13.47it/s]
Time taken for vae.decode:  4.754204511642456
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 20/20 [00:01<00:00, 13.93it/s]
Time taken for vae.decode:  4.803436756134033
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 20/20 [00:01<00:00, 13.96it/s]
Time taken for vae.decode:  4.808917284011841
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 20/20 [00:01<00:00, 13.92it/s]
Time taken for vae.decode:  4.817259788513184
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 20/20 [00:01<00:00, 13.58it/s]
Time taken for vae.decode:  4.765727996826172
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 20/20 [00:01<00:00, 13.74it/s]
Time taken for vae.decode:  5.21826171875
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 20/20 [00:01<00:00, 13.58it/s]
Time taken for vae.decode:  5.15399432182312
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 20/20 [00:01<00:00, 12.56it/s]
Time taken for vae.decode:  4.843561410903931
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 20/20 [00:01<00:00, 14.04it/s]
Time taken for vae.decode:  4.896574020385742
Time taken per image:  6.55001814365387

5x VAE decoding slow down.

@JT-Gresham
Copy link

I've also noticed a considerable speed regression after installation of 2.5.10 on my A770 ....almost 3x slower than 2.3.1 when using Fooocus or Forge. Interesting that it seems to be the VAE decoding...

@huiyan2021 huiyan2021 self-assigned this Dec 19, 2024
@Disty0
Copy link

Disty0 commented Dec 19, 2024

I am not sure if this is an IPEX issue or a PyTorch 2.5 issue because i am seeing this behavior with multiple vendors.
VAE decode is significantly slower on PyTorch 2.5 and 2.6 with AMD GPUs as well. It takes a whole minute just to start decoding on AMD's case.

Also add this line for GPUs that doesn't support Flash Attention or Memory Efficient Attention (like Intel and older AMD) when using PyTorch 2.5 and above:

torch.backends.cuda.allow_fp16_bf16_reduction_math_sdp(True)

@NineMeowICT
Copy link

After installing ipex 2.5, I tried ComfyUI on Linux with my A770 and encountered similar issue.
SDXL, Euler A, 832x1216 ,native K sampler
Before: ~1.1it/s
Now: ~0.5it/s
I have tested it for several time and got this average speed. A 2x slowdown as well.
Wondering the root cause of it.

@zhuyuhua-v
Copy link

The performance slow down of Stable Diffusion on IPEX 2.3 and 2.5 is expected.

During our testing on IPEX 2.3, we discovered that the images generated by Stable Diffusion occasionally exhibited accuracy issues, often resulting in meaningless output images. Through debugging, we identified the root cause: on ARC, IPEX SDPA kernel utilizes the "math" implementation path, which is a naive approach https://pytorch.org/docs/stable/generated/torch.nn.functional.scaled_dot_product_attention.html. This path, when operating with FP16 data, can result in GEMM computation values exceeding the representable range of FP16, ultimately leading to NaN issues, and generate meaningless images.

To address this problem and ensure functional correctness, we convert the data type to FP32 before performing SDPA calculations, and convert back to FP16 after SDPA. While this effectively resolves the NaN issue, it comes at the cost of performance. This trade-off is necessary to maintain Stable Diffusion’s functionality.

We have addressed this limitation in the next-generation BMG hardware by adopting Flash Attention, which delivers significantly better performance. However, due to architectural differences between ARC and BMG, this improvement cannot be backported to ARC. Consequently, ARC hardware has to accept this performance compromise.

@Disty0
Copy link

Disty0 commented Dec 26, 2024

During our testing on IPEX 2.3, we discovered that the images generated by Stable Diffusion occasionally exhibited accuracy issues, often resulting in meaningless output images. Through debugging, we identified the root cause: on ARC, IPEX SDPA kernel utilizes the "math" implementation path, which is a naive approach https://pytorch.org/docs/stable/generated/torch.nn.functional.scaled_dot_product_attention.html. This path, when operating with FP16 data, can result in GEMM computation values exceeding the representable range of FP16, ultimately leading to NaN issues, and generate meaningless images.

Casting to FP32 doesn't fix the root cause, it is just a workaround for the deeper accuracy issues on IPEX.
This doesn't happen on Math SDPA + FP16/BF16 reduction forced Nvidia or AMD, only happens on Intel.
You also have to run the CLIP text encoders on the CPU to get accurate results on IPEX 2.3, SDPA is not the root issue.
Accuracy and NaN issues still happens with BF16 and FP32 on Intel. BF16 or FP32 shouldn't go NaN, but it goes NaN on Intel: #529

To address this problem and ensure functional correctness, we convert the data type to FP32 before performing SDPA calculations, and convert back to FP16 after SDPA. While this effectively resolves the NaN issue, it comes at the cost of performance. This trade-off is necessary to maintain Stable Diffusion’s functionality.

FP32 upcasting SDPA should still listen to this flag and be disabled when it is set to true:

torch.backends.cuda.allow_fp16_bf16_reduction_math_sdp(True)

@zhuyuhua-v
Copy link

Casting to FP32 doesn't fix the root cause, it is just a workaround for the deeper accuracy issues on IPEX.

Yes, converting data from FP16 to FP32 does not fundamentally resolve the accuracy issues but serves as a trade-off to prevent NaN problems that result in meaningless images. For the ARC architecture, we observed that the "math" path in FP16 mode can easily lead to computation values exceeding the FP16 representable range. This limitation in IPEX is currently mitigated by converting to FP32.

This doesn't happen on Math SDPA + FP16/BF16 reduction forced Nvidia or AMD, only happens on Intel. You also have to run the CLIP text encoders on the CPU to get accurate results on IPEX 2.3, SDPA is not the root issue.

Regarding comparisons with other vendors, I’m confident that for Stable Diffusion v2.1 (i.e., the case where we observed NaN issues), the results of SDPA on the IPEX FP16 math path before NaN issues occur are almost identical to those on CUDA. I conducted a detailed investigation of Stable Diffusion v2.1 on this matter, and if you're interested, I can share my findings here.

Accuracy and NaN issues still happens with BF16 and FP32 on Intel. BF16 or FP32 shouldn't go NaN, but it goes NaN on Intel: #529

For the NaN issues that occur even with BF16 and FP32, I believe this is a different case from the FP16 NaN issue. The FP16 problem specifically arises due to the GEMM operation in the IPEX FP16 SDPA math path.
https://github.com/intel/intel-extension-for-pytorch/blob/release/xpu/2.5.10/csrc/gpu/aten/operators/transformers/attention.cpp#L210
In this matmul, matrix elements in the query can have absolute values up to ~500, and in the key, up to ~70. The products can therefore reach absolute values of ~35,000, and summing 9,216 such products can lead to results well beyond the FP16 range (~64,000). According to the PyTorch documentation, the inputs are scaled by 1/√9,216 = 1/96, but even with this scaling, the final results can easily exceed the FP16 range. When we tested this by converting the tensors to FP32, the results were indeed outside the FP16 range (~100,000). Our investigation found this to be a math-path-only issue.

Since future generations of Intel GPUs will support Flash Attention, we evaluated and concluded that Flash Attention balances performance and accuracy better. Thus, for next generations of Intel GPUs, we recommend using Flash Attention in performance-sensitive scenarios.

FP32 upcasting SDPA should still listen to this flag and be disabled when it is set to true:

torch.backends.cuda.allow_fp16_bf16_reduction_math_sdp(True)

This is an excellent suggestion, and we will discuss internally whether to incorporate this flag in future updates.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants