Add option to build CUDA backend without Flash attention #11946

slaren · 2025-02-18T17:22:44Z

          @slaren Honestly, I think Flash Attention should be an optional feature in ggml since it doesn't introduce significant performance improvements, and the binary size has increased considerably—not to mention the compilation time, which, even though I only compile it for my GPU architecture, still takes 20 minutes on an i5-12400. It is not related to this PR, but it would be good to take it into account.

Originally posted by @FSSRepo in #11867 (comment)

The text was updated successfully, but these errors were encountered:

bssrdf · 2025-02-18T23:11:04Z

I can get that FA can be built optionally to reduce build time. But saying 'it doesn't introduce significant performance improvements' is a bit misleading. On my 4090, I got 47 T/S with FA on and 37 T/S off. SD generation also got a speed up with FA.

slaren added the enhancement New feature or request label Feb 18, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add option to build CUDA backend without Flash attention #11946

Add option to build CUDA backend without Flash attention #11946

slaren commented Feb 18, 2025

bssrdf commented Feb 18, 2025

Add option to build CUDA backend without Flash attention #11946

Add option to build CUDA backend without Flash attention #11946

Comments

slaren commented Feb 18, 2025

bssrdf commented Feb 18, 2025