softmax with EVT #177

jiyang1011 · 2024-12-27T02:06:02Z

No description provided.

examples/sycl/pvc/pvc_gemm_with_epilogue_softmax.cpp

include/cutlass/epilogue/collective/xe_epilogue.hpp

include/cutlass/epilogue/fusion/xe_vistor_softmax.hpp

examples/sycl/pvc/pvc_gemm_with_epilogue_softmax.cpp

include/cutlass/epilogue/fusion/xe_vistor_softmax.hpp

aacostadiaz · 2025-01-27T12:52:17Z

include/cutlass/epilogue/collective/xe_epilogue.hpp

@@ -312,7 +313,7 @@ class CollectiveEpilogue<
    bool is_C_load_needed = is_source_supported && fusion_callbacks.is_C_load_needed();

    Tensor trC = make_tensor<typename TiledMma::ValTypeC>(Shape<Int<FragmentSize>>{});
-    Tensor trD = make_tensor<typename TiledMma::ValTypeD>(Shape<Int<FragmentSize>>{});
+    Tensor trD = make_tensor<typename TiledMma::ValTypeD>(Shape<Int<FragmentSize>, Int<FragsM>, Int<FragsN>>{});


If a callback needs to store all the values for Int and Int, that should be done inside the callback instead of in the generic path.

aacostadiaz · 2025-01-27T12:53:19Z

include/cutlass/epilogue/collective/xe_epilogue.hpp

        }
-        copy(params.xe_store_d, trD, rw_coord(_, epi_m, epi_n));
+        cst_callbacks.reduce(nullptr, synchronize, epi_m, epi_n, (epi_m == FragsM - 1 && epi_n == FragsN - 1), trD(_, epi_m, epi_n));


aacostadiaz · 2025-01-27T13:06:23Z

include/cutlass/epilogue/collective/xe_epilogue.hpp

+    CUTLASS_PRAGMA_UNROLL
+    for (int epi_n = 0; epi_n < FragsN; epi_n++) {
+      CUTLASS_PRAGMA_UNROLL
+      for (int epi_m = 0; epi_m < FragsM; epi_m++) {
+        copy(params.xe_store_d, trD(_, epi_m, epi_n), rw_coord(_, epi_m, epi_n));
+      }
+    }


This change will group all the store operations to run at the end.

In the SM90 implementation, there’s a condition around the copy operation to check if the D matrix is available. We can apply the same logic here.

If D isn't available, the softmax output should be stored in a different matrix, which will be added directly to the callback. This way, the generic path won’t know about that matrix. Since the reduce method is the only one that knows when the data is ready to be copied, I suggest we handle the copy operations within that method.

I've added the if condition to skip the copy in PR#199. Now, you should be able to revert the changes in the generic path for the epilogue and implement all the softmax functionalities inside the callback.

mehdi-goli · 2025-01-29T16:41:56Z

examples/sycl/pvc/pvc_gemm_with_epilogue_softmax.cpp

+      float cute_time = timer.seconds() / options.iterations;
+      double tflops = (2.0 * options.m * options.n * options.k * options.l) * 1e-12;
+      std::cout << "Problem Size: " << options.m << 'x' << options.n << 'x' << options.k << 'x' << options.l << std::endl;
+      printf("Cutlass GEMM Performance:     [%4.3f]GB/s  (%6.4f)ms\n", io / cute_time, cute_time*1000);


Suggested change

printf("Cutlass GEMM Performance: [%4.3f]GB/s (%6.4f)ms\n", io / cute_time, cute_time*1000);

printf("Cutlass GEMM Performance: %4.3f GB/s , %4.3f TF/s , %6.4f ms\n", io / cute_time, tflops/cute_time, cute_time*1000);

mehdi-goli · 2025-01-29T16:56:30Z

include/cute/arch/mma_xe.hpp

+SYCL_DEVICE_OCL(float sub_group_reduce_add(float i));
+SYCL_DEVICE_OCL(float sub_group_reduce_max(float i));


the reduce operation is not part of mma

jiyang1011 requested review from aacostadiaz, muhammad-tanvir-1211, taozha2 and t4c1 December 27, 2024 05:14

jiyang1011 force-pushed the jiyang/softmax branch 3 times, most recently from a435631 to ad00c03 Compare December 31, 2024 01:20

jiyang1011 requested review from tdeng5, mehdi-goli and rolandschulz December 31, 2024 02:05

t4c1 reviewed Jan 3, 2025

View reviewed changes

jiyang1011 force-pushed the jiyang/softmax branch from ad00c03 to bf7c5e0 Compare January 7, 2025 01:50

mehdi-goli reviewed Jan 13, 2025

View reviewed changes