🎉 Modern CUDA Learn Notes with PyTorch for Beginners: fp32/tf32, fp16/bf16, fp8/int8, Tensor/CUDA Cores, flash_attn, rope, embedding, sgemm, sgemv, hgemm, hgemv, warp/block reduce, dot prod, elementwise, sigmoid, relu, gelu, softmax, layernorm, rmsnorm, hist and some CUDA optimization techniques (pack LDST, cp.async, warp gemv, sliced_k/split_k/pipeline gemm, bank conflicts reduce, WMMA/MMA, block/warp swizzle, etc).
![image](https://private-user-images.githubusercontent.com/31974251/352692153-0c5e5125-586f-43fa-8e8b-e2c61c1afbbe.png?jwt=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJnaXRodWIuY29tIiwiYXVkIjoicmF3LmdpdGh1YnVzZXJjb250ZW50LmNvbSIsImtleSI6ImtleTUiLCJleHAiOjE3MzkyNjM2MjIsIm5iZiI6MTczOTI2MzMyMiwicGF0aCI6Ii8zMTk3NDI1MS8zNTI2OTIxNTMtMGM1ZTUxMjUtNTg2Zi00M2ZhLThlOGItZTJjNjFjMWFmYmJlLnBuZz9YLUFtei1BbGdvcml0aG09QVdTNC1ITUFDLVNIQTI1NiZYLUFtei1DcmVkZW50aWFsPUFLSUFWQ09EWUxTQTUzUFFLNFpBJTJGMjAyNTAyMTElMkZ1cy1lYXN0LTElMkZzMyUyRmF3czRfcmVxdWVzdCZYLUFtei1EYXRlPTIwMjUwMjExVDA4NDIwMlomWC1BbXotRXhwaXJlcz0zMDAmWC1BbXotU2lnbmF0dXJlPWI2MzdlYjcwYTYzNzM2M2YyNzg1NGVjN2Y1MGJjYmY4Y2Y3MjA2NDlmODhiYjIwYTc3MmNmZjljNWQ2YThlOTAmWC1BbXotU2lnbmVkSGVhZGVycz1ob3N0In0.cIJU6NRoBTnXTU2q_yHgR6XBYe6_RSIncUM83Vq4Upw)
- / = not supported now.
- ✔️ = known work and already supported now.
- ❔ = in my plan, but not coming soon, maybe a few weeks later.
- workflow: custom CUDA kernel impl -> PyTorch python binding -> Run tests.
- How to contribute? please check 🌤🌤Kernel Trace & 目标 & 代码规范 & 致谢🎉🎉
👉TIPS: * means using Tensor Cores(MMA PTX), otherwise, using CUDA Cores by default.
💡说明: 大佬们写的文章实在是太棒了,学到了很多东西。欢迎大家提PR推荐更多优秀的文章!
GNU General Public License v3.0
Welcome to 🌟👆🏻star & submit a PR to this repo!