This document serves as a log of the progress and knowledge I gained while working on GPU programming and studying the PMPP (Parallel Programming and Optimization) book.
Mentor: https://github.com/hkproj/
Bro in the 100 days challenge: https://github.com/1y33/100Days
Check out my blog: https://hamdi.bearblog.dev/
Summary:
Implemented vector addition by writing a simple CUDA program. Explored how to launch a kernel to perform a parallelized addition of two arrays, where each thread computes the sum of a pair of values.
Learned:
- Basics of writing a CUDA kernel.
- Understanding of grid, block, and thread hierarchy in CUDA.
- How to allocate and manage device (GPU) memory using
cudaMalloc
,cudaMemcpy
, andcudaFree
.
- Read Chapter 1 of the PMPP book.
- Learned about the fundamentals of parallel programming, CUDA architecture, and the GPU execution model.
Summary:
Worked on matrix addition using CUDA. Designed the grid and block layout to handle 2D matrices in parallel, with each element processed by an individual thread.
Learned:
- How to map 2D matrix data onto multiple threads.
- Understanding thread indexing in 2D grids and blocks using
threadIdx
,blockIdx
,blockDim
, andgridDim
. - Synchronizing threads and avoiding race conditions when writing results to an output matrix.
- Read Chapter 2 of the PMPP book.
- Learned about scalability of GPUs, massive parallelism, and how to configure problem data to match GPU thread hierarchies.
Summary:
Implemented matrix-vector multiplication using CUDA. Each thread was set up to compute the dot product between a matrix row and the given vector. Optimized performance using shared memory.
Learned:
- How to perform dot products in parallel.
- Efficiently handling shared memory to avoid excessive global memory accesses and improve memory coalescing.
- Launching kernels for 1D or 2D thread configurations based on input data.
- Read half of Chapter 3 of the PMPP book.
-Learned about Scalable Parallel Execution.
Summary:
Worked on parallel reduction to compute the partial sum of an array. Implemented a tree-based reduction algorithm, minimizing warp divergence for better performance.
Learned:
- The concept of reduction in parallel programming.
- Techniques for minimizing warp divergence and balancing workload across threads.
- How to use shared memory effectively in reduction operations.
- Finished Chapter 3 of the PMPP book.
- Learned about Scalable Parallel Execution including Resource Assignment and Thread Scheduling and Latency Tolerance
Summary:
Implemented Layer Normalization in CUDA, often used in deep learning models. Explored normalization techniques across batches and layers using reduction operations. Addressed the challenge of maintaining numerical stability during computation.
Learned:
- How to calculate mean and variance in parallel using reduction algorithms.
- Strategies to stabilize floating-point operations to prevent overflow or underflow issues.
- CUDA kernel optimization for workloads involving tensor computation.
- Read Chapter 4 of the PMPP book.
- Learned about memory optimizations and strategies for GPU performance tuning.
Summary:
Implemented CUDA-based matrix transposition. Optimized the implementation by leveraging shared memory to minimize global memory reads and writes. Ensured proper handling of edge cases when the matrix dimensions are not multiples of the block size.
Learned:
- How to optimize memory usage when working with global and shared memory.
- Techniques to handle data alignment and padding for non-square matrices during transposition.
- The importance of coalescing memory accesses in CUDA to improve performance.
- Read Chapter 5 of the PMPP book.
- Learned about Performance Considerations including optimizing memory access patterns, advanced use of shared memory for performance and dynamic Partitioning of Resources .
- Read Chapter 6 of the PMPP book.
- Learned about Numerical Considerations including IEEE Format, Arithmetic Accuracy and Rounding and Linear Solvers and Numerical Stability.
Summary:
Implemented a simple 1D convolution algorithm using CUDA. This involved sliding a kernel (or filter) over an input array and computing the weighted sum of elements. Each thread was assigned to compute the convolution at a specific position in the output array.
Learned:
- Basics of 1D convolution in parallel, including mapping threads to positions in the output array.
- How to handle boundary conditions (halo cells) when the kernel partially overlaps the input array bounds.
- Importance of memory layout and contiguous access for kernel weights and input arrays to maximize performance.
Summary:
Implemented an optimized version of the 1D convolution algorithm using tiling and shared memory. Divided the input array into tiles and loaded data into shared memory, minimizing global memory accesses for better performance. Used halo cells to handle edge cases where kernel overlap extended into neighboring tiles.
Learned:
- Tiling in CUDA: Dividing input data into manageable chunks and leveraging shared memory to reduce global memory latency.
- Use of halo cells to ensure correctness at tile boundaries during convolution.
- How to balance computation and memory usage in tiled algorithms to improve performance.
- Proper synchronization of threads within a block (using
__syncthreads()
) to ensure data consistency in shared memory.
Summary:
Implemented a 2D convolution algorithm with tiling optimization using CUDA. Divided the input matrix into tiles and leveraged shared memory to minimize global memory accesses, ensuring efficient computation of the convolution kernel across the matrix. Handled boundary conditions using halo cells to process edges and corners correctly.
Learned:
- Extended tiling techniques from 1D to 2D data structures for efficient parallel computation.
- Optimized global memory access by using shared memory for each tile.
- Synchronization of threads for consistent shared memory usage within a block (
__syncthreads()
for proper execution order). - Efficient handling of edge cases and boundary cells in 2D convolution.
- Read Chapter 7 of the PMPP book.
- Learned about parallel patterns for convolution, including basic algorithms, memory optimizations with constant and shared memory, and tiling techniques with halo cells for 1D and 2D convolution.
Summary:
Implemented the Brent-Kung algorithm for parallel prefix sum (scan) in CUDA, designing a work-efficient strategy to compute prefix sums across an array.
Learned:
- The fundamentals of hierarchical parallel scan algorithms and the Brent-Kung approach for work efficiency.
- How to divide the scan operation into an up-sweep (reduce) phase and a down-sweep phase using shared memory for efficient computation.
- Optimized thread synchronization and memory usage for large input arrays.
- Read Chapter 8 of the PMPP book.
- Learned about different parallel patterns for prefix sum computation, focusing on performance, memory access efficiency, and work-efficient algorithms like hierarchical scans.
- Read Chapter 9 of the PMPP book.
- Learned about different parallel patterns for Parallel Histogram Computation, focusing on Atomic Operations, Interleaved Partitioning, Privatization and Aggregation.
Summary:
Implemented a forward pass for Flash Attention in CUDA, based on the Flash Attention paper. The code is still a work in progress and might produce incorrect results. A refined and fully functional version will be updated in the coming days.
Learned:
- Explored the fundamentals of Flash Attention, including its memory-efficient mechanism for attention computation.
- Gained insights into optimizing CUDA kernels for operations like softmax and scaling factors used in attention.
- Identified potential challenges in achieving numerical stability and correctness when implementing complex attention mechanisms.
- Read the Flash Attention paper.
- Learned about the key concepts of reducing memory overhead in attention computation, streamlining the matrix multiplication process, and ensuring efficient scaling for large models.
Optimized and corrected yesterday's forward pass for Flash Attention in CUDA, based on the Flash Attention paper. The code is still a work in progress!
Torch code to check the results of flash_attention_forward kernel.
A blog on flash attention (forward algorithm) explaining the parts of my code. I'll try to make it more intuitive with drawings as soon as I have time.
Summary:
Completed the implementation of a highly optimized sparse matrix-vector multiplication (SpMV) algorithm using a hybrid approach that combines ELL (Ellpack) and COO (Coordinate) formats. This implementation focuses on minimizing memory overhead while maximizing computational efficiency across the sparsity of the input matrix.
Learned:
- Explored the principles and benefits of different sparse matrix representations, namely ELL and COO formats.
- Implemented hybrid techniques to optimize performance by balancing memory access patterns and ensuring efficient data locality.
- Benchmarked the performance of the CUDA implementation against PyTorch to evaluate the efficiency and correctness of the optimized SpMV algorithm.
- Completed Chapter 10 of the PMPP book.
- Gained insights into parallel patterns for sparse matrix computations, focusing on the background of sparse data handling, parallel SpMV using CSR formats, and padding and transposition techniques for optimization.
- Learned about utilizing hybrid approaches to manage padding effectively and methods for sorting and partitioning to enhance regularization in sparse data.
Summary:
Developed a benchmarking script to evaluate the performance of the custom CUDA SpMV implementation against PyTorch's built-in functions. This benchmark facilitates comparative analysis of execution times and ensures that the implementation meets expected performance standards.
- Wrote a blog post titled "Learning CUDA with a Weak GPU or No GPU at All: Yes, You Can!"
- Addressed common misconceptions regarding GPU programming and provided practical tips for learners with limited hardware resources. The blog offers insights on optimizing CPU-based implementations and highlights methods to learn CUDA fundamentals without direct access to a powerful GPU.
Link to Blog:
Learning CUDA with a Weak GPU or No GPU at All: Yes, You Can!
Summary:
Implemented the Merge Sort algorithm using CUDA. The implementation focuses on merging two sorted arrays into a single sorted array using a parallel approach. The kernel utilizes a co-rank function to find positions in the combined array for inserting elements from the two sorted input arrays efficiently.
Learned:
- Explored the fundamentals of merge sort and its parallelization strategies.
- Implemented the co-rank function which assists in finding the correct position of elements while merging two sorted arrays.
- Developed a parallel merge kernel that utilizes the GPU's capabilities for concurrent execution, enhancing performance beyond a sequential merge approach.
- Read Chapter 11 of the PMPP book.
- Covered various aspects of merge sort parallel pattern. Key sections included:
- Background: Understanding the merge sort algorithm and its significance in parallel processing.
- Sequential Merge Algorithm: Key insights into how merge operations are typically conducted sequentially.
- Parallelization Approach: Strategies for achieving parallelism in merge sort, highlighting the expected performance benefits.
- Co-Rank Function Implementation: Understanding how the co-rank function is used to determine merging positions effectively.
- Basic and Tiled Merge Kernel: Learning about different kernel designs including basic parallel merge kernels and more advanced tiled merge techniques for optimizing data access patterns.
- Covered various aspects of merge sort parallel pattern. Key sections included:
I coded a Breadth first search optimized kernel, check this for more details: BFS .
I also coded Gelu activation kernel, check this for more details: Gelu .
And also coded a full linear layer that treats batches using cublas: Linear_kernel .
- Read Chapter 12 of the PMPP book.
- Explored parallel patterns for graph searches, covering:
- Background on graph structures and traversal mechanisms.
- Detailed sections on implementing both sequential and parallel BFS functions.
- Insights into optimizing graph traversal performance, including memory bandwidth considerations and load balancing strategies in parallel algorithms.
- Explored parallel patterns for graph searches, covering:
- Read Chapter 13 of the PMPP book.
- Learned about the fundamentals of CUDA Dynamic Parallelism, including:
- The basics and overview of dynamic parallelism in CUDA.
- How memory visibility works, especially in the context of different memory types (global, shared, local).
- Memory management strategies and the impact of nesting depth on kernel launches.
- Synchronization techniques, streams, and events for managing concurrent operations within dynamic kernels.
- Studied a more complex example about Bezier curve calculations both with and without dynamic parallelism, enhancing my understanding of recursive
- Learned about the fundamentals of CUDA Dynamic Parallelism, including:
- Optimize the BFS implementation using hierarchical queues for better memory usage and performance.
- Explore additional enhancements and optimizations discussed in Chapter 12 to refine the BFS algorithm further.
- Prepare a performance comparison between CPU and GPU implementations in the subsequent days.
Summary:
Implemented the FHD (Fully-Hybrid Domain) algorithm for non-Cartesian magnetic resonance imaging (MRI) reconstruction in CUDA. The code focuses on optimizing the parallelism structure to handle iterative reconstruction efficiently, aiming to balance computational load while reducing memory footprint.
Learned:
- Gained insights into non-Cartesian MRI imaging techniques and their relevance in modern medical imaging applications.
- Developed an understanding of iterative reconstruction methods and how parallelization can significantly improve performance in reconstructing images from non-Cartesian data.
- Implemented optimizations to address common challenges in MRI reconstruction, such as memory bandwidth limitations and computational heavy-lifting.
Summary:
Built upon the previous implementation of the FHD algorithm to include real image reading and processing capabilities. This version takes an actual image, applies the FHD reconstruction algorithm, and outputs the reconstructed image, demonstrating practical applicability of the CUDA code.
Learned:
- Expanded the previous understanding of memory management and kernel optimization by integrating real-world data processing into the workflow.
- Familiarized myself with image I/O operations in CUDA, allowing for the handling of real data as input for reconstruction algorithms.
- Completed Chapter 14 of the PMPP book.
- Delved into the case study of non-Cartesian magnetic resonance imaging, which provided:
- Background on the principles and necessities driving advancements in MRI technology.
- A comprehensive look at iterative reconstruction techniques that enhance image quality using statistical estimation methods.
- Detailed steps on optimizing the kernel parallelism structure to maximize performance and minimize memory constraints in handling MRI data.
- Insights into experimental performance tuning, particularly the advantages of leveraging hardware trigonometry functions to achieve rapid computations.
- Delved into the case study of non-Cartesian magnetic resonance imaging, which provided:
Summary:
Implemented the backpropagation for Flash Attention in CUDA, continuing from the forward pass developed earlier. The backpropagation step computes the gradients required for training the attention mechanism. However, a small issue arose where some of the gradients are outputting as zero at certain points, which will be addressed and fixed in the coming days.
Learned:
- Explored the process of backpropagation in the context of Flash Attention, including the calculation of gradients for the attention weights and input matrices.
- Worked on integrating gradient calculation with memory optimization techniques to maintain efficiency, consistent with the original forward pass.
- Identified potential issues related to numerical stability when dealing with gradient flow in CUDA, specifically in the attention layer.
Summary:
Developed a Convolutional Neural Network (CNN) implementation in CUDA, including both forward and backward passes with pooling layers. Used the unrolling trick for improved performance in the backward pass, optimizing the matrix operations involved.
Learned:
- Implemented the core components of a CNN in CUDA, including convolutions, activations, pooling layers, and backpropagation.
- Utilized the unrolling trick to optimize it, improving the performance of matrix multiplications and gradient calculations.
- Gained deeper understanding of the computational requirements for CNN training on GPUs and the importance of efficient memory access patterns and parallelism in deep learning.
-
Chapter 15: Application Case Study—Molecular Visualization and Analysis
- Delved into the background and practical aspects of molecular visualization in parallel computing.
- Learned about the importance of thread granularity adjustments and memory coalescing in visualizing large-scale molecular structures using CUDA.
-
Chapter 16: Application Case Study—Machine Learning
- Focused on Convolutional Neural Networks (ConvNets) and their implementation in CUDA.
- Covered key concepts such as basic layers, backpropagation, and the reduction of convolutional layers to matrix multiplication for optimization.
- Explored the cuDNN library and its use in accelerating deep learning operations.
-
Chapter 17: Parallel Programming and Computational Thinking
- Studied the core principles of parallel computing, including problem decomposition, algorithm selection, and computational thinking.
- Focused on strategies for optimizing memory locality and shared memory usage in parallel applications.
Summary:
Implemented a CUDA-accelerated Naive Bayes classifier, focusing on the training and inference stages. Leveraging shared memory to maximize computational efficiency, the implementation is structured to divide work among threads for parallelized data processing of feature probabilities.
Components Developed:
-
NaiveBayes.cu
:- This file contains the CUDA kernel responsible for calculating feature likelihoods and class probabilities in parallel. Shared memory was used where possible to minimize global memory access penalties.
- Optimized kernel launches to balance between grid and block dimensions for datasets with high dimensionality.
-
NaiveBayesKernel.cuh
:- Header file declaring the kernel functions, ensuring modularity and separation of concerns in code structure.
-
NaiveBayesTrain.cuh
:- Declared the host-side training function, encapsulating the logic to copy data to the GPU, launch CUDA kernels, and retrieve results.
-
NaiveBayesTrain.cpp
:- Implemented the host-side training process, providing pre-processing for input data and managing memory transfers between CPU and GPU.
-
main.cpp
:- Entry point of the program, performing tasks like loading data, splitting datasets for training and testing, and evaluating model performance after training.
- Updated My blog with an important information about using NVCC in Colab.
Summary:
Today, I implemented vector addition using the cuBLAS library in CUDA. By leveraging the optimized linear algebra routines provided by cuBLAS, this implementation achieves highly efficient computation of the vector addition operation C = A + B
for two input vectors A
and B
. The addition was performed using the cublasSaxpy
function, which computes scaled vector addition.
Key Concepts Implemented:
- Used the
cublasSaxpy
function to perform the vector addition in the formatC = alpha * A + B
wherealpha
is a scaling factor. In this case,alpha
was set to1.0
to achieve the simple addition ofA
andB
. - Managed the cuBLAS library handle for the operation.
Learned:
-
cuBLAS Basics:
- Gained an introduction to the cuBLAS library and its capabilities for high-performance linear algebra operations.
- Learned how to use cuBLAS functions like
cublasSaxpy
for vector addition and understood its parameters.
-
cuBLAS Handle Management:
- Understood how to create and destroy a cuBLAS handle using
cublasCreate
andcublasDestroy
. This is critical for managing state across cuBLAS calls.
- Understood how to create and destroy a cuBLAS handle using
-
Functionality of
cublasSaxpy
:- Reviewed the underlying algorithm and implementation of the AXPY operation, which computes
y = a*x + y
for real vectorsx
andy
and scalara
.
- Reviewed the underlying algorithm and implementation of the AXPY operation, which computes
Summary:
Implemented a matrix multiplication algorithm using the cuBLAS library, which is optimized for CUDA-capable GPUs. This involved utilizing high-performance BLAS (Basic Linear Algebra Subprograms) routines for efficient computation of matrix products. The cuBLAS library abstracts much of the complexity involved in parallel programming for matrix operations, allowing for better performance due to its efficient use of memory and computations optimized for the hardware.
- Completed Chapter 18 of the PMPP book:
- Programming a Heterogeneous Computing Cluster
- Completed Chapter 19 of the PMPP book:
- Parallel Programming with OpenACC
- Completed Chapter 20 of the PMPP book:
- More on CUDA and Graphics Processing Unit Computing
- Completed Chapter 21 of the PMPP book:
- Programming a Heterogeneous Computing Cluster
- Explored Appendix A (Introduction to OpenCL) and Appendix B (THRUST), Appendix C: CUDA Fortran, Appendix D: An introduction to C++ AMP
Summary:
Implemented a fully connected neural network (FCNet) using cuDNN in a CUDA program. This program utilizes the cuDNN library to perform forward passes through the network. The network consists of three layers: an input layer, two hidden layers, and an output layer. Each layer applies convolution, activation functions (ReLU), and includes bias terms.
Learned:
- How to leverage cuDNN to construct and optimize neural networks effectively on the GPU.
- Understanding of tensor descriptors, filter descriptors, and convolution descriptors, which are essential for defining the structure of the network.
- The process for initializing weights using the cuRAND library to generate random numbers for model training.
- Execution of the forward pass through the network with proper handling of data types, memory allocations, and error checking.
- Importance of initializing and cleaning up CUDA and cuDNN resources to prevent memory leaks.
- Used
cudnnCreate
,cudnnSetTensor4dDescriptor
, and related functions to define the structure of inputs, outputs, and weights. - Utilized convolution and activation layers to mimic the behavior of a feedforward neural network.
- Implemented error checking macros (
CHECK_CUDA
,CHECK_CUDNN
) to facilitate debugging of CUDA and cuDNN calls. - Conducted a simple forward training loop to process data; generated dummy input and label data for testing purposes.
Summary:
Implemented the Rotary Positional Encoding (RoPE) mechanism in CUDA for enhancing transformer models. RoPE is designed to introduce the notion of position into token representations, allowing the model to effectively capture the sequential nature of the input data without the limitations of traditional positional encodings.
- Continued exploring additional literature on data structures suitable for parallel computing, focusing on the use cases and performance implications of rope data structures in practical applications.
Summary:
Implemented a Stochastic Gradient Descent (SGD) algorithm in CUDA, focusing on optimizing calculations via shared memory and other performance tricks. This implementation is designed for linear regression, consisting of weight and bias updates, where the effectiveness of using CUDA's parallel computation capabilities significantly enhances the training speed.
Key Components:
-
Compute Loss:
- The kernel
compute_loss
calculates the predictions and corresponding squared loss for each data point. The predictions are calculated using an input matrixX
, weightsW
, and biasb
.
- The kernel
-
Compute Gradients:
- The
compute_gradients
kernel computes the gradients of the loss with respect to weights and bias. It uses shared memory (db_shared
) for the computation of bias gradients, reducing global memory access and improving performance.
- The
-
Update Weights:
- The
update_weights
kernel updates the weights and bias based on the calculated gradients and the learning rate.
- The
-
Training Function:
train_sgd
orchestrates the memory management, kernel launches, and data transfers between host and device for the entire training loop over a specified number of epochs.
- Continued the exploration of optimization techniques for CUDA applications, focusing on strategies for improving kernel performance and reducing latency.
- Reviewed literature on best practices for applying SGD and other optimization algorithms in machine learning frameworks, considering both theoretical and practical aspects.
Summary:
Developed the Expectation-Maximization (EM) algorithm in CUDA for clustering 1D data into a specified number of clusters. The implementation includes E-step and M-step kernels to maximize the expected likelihood function iteratively.
Key Components:
-
E-step Kernel (
eStepKernel
):
Calculates the responsibilities (probabilities) of each data point belonging to each cluster based on current parameters (means, standard deviations, mixing coefficients). -
M-step Kernel (
mStepKernel
):
Updates the parameters for each cluster (mean, variance, and mixing coefficients) based on the responsibilities calculated in the E-step.
Learnings:
- Gained practical experience with CUDA memory management, including data allocation and copying between host and device.
- Understood how to implement parallel reduction using atomic operations for accumulating sums across threads in the M-step phase.
- Familiarized with Gaussian mixture modeling concepts and the iterative nature of the EM algorithm, enhancing the ability to cluster data points effectively.
- Reviewed relevant literature on Gaussian Mixture Models and the EM algorithm's applications in various domains, including image processing and statistical analysis.
Summary:
Implemented the SwiGLU (Swish-Gated Linear Unit) activation function in CUDA. This kernel computes the activation function in a parallelized manner, enhancing the performance of neural network models that utilize the SwiGLU activation. The implementation focuses on efficient computation and memory access patterns to optimize performance on CUDA-capable GPUs.
Key Components:
-
Kernel Function:
- The
swiglu_kernel
computes the SwiGLU output by first performing matrix multiplications with weight matricesW1
andW2
, followed by applying the sigmoid function to the results. This approach allows the kernel to compute the outputs for multiple batches and dimensions concurrently.
- The
-
Memory Management:
- Memory allocation and deallocation on the GPU are managed using
cudaMalloc
andcudaFree
, ensuring efficient usage of GPU resources. Input matrices are copied to the device memory usingcudaMemcpy
.
- Memory allocation and deallocation on the GPU are managed using
-
Debugging Information:
- Added debugging print statements within the GPU kernel to help verify correctness for the initial output values. This assists in tracking computations and identifying potential issues during development.
-
Manual Verification:
- Performed manual computations for the first output element on the CPU to verify the correctness of the CUDA implementation against expectations, ensuring the output matches the sequential computation results.
Summary:
Implemented and tested a custom atomic addition (atomicAddLL
) function for 64-bit integers (long long) in CUDA. This implementation leverages the atomicCAS
(compare-and-swap) operation to ensure thread safety when multiple threads attempt to update the same memory location concurrently. The kernel accumulates thread indices into a shared long long
variable using the atomicAddLL
function.
Summary:
Implemented a parallelized version of Monte Carlo Tree Search (MCTS) using CUDA. The goal was to accelerate the rollout phase of MCTS by leveraging GPU parallelism. The implementation includes a simple game state structure, a rollout function, and a CUDA kernel to perform multiple simulations in parallel.
Summary:
I implemented a CUDA-based Gaussian Histogram Loss (HL-Gauss), which computes probabilities for a histogram-based representation of data using the Gaussian kernel's cumulative distribution function. The implementation makes use of massively parallel computation on the GPU to efficiently handle large datasets. This approach is particularly useful in reinforcement learning, as outlined by work from DeepMind and related research papers.
Summary:
Implemented a Mirror Descent algorithm in CUDA, which applies different mirror maps to optimize gradient-based updates efficiently.
Mirror Descent is a generalization of gradient descent that allows updates in a transformed space, making it effective for constrained optimization problems.
Summary:
This CUDA implementation simulates the time evolution of a quantum wavefunction using the finite difference method. The program calculates the wavefunction's evolution based on the time-dependent Schrödinger equation in a harmonic potential.
Summary:
This CUDA program performs a series of matrix operations using GPU acceleration, demonstrating the use of CUDA kernels for matrix addition, scaling, squaring, and offsetting while measuring performance with and without CUDA Graphs. It includes element-wise matrix addition, scalar multiplication, squaring of elements, and offset addition. The program also compares execution time using traditional CUDA execution versus CUDA Graphs, leveraging CUDA streams and events for optimized performance measurement. Finally, it verifies correctness by comparing GPU-computed results with CPU-verified results.
For more details check :Cuda_graphs
This is a special day as it marks one month of coding GPU kernels daily! During this time, I have implemented various kernels supporting AMD GPUs, including:
- Vector Addition
- Vector-Matrix Multiplication
- GELU Activation
- Layer Normalization
- Matrix Transpose
- 2D Convolution
- Flash Attention
- Prefix Sum
- Partial Sum
- Parallel Merge
- Sparse Matrix-Vector Multiplication
- ROPE (Rotary Position Embedding)
- Matrix Addition
- rocBLAS Vector Operations
- rocBLAS Matrix Multiplication
Compiled and test on AMD MI250 - 128 Cores/Node + 1TB, 14 CPUs
I made a blog for this special day : https://hamdi.bearblog.dev/my-one-month-journey-into-gpu-programming/
- Day 15 - mandatory FA2-forward (Done)
- Day 20 - mandatory FA2-bakcwards (Done)