-
Notifications
You must be signed in to change notification settings - Fork 374
Using Managed Memory in Any Fortran Compiler in OpenMP and OpenACC
Managed memory is a very useful tool when porting a code to GPUs. In some cases, there are hundreds of kernels and hundreds of variables to manage, and tracking dependencies can be both burdensome and error prone. In many cases, the CUDA runtime will give faster simulations with managed memory than explicit data statements if only because it's hard for a human to track in, out, and inout dependencies over hundreds of kernels and variables. The CUDA runtime automatically minimized the amount of data movement for you, and it doesn't introduce bugs like you possibly could.
Managed memory will not give good performance in two cases:
- If you have intermittent CPU work that interrupts GPU kernels and touches the same data that's used in the GPU kernels, then that shuffling back and forth will be slower than if you put in data statements yourself
- If you do not
prefetch
your data, then it is not paged to the GPU until first touch in a GPU kernel. What this ultimately looks like in annvprof
profile is absurdly large kernel runtimes due to the latency of that fetch. But the profiler will not tell you that it's due to Managed Memory fetches because it doesn't have that information available. It just looks like kernels taking much longer than they should. The solution to this is to usecudaMemPrefetchAsync(...)
.
So long as your work is continuously on the GPU (no CPU interruptions touching data used in GPU kernels), and you prefetch your data upon allocation with cudaMallocManaged()
, then there is absolutely no performance degradation with CUDA Managed Memory. In fact, for larger complex codes, if these two criteria are met, you many notice it doing a better job than you did with your explicit data statements.
I'm glad you asked! It turns out you, in fact, can use this with GNU OpenACC and OpenMP offload, PGI OpenACC, and XL OpenMP offload. To start with, you need to stop using automatic Fortran arrays, and you need to turn them into allocatable
arrays and allocate
and deallocate
them instead. This has the benefit that valgrind
can now check that memory for incorrect memory accesses. But it's also the only way that data can be used in CUDA Managed Memory. The only data that should be in Fortran automatic arrays are small arrays that are not dimensioned the size of any of the loops. Essentially only "private" arrays should be automatic in Fortran.
Next, you'll need to change these arrays from allocatable
to pointer, contiguous
. The reason is that you're going to allocate them in C++ instead because Fortran doesn't have the capability to use CUDA Managed Memory unless you explicitly use CUDA Fortran, which has no path forward on future machines. To allocate in C++ and use it in Fortran, you'll have to use the iso_c_binding
's c_f_pointer()
routine to turn an allocation into a Fortran pointer
. The c_f_pointer()
routine basically adds the necessary Fortran array descriptors to make that allocation useful in Fortran. The point of the contiguous
specifier is to help the compiler optimize use of the pointer by declaring it cannot point into strided memory (which Fortran quite unfortunately allows).
Next, you're going to have to change the allocate( var( 0:nx+1 , 0:ny+1 ) )
with a Fortran interface to a C++ routine such as gator_allocate(var,[0,0],[nx,ny])
, assuming we're using the gator
pool allocator with managed memory. What it's doing under the hood is:
module gator_mod
use iso_c_binding
[...]
interface
function gator_allocate_c( bytes ) result(ptr) bind(C, name="gatorAllocate")
use iso_c_binding
type(c_ptr) :: ptr
integer(c_size_t), value :: bytes
end function gator_allocate_c
end interface
[...]
interface gator_allocate
[...]
module procedure :: gator_allocate_real4_2d
[...]
end interface gator_allocate
contains
[...]
subroutine gator_allocate_real4_2d( arr , dims , lbounds_in )
integer, parameter :: ndims = 2
real , pointer , intent( out) :: arr (:,:)
integer , intent(in ) :: dims (ndims)
integer, optional, intent(in ) :: lbounds_in(ndims)
integer :: lbounds(ndims)
type(c_ptr) :: data_ptr
if (present(lbounds_in)) then
lbounds = lbounds_in
else
lbounds = 1
endif
data_ptr = gator_allocate_c( int(product(dims)*sizeof(r4),c_size_t) )
call c_f_pointer( data_ptr , arr , dims )
arr(lbounds(1):,lbounds(2):) => arr
end subroutine gator_allocate_real4_2d
[...]
end module gator_mod
extern "C" void* gatorAllocate( size_t bytes ) {
void* ptr;
cudaMallocManaged(&ptr,bytes);
cudaMemPrefetchAsync(ptr,bytes,0);
cudaDeviceSynchronize();
#ifdef _OPENMP45
omp_target_associate_ptr(ptr,ptr,bytes,0,0);
#endif
#ifdef _OPENACC
acc_map_data(ptr,ptr,bytes);
#endif
return ptr;
}
What this does is allocate the memory pool using cudaMallocManaged
, then prefetch the entire pool to the GPU with cudaMemPrefetchAsync
. Then the real magic happens. With only these two routines, the OpenMP and OpenACC runtimes have no idea that the CPU pointers being used in GPU kernels are using Managed Memory. Consequently, XL, PGI, and Cray runtimes will create their own GPU copies and completely ignore that you are using Managed Memory (thus, really bad performance). To fix this, we use omp_target_associate_ptr
in OpenMP offload and acc_map_data
in OpenACC to trick the OpenMP and OpenACC runtimes into thinking that the CPU pointer is the GPU pointer. And then, the runtime doesn't do anything with the data because it thinks the data is already present on the GPU (which, in fact, it truly is but through CUDA Managed Memory rather than the OpenMP and OpenACC interfaces).
Not all arrays are created equal in Fortran. Different forms of Fortran arrays have different amounts of meta data associated with them that are handled differently by different compilers, and there is nothing in the Fortran standard that constrains how to handle these things. Pointers have the most metadata attached to them, contiguous
pointers have slightly less metadata, assumed shape arrays have even less metadata, and finally explicitly sized and automatic arrays have the least metadata. When you run a pointer in a GPU kernel, even though the data being pointed to is on the GPU, there's no way to get the metadata on the GPU in current implementations, and that will have to be copied to the GPU, which adds a fair amount of time to smaller kernels.
The problem is that when you use your fancy new pointer, contiguous
data in kernels, in nvprof
you'll notice small amounts of data being copied to the GPU before each kernel (often multiple things being copied for each pointer used). That's the array descriptors being copied by the OpenMP runtime with cudaMemcpy
for you under the hood. This isn't a problem for very large kernels, but rarely do we have that luxury in climate, where we strong scale to the point that kernels run as little as 1-10 microseconds.
To get rid of these array descriptor copies, you have to pass the pointers through a function
or subroutine
interface and accept them as normal Fortran arrays. For most compilers, accepting a pointer as an assumed shape array (i.e., arrayname(:,:,:)
) will get rid of these extraneous copies. However, some compilers require an explicitly sized array to get rid of all copies to the GPU.
- Change automatic arrays to
pointer, contiguous
, explicitly allocate, and explicitly deallocate - Change
allocate()
anddeallocate()
to Fortran bindings to C++ routines that do the allocations and deallocations for you - In the C++ allocation, use
cudaMallocManaged()
,cudaMemPrefetchAsync()
- In OpenMP, also use
omp_target_associate_ptr()
- In OpenACC, also use
acc_map_data()
-
cudaMallocManaged
is very expensive, so ideally you should use a pool allocator, which pays this cost once for a large block of memory and then hands out chunks of that block at a much lower cost.
As an example of a simple code that does this, see Pool Alligator
PGI: Note that PGI can enable this by default with -ta=nvidia,managed
, and it will turn on its own pool allocator. But if you want to use another compiler, at the moment, you'll need to use a custom C++ implementation with Fortran bindings.