-
Notifications
You must be signed in to change notification settings - Fork 23
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Single-precision support for CUDA variants #91
Conversation
555366a
to
67d1265
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, this looks great, definitely worth doing over replicating source for single precision. I've left a few remarks where things seemed a bit off, otherwise very happy with this.
NB: I haven't tested this myself, yet, which I would do before merging
// # pragma omp parallel for default(shared) private(b, bsize, jk) \ | ||
// reduction(min:zminval) reduction(max:zmaxval,zmaxerr) reduction(+:zerrsum,zsum) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should likely be removed entirely?
// # pragma omp parallel for default(shared) private(b, bsize, jl, jk) \ | ||
// reduction(min:zminval) reduction(max:zmaxval,zmaxerr) reduction(+:zerrsum,zsum) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same
// # pragma omp parallel for default(shared) private(b, bsize, jl, jk, jm) \ | ||
// reduction(min:zminval) reduction(max:zmaxval,zmaxerr) reduction(+:zerrsum,zsum) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ditto
src/cloudsc_cuda/cloudsc/dtype.h
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Probably needs single include guards:
#ifndef CLOUDSC_DTYPE_H
#define CLOUDSC_DTYPE_H
...
#endif
@@ -12,6 +12,7 @@ | |||
|
|||
#include <omp.h> | |||
#include "mycpu.h" | |||
// #include "dtype.h" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should probably be here but has likely caused problems because of duplicate definitions without the include guards?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The include is within cloudsc_driver.h
and cloudsc_driver.h
is included in cloudsc_driver.cu
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should probably have a #include "dtype.h"
?
|
||
#pragma omp parallel for default(shared) private(b, i, buf_start_idx, buf_idx) | ||
#pragma omp parallel for default(shared) private(b, l, i, buf_start_idx, buf_idx) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The l
appears unused?
|
||
#pragma omp parallel for default(shared) private(b, i, buf_start_idx, buf_idx) | ||
#pragma omp parallel for default(shared) private(b, l, i, buf_start_idx, buf_idx) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The l
appears to be unused?
|
||
dtype (*buffer)[nlev][nlon] = (dtype (*)[nlev][nlon]) buffer_in; | ||
dtype (*field)[nclv][nlev][nproma] = (dtype (*)[nclv][nlev][nproma]) field_in; | ||
|
||
#pragma omp parallel for default(shared) private(b, buf_start_idx, buf_idx, l, i) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should probably also have c
as private?
} | ||
|
||
void load_and_expand_1d_int(serialboxSerializer_t *serializer, serialboxSavepoint_t* savepoint, | ||
const char *name, int nlon, int nproma, int ngptot, int nblocks, int *field) | ||
{ | ||
int buffer[nlon]; | ||
double buffer[nlon]; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should load ints, why does the buffer have to be double?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
there is no reason ... not sure why I changed that
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great and tested building and running on AC. Many thanks!
Tested via:
./cloudsc-bundle build --clean --build-dir=build-sp-cuda-hdf5 --arch=arch/ecmwf/hpc2020/nvhpc/22.11 --with-cuda --with-gpu --single-precision
./cloudsc-bundle build --clean --build-dir=build-dp-cuda-hdf5 --arch=arch/ecmwf/hpc2020/nvhpc/22.11 --with-cuda --with-gpu
./cloudsc-bundle build --clean --build-dir=build-sp-cuda-serialbox --arch=arch/ecmwf/hpc2020/nvhpc/22.11 --with-cuda --with-gpu --single-precision --with-serialbox
./cloudsc-bundle build --clean --build-dir=build-dp-cuda-serialbox --arch=arch/ecmwf/hpc2020/nvhpc/22.11 --with-cuda --with-gpu --with-serialbox
and
cd build-...
and. ./env.sh
./bin/dwarf-cloudsc-c-cuda 1 65536 32
./bin/dwarf-cloudsc-c-cuda-hoist 1 65536 32
./bin/dwarf-cloudsc-c-cuda-k-caching 1 65536 32