-
Notifications
You must be signed in to change notification settings - Fork 10
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add support for blocked offload of fields to device #83
base: main
Are you sure you want to change the base?
Conversation
…lk_bounds arg to copy methods
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks a lot @wertysas for this excellent piece of work 🙏 You've added very useful new functionality in a super clean way, well done! Kudos also for taking what must have been a difficult rebase in your stride 😅
I've suggested a few minor fixes and one more significant one to the CREATE_DEVICE_DATA
logic. It would also be good to expand the testing to cover the latter.
CLASS(${ftn1}$_WRAPPER_HELPER) :: SELF | ||
INTEGER(KIND=JPIM), OPTIONAL, INTENT(IN) :: BLK_BOUNDS(2) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why are we not passing the BLK_BOUNDS
through to SELF%PARENT%CREATE_DEVICE_DATA
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for catching this! I've fixed that now.
@@ -93,11 +94,12 @@ CONTAINS | |||
END SUBROUTINE | |||
|
|||
#:for what in ['HOST', 'DEVICE'] | |||
SUBROUTINE ${ftn1}$_GET_${what}$_DATA_WRAPPER_HELPER (SELF, MODE, PTR, QUEUE) | |||
SUBROUTINE ${ftn1}$_GET_${what}$_DATA_WRAPPER_HELPER (SELF, MODE, PTR, QUEUE, BLK_BOUNDS) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same comment here about passing through BLK_BOUNDS
.
@@ -250,8 +252,9 @@ CONTAINS | |||
|
|||
END SUBROUTINE ${ftn}$_GANG_${type}$_DELETE_DEVICE_DATA | |||
|
|||
SUBROUTINE ${ftn}$_GANG_${type}$_CREATE_DEVICE_DATA (SELF) | |||
SUBROUTINE ${ftn}$_GANG_${type}$_CREATE_DEVICE_DATA (SELF, BLK_BOUNDS) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same comment here about passing through BLK_BOUNDS
.
src/core/field_RANKSUFF_module.fypp
Outdated
|
||
|
||
IF ( .NOT. PRESENT(BLK_BOUNDS) ) THEN | ||
CALL CPU_TIME(START) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These timers should be around this entire conditional.
src/core/field_RANKSUFF_module.fypp
Outdated
LBOUNDS=LBOUND(SELF%PTR) | ||
IF (.NOT. ASSOCIATED (SELF%DEVPTR)) THEN | ||
CALL SELF%CREATE_DEVICE_DATA | ||
CALL SELF%CREATE_DEVICE_DATA(BLK_BOUNDS=BLK_BOUNDS) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should always call CREATE_DEVICE_DATA
and inside that routine check whether the device allocation is big enough for the transfer we want to do. The current implementation will fail if we've done a "blocked" transfer for a field and then try and do a non-blocked transfer without wiping the device memory.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed, I have moved the logic to the CREATE_DEVICE_DATA
routine now and added the missing delete.
|
||
END SUBROUTINE ${ftn}$_GET_DEVICE_DATA_RDWR | ||
|
||
SUBROUTINE ${ftn}$_SYNC_DEVICE_RDWR (SELF, QUEUE) | ||
SUBROUTINE ${ftn}$_SYNC_DEVICE_RDWR (SELF, QUEUE, BLK_BOUNDS) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We need to pass the BLK_BOUNDS
through here too.
IF ( PTR_CPU(I,3) /= 37 ) THEN | ||
OKAY =.FALSE. | ||
END IF | ||
END DO |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would add here another offload round but this time for the full field, to test the CREATE_DEVICE_DATA
mechanics.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, that should definitely be covered by the test, I have added another offload round that triggers the "reallocation".
6379731
to
27b24a5
Compare
…te_device_data and updated test
27b24a5
to
c8eb929
Compare
Thanks a lot again for this @wertysas. After a lengthy offline discussion with @mlange05 and @wertysas, we realised that the current implementation has an inconsistency when it comes to the status tracking; a partial offload of a field shouldn't set the entire field dirty/fresh on host/device. Furthermore, we realised that the existing asynchronous functionality is also incompatible with the status tracking. Consider the following example which would lead to undefined behaviour:
The status tracking should therefore only be reserved for the default use case of offloading entire fields synchronously. Anything more complex than that, e.g. asynchronous or partial offload, should go via an "expert" mode, where responsibility is ceded to the user and they are able to override the status checks. As such, @wertysas could you please prepare a PR where you enable such an expert mode that overrides the field status tracking? The current PR should be converted to a draft and rebased once the expert mode functionality is merged. |
This PR adds the ability to offload partial fields, referred to as blocks, to the device. This corresponds to PR2 described in the issue Add support for partial offloading of field API fields to device #68.
A new optional dummy argument
BLK_BOUNDS
, an array of two integers that hold the lower and upper index of the block to be offloaded, is added to all data transfer methods. If the field is not already allocated on device the new allocation of the field on the device will only be the size of the block.The implementation has been tested on the cloudsc-dwarf on GPU nodes (NVIDIA A100) of ECMWF's Atos supercomputer by running instances with 10x larger fields than what is possible without blocking. The results from such an experiment is shown in the plot below. The blocked version
SCC-FIELD
is not only able to handle larger problem loads but also displays better performance than the non-blocked OpenACC variantSCC
.