Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for blocked offload of fields to device #83

Draft
wants to merge 6 commits into
base: main
Choose a base branch
from

Conversation

wertysas
Copy link
Contributor

This PR adds the ability to offload partial fields, referred to as blocks, to the device. This corresponds to PR2 described in the issue Add support for partial offloading of field API fields to device #68.

A new optional dummy argument BLK_BOUNDS, an array of two integers that hold the lower and upper index of the block to be offloaded, is added to all data transfer methods. If the field is not already allocated on device the new allocation of the field on the device will only be the size of the block.

The implementation has been tested on the cloudsc-dwarf on GPU nodes (NVIDIA A100) of ECMWF's Atos supercomputer by running instances with 10x larger fields than what is possible without blocking. The results from such an experiment is shown in the plot below. The blocked version SCC-FIELD is not only able to handle larger problem loads but also displays better performance than the non-blocked OpenACC variant SCC.

partial-offload

Copy link
Collaborator

@awnawab awnawab left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot @wertysas for this excellent piece of work 🙏 You've added very useful new functionality in a super clean way, well done! Kudos also for taking what must have been a difficult rebase in your stride 😅

I've suggested a few minor fixes and one more significant one to the CREATE_DEVICE_DATA logic. It would also be good to expand the testing to cover the latter.

CLASS(${ftn1}$_WRAPPER_HELPER) :: SELF
INTEGER(KIND=JPIM), OPTIONAL, INTENT(IN) :: BLK_BOUNDS(2)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are we not passing the BLK_BOUNDS through to SELF%PARENT%CREATE_DEVICE_DATA?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for catching this! I've fixed that now.

@@ -93,11 +94,12 @@ CONTAINS
END SUBROUTINE

#:for what in ['HOST', 'DEVICE']
SUBROUTINE ${ftn1}$_GET_${what}$_DATA_WRAPPER_HELPER (SELF, MODE, PTR, QUEUE)
SUBROUTINE ${ftn1}$_GET_${what}$_DATA_WRAPPER_HELPER (SELF, MODE, PTR, QUEUE, BLK_BOUNDS)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same comment here about passing through BLK_BOUNDS.

@@ -250,8 +252,9 @@ CONTAINS

END SUBROUTINE ${ftn}$_GANG_${type}$_DELETE_DEVICE_DATA

SUBROUTINE ${ftn}$_GANG_${type}$_CREATE_DEVICE_DATA (SELF)
SUBROUTINE ${ftn}$_GANG_${type}$_CREATE_DEVICE_DATA (SELF, BLK_BOUNDS)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same comment here about passing through BLK_BOUNDS.



IF ( .NOT. PRESENT(BLK_BOUNDS) ) THEN
CALL CPU_TIME(START)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These timers should be around this entire conditional.

LBOUNDS=LBOUND(SELF%PTR)
IF (.NOT. ASSOCIATED (SELF%DEVPTR)) THEN
CALL SELF%CREATE_DEVICE_DATA
CALL SELF%CREATE_DEVICE_DATA(BLK_BOUNDS=BLK_BOUNDS)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should always call CREATE_DEVICE_DATA and inside that routine check whether the device allocation is big enough for the transfer we want to do. The current implementation will fail if we've done a "blocked" transfer for a field and then try and do a non-blocked transfer without wiping the device memory.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed, I have moved the logic to the CREATE_DEVICE_DATA routine now and added the missing delete.


END SUBROUTINE ${ftn}$_GET_DEVICE_DATA_RDWR

SUBROUTINE ${ftn}$_SYNC_DEVICE_RDWR (SELF, QUEUE)
SUBROUTINE ${ftn}$_SYNC_DEVICE_RDWR (SELF, QUEUE, BLK_BOUNDS)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to pass the BLK_BOUNDS through here too.

IF ( PTR_CPU(I,3) /= 37 ) THEN
OKAY =.FALSE.
END IF
END DO
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would add here another offload round but this time for the full field, to test the CREATE_DEVICE_DATA mechanics.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that should definitely be covered by the test, I have added another offload round that triggers the "reallocation".

@wertysas wertysas force-pushed the je-partial-field-offload branch from 6379731 to 27b24a5 Compare February 13, 2025 16:33
@wertysas wertysas force-pushed the je-partial-field-offload branch from 27b24a5 to c8eb929 Compare February 14, 2025 08:05
@awnawab
Copy link
Collaborator

awnawab commented Feb 18, 2025

Thanks a lot again for this @wertysas. After a lengthy offline discussion with @mlange05 and @wertysas, we realised that the current implementation has an inconsistency when it comes to the status tracking; a partial offload of a field shouldn't set the entire field dirty/fresh on host/device.

Furthermore, we realised that the existing asynchronous functionality is also incompatible with the status tracking. Consider the following example which would lead to undefined behaviour:

CALL FIELD%SYNC_DEVICE_RDWR(QUEUE=1)
CALL FIELD%SYNC_HOST_RDWR()

The status tracking should therefore only be reserved for the default use case of offloading entire fields synchronously. Anything more complex than that, e.g. asynchronous or partial offload, should go via an "expert" mode, where responsibility is ceded to the user and they are able to override the status checks.

As such, @wertysas could you please prepare a PR where you enable such an expert mode that overrides the field status tracking? The current PR should be converted to a draft and rebased once the expert mode functionality is merged.

@wertysas wertysas marked this pull request as draft February 18, 2025 21:51
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants