-
Notifications
You must be signed in to change notification settings - Fork 129
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Adds API for non-caching-store and load #1476
base: develop
Are you sure you want to change the base?
Adds API for non-caching-store and load #1476
Conversation
created a new PR with both load and store. Added test cases for byte, 2-byte, 4, 8, 16. closed the old pr #1289 |
@ranapratap55 can you clarify if the non-caching load and store can work over coarse grained device memory, which is the memory type used in your tests? |
yes, it works on coarse-grained memory. I have used the hipMalloc() which allocates coarse-grained memory. |
Can you port and test https://github.com/ROCm/rccl/tree/develop/tools/p2p-latency-test to use the new API, and add the commit to this PR? This is more accurate testing for RCCL use cases. Please try changing memory from "uncached" to coarse-grained memory in these tests. |
#else | ||
#define BITS "glc slc dlc" | ||
#endif | ||
#define WAIT ((0 << 14) | (0x3f << 8) | (0x7) << 4) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please add a comment where you got the WAIT values from
__attribute__((always_inline)) | ||
__host__ __device__ T __non_caching_load(const T* p) | ||
{ | ||
#if !defined(__GFX11__) && !defined(GFX12) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are two families of instructions (e.g *_u8
vs *_ubyte
). These need to have some coverage in the tests pointed out by @wenkaidu by targeting 1 arch from each.
return; | ||
} | ||
|
||
int main(int argc, char **argv) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you add validation cpp?
For example, cover memory consistency cases of read-after-write, write-after-write, read-after-read and so on
RCCL provides "low latency" protocols for communication between agents, where the entire message consisting of data and flags is packed into a single L2 cache line. This is usually accomplished using atomic relaxed instructions in LLVM. But the 128-byte version of this protocol (LL-128) requires 128-bit load or store instructions that bypass the cache and are not broken up into multiple instructions. The nontemporal builtin is not always suitable for this use case.
The proposed approach is to provide a C++ function template that encapsulates an inline assembly call. This asm is intended to use the appropriate load/store parameters for each combination of data size and architecture.