Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add MSCCLPP user buffer registration APIs and integrate with RCCL #1477

Merged
merged 14 commits into from
Jan 14, 2025

Conversation

nusislam
Copy link
Contributor

@nusislam nusislam commented Jan 2, 2025

Details

This PR adds user buffer registration APIs in MSCCLPP and integrates them with the corresponding RCCL APIs.

Work item: "Internal", or link to GitHub issue (if applicable).

What were the changes?
A patch has been added in ext-src. This patch will be applied to MSCCLPP during RCCL build. Also MSCCLPP related changes have been added to the RCCL user buffer registration code.

Why were the changes made?
These changes are required to use MSCCLPP in non-HIP-graph mode.

How was the outcome achieved?
Technical details behind the work. Explain any publicly-available hardware peculiarities.

Additional Documentation:
What else should the reviewer know?

Approval Checklist

Do not approve until these items are satisfied.

  • Verify the CHANGELOG has been updated, if
    • there are any NCCL API version changes,
    • any changes impact library users, and/or
    • any changes impact any other ROCm library.

@nusislam nusislam changed the title ext-src: add MSCCLPP memory registration APIs Add MSCCLPP user buffer registration APIs and integrate with RCCL Jan 3, 2025
@nusislam nusislam marked this pull request as ready for review January 3, 2025 03:26
+}
+
+NCCL_API ncclResult_t ncclCommDeregister(ncclComm_t comm, void* handle) {
+ if (comm && handle) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't we free any resources associated with handle here?
Example:

auto it = comm->channelScratchInfos.find(buffKey);
if (it != comm->channelScratchInfos.end()) {
  comm->channelScratchInfos.erase(it);
}

Same thing can be done for channelOut and channelIn

Comment on lines 71 to 74
+NCCL_API ncclResult_t ncclCommDeregister(ncclComm_t comm, void* handle) {
+ if (comm && handle) {
+ handle = nullptr;
+ }
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The implementation nullifies the handle inside the function, but since it's passed as void*, the change doesn't reflect in the caller's context. This can lead to dangling pointers. I suggest passing handle as a void** to resolve this issue, if API change is possible

Suggested change
+NCCL_API ncclResult_t ncclCommDeregister(ncclComm_t comm, void* handle) {
+ if (comm && handle) {
+ handle = nullptr;
+ }
+NCCL_API ncclResult_t ncclCommDeregister(ncclComm_t comm, void** handle) {
+ if (comm && handle && *handle) {
+ *handle = nullptr;
+ }

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a NCCL API and cannot be changed. I am still working on some of these fixes.

+ std::vector<mscclpp::RegisteredMemory> remoteMemories;
+
+ // Creating the channels
+ auto buffIt = comm->channelScratchInfos.find(buffKey);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If there's no guarantee this will be called from a single thread, we probably need to lock access to channelScratchInfos and the rest of the maps. I suggest using one lock if this function doesn't fall on the critical path.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This map is supposed to be checked by the host process and MSCCLPP is invoked only when using one GPU per process case.

/*
* Send
*
diff --git a/apps/nccl/src/nccl.cu b/apps/nccl/src/nccl.cu
index a697be2..d8497e7 100644
--- a/apps/nccl/src/nccl.cu
+++ b/apps/nccl/src/nccl.cu
@@ -577,6 +577,58 @@ NCCL_API ncclResult_t ncclAllGather(const void* sendbuff, void* recvbuff, size_t
@@ -577,6 +577,67 @@ NCCL_API ncclResult_t ncclAllGather(const void* sendbuff, void* recvbuff, size_t
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change does not seem to be necessary

src/register.cc Outdated
cache->population -= 1;

#ifdef ENABLE_MSCCLPP
if (comm->mscclCompatible){
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here you should do the same checks that you did before registering the buffer.
if (comm->mscclCompatible && size > 0 && (size & 31) == 0 && size <= comm->mscclpp_threshold){

Comment on lines 12 to 15
+ncclResult_t ncclCommRegister(ncclComm_t comm, void* buff, size_t size, void** handle);
+ncclResult_t ncclCommDeregister(ncclComm_t comm, void* handle);
+ncclResult_t ncclBuffIsRegistered(ncclComm_t comm, const void* buff, size_t count, bool* registered);
+size_t ncclBufferSize(ncclComm_t comm, void* handle);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since you're inserting functions into mscclpp, you can just give them names such as mscclppCommRegister, and then there's no need to rename via mscclpp_nccl_syms.txt.

Copy link
Contributor Author

@nusislam nusislam Jan 6, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ncclCommRegister() and ncclCommDeRegister() are NCCL APIs. We need to keep these names so that a workload can also be run with the mscclpp backend.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am curious why ncclCommRegister and ncclCommDeRegister are not implement in mscclpp? The stub functions already exist: https://github.com/microsoft/mscclpp/blob/6d26b92665383b60ffae7e4dd9ab4cc047d34721/apps/nccl/src/nccl.cu#L809C1-L817C2

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@nusislam Should these changes be submitted as a mscclpp PR?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Eventually, yes. But at the moment, will use this patch.

@nusislam
Copy link
Contributor Author

nusislam commented Jan 8, 2025

@corey-derochie-amd - Any more reviews?

@isaki001 isaki001 requested a review from JhaShweta1 as a code owner January 8, 2025 21:53
@isaki001
Copy link
Contributor

isaki001 commented Jan 8, 2025

There is a typo I need to fix. Will push commit shortly

src/register.cc Outdated
Comment on lines 169 to 171
else{
WARN("MSCCL++: Cannot register user-buffers on managed memory");
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will do nothing, issue a warning, then fall through and return ncclSuccess. Is that the right behaviour? Should it not return an error?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the expectation is that if no registration happens in MSCCL++ due to the buffer being in managed memory, no MSCCL++ or RCCL registration will happen and execution will fall back to RCCL kernel without UBR.

I actually need to do a small adjustment to not cause a RCCL deregistration error when MSCCL++ detects no registered buffer because of this. I just tested a simple fix for this.

But since we always fallback to RCCL kernel when MSCCL++ is not enabled, I think a warning without return an error is OK. Also, I haven't seen any errors returned in any UBR related routines, except when RCCL cannot find the registration handle during deregistration. Of course, this definitely makes it harder to see if anything goes wrong.

… back to rccl deRegister which will return an error
@nusislam
Copy link
Contributor Author

@isaki001 - I think the helper routines do not have to be named as NCCL APIs; the return type can also be changed. Other than this, the PR looks good to me.

Copy link
Collaborator

@corey-derochie-amd corey-derochie-amd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left some suggestions.

src/register.cc Outdated Show resolved Hide resolved
src/register.cc Outdated Show resolved Hide resolved
cmake/MSCCLPP.cmake Outdated Show resolved Hide resolved
cmake/MSCCLPP.cmake Outdated Show resolved Hide resolved
cmake/MSCCLPP.cmake Outdated Show resolved Hide resolved
cmake/MSCCLPP.cmake Outdated Show resolved Hide resolved
src/register.cc Outdated Show resolved Hide resolved
@isaki001
Copy link
Contributor

Thanks @corey-derochie-amd I'll add your suggestions and I'll change the names as @nusislam indicated.

isaki001 and others added 4 commits January 10, 2025 16:08
Whitespace suggestions and reducing diffs to avoid future merge conflicts

Co-authored-by: corey-derochie-amd <[email protected]>
@@ -165,11 +165,11 @@ ncclResult_t ncclCommRegister_impl(const ncclComm_t comm, void* buff, size_t siz
if(!isManagedBuffer){
INFO(NCCL_INIT, "MSCCL++: ncclCommRegister");
NCCLCHECK(mscclpp_ncclCommRegister(comm->mscclpp_comm, buff, size, handle));
return ncclSuccess;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm debating whether this is the right behaviour... Are these buffers reused in ways that might call MSCCL++ sometimes and RCCL other times? Are buffers always matched to message size, or could they use a buffer that's larger than the given message size? If so, the safest thing to do might be to register for BOTH RCCL and MSCCL++. Thoughts, @nusislam ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The cases when mscclppRegister should be used vs. rcclRegister should be used are mutually exclusive.

@isaki001 isaki001 merged commit e9b6bbc into ROCm:develop Jan 14, 2025
24 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants