Skip to content

Commit

Permalink
init: Avoid hang by forcing SENDRECV in case of neuron v4 API usage
Browse files Browse the repository at this point in the history
v4 API may block infinitively when executed with RDMA protocol because
communicator creation is (a) blocking operation by definition of v4
API and (b) performing 4-way handshake in case of RDMA
protocol. Therefore, we force it to use SENDRECV protocol in case
neuron specific API is used. We do not force SENDRECV protocol in case
of NCCL API, since there is no known platform that uses RDMA protocol
with v4 API. Note, on P5 instances, NCCL needs to needs to support
more recent API anyways.

Signed-off-by: Michael Axtmann <[email protected]>
  • Loading branch information
maxtmann committed Aug 21, 2024
1 parent 7c014f0 commit 1925554
Showing 1 changed file with 8 additions and 2 deletions.
10 changes: 8 additions & 2 deletions src/nccl_ofi_interface_neuron.c
Original file line number Diff line number Diff line change
Expand Up @@ -4,9 +4,16 @@

#include "config.h"

#include <stdlib.h>
#include "nccl_ofi.h"
#include "nccl_ofi_api.h"

static ncclResult_t init_v4(ncclDebugLogger_t logFunction)
{
setenv("NCCL_OFI_PROTOCOL", "SENDRECV", 0);
return nccl_net_ofi_init(logFunction);
}

static ncclResult_t getProperties_v4(int dev_id, ncclNetProperties_v4_t *props)
{
nccl_ofi_properties_t ofi_properties;
Expand All @@ -29,10 +36,9 @@ static ncclResult_t getProperties_v4(int dev_id, ncclNetProperties_v4_t *props)
return ncclSuccess;
}


NCCL_OFI_EXPORT_SYMBOL const ncclNet_v4_t ncclNetPlugin_v4 = {
.name = "AWS Libfabric",
.init = nccl_net_ofi_init,
.init = init_v4,
.devices = nccl_net_ofi_devices,
.getProperties = getProperties_v4,
.listen = nccl_net_ofi_listen_v4,
Expand Down

0 comments on commit 1925554

Please sign in to comment.