Skip to content

Commit

Permalink
neuron: Use SENDRECV protocol on v4 net-plugin API version
Browse files Browse the repository at this point in the history
Commit 4763d60 ("rdma: wait for connect response msg in
connect()") changed RDMA protocol `connect()` to return a valid send
communicator only after a connect response message is received from
peer. Because the v4 net-plugin `connect()` API is expected to
synchronously return a valid send communicator (a behaviour that was
changed since v5+), this RDMA protocol behaviour is incompatible with
v4 `connect()` API.

Because of this, commit a725e08 ("api: fail when using
connect/accept_v4 with RDMA protocol") later introduced an explicit
failure when v4 net-plugin API is used together with RDMA protocol.

Even though current Neuron collectives library use the v4 net-plugin
API, this didn't caused an issue because the Neuron platform is
currently defined (by `platform_data_map[]`) to use SENDRECV protocol
by default. However, a soon to follow patch will change this default
setting for Neuron platform to use RDMA protocol. Therefore, as a
preparation to this Neuron default protocol change, we need to fix the
behaviour of Neuron v4 net-plugin API for cases where the user use a
Neuron collectives library that use the v4 net-plugin API.

To address this, we change the Neuron v4 net-plugin `init()` API to
internally set to use SENDRECV protocol if the user haven't set it. If
the user have explicitly set the `OFI_NCCL_PROTOCOL` env var to RDMA
protocol, the v4 net-plugin API will explicitly fail due to the
changed introduced by a725e08 ("api: fail when using
connect/accept_v4 with RDMA protocol").

Note that similar change is not required for the Nvidia platform v4
net-plugin `init()` API. Because all the Nvidia instance types that
support RDMA protocol also require Nvidia collectives library (NCCL)
of a version that also use a v5+ net-plugin API.

Signed-off-by: Michael Axtmann <[email protected]>
  • Loading branch information
maxtmann committed Aug 21, 2024
1 parent 7c014f0 commit cf433c8
Showing 1 changed file with 19 additions and 2 deletions.
21 changes: 19 additions & 2 deletions src/nccl_ofi_interface_neuron.c
Original file line number Diff line number Diff line change
Expand Up @@ -4,8 +4,26 @@

#include "config.h"

#include <stdlib.h>
#include "nccl_ofi.h"
#include "nccl_ofi_api.h"
#include "nccl_ofi_param.h"

static ncclResult_t init_v4(ncclDebugLogger_t logFunction)
{
/*
* RDMA protocol `connect()` returns a valid send communicator only
* after a connect response message is received from peer. Because the
* v4 net-plugin `connect()` API is expected to synchronously return a
* valid send communicator (a behaviour that was changed since v5+),
* this RDMA protocol behaviour is incompatible with v4 `connect()`
* API.
*/
if(ofi_nccl_protocol() == NULL) {
setenv("OFI_NCCL_PROTOCOL", "SENDRECV", 0);
}
return nccl_net_ofi_init(logFunction);
}

static ncclResult_t getProperties_v4(int dev_id, ncclNetProperties_v4_t *props)
{
Expand All @@ -29,10 +47,9 @@ static ncclResult_t getProperties_v4(int dev_id, ncclNetProperties_v4_t *props)
return ncclSuccess;
}


NCCL_OFI_EXPORT_SYMBOL const ncclNet_v4_t ncclNetPlugin_v4 = {
.name = "AWS Libfabric",
.init = nccl_net_ofi_init,
.init = init_v4,
.devices = nccl_net_ofi_devices,
.getProperties = getProperties_v4,
.listen = nccl_net_ofi_listen_v4,
Expand Down

0 comments on commit cf433c8

Please sign in to comment.