Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
neuron: Use SENDRECV protocol on v4 net-plugin API version
Commit 4763d60 ("rdma: wait for connect response msg in connect()") changed RDMA protocol `connect()` to return a valid send communicator only after a connect response message is received from peer. Because the v4 net-plugin `connect()` API is expected to synchronously return a valid send communicator (a behaviour that was changed since v5+), this RDMA protocol behaviour is incompatible with v4 `connect()` API. Because of this, commit a725e08 ("api: fail when using connect/accept_v4 with RDMA protocol") later introduced an explicit failure when v4 net-plugin API is used together with RDMA protocol. Even though current Neuron collectives library use the v4 net-plugin API, this didn't caused an issue because the Neuron platform is currently defined (by `platform_data_map[]`) to use SENDRECV protocol by default. However, a soon to follow patch will change this default setting for Neuron platform to use RDMA protocol. Therefore, as a preparation to this Neuron default protocol change, we need to fix the behaviour of Neuron v4 net-plugin API for cases where the user use a Neuron collectives library that use the v4 net-plugin API. To address this, we change the Neuron v4 net-plugin `init()` API to internally set to use SENDRECV protocol if the user haven't set it. If the user have explicitly set the `OFI_NCCL_PROTOCOL` env var to RDMA protocol, the v4 net-plugin API will explicitly fail due to the changed introduced by a725e08 ("api: fail when using connect/accept_v4 with RDMA protocol"). Note that similar change is not required for the Nvidia platform v4 net-plugin `init()` API. Because all the Nvidia instance types that support RDMA protocol also require Nvidia collectives library (NCCL) of a version that also use a v5+ net-plugin API. Signed-off-by: Michael Axtmann <[email protected]>
- Loading branch information