forked from aws/aws-ofi-nccl
-
Notifications
You must be signed in to change notification settings - Fork 0
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
init: Avoid hang by forcing SENDRECV in case of neuron v4 API usage
v4 API may block infinitively when executed with RDMA protocol because communicator creation is (a) blocking operation by definition of v4 API and (b) performing 4-way handshake in case of RDMA protocol. Therefore, we force it to use SENDRECV protocol in case neuron specific API is used. We do not force SENDRECV protocol in case of NCCL API, since there is no known platform that uses RDMA protocol with v4 API. Note, on P5 instances, NCCL needs to needs to support more recent API anyways. Signed-off-by: Michael Axtmann <[email protected]>
- Loading branch information
Showing
4 changed files
with
22 additions
and
3 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters