Skip to content

Commit

Permalink
topo: Avoid grouping of multiple NICs to trainium accelerator
Browse files Browse the repository at this point in the history
Since a TRN accelerator is composed of multiple cores, the number of
trainium accelerators does not necessarily reflect the number of NIC
devices that the RDMA protocol should expose to the user. Instead,
each core should have a NIC accessible for communication if that many
NICs are available.
The best approach, for now, is to remove trainium accelerators from
the list of accelerators around which NICs are grouped. Consequently,
each libfabric NIC is exposed as on NIC device to the user. This
provides trainium maximal freedom in routing data over NICs.

In the long run, a better solution might be to expose the number of
actual cores to the plugin and take that number into account while NIC
grouping.

Signed-off-by: Michael Axtmann <[email protected]>
  • Loading branch information
maxtmann committed Aug 27, 2024
1 parent 3f12cad commit 414997b
Showing 1 changed file with 0 additions and 5 deletions.
5 changes: 0 additions & 5 deletions src/nccl_ofi_topo.c
Original file line number Diff line number Diff line change
Expand Up @@ -17,13 +17,8 @@
#include "nccl_ofi_math.h"
#include "nccl_ofi_ofiutils.h"

#if HAVE_CUDA
static const uint8_t target_class_id = 0x03; /* Display controller class */
static const unsigned short target_vendor_id = 0x10de; /* NVIDIA */
#else
static const uint8_t target_class_id = 0x08; /* System peripheral */
static const unsigned short target_vendor_id = 0x1d0f; /* Amazon */
#endif

/* Maximum length of the device property read from file by function
* get_device_property() */
Expand Down

0 comments on commit 414997b

Please sign in to comment.