Skip to content

Commit

Permalink
topo: Avoid grouping of multiple NICs to trainium accelerator
Browse files Browse the repository at this point in the history
Since a TRN accelerator is composed of multiple cores, the number of
trainium accelerators does not necessarily reflect the number of NIC
devices that the RDMA protocol should expose to the user. Instead,
each core should have a NIC accessible for communication if that many
NICs are available.
The best approach, for now, is to remove trainium accelerators from
the list of accelerators around which NICs are grouped. Consequently,
each libfabric NIC is exposed as on NIC device to the user. This
provides trainium maximal freedom in routing data over NICs.

In the long run, a better solution might be to expose the number of
actual cores to the plugin and take that number into account while NIC
grouping.

Signed-off-by: Michael Axtmann <[email protected]>
  • Loading branch information
maxtmann committed Aug 27, 2024
1 parent 519e50d commit 2021785
Showing 1 changed file with 3 additions and 18 deletions.
21 changes: 3 additions & 18 deletions src/nccl_ofi_topo.c
Original file line number Diff line number Diff line change
Expand Up @@ -17,13 +17,8 @@
#include "nccl_ofi_math.h"
#include "nccl_ofi_ofiutils.h"

#if HAVE_CUDA
static const uint8_t target_class_id = 0x03; /* Display controller class */
static const unsigned short target_vendor_id = 0x10de; /* NVIDIA */
#else
static const uint8_t target_class_id = 0x08; /* System peripheral */
static const unsigned short target_vendor_id = 0x1d0f; /* Amazon */
#endif

/* Maximum length of the device property read from file by function
* get_device_property() */
Expand Down Expand Up @@ -160,19 +155,8 @@ static int is_accelerator_dev(hwloc_obj_t obj, bool *res)
the class code. */
class_code = obj->attr->pcidev.class_id >> 8;

/*
* TODO: This is still a broad match that assumes any Amazon device
* registered with class "System Peripheral" is a Neuron device. While
* this is true today, it might not be in the future. Filtering on this
* is better than statically matching against the supported device IDs,
* which we would have to manually update as newer generations get released.
* In the future, we should update this to dynamically query Neuron
* devices on the instance and match the hwloc node against the
* discovered Neuron device BDFs.
*/
class_match = target_class_id == class_code;
vendor_match = obj->attr->pcidev.vendor_id == target_vendor_id;

*res = class_match && vendor_match;
return 0;
}
Expand Down Expand Up @@ -685,8 +669,9 @@ static int propoagate_accel_group_counts(hwloc_topology_t topo)
hwloc_obj_t obj = NULL;

/* Iterate over all PCI topology nodes and find nodes
* corresponding to NICs and Nvidia GPUs or Amazon Neuron devices. From
* those nodes, walk up towards the root and set user data. */
* corresponding to Nvidia GPUs. From those nodes, walk up
* towards the root and increase group count on closest
* ancestor that has NICs attached. */
while ((obj = hwloc_get_next_pcidev(topo, obj))) {
bool is_accel = false;

Expand Down

0 comments on commit 2021785

Please sign in to comment.