topo: Avoid grouping of multiple NICs to trainium accelerator

Since a TRN accelerator is composed of multiple cores, the number of trainium accelerators does not necessarily reflect the number of NIC devices that the RDMA protocol should expose to the user. Instead, each core should have a NIC accessible for communication if that many NICs are available. The best approach, for now, is to remove trainium accelerators from the list of accelerators around which NICs are grouped. Consequently, each libfabric NIC is exposed as on NIC device to the user. This provides trainium maximal freedom in routing data over NICs. In the long run, a better solution might be to expose the number of actual cores to the plugin and take that number into account while NIC grouping. Signed-off-by: Michael Axtmann <[email protected]>
maxtmann · Aug 27, 2024 · 2021785 · 2021785
1 parent 519e50d
commit 2021785
Showing 1 changed file with 3 additions and 18 deletions.
diff --git a/src/nccl_ofi_topo.c b/src/nccl_ofi_topo.c
@@ -17,13 +17,8 @@
 #include "nccl_ofi_math.h"
 #include "nccl_ofi_ofiutils.h"
 
-#if HAVE_CUDA
 static const uint8_t target_class_id = 0x03;		/* Display controller class */
 static const unsigned short target_vendor_id = 0x10de;	/* NVIDIA */
-#else
-static const uint8_t target_class_id = 0x08;		/* System peripheral */
-static const unsigned short target_vendor_id = 0x1d0f;	/* Amazon */
-#endif
 
 /* Maximum length of the device property read from file by function
  * get_device_property() */
@@ -160,19 +155,8 @@ static int is_accelerator_dev(hwloc_obj_t obj, bool *res)
 	   the class code. */
 	class_code = obj->attr->pcidev.class_id >> 8;
 
-	/*
-	 * TODO: This is still a broad match that assumes any Amazon device
-	 * registered with class "System Peripheral" is a Neuron device.  While
-	 * this is true today, it might not be in the future.  Filtering on this
-	 * is better than statically matching against the supported device IDs,
-	 * which we would have to manually update as newer generations get released.
-	 * In the future, we should update this to dynamically query Neuron
-	 * devices on the instance and match the hwloc node against the
-	 * discovered Neuron device BDFs.
-	 */
 	class_match = target_class_id == class_code;
 	vendor_match = obj->attr->pcidev.vendor_id == target_vendor_id;
-
         *res = class_match && vendor_match;
         return 0;
 }
@@ -685,8 +669,9 @@ static int propoagate_accel_group_counts(hwloc_topology_t topo)
 	hwloc_obj_t obj = NULL;
 
 	/* Iterate over all PCI topology nodes and find nodes
-	 * corresponding to NICs and Nvidia GPUs or Amazon Neuron devices. From
-	 * those nodes, walk up towards the root and set user data. */
+	 * corresponding to Nvidia GPUs. From those nodes, walk up
+	 * towards the root and increase group count on closest
+	 * ancestor that has NICs attached. */
 	while ((obj = hwloc_get_next_pcidev(topo, obj))) {
 		bool is_accel = false;