-
Notifications
You must be signed in to change notification settings - Fork 312
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
PCluster 3.10.1 and 3.11.0 Slurm compute daemon node configuration differs from hardware #6449
Comments
Hi Stefan! Thank you for the detailed description. I could reproduce the same issue. The same logs appears in the I am actively working on this and will keep you updated! Thank you, |
Hi Stefan, ParallelCluster has never explicitly configured Sockets and Cores for Slurm nodes, therefore Slurm uses its defaults. This could be due to Slurm 23.11 changing the way the value for Sockets and Cores are computed. Were you able to confirm that after setting the expected values for Sockets and Cores in slurm.conf the performance degradation is resolved? I don't expect seeing relevant changes in scheduling behaviour due to the lack of Sockets/Cores configuration that justify such a big regression. Would you be able to extract some logs showing how processes are mapped to the various cores? also if you don't mind can you share the cluster configuration and a potential reproducer? Also if you don't mind could you share the full Slurm config from both clusters? You can retrieve it with If Sockets and Cores configuration turns out to be a red herring here is another potential issue to look into: Francesco |
Hey @demartinofra - thanks for the reply! For my testing, I did set the following in the PCluster configuration to force the proper configuration:
Which did yield proper configuration via slurmd (HPC6a.48xlarge on PCluster 3.11.0 with Slurm 23.11.10):
For our applications, I did note some improvement in performance and recouped a few percent of the ~40% degradation using the proper hardware configuration. So, not quite red herring, but definitely not the solution either! Regarding the SRSO mitigation - thanks for passing this along. This is news to me and is definitely something I am going to investigate further. From what I can see, HPC6a with PCluster 3.11 base AMI has that patch as you refer to:
Other than creating a custom AMI that disables this patch, do you have any suggestions for how to disable this upon instance startup via PCluster? The patch seemingly cant be removed during post install procedures because it requires instance reboot and once you reboot, slurm will detect the instance as "down" and will swap it out. I would rather not have to create a custom AMI if there is some other way to test this out. Thanks! |
If you want to test it real quick one option is to run the following on the compute nodes:
and then reboot them through the scheduler, so that Slurm does not mark the nodes as unhealthy and the reboot is successful:
|
Hi @demartinofra - I ran the commands you suggested to disable SRSO mitigation and rebooted via slurm which resulted in the patching being disabled:
I then ran one of our smaller-scale hybrid MPI-openMP jobs and the performance was expected with no ~40% performance degradation (I also corrected the HPC6a configuration, which also did help with performance a little). So, it definitely seems like this SRSO mitigation is the culprit for our application slowdowns...and I'll doubly confirm with our larger-scale job. What do you suggest as a more formal workaround for the SRSO mitigation in the PCluster realm? Custom AMI? Something else? When we had performance issues because of the log4j patch, it was a simple |
Hi Stefan, We will work on a Wiki page to describe the mitigation in pcluster realm and let you know when it is done. Thank you Stefan and Francesco for discovering the issue! |
Also, please avoid using 3.11.0 because of the known issue https://github.com/aws/aws-parallelcluster/wiki/(3.11.0)-Job-submission-failure-caused-by-race-condition-in-Pyxis-configuration |
Hi Stefan, We've published Wiki page (3.9.1 ‐ latest) Speculative Return Stack Overflow (SRSO) mitigations introducing potential performance impact on some AMD processors Moreover, we've released ParallelCluster 3.11.1 Cheers, |
A follow up on this as we have been finally able to do a lot more testing with newer versions of PCluster. We have disabled SRSO following the guide for PCluster 3.11.1 AMIs, both AL2 and AL2023 OSes. For our large scale hybrid MPI-openMP application that runs on ~200 hpc6a, we still see substantial performance degradation compared to PCluster 3.8.0 even with the SRSO disabled on both OSes. Further, the PCluster 3.8.0 AL2 AMI we currently use in production does ship with the SRSO mitigation enabled; we have never disabled it. Spinning up a hpc6a with the base us-east-2 PCluster 3.8.0 AL2 AMI (ami-03e71395f1580f16e) yields:
So, something else is going on that is causing issues with large-scale applications/jobs. Its worth reiterating - disabling SRSO in PCluster 3.11.1 AMIs DID help return performance back to near-normal for a small-scale (2 hpc6a) MPI job, but it wasn't the cure for our job using ~200 hpc6a. Were there any other foundational changes that could cause scaling issues in newer versions of PCluster? |
Another update here as we've continued testing with PCluster 3.12.0. We are still seeing performance degradation at scale with PCluster 3.10+, including 3.12.0 on both AL2 and AL2023. After a lot more digging, we've noticed that the network throughput (EFA traffic) is substantially less in the newer versions. Our latest tests were with the following: Cluster 1:
Cluster 2 (current production environment):
Attaching screenshots of an instance from both test clusters showing network in/out and network packets in/out using 5-min averages (top is cluster 1, bottom is cluster 2). During the main MPI job, test cluster 2 has consistent 5-min average network in/out performance of 115+ Gb and packets exceeding 30M. In contrast, test cluster 1 has significantly less 5-min average network in/out performance, varying between 80 and 90 Gb with packets hovering around 24-26M. Further, the traffic is much more volatile (sawtooth pattern). This performance degradation is consistent with other instances within the cluster, but for ease of showing in plot, we isolated it down to 1 compute instance from each. I am not sure what could be causing this performance drop and could use some pointers on where to dig into next if there are any EFA-related configurations that might have changed. Since its a pretty large version bump in EFA installer, Im sure there are a lot of moving parts that could be the culprit.
|
Hi Stephan, We are still looking at the issue. We apologize for the late reply. Thank you, |
Thanks Hanwen! FWIW - I am continuing to test some of our smaller-scale jobs and am not seeing any performance issues. Latest testing for a hybrid MPI-openMP job that uses 4 hpc6a: Cluster 1:
Cluster 2 (current production environment):
The total wall clock time from the jobs on cluster 1 were nearly identical to that of our production total wall clock times on cluster 2. So, this seems to only be an issue at scale, at least from what I have seen. |
@stefan-maxar There are big changes between EFA installer 1.30.0 and 1.37.0. In our past experience the performance change is mostly related to the Libfabric version bumps in the installer. EFA installer 1.30.0 has Libfabric 1.19.0amzn4.0, EFA installer 1.37.0 has Libfabric 1.22.0amzn4.0. I think we can start the investigation by comparing the performance between different Libfabric versions whole keeping other components the same. How about we doing the following: keep using EFA installer 1.30.0 (I also saw you mentioned 1.31.0 and 1.32.0 which are "good" ?), but installing a customized Libfabric via
Since you are using Intel MPI, you should be able to use these customized Libfabric by having
(I assume you already disabled internal ofi via It will be great if you can share performance data between v1.22.x, v1.21.x, v1.20.x, v1.19.x. I think v1.20.x should be mostly the same with v1.19.x. My suspect is that your performance may be degraded by some changes between v1.22.x and v1.21.x so those are the versions I would start from. |
@shijin-aws - Thanks for this; I'll do the requested testing and report back. And yes, we set
|
@stefan-maxar Thanks! Just corrected a typo in my earlier comment. I removed the |
I have some initial results I'd like to pass on. I spun up our standard production environment using PCluster v3.8.0 on AL2 with EFA installer 1.30.0. I then ran the following tests from that cluster. The tests utilized ~200 hpc6a instances with a hybrid MPI-openMP job. Test 1: Identical environment to production - PCluster 3.8.0 AL2 base AMI with EFA installer v1.30.0
Test 2: Like test 1, but with libfabric v1.22.x compiled and sourced
Test 3: Like test 1 but with libfabric v1.21.x compiled and sourced
All three tests had nearly identical total wall clock times (within 1 second of each other) and the EFA performance was similar in all three. So, the performance was very similar to what we see in production and not degraded even changing out the libfabric version. |
@stefan-maxar Thank you very much! So you are saying keeping everything in EFA installer 1.30.0 while changing Libfabric versions gives the same (GOOD) performance! Then the next suspicion is the efa kernel driver version:
I don't have a solid idea right now that which bump may have performance impact, so I would suggest installing 2.13.0 (the latest) on your "good" setup to see whether it is the smoking gun. This can be done by running the following command on all your compute nodes (via srun)
You can recover your driver to the old one by first removing them via
and then installing the old one via
I understand it is more trouble to change driver versions than Libfabric versions. Thanks for your effort. |
@shijin-aws Just did the requested; I took our production PCluster v3.8.0 AL2 base AMI with EFA installer v1.30.0 and ran two tests on ~200 hpc6a: 1) with EFA driver version 2.6.0-1 (what comes with EFA installer v1.30.0) and 2) with incrementing the EFA driver version to 2.13.0-1. The total wall clock time of test (2) with the upgraded EFA driver was within seconds of test (1) and did not have any notable performance degradation or variability. So, upgrading to the latest EFA driver version does not seem to be the culprit in the performance drops Im seeing at scale. |
@stefan-maxar Thanks for the update! I realize there are some features in newer Libfabric versions that only gets active when it is running with newer efa-driver and rdma-core. So only changing one component each time based on a "good" setup may not be enough. I would suggest the following. On your production setup that has everything from EFA installer 1.30.0, try to update the following 3 togethers
If this still doesn't give us any smoking guns, we may need to have a new approach to diagnose the regression - We get a reproducer from your side and work backwards. That may take some time. |
Hello,
We have been testing to upgrade from PCluster 3.8.0 to 3.11.0 and noticed some differences that impact performance after extensive testing of our applications. We run hybrid MPI-openMP applications using HPC6a.48xlarge instances and noticed that after testing PCluster 3.10.1 or 3.11.0 all of our applications are running ~40% slower than 3.8.0 using the out-of-the-box PCluster AMIs associated with either version. We narrowed down the issue by downgrading/changing versions of performance impacting software (such as EFA installer, downgrading to v1.32.0 or v1.33.0), switching how the job is submitted/run in Slurm (Hydra bootstrap and mpiexec vs PMIv2 and srun), and some other changes that did not improve the degraded performance.
Upon investigation, we noticed that the slurmd compute daemon on the HPC6a.48xlarge instances incorrectly identifies the hardware configuration, resulting in improper job placement and degraded performance. Snapshots of the slurmd from varying versions of PCluster as follows:
HPC6a.48xlarge on PCluster 3.8.0 with Slurm 23.02.7 (correct when considering NUMA node as socket):
HPC6a.48xlarge on PCluster 3.10.1 with Slurm 23.11.7:
HPC6a.48xlarge on PCluster 3.11.0 with Slurm 23.11.10:
lscpu from a HPC6a.48xlarge instance:
Is there some fix (or workaround) to properly reconfigure the node configuration in PCluster 3.11.0? It looks like some process/script that was run in 3.8.0 (e.g. line:
[2024-10-03T09:14:54.114] Node reconfigured socket/core boundaries ...
) is either not being run or not running properly. We'd prefer not to hard code the proper node configuration in the PCluster compute resource YAML as we dynamically spin up/down clusters and could use difference instance types in a given compute resource depending on resource availability.Thanks for any help you can provide!
The text was updated successfully, but these errors were encountered: