Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PCluster 3.10.1 and 3.11.0 Slurm compute daemon node configuration differs from hardware #6449

Open
stefan-maxar opened this issue Oct 3, 2024 · 19 comments
Labels

Comments

@stefan-maxar
Copy link

Hello,

We have been testing to upgrade from PCluster 3.8.0 to 3.11.0 and noticed some differences that impact performance after extensive testing of our applications. We run hybrid MPI-openMP applications using HPC6a.48xlarge instances and noticed that after testing PCluster 3.10.1 or 3.11.0 all of our applications are running ~40% slower than 3.8.0 using the out-of-the-box PCluster AMIs associated with either version. We narrowed down the issue by downgrading/changing versions of performance impacting software (such as EFA installer, downgrading to v1.32.0 or v1.33.0), switching how the job is submitted/run in Slurm (Hydra bootstrap and mpiexec vs PMIv2 and srun), and some other changes that did not improve the degraded performance.

Upon investigation, we noticed that the slurmd compute daemon on the HPC6a.48xlarge instances incorrectly identifies the hardware configuration, resulting in improper job placement and degraded performance. Snapshots of the slurmd from varying versions of PCluster as follows:

HPC6a.48xlarge on PCluster 3.8.0 with Slurm 23.02.7 (correct when considering NUMA node as socket):

[2024-10-03T09:14:54.114] Considering each NUMA node as a socket
[2024-10-03T09:14:54.114] Node reconfigured socket/core boundaries SocketsPerBoard=96:4(hw) CoresPerSocket=1:24(hw)
[2024-10-03T09:14:54.116] Considering each NUMA node as a socket
[2024-10-03T09:14:54.124] CPU frequency setting not configured for this node
[2024-10-03T09:14:54.130] slurmd version 23.02.7 started
[2024-10-03T09:14:54.168] slurmd started on Thu, 03 Oct 2024 09:14:54 +0000
[2024-10-03T09:14:54.169] CPUs=96 Boards=1 Sockets=4 Cores=24 Threads=1 Memory=378805 TmpDisk=40947 Uptime=240 CPUSpecList=(null) FeaturesAvail=(null) FeaturesActive=(null)

HPC6a.48xlarge on PCluster 3.10.1 with Slurm 23.11.7:

[2024-10-01T13:38:57.884] Considering each NUMA node as a socket
[2024-10-01T13:38:57.960] Considering each NUMA node as a socket
[2024-10-01T13:38:57.965] CPU frequency setting not configured for this node
[2024-10-01T13:38:58.142] slurmd version 23.11.7 started
[2024-10-01T13:38:58.221] slurmd started on Tue, 01 Oct 2024 13:38:58 +0000
[2024-10-01T13:38:58.221] CPUs=96 Boards=1 Sockets=96 Cores=1 Threads=1 Memory=378805 TmpDisk=40947 Uptime=123 CPUSpecList=(null) FeaturesAvail=(null) FeaturesActive=(null)

HPC6a.48xlarge on PCluster 3.11.0 with Slurm 23.11.10:

[2024-10-03T13:56:38.733] Considering each NUMA node as a socket
[2024-10-03T13:56:38.735] Considering each NUMA node as a socket
[2024-10-03T13:56:38.740] CPU frequency setting not configured for this node
[2024-10-03T13:56:39.387] pyxis: version v0.20.0
[2024-10-03T13:56:39.388] slurmd version 23.11.10 started
[2024-10-03T13:56:39.830] slurmd started on Thu, 03 Oct 2024 13:56:39 +0000
[2024-10-03T13:56:39.831] CPUs=96 Boards=1 Sockets=96 Cores=1 Threads=1 Memory=378805 TmpDisk=40947 Uptime=377 CPUSpecList=(null) FeaturesAvail=(null) FeaturesActive=(null)

lscpu from a HPC6a.48xlarge instance:

Architecture:        x86_64
CPU op-mode(s):      32-bit, 64-bit
Byte Order:          Little Endian
CPU(s):              96
On-line CPU(s) list: 0-95
Thread(s) per core:  1
Core(s) per socket:  48
Socket(s):           2
NUMA node(s):        4
Vendor ID:           AuthenticAMD
CPU family:          25
Model:               1
Model name:          AMD EPYC 7R13 Processor
Stepping:            1
CPU MHz:             2420.130
BogoMIPS:            5299.99
Hypervisor vendor:   KVM
Virtualization type: full
L1d cache:           32K
L1i cache:           32K
L2 cache:            512K
L3 cache:            16384K
NUMA node0 CPU(s):   0-23
NUMA node1 CPU(s):   24-47
NUMA node2 CPU(s):   48-71
NUMA node3 CPU(s):   72-95

Is there some fix (or workaround) to properly reconfigure the node configuration in PCluster 3.11.0? It looks like some process/script that was run in 3.8.0 (e.g. line: [2024-10-03T09:14:54.114] Node reconfigured socket/core boundaries ...) is either not being run or not running properly. We'd prefer not to hard code the proper node configuration in the PCluster compute resource YAML as we dynamically spin up/down clusters and could use difference instance types in a given compute resource depending on resource availability.

Thanks for any help you can provide!

@hanwen-pcluste
Copy link
Contributor

Hi Stefan!

Thank you for the detailed description. I could reproduce the same issue. The same logs appears in the slurmd.log on compute nodes.

I am actively working on this and will keep you updated!

Thank you,
Hanwen

@demartinofra
Copy link
Contributor

demartinofra commented Oct 15, 2024

Hi Stefan,

ParallelCluster has never explicitly configured Sockets and Cores for Slurm nodes, therefore Slurm uses its defaults. This could be due to Slurm 23.11 changing the way the value for Sockets and Cores are computed. Were you able to confirm that after setting the expected values for Sockets and Cores in slurm.conf the performance degradation is resolved? I don't expect seeing relevant changes in scheduling behaviour due to the lack of Sockets/Cores configuration that justify such a big regression.

Would you be able to extract some logs showing how processes are mapped to the various cores? also if you don't mind can you share the cluster configuration and a potential reproducer?

Also if you don't mind could you share the full Slurm config from both clusters? You can retrieve it with scontrol write config /tmp/slum.conf

If Sockets and Cores configuration turns out to be a red herring here is another potential issue to look into:
The Amazon Linux Kernel versions [v6.1.82, v5.15.152, v5.10.213] contain mitigations for CVE-2023-20569. The SRSO mitigations are enabled by default but may have a performance impact for very specific workloads. It is possible to disable these security mitigations to avoid a possible performance impact, however users should carefully consider the security implications.. To disable specify spec_rstack_overflow=off as a kernel boot parameter. For further details see https://docs.kernel.org/admin-guide/hw-vuln/srso.html

Francesco

@stefan-maxar
Copy link
Author

Hey @demartinofra - thanks for the reply!

For my testing, I did set the following in the PCluster configuration to force the proper configuration:

      ComputeResources:
          CustomSlurmSettings:
            Sockets: 4
            CoresPerSocket: 24

Which did yield proper configuration via slurmd (HPC6a.48xlarge on PCluster 3.11.0 with Slurm 23.11.10):

[2024-10-15T19:13:36.503] Considering each NUMA node as a socket
[2024-10-15T19:13:36.505] Considering each NUMA node as a socket
[2024-10-15T19:13:36.509] CPU frequency setting not configured for this node
[2024-10-15T19:13:36.569] pyxis: version v0.20.0
[2024-10-15T19:13:36.570] slurmd version 23.11.10 started
[2024-10-15T19:13:36.619] slurmd started on Tue, 15 Oct 2024 19:13:36 +0000
[2024-10-15T19:13:36.619] CPUs=96 Boards=1 Sockets=4 Cores=24 Threads=1 Memory=378805 TmpDisk=40947 Uptime=205 CPUSpecList=(null) FeaturesAvail=(null) FeaturesActive=(null)

For our applications, I did note some improvement in performance and recouped a few percent of the ~40% degradation using the proper hardware configuration. So, not quite red herring, but definitely not the solution either!

Regarding the SRSO mitigation - thanks for passing this along. This is news to me and is definitely something I am going to investigate further. From what I can see, HPC6a with PCluster 3.11 base AMI has that patch as you refer to:

/sys/devices/system/cpu/vulnerabilities/spec_rstack_overflow:Mitigation: safe RET, no microcode

Other than creating a custom AMI that disables this patch, do you have any suggestions for how to disable this upon instance startup via PCluster? The patch seemingly cant be removed during post install procedures because it requires instance reboot and once you reboot, slurm will detect the instance as "down" and will swap it out. I would rather not have to create a custom AMI if there is some other way to test this out. Thanks!

@demartinofra
Copy link
Contributor

Other than creating a custom AMI that disables this patch, do you have any suggestions for how to disable this upon instance startup via PCluster?

If you want to test it real quick one option is to run the following on the compute nodes:

sudo grubby --update-kernel=ALL --args='spec_rstack_overflow=off'
sudo sync

and then reboot them through the scheduler, so that Slurm does not mark the nodes as unhealthy and the reboot is successful:

sudo -i scontrol reboot <nodelist>

@stefan-maxar
Copy link
Author

Hi @demartinofra - I ran the commands you suggested to disable SRSO mitigation and rebooted via slurm which resulted in the patching being disabled:

/sys/devices/system/cpu/vulnerabilities/spec_rstack_overflow:Vulnerable, no microcode

I then ran one of our smaller-scale hybrid MPI-openMP jobs and the performance was expected with no ~40% performance degradation (I also corrected the HPC6a configuration, which also did help with performance a little). So, it definitely seems like this SRSO mitigation is the culprit for our application slowdowns...and I'll doubly confirm with our larger-scale job.

What do you suggest as a more formal workaround for the SRSO mitigation in the PCluster realm? Custom AMI? Something else? When we had performance issues because of the log4j patch, it was a simple yum remove that we could run during post install. This is a bit more involved and since we spin up and down clusters daily from the base PCluster AMI, it would be great if you could provide some recommendations. Thanks for bringing this to our attention again!

@hanwen-pcluste
Copy link
Contributor

Hi Stefan,

We will work on a Wiki page to describe the mitigation in pcluster realm and let you know when it is done.

Thank you Stefan and Francesco for discovering the issue!
Hanwen

@hanwen-pcluste
Copy link
Contributor

Also, please avoid using 3.11.0 because of the known issue https://github.com/aws/aws-parallelcluster/wiki/(3.11.0)-Job-submission-failure-caused-by-race-condition-in-Pyxis-configuration

@hanwen-pcluste
Copy link
Contributor

hanwen-pcluste commented Oct 23, 2024

Hi Stefan,

We've published Wiki page (3.9.1 ‐ latest) Speculative Return Stack Overflow (SRSO) mitigations introducing potential performance impact on some AMD processors

Moreover, we've released ParallelCluster 3.11.1

Cheers,
Hanwen

@stefan-maxar
Copy link
Author

A follow up on this as we have been finally able to do a lot more testing with newer versions of PCluster. We have disabled SRSO following the guide for PCluster 3.11.1 AMIs, both AL2 and AL2023 OSes. For our large scale hybrid MPI-openMP application that runs on ~200 hpc6a, we still see substantial performance degradation compared to PCluster 3.8.0 even with the SRSO disabled on both OSes.

Further, the PCluster 3.8.0 AL2 AMI we currently use in production does ship with the SRSO mitigation enabled; we have never disabled it. Spinning up a hpc6a with the base us-east-2 PCluster 3.8.0 AL2 AMI (ami-03e71395f1580f16e) yields:

/sys/devices/system/cpu/vulnerabilities/spec_rstack_overflow:Mitigation: safe RET, no microcode

So, something else is going on that is causing issues with large-scale applications/jobs. Its worth reiterating - disabling SRSO in PCluster 3.11.1 AMIs DID help return performance back to near-normal for a small-scale (2 hpc6a) MPI job, but it wasn't the cure for our job using ~200 hpc6a. Were there any other foundational changes that could cause scaling issues in newer versions of PCluster?

@stefan-maxar
Copy link
Author

@hanwen-pcluste @demartinofra

Another update here as we've continued testing with PCluster 3.12.0. We are still seeing performance degradation at scale with PCluster 3.10+, including 3.12.0 on both AL2 and AL2023. After a lot more digging, we've noticed that the network throughput (EFA traffic) is substantially less in the newer versions. Our latest tests were with the following:

Cluster 1:

  • MPI job on ~200 hpc6a using Intel MPI
  • PCluster 3.12.0 Base AMI configured with SRSO disabled and EFA Installer 1.37.0

Cluster 2 (current production environment):

  • MPI job on ~200 hpc6a using Intel MPI
  • PCluster 3.8.0 Base AMI with EFA Installer 1.30.0 (we did not disable SRSO)

Attaching screenshots of an instance from both test clusters showing network in/out and network packets in/out using 5-min averages (top is cluster 1, bottom is cluster 2). During the main MPI job, test cluster 2 has consistent 5-min average network in/out performance of 115+ Gb and packets exceeding 30M. In contrast, test cluster 1 has significantly less 5-min average network in/out performance, varying between 80 and 90 Gb with packets hovering around 24-26M. Further, the traffic is much more volatile (sawtooth pattern). This performance degradation is consistent with other instances within the cluster, but for ease of showing in plot, we isolated it down to 1 compute instance from each.

I am not sure what could be causing this performance drop and could use some pointers on where to dig into next if there are any EFA-related configurations that might have changed. Since its a pretty large version bump in EFA installer, Im sure there are a lot of moving parts that could be the culprit.

  • Stefan

Image
Image

@hanwen-cluster
Copy link
Contributor

Hi Stephan,

We are still looking at the issue. We apologize for the late reply.

Thank you,
Hanwen

@stefan-maxar
Copy link
Author

stefan-maxar commented Jan 16, 2025

Hi Stephan,

We are still looking at the issue. We apologize for the late reply.

Thank you, Hanwen

Thanks Hanwen!

FWIW - I am continuing to test some of our smaller-scale jobs and am not seeing any performance issues. Latest testing for a hybrid MPI-openMP job that uses 4 hpc6a:

Cluster 1:

  • MPI job on 4 hpc6a using Intel MPI
  • PCluster 3.12.0 AL2023 Base AMI configured with SRSO disabled and EFA Installer 1.37.0

Cluster 2 (current production environment):

  • MPI job on 4 hpc6a using Intel MPI
  • PCluster 3.8.0 AL2 Base AMI with EFA Installer 1.30.0 (we did not disable SRSO)

The total wall clock time from the jobs on cluster 1 were nearly identical to that of our production total wall clock times on cluster 2. So, this seems to only be an issue at scale, at least from what I have seen.

@shijin-aws
Copy link

shijin-aws commented Jan 23, 2025

@stefan-maxar There are big changes between EFA installer 1.30.0 and 1.37.0. In our past experience the performance change is mostly related to the Libfabric version bumps in the installer. EFA installer 1.30.0 has Libfabric 1.19.0amzn4.0, EFA installer 1.37.0 has Libfabric 1.22.0amzn4.0. I think we can start the investigation by comparing the performance between different Libfabric versions whole keeping other components the same.

How about we doing the following: keep using EFA installer 1.30.0 (I also saw you mentioned 1.31.0 and 1.32.0 which are "good" ?), but installing a customized Libfabric via

  • Libfabric v1.22.x
git clone https://github.com/ofiwg/libfabric.git
cd libfabric
git checkout v1.22.x

cd libfabric
./autogen.sh
./configure --prefix=</path/to/your/libfabric/installation> --disable-verbs \
   --disable-psm3 --disable-opx --disable-usnic --disable-rstream
make -j 32
make install

  • Libfabric v1.21.x, v1.20.x, v1.19.x: same installation procedure as above but change v1.22.x to the older ones.

Since you are using Intel MPI, you should be able to use these customized Libfabric by having

export LD_LIBRARY_PATH=</path/to/your/libfabric/installation>/lib:$LD_LIBRARY_PATH

(I assume you already disabled internal ofi via . vars.sh -i_mpi_ofi_internal=0)

It will be great if you can share performance data between v1.22.x, v1.21.x, v1.20.x, v1.19.x.

I think v1.20.x should be mostly the same with v1.19.x. My suspect is that your performance may be degraded by some changes between v1.22.x and v1.21.x so those are the versions I would start from.

@stefan-maxar
Copy link
Author

@shijin-aws - Thanks for this; I'll do the requested testing and report back.

And yes, we set export I_MPI_OFI_LIBRARY_INTERNAL=0 and will follow your guidance on installing and sourcing the different libfabric. I'll turn up I_MPI_DEBUG for further verification, too.

  • Stefan

@shijin-aws
Copy link

@stefan-maxar Thanks! Just corrected a typo in my earlier comment. I removed the --enable-debug flag in ./configure as it shouldn't be added for performance test purpose.

@stefan-maxar
Copy link
Author

stefan-maxar commented Jan 23, 2025

@shijin-aws

I have some initial results I'd like to pass on. I spun up our standard production environment using PCluster v3.8.0 on AL2 with EFA installer 1.30.0. I then ran the following tests from that cluster. The tests utilized ~200 hpc6a instances with a hybrid MPI-openMP job.

Test 1: Identical environment to production - PCluster 3.8.0 AL2 base AMI with EFA installer v1.30.0

libfabric CLI:
fi_info --version
fi_info: 1.19.0amzn4.0
libfabric: 1.19.0amzn4.0
libfabric api: 1.19

Intel MPI debug (I_MPI_DEBUG=5):
[0] MPI startup(): library kind: release
[0] MPI startup(): libfabric version: 1.19.0amzn4.0
[0] MPI startup(): libfabric provider: efa

Test 2: Like test 1, but with libfabric v1.22.x compiled and sourced

libfabric CLI:
fi_info --version
fi_info: 1.22.0
libfabric: 1.22.0
libfabric api: 1.22

Intel MPI debug (I_MPI_DEBUG=5):
[0] MPI startup(): library kind: release
[0] MPI startup(): libfabric version: 1.22.0
[0] MPI startup(): libfabric provider: efa

Test 3: Like test 1 but with libfabric v1.21.x compiled and sourced

libfabric CLI:
fi_info --version
fi_info: 1.21.1
libfabric: 1.21.1
libfabric api: 1.21

Intel MPI debug (I_MPI_DEBUG=5):
[0] MPI startup(): library kind: release
[0] MPI startup(): libfabric version: 1.21.1
[0] MPI startup(): libfabric provider: efa

All three tests had nearly identical total wall clock times (within 1 second of each other) and the EFA performance was similar in all three. So, the performance was very similar to what we see in production and not degraded even changing out the libfabric version.

@shijin-aws
Copy link

@stefan-maxar Thank you very much! So you are saying keeping everything in EFA installer 1.30.0 while changing Libfabric versions gives the same (GOOD) performance!

Then the next suspicion is the efa kernel driver version:

  • EFA installer 1.30.0 uses efa driver 2.6.0
  • Then EFA installer 1.32.0 bumps it to efa driver 2.8.0
  • Then EFA installer 1.35.0 upgrade it to efa driver 2.12.1
  • Then EFA installer 1.36.0 upgrade it to efa driver 2.13.0

I don't have a solid idea right now that which bump may have performance impact, so I would suggest installing 2.13.0 (the latest) on your "good" setup to see whether it is the smoking gun. This can be done by running the following command on all your compute nodes (via srun)

sudo yum install -y <your_efa_installer_1.37.0_repo>/RPMS/ALINUX2/x86_64/efa-driver/efa-2.13.0-1.amzn2.x86_64.rpm

You can recover your driver to the old one by first removing them via

sudo yum remove -y efa

and then installing the old one via

sudo yum install -y <your_efa_installer_1.30.0_repo>/RPMS/ALINUX2/x86_64/efa-driver/efa-2.6.0-1.amzn2.x86_64.rpm

I understand it is more trouble to change driver versions than Libfabric versions. Thanks for your effort.

@stefan-maxar
Copy link
Author

@shijin-aws Just did the requested; I took our production PCluster v3.8.0 AL2 base AMI with EFA installer v1.30.0 and ran two tests on ~200 hpc6a: 1) with EFA driver version 2.6.0-1 (what comes with EFA installer v1.30.0) and 2) with incrementing the EFA driver version to 2.13.0-1. The total wall clock time of test (2) with the upgraded EFA driver was within seconds of test (1) and did not have any notable performance degradation or variability. So, upgrading to the latest EFA driver version does not seem to be the culprit in the performance drops Im seeing at scale.

@shijin-aws
Copy link

@stefan-maxar Thanks for the update! I realize there are some features in newer Libfabric versions that only gets active when it is running with newer efa-driver and rdma-core. So only changing one component each time based on a "good" setup may not be enough. I would suggest the following. On your production setup that has everything from EFA installer 1.30.0, try to update the following 3 togethers

  • Libfabric: Installing the libfabric 1.22.0amzn4.0 from EFA installer 1.37.0 via
sudo yum install -y <your_efa_installer_1.37.0_repo>/RPMS/ALINUX2/x86_64/RPMS/ALINUX2/x86_64/libfabric-aws*.rpm
  • EFA driver: Incrementing the EFA driver to 2.13.0 as what you did earlier

  • rdma-core: Installing the rdma-core 54.0 packages from EFA installer 1.37.0 via

sudo yum install -y <your_efa_installer_1.37.0_repo>/RPMS/ALINUX2/x86_64/rdma-core/*.rpm

If this still doesn't give us any smoking guns, we may need to have a new approach to diagnose the regression - We get a reproducer from your side and work backwards. That may take some time.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

5 participants