-
Notifications
You must be signed in to change notification settings - Fork 0
AWS meeting 2023 08 10
Kenneth Hoste edited this page Aug 11, 2023
·
1 revision
- link to AWS project doc: https://docs.google.com/document/d/1CHG9fCh2LkfJ-EI8J-_Wr5NpHL5iwm8Wu6syfK9h7-c
- Thu 14 Sept 2023, 12:00 UTC
- Thu 12 Oct 2023, 12:00 UTC
- Thu 9 Nov 2023, 12:00 UTC
- August 2023: https://github.com/EESSI/meetings/wiki/AWS-meeting-2023-08-10
- July 2023: skipped
- 8 June 2023: https://github.com/EESSI/meetings/wiki/AWS-meeting-2023-06-08
- 11 May 2023: https://github.com/EESSI/meetings/wiki/AWS-meeting-2023-05-11
- 13 April 2023: https://github.com/EESSI/meetings/wiki/AWS-meeting-2023-04-13
- 9 Mar 2023: https://github.com/EESSI/meetings/wiki/AWS-meeting-2023-03-09
- 11 Jan 2023: https://github.com/EESSI/meetings/wiki/AWS-meeting-2023-01-11
- status update on sponsored credits
- Costs are about $3k/month for March-July'23 (up from ~$1.5k/month)
- EFS costs are on the rise (~50% in July)
- Build bot is still leaving behind large tarballs to allow debugging failing builds, which are not getting cleaned up currently
- EFS costs are on the rise (~50% in July)
- currently ~$10k left in sponsored credits
- Costs are about $3k/month for March-July'23 (up from ~$1.5k/month)
- Looking into using a CDN
- Will be Q4 before we look further into this
- Injecting OpenMPI/libfabric libraries into EESSI
- Full discussion in https://github.com/EESSI/software-layer/issues/252 (see this comment for details on the potential way forward)
- Basically two steps
- Take a copy of the host libmpi.so
- We are using the EESSI linker so we need to force the library to find some of it's libraries from the host (like libfabric)
- We modify the elf header of the library to do this
- We also inject some additional dependencies to effectively preload some other required host libraries
- Place it in a special place where it will get preferentially get picked up before the EESSI MPI library
- Take a copy of the host libmpi.so
- Seems to work with latest version of EESSI, GROMACS runs show performance improvement of ~5%
- failing test suites for OpenBLAS/FFTW/numpy (only) on Graviton 3 (not seeing this on Graviton 2)
- popping up while populating software stack in EESSI pilot 2023.06
- Numerical errors with OpenBLAS in LAPACK test suite
- Some toolchains use older OpenBLAS which lack optimisations
- We see increased number of failing tests
- Discussion on issue at https://github.com/xianyi/OpenBLAS/issues/4187
- Note OpenBLAS devels are only just starting to test on
neoverse_v1
- We ignored these failing tests for now, assuming they're mostly harmless
- cfr. https://github.com/EESSI/software-layer/pull/309
- FFTW: erratic error with single FFTW test (not always the same one)
- cfr. https://github.com/EESSI/software-layer/pull/310
- still figuring this out
- handful of failing tests in numpy test suite
- cfr. https://github.com/EESSI/software-layer/pull/306
- planning to open upstream issues for this to figure out how serious these are
- Kenneth will send email to Angel on this, could be useful to get some feedback on this from AWS Performance Engineering team
- progress on making it easy to integrate EESSI with ParallelCluster
- Matt is working on open source add-ons for ParallelCluster
- booth talk at AWS booth at SC'23
- long talks (~45min), repeated a couple of times
- live demo of getting EESSI working on AWS
- can cover different aspects