-
Notifications
You must be signed in to change notification settings - Fork 0
Sync meeting on EESSI test suite (2023 06 15)
Kenneth Hoste edited this page Jun 16, 2023
·
1 revision
- every 2 weeks on Thursday at 14:00 CE(S)T
- next meetings:
- Thu 15 June 14:00 => OK for all
- Wed 28 June 14:00 => OK for all
- Thu 13 July 14:00 => Kenneth/Sam(?) is on summer vacation
- Thu 27 July 14:00 => Kenneth is on summer vacation
- Thu 10 Aug 14:00 => clash with monthly AWS sync meeting
- Wed 9 Aug 14:00?
- https://github.com/EESSI/meetings/wiki/Sync-meeting-on-EESSI-test-suite-(2023-05-31)
- https://github.com/EESSI/meetings/wiki/Sync-meeting-on-EESSI-test-suite-(2023-05-17)
- https://github.com/EESSI/meetings/wiki/Sync-meeting-on-EESSI-test-suite-(2023-04-20)
- https://github.com/EESSI/meetings/wiki/Sync-meeting-on-EESSI-test-suite-(2023-03-30)
- https://github.com/EESSI/meetings/wiki/Sync-meeting-on-EESSI-test-suite-(2023-03-10) (incl. 2023-02-23)
- https://github.com/EESSI/meetings/wiki/Sync-meeting-on-EESSI-test-suite-(2023-02-09)
- merged PRs
- refactoring PR #45
- fix GPU devices error PR #50
- more info in README on Git workflow PR #36
- Sam: could add info on sending PR to branch used for open PR
- config file for Hortense @ VSC PR #24
- config file for Snellius @ SURF PR #52
- small extra change needed (PR #55)
- open PRs
- OSU Microbenchmarks (PR #54)
- separate class for pt2pt tests with limited set of scales
- focus on pt2pt test only for now, follow up with test for collectives later
- OSU Microbenchmarks (PR #54)
- testing on various systems
- Caspar: AWS CitC Slurm cluster
- see https://github.com/EESSI/hackathons/tree/main/2022-12/citc#node-types
- config file (PR #53)
- using auto-detect for CPU features (see
remote_detect
), working well - ReFrame feature request to control launcher used for auto-detect: https://github.com/reframe-hpc/reframe/issues/2926
- Kenneth: Vega
- PR for config file coming up
- Caspar: AWS CitC Slurm cluster
- notes
- can we control the launcher to use from the test?
- we don't want to launch non-MPI workloads with mpirun, since it will not be available without loading a module
- Sam's hack: define separate partition with 'local' launcher
- only a problem when not using srun (which is always there)
- required to make sure that tests still work even when mpirun is used as a launcher
- periodic testing
- on AWS, stick to 1-2 (maybe 4) nodes for running test suite
- should open issue on making sure that thread/process to core binding is done correctly (for GROMACS)
- https://github.com/EESSI/test-suite/issues/57
- can use
affinity
tool to check binding: https://github.com/vkarak/affinity - setting
$OMP_PROC_BIND
totrue
triggers really bad binding for TensorFlow- see upstream issue @ https://github.com/tensorflow/tensorflow/issues/60843
- could be related to how TensorFlow installation was configured, see https://github.com/easybuilders/easybuild-easyblocks/issues/2577
- should also look into
likwid-pin
, cfr. https://github.com/RRZE-HPC/likwid + https://github.com/RRZE-HPC/likwid/wiki/Likwid-Pin - main reason to control binding is to make the runs reproducible (and not make it do something stupid), not to get best possible performance
- dashboard to show test results
- need to figure out how to collect data (and which data)
- could consider letting participating sites push that data to a Git repo (in GitHub or GitLab), which could trigger the update of a dashboard hosted in GitHub Pages or the GitLab equivalent
- can we control the launcher to use from the test?