Sync meeting on EESSI test suite (2023 06 15)

Jump to bottom

Kenneth Hoste edited this page Jun 16, 2023 · 1 revision

EESSI test suite sync meetings

Planning

every 2 weeks on Thursday at 14:00 CE(S)T
next meetings:
- Thu 15 June 14:00 => OK for all
- Wed 28 June 14:00 => OK for all
- Thu 13 July 14:00 => Kenneth/Sam(?) is on summer vacation
- Thu 27 July 14:00 => Kenneth is on summer vacation
- Thu 10 Aug 14:00 => clash with monthly AWS sync meeting
- Wed 9 Aug 14:00?

Previous meetings

Notes for 2023-06-15

merged PRs
- refactoring PR #45
- fix GPU devices error PR #50
- more info in README on Git workflow PR #36
  - Sam: could add info on sending PR to branch used for open PR
- config file for Hortense @ VSC PR #24
- config file for Snellius @ SURF PR #52
  - small extra change needed (PR #55)
open PRs
- OSU Microbenchmarks (PR #54)
  - separate class for pt2pt tests with limited set of scales
  - focus on pt2pt test only for now, follow up with test for collectives later
testing on various systems
- Caspar: AWS CitC Slurm cluster
  - see https://github.com/EESSI/hackathons/tree/main/2022-12/citc#node-types
  - config file (PR #53)
  - using auto-detect for CPU features (see remote_detect), working well
  - ReFrame feature request to control launcher used for auto-detect: https://github.com/reframe-hpc/reframe/issues/2926
- Kenneth: Vega
  - PR for config file coming up
notes
- can we control the launcher to use from the test?
  - we don't want to launch non-MPI workloads with mpirun, since it will not be available without loading a module
  - Sam's hack: define separate partition with 'local' launcher
  - only a problem when not using srun (which is always there)
  - required to make sure that tests still work even when mpirun is used as a launcher
- periodic testing
  - on AWS, stick to 1-2 (maybe 4) nodes for running test suite
- should open issue on making sure that thread/process to core binding is done correctly (for GROMACS)
  - https://github.com/EESSI/test-suite/issues/57
  - can use affinity tool to check binding: https://github.com/vkarak/affinity
  - setting $OMP_PROC_BIND to true triggers really bad binding for TensorFlow
    - see upstream issue @ https://github.com/tensorflow/tensorflow/issues/60843
    - could be related to how TensorFlow installation was configured, see https://github.com/easybuilders/easybuild-easyblocks/issues/2577
  - should also look into likwid-pin, cfr. https://github.com/RRZE-HPC/likwid + https://github.com/RRZE-HPC/likwid/wiki/Likwid-Pin
  - main reason to control binding is to make the runs reproducible (and not make it do something stupid), not to get best possible performance
- dashboard to show test results
  - need to figure out how to collect data (and which data)
  - could consider letting participating sites push that data to a Git repo (in GitHub or GitLab), which could trigger the update of a dashboard hosted in GitHub Pages or the GitLab equivalent