Update vsc_calcua #725

pmoris · 2024-08-07T14:03:12Z

Some bugfixes and improvements for the CalcUA VSC config.

Bug fix for unknown config attribute error.
Clearer warnings about singularity/apptainer and nextflow tmp and cache dirs.
Improvements to checks whether process is running on a correct partition versus login node, and with the appropriate resources.

The host variable started throwing errors upon launch. The reason was that the defined host variable was being called inside the local scope of a function, where it was not accessible. To fix it, either the partition_checker() function could be tweaked to take an original parameter (String host), or the host variable could just be omitted in favour of the underlying logic (= fetching the content of an environment variable). The latter option was chosen.

The function that tests whether nextflow is running on the correct slurm partition was broken; try-catch was used instead of conditionals and a missing == comparison. Now the code correctly warns the user if nextflow is running on a login node, and aborts the run if the current partition does not match the requested one (in the -profile flag). It does not abort runs launched on the login node to allow easier debugging.

…ables Improves checks for the environment variables for singularity/apptainer/nextflow cache and tmp dirs: - APPTAINER_TMPDIR/SINGULARITY_TMPDIR, - APPTAINER_CACHEDIR/SINGULARITY_CACHEDIR or - NXF_APPTAINER_CACHEDIR/NXF_SINGULARITY_CACHEDIR Only the nextflow cache is set to a default value if it is not set in the launch environment (via singularity.cacheDir). Before, this location was always set and any value set in the env var was ignored. The other two cannot be overriden in the env scope because that only affects the task environment variables, not the environment in which nextflow is launched. See: https://www.nextflow.io/docs/latest/config.html#scope-env Warnings are issued for the user when any of these variables are not set. Also clarifies code comments and updates documentation.

The automatic resource allocation for local node execution now also works when memory is specified via --mem, i.e. total node memory, instead of only looking at --mem-per-cpu. These SLURM environment variables (SLURM_MEM_PER_CPU and SLURM_MEM_PER_NODE) are not both dynamically set based on the value of the other. As before, if these env vars are not set, there is still a fallback to the profile's default max mem/cpu values.

pmoris · 2024-08-07T16:27:10Z

Fixed automatic memory allocation. I didn't realize there were multiple possible SLURM environment variables that could keep track of the available amount of memory (SLURM_MEM_PER_NODE vs SLURM_MEM_PER_CPU).

That should be it for now...

Thanks in advance for the review!

jfy133

OK this is a ultra-mega-complicated-config 😅 , but assuming it's been tested, I don't see anything majorly flawed 👍

conf/vsc_calcua.config

pmoris · 2024-08-08T10:23:15Z

OK this is a ultra-mega-complicated-config 😅 , but assuming it's been tested, I don't see anything majorly flawed 👍

I might have gone a little overboard with this one...

The basic premises are:

Allow the choice between SLURM scheduled jobs and local execution on a single worker node (*_slurm vs *_local).
- The former has more overhead and waiting time between tasks, but also allows more parallel tasks. The latter is faster, but limited to the CPU/RAM of a single node.
- Since each of our SLURM partitions contains a different type of nodes (CPU, memory, wall time, etc.), I separated these out into sub-profiles (zen2_*, skylake_*, etc.). A single profile would have max_* values that could be too high (or too restrictive) depending on the partition a job is submitted to, wouldn't it?
Dynamically set max_* values for local execution profiles, by setting these based on the number of requested resources as reported by SLURM (and default to the max of the partition/node type otherwise).
Add several checks and warnings:
- Check if singularity cache, tmp and nextflow cache are set properly. Our HPC has strict file quotas and low storage worker nodes that could be exceeded/filled otherwise.
- Avoid runs on logins nodes.
Avoid crashing the GitHub actions by skipping various parts of the config if the hostname does not match that of our HPC.
Lots of documentation on how to install/run nextflow, things to avoid, what the default directories are, overview of the different resources and matching profiles, etc.

I'm open to any and all suggestions to improve this config though, especially if they can simplify this behemoth of a config!

I just noticed that the vsc_ugent.config recently switched to dynamically determining the current SLURM partition, which eliminates the needs for different subprofiles. So I might put that on my todo list. I'd still need two subprofiles to specify SLURM vs local executor, but if one of them is set as the default, that'd still simplify things.

conf/vsc_calcua.config

jfy133 · 2024-08-08T11:52:13Z

OK this is a ultra-mega-complicated-config 😅 , but assuming it's been tested, I don't see anything majorly flawed 👍

I might have gone a little overboard with this one...

The basic premises are:
1. Allow the choice between SLURM scheduled jobs and local execution on a single worker node (`*_slurm` vs `*_local`).
   
   * The former has more overhead and waiting time between tasks, but also allows more parallel tasks. The latter is faster, but limited to the CPU/RAM of a single node.
   * Since each of our SLURM partitions contains a different type of nodes (CPU, memory, wall time, etc.), I separated these out into sub-profiles (`zen2_*`, `skylake_*`, etc.). A single profile would have `max_*` values that could be too high (or too restrictive) depending on the partition a job is submitted to, wouldn't it?

2. Dynamically set `max_*` values for local execution profiles, by setting these based on the number of requested resources as reported by SLURM (and default to the max of the partition/node type otherwise).

3. Add several checks and warnings:
   
   * Check if singularity cache, tmp and nextflow cache are set properly. Our HPC has strict file quotas and low storage worker nodes that could be exceeded/filled otherwise.
   * Avoid runs on logins nodes.

4. Avoid crashing the GitHub actions by skipping various parts of the config if the hostname does not match that of our HPC.

5. Lots of documentation on how to install/run nextflow, things to avoid, what the default directories are, overview of the different resources and matching profiles, etc.
I'm open to any and all suggestions to improve this config though, especially if they can simplify this behemoth of a config!

I just noticed that the vsc_ugent.config recently switched to dynamically determining the current SLURM partition, which eliminates the needs for different subprofiles. So I might put that on my todo list. I'd still need two subprofiles to specify SLURM vs local executor, but if one of them is set as the default, that'd still simplify things.

I don't have time to read through fully currently (in the middle of running/teaching a summer school), so I would say you can merge in otherwise, depending on @nvnieuwk 's comment :)

Makes warning messages more clear to the user and differentiates between warnings and errors. Exits out in case singularity/apptainer cache is not set. Improves docs related to cache and temp dirs.

pmoris · 2024-08-08T14:00:37Z

Incorporated your suggestions and tried to improve the docs/comments a bit more. Ready for merging! Thanks for the help!

pmoris added 3 commits August 7, 2024 15:11

jfy133 requested review from a team, maxulysse and jemten and removed request for a team August 7, 2024 15:57

jfy133 approved these changes Aug 8, 2024

View reviewed changes

conf/vsc_calcua.config Outdated Show resolved Hide resolved

nvnieuwk reviewed Aug 8, 2024

View reviewed changes

conf/vsc_calcua.config Outdated Show resolved Hide resolved

pmoris added 2 commits August 8, 2024 15:29

Improve singularity/apptainer env var warnings

c46944b

Makes warning messages more clear to the user and differentiates between warnings and errors. Exits out in case singularity/apptainer cache is not set. Improves docs related to cache and temp dirs.

Clarify documentation and comments

a4ed136

pmoris merged commit ad49f85 into nf-core:master Aug 8, 2024
126 checks passed

pmoris mentioned this pull request Aug 13, 2024

Update vsc_calcua - simplify all the things #730

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Update vsc_calcua #725

Update vsc_calcua #725

pmoris commented Aug 7, 2024 •

edited

Loading

pmoris commented Aug 7, 2024 •

edited

Loading

jfy133 left a comment

pmoris commented Aug 8, 2024

jfy133 commented Aug 8, 2024

pmoris commented Aug 8, 2024

Update vsc_calcua #725

Update vsc_calcua #725

Conversation

pmoris commented Aug 7, 2024 • edited Loading

pmoris commented Aug 7, 2024 • edited Loading

jfy133 left a comment

Choose a reason for hiding this comment

pmoris commented Aug 8, 2024

jfy133 commented Aug 8, 2024

pmoris commented Aug 8, 2024

pmoris commented Aug 7, 2024 •

edited

Loading

pmoris commented Aug 7, 2024 •

edited

Loading