Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update vsc_calcua #725

Merged
merged 6 commits into from
Aug 8, 2024
Merged

Update vsc_calcua #725

merged 6 commits into from
Aug 8, 2024

Conversation

pmoris
Copy link
Contributor

@pmoris pmoris commented Aug 7, 2024

Some bugfixes and improvements for the CalcUA VSC config.

  • Bug fix for unknown config attribute error.
  • Clearer warnings about singularity/apptainer and nextflow tmp and cache dirs.
  • Improvements to checks whether process is running on a correct partition versus login node, and with the appropriate resources.

pmoris added 3 commits August 7, 2024 15:11
The host variable started throwing errors upon launch. The reason
was that the defined host variable was being called inside the
local scope of a function, where it was not accessible. To fix it,
either the partition_checker() function could be tweaked to take
an original parameter (String host), or the host variable could
just be omitted in favour of the underlying logic (= fetching the
content of an environment variable). The latter option was chosen.
The function that tests whether nextflow is running on the correct
slurm partition was broken; try-catch was used instead of conditionals
and a missing == comparison.
Now the code correctly warns the user if nextflow is running on a
login node, and aborts the run if the current partition does not
match the requested one (in the -profile flag).
It does not abort runs launched on the login node to allow easier
debugging.
…ables

Improves checks for the environment variables for
singularity/apptainer/nextflow cache and tmp dirs:
- APPTAINER_TMPDIR/SINGULARITY_TMPDIR,
- APPTAINER_CACHEDIR/SINGULARITY_CACHEDIR or
- NXF_APPTAINER_CACHEDIR/NXF_SINGULARITY_CACHEDIR
Only the nextflow cache is set to a default value if it is not set in the
launch environment (via singularity.cacheDir). Before, this location
was always set and any value set in the env var was ignored.
The other two cannot be overriden in the env scope because that only affects
the task environment variables, not the environment in which nextflow is launched.
See: https://www.nextflow.io/docs/latest/config.html#scope-env
Warnings are issued for the user when any of these variables are not set.
Also clarifies code comments and updates documentation.
@jfy133 jfy133 requested review from a team, maxulysse and jemten and removed request for a team August 7, 2024 15:57
The automatic resource allocation for local node execution
now also works when memory is specified via --mem, i.e.
total node memory, instead of only looking at --mem-per-cpu.
These SLURM environment variables (SLURM_MEM_PER_CPU and
SLURM_MEM_PER_NODE) are not both dynamically set based on
the value of the other.
As before, if these env vars are not set, there is still a
fallback to the profile's default max mem/cpu values.
@pmoris
Copy link
Contributor Author

pmoris commented Aug 7, 2024

  • Fixed automatic memory allocation. I didn't realize there were multiple possible SLURM environment variables that could keep track of the available amount of memory (SLURM_MEM_PER_NODE vs SLURM_MEM_PER_CPU).

That should be it for now...

Thanks in advance for the review!

Copy link
Member

@jfy133 jfy133 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK this is a ultra-mega-complicated-config 😅 , but assuming it's been tested, I don't see anything majorly flawed 👍

conf/vsc_calcua.config Outdated Show resolved Hide resolved
@pmoris
Copy link
Contributor Author

pmoris commented Aug 8, 2024

OK this is a ultra-mega-complicated-config 😅 , but assuming it's been tested, I don't see anything majorly flawed 👍

I might have gone a little overboard with this one...

The basic premises are:

  1. Allow the choice between SLURM scheduled jobs and local execution on a single worker node (*_slurm vs *_local).
    • The former has more overhead and waiting time between tasks, but also allows more parallel tasks. The latter is faster, but limited to the CPU/RAM of a single node.
    • Since each of our SLURM partitions contains a different type of nodes (CPU, memory, wall time, etc.), I separated these out into sub-profiles (zen2_*, skylake_*, etc.). A single profile would have max_* values that could be too high (or too restrictive) depending on the partition a job is submitted to, wouldn't it?
  2. Dynamically set max_* values for local execution profiles, by setting these based on the number of requested resources as reported by SLURM (and default to the max of the partition/node type otherwise).
  3. Add several checks and warnings:
    • Check if singularity cache, tmp and nextflow cache are set properly. Our HPC has strict file quotas and low storage worker nodes that could be exceeded/filled otherwise.
    • Avoid runs on logins nodes.
  4. Avoid crashing the GitHub actions by skipping various parts of the config if the hostname does not match that of our HPC.
  5. Lots of documentation on how to install/run nextflow, things to avoid, what the default directories are, overview of the different resources and matching profiles, etc.

I'm open to any and all suggestions to improve this config though, especially if they can simplify this behemoth of a config!

I just noticed that the vsc_ugent.config recently switched to dynamically determining the current SLURM partition, which eliminates the needs for different subprofiles. So I might put that on my todo list. I'd still need two subprofiles to specify SLURM vs local executor, but if one of them is set as the default, that'd still simplify things.

conf/vsc_calcua.config Outdated Show resolved Hide resolved
@jfy133
Copy link
Member

jfy133 commented Aug 8, 2024

OK this is a ultra-mega-complicated-config 😅 , but assuming it's been tested, I don't see anything majorly flawed 👍

I might have gone a little overboard with this one...

The basic premises are:

1. Allow the choice between SLURM scheduled jobs and local execution on a single worker node (`*_slurm` vs `*_local`).
   
   * The former has more overhead and waiting time between tasks, but also allows more parallel tasks. The latter is faster, but limited to the CPU/RAM of a single node.
   * Since each of our SLURM partitions contains a different type of nodes (CPU, memory, wall time, etc.), I separated these out into sub-profiles (`zen2_*`, `skylake_*`, etc.). A single profile would have `max_*` values that could be too high (or too restrictive) depending on the partition a job is submitted to, wouldn't it?

2. Dynamically set `max_*` values for local execution profiles, by setting these based on the number of requested resources as reported by SLURM (and default to the max of the partition/node type otherwise).

3. Add several checks and warnings:
   
   * Check if singularity cache, tmp and nextflow cache are set properly. Our HPC has strict file quotas and low storage worker nodes that could be exceeded/filled otherwise.
   * Avoid runs on logins nodes.

4. Avoid crashing the GitHub actions by skipping various parts of the config if the hostname does not match that of our HPC.

5. Lots of documentation on how to install/run nextflow, things to avoid, what the default directories are, overview of the different resources and matching profiles, etc.

I'm open to any and all suggestions to improve this config though, especially if they can simplify this behemoth of a config!

I just noticed that the vsc_ugent.config recently switched to dynamically determining the current SLURM partition, which eliminates the needs for different subprofiles. So I might put that on my todo list. I'd still need two subprofiles to specify SLURM vs local executor, but if one of them is set as the default, that'd still simplify things.

I don't have time to read through fully currently (in the middle of running/teaching a summer school), so I would say you can merge in otherwise, depending on @nvnieuwk 's comment :)

pmoris added 2 commits August 8, 2024 15:29
Makes warning messages more clear to the user and
differentiates between warnings and errors.
Exits out in case singularity/apptainer cache is not set.
Improves docs related to cache and temp dirs.
@pmoris
Copy link
Contributor Author

pmoris commented Aug 8, 2024

Incorporated your suggestions and tried to improve the docs/comments a bit more. Ready for merging! Thanks for the help!

@pmoris pmoris merged commit ad49f85 into nf-core:master Aug 8, 2024
126 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants