-
Notifications
You must be signed in to change notification settings - Fork 35
Throwaway Slurm cluster in AWS (Feb'21)
We have created a small throwaway Slurm cluster to experiment with the EESSI pilot repository, thanks to the generous sponsoring by AWS, using Cluster-in-the-Cloud.
This is a throwaway Slurm cluster, it will only be available for a short amount of time!
We currently plan to destroy the cluster one week after the EESSI update meeting of Feb'21, so on Thu Feb 11th 2021.
DO NOT USE THESE RESOURCES FOR PRODUCTION WORK!
Also, use these resources sparingly: make sure to cancel jobs as soon as you're done experimenting.
To get access, please contact Kenneth Hoste via the EESSI Slack or email ([email protected]
), so he can create you an account.
Required information:
- desired login name (for example
kehoste
) - first and last name
- GitHub account (which is only used to grab the SSH public keys associated with it, see for example https://github.com/boegel.keys)
- or, alternative, an SSH public key
To log in, use your personal account to SSH into the login node of the cluster (you should be informed which hostname to use)
ssh YOUR_USER_NAME_GOES_HERE@HOSTNAME_GOES_HERE
To get a list of all compute nodes, use the list_nodes
command.
Nodes marked with idle~
are idle, but currently not booted, so it will take a couple of minutes to start them when Slurm directs a job to it.
There are 16 compute nodes available in this throwaway cluster, of 4 different node types:
- 4x
c4.xlarge
: Intel Haswell, 4 cores, 7.5GB RAM - 4x
c5.xlarge
: Intel Skylake or Cascade Lake, 4 cores, 8GB RAM - 4x
c5a.xlarge
: AMD Rome, 4 cores, 8GB RAM - 4x
c6g.xlarge
: AWS Graviton 2 (Arm 64-bite), 4 cores, 8GB RAM
See https://aws.amazon.com/ec2/instance-types/#Compute_Optimized for detailed information.
The compute nodes are only started when needed, i.e. when jobs are submitted to run on them.
Keep this in mind when submitting jobs, it may take a couple of minutes before your job starts if there are no (matching) active compute nodes!
To check for active nodes, use the sinfo
command.
There is no high-performance network interconnect for these compute nodes, so be careful with multi-node jobs!
Your home directory is on a shared filesystem, so any files you create there are also accessible from the compute nodes.
Keep in mind that this is a slow NFS filesystem, it is not well suited for I/O-intensive work!
To start a quick interactive job, use:
srun --pty /bin/bash
To submit a job to a specific type of nodes, use the --constraint
option and specify the node "shape
".
For example, to submit to a Graviton2 compute node:
srun --constraint=shape=c6g.xlarge --pty /bin/bash
The EESSI pilot repository is available on each compute node.
Keep in mind that it may not be mounted yet, since CernVM-FS uses autofs
, so ls /cvmfs
may show nothing.
To get started, just source the EESSI initialization script:
source /cvmfs/pilot.eessi-hpc.org/latest/init/bash
And then use module avail
to check the available software.
We recommend trying to demo scripts available at https://github.com/EESSI/eessi-demo (see the run.sh
script in each subdirectory).