Skip to content

Commit

Permalink
Merge pull request #103 from cedadev/remove-lsf-refs
Browse files Browse the repository at this point in the history
Remove LSF references, fix typos
  • Loading branch information
mjpritchard authored Jul 30, 2024
2 parents c5d8a55 + a7393b3 commit 65c4482
Show file tree
Hide file tree
Showing 11 changed files with 96 additions and 98 deletions.
8 changes: 4 additions & 4 deletions content/docs/batch-computing/example-job-2-calc-md5s.md
Original file line number Diff line number Diff line change
Expand Up @@ -16,7 +16,7 @@ This is a simple case because:
1. the archive only needs to be read by the code and
2. the code that we need to run involves only the basic linux commands so there are no issues with picking up dependencies from elsewhere.

### Case Description**
### Case Description

- we want to calculate the MD5 checksums of about 220,000 files. It will take a day or two to run them all in series.
- we have a text file that contains 220,000 lines - one file per line.
Expand Down Expand Up @@ -91,7 +91,7 @@ All jobs ran within about an hour.
A variation on Case 2 has been used for checksumming datasets in the CMIP5
archive. The Python code below will find all NetCDF files in a DRS dataset and
generate a checksums file and error log. Each dataset is submitted as a
separate bsub job.
separate Slurm job.

```python
"""
Expand All @@ -116,7 +116,7 @@ def submit_job(dataset):
if not op.exists(path):
raise Exception('%s does not exist' % path)
job_name = dataset
cmd = ('bsub -q lotus -J {job_name} '
cmd = ('sbatch -q short-serial -J {job_name} '
'-o {job_name}.checksums -e {job_name}.err '
"/usr/bin/md5sum '{path}/*/*.nc'").format(job_name=job_name,
path=path)
Expand All @@ -141,6 +141,6 @@ separate job by invoking the above script as follows:

{{<command user="user" host="sci1">}}
./checksum_dataset.py $(cat datasets_to_checksum.dat)
sbatch-q short-serial -J cmip5.output1.MOHC.HadGEM2-ES.rcp85.day.seaIce.day.r1i1p1.v20111128 -o cmip5.output1.MOHC.HadGEM2-ES.rcp85.day.seaIce.day.r1i1p1.v20111128.checksums -e cmip5.output1.MOHC.HadGEM2-ES.rcp85.day.seaIce.day.r1i1p1.v20111128.err /usr/bin/md5sum '/badc/cmip5/data/cmip5/output1/MOHC/HadGEM2-ES/rcp85/day/seaIce/day/r1i1p1/v20111128/*/*.nc'
sbatch -q short-serial -J cmip5.output1.MOHC.HadGEM2-ES.rcp85.day.seaIce.day.r1i1p1.v20111128 -o cmip5.output1.MOHC.HadGEM2-ES.rcp85.day.seaIce.day.r1i1p1.v20111128.checksums -e cmip5.output1.MOHC.HadGEM2-ES.rcp85.day.seaIce.day.r1i1p1.v20111128.err /usr/bin/md5sum '/badc/cmip5/data/cmip5/output1/MOHC/HadGEM2-ES/rcp85/day/seaIce/day/r1i1p1/v20111128/*/*.nc'
(out)Job <745307> is submitted to queue <lotus>. ...
{{</command>}}
Original file line number Diff line number Diff line change
Expand Up @@ -59,7 +59,7 @@ sleep 5m
```

For job specification of resources please refer to Table 2 of the help article
[LSF to Slurm quick reference]({{< ref "lsf-to-slurm-quick-reference" >}})
[Slurm quick reference]({{< ref "slurm-quick-reference" >}})

## Method 2: Submit via command-line options

Expand Down
74 changes: 0 additions & 74 deletions content/docs/batch-computing/lsf-to-slurm-quick-reference.md

This file was deleted.

73 changes: 73 additions & 0 deletions content/docs/batch-computing/slurm-quick-reference.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,73 @@
---
aliases:
- /article/4891-lsf-to-slurm-quick-reference
- /docs/batch-computing/lsf-to-slurm-quick-reference/
date: 2022-10-11 15:15:57
description: An overview of Slurm commands and its environment variables
slug: slurm-quick-reference
tags:
- lotus
- orchid
- slurm
title: Slurm quick reference
---

## The Slurm Scheduler

[Slurm](https://slurm.schedmd.com/) is the job scheduler deployed on JASMIN. It
allows users to submit, monitor, and control jobs on the [LOTUS]({{< ref "lotus-overview" >}}) (CPU) and [ORCHID]({{< ref "orchid-gpu-cluster" >}}) (GPU) clusters.

## Essential Slurm commands

| **Slurm command** | **Description** |
| ---------------------------------- | --------------------------------------- |
| sbatch _script_file_ | Submit a job script to the scheduler |
| sinfo | Show available scheduling queues |
| squeue -u _\<username\>_ | List user's pending and running jobs |
| srun -n 1 -p test \--pty /bin/bash | Request an interactive session on LOTUS |
{.table .table-striped}

## Job specification

<!-- Turn word wrap off to edit this table, or use a site such as https://tableconvert.com/markdown-to-markdown -->
| **Slurm parameter** | **Description** |
| ----------------------------------------------------------------------------------------------------------------------------------------------- | ----------------------------------------------------------------------------------------------------------------------------------- |
| #SBATCH | Scheduler directive |
| \--partition=_queue_name_ <br> -p _queue_name_ | Specify the scheduling queue |
| \--time=_hh:mm:ss_ or -t _hh:mm:ss_ | Set the maximum runtime limit |
| \--time-min=_hh:mm:ss_ | Set an estimated runtime |
| \--job-name=_jobname_ | Specify a name for the job |
| \--output=_filename_ or -o _filename_ <br> \--error=_filename_ or -e _filename_ | Standard job output and error output. Default append. The default file name is `slurm-%j.out`, where `%j` is replaced by the job ID |
| \--open-mode=append\|truncate | Write mode for error/output files |
| %j | Job ID |
| %a | Job array index |
| \--mem=_XXX_ | Memory XXX is required for the job. Default units are megabytes |
| \--array= _index_ (e.g. \--array=1-10) | Specify a job array. The default file name is `slurm-%A_%a.out`, `%A` is replaced by the job ID and `%a` with the array index. |
| \--array=index% _ArrayTaskThrottle_ <br> (e.g. \--array=1-15%4 will limit the number of simultaneously running tasks from this job array to 4) | A maximum number of simultaneously running tasks from the job array may be specified using a `%` separator. |
| -D <br> \--chdir=_\<directory\>_ | Set the working directory of the batch script to < _directory >_ before it is executed. |
| \--exclusive | Exclusive execution mode |
| \--dependency= _\<dependency_list\>_ | Defer the start of this job until the specified dependencies have been satisfied completed |
| \--ntasks=_number-of-cores_ <br> -n _number-of-cores_ | Number of CPU cores |
| \--constraint="_\< host-group-name\>_" | To select a node with a specific processor model |
{.table .table-striped}

## Job control commands

| **Slurm command** | **Description** |
| ------------------------------- | ----------------------------- |
| scancel _\<jobid\>_ | Kill a job |
| scontrol show job _\<jobid\>_ | Show details job information |
| scontrol update job _\<jobid\>_ | Modify a pending job |
| scancel \--user=_\<username\>_ | Kill all jobs owned by a user |
{.table .table-striped}

## Job environment variables

| **Slurm variable** | **Description** |
| --------------------- | ------------------------------------ |
| $SLURM_JOBID | Job identifier number |
| $SLURM_ARRAY_JOB_ID | Job array |
| $SLURM_ARRAY_TASK_ID | Job array index |
| $SLURM_ARRAY_TASK_MAX | Last index number within a job array |
| $SLURM_NTASKS | Number of processors allocated |
{.table .table-striped}
Original file line number Diff line number Diff line change
Expand Up @@ -203,7 +203,7 @@ button.
## Managing groups

When you deploy a cluster through CaaS, it may create one or more access
control groups in FreeIPA as part of it's configuration. Some clusters can
control groups in FreeIPA as part of its configuration. Some clusters can
also consume additional groups created in FreeIPA. This is discussed in more
detail in the documentation for each cluster type, but the way you manage
group membership is the same in all cases.
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -14,7 +14,7 @@ Cluster-as-a-Service (CaaS).
[Kubernetes](https://kubernetes.io/) is an open-source system for automating
the deployment, scaling and management of containerised applications.

Kubernetes is an extremely powerful system, and a full discussion of it's
Kubernetes is an extremely powerful system, and a full discussion of its
capabilities is beyond the scope of this article - please refer to the
Kubernetes documentation. This article assumes some knowledge of Kubernetes
terminology and focuses on things that are specific to the way Kubernetes is
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -8,7 +8,7 @@ title: Understanding new JASMIN storage
weight: 160
---

{{<alert type="info">}}This article was originally written in 2018/19 to introdice new forms of storage which were brought into produciton at that stage. Some of the information and terminology is now out of date, pending further review of JASMIN documentation.{{</alert>}}
{{<alert type="info">}}This article was originally written in 2018/19 to introduce new forms of storage which were brought into production at that stage. Some of the information and terminology is now out of date, pending further review of JASMIN documentation.{{</alert>}}

## Introduction

Expand Down
21 changes: 10 additions & 11 deletions content/docs/short-term-project-storage/faqs-storage.md
Original file line number Diff line number Diff line change
Expand Up @@ -8,7 +8,7 @@ tags:
title: New storage FAQs and issues
---

{{<alert type="info">}}This article was originally written in 2018/19 to introdice new forms of storage which were brought into produciton at that stage. Some of the information and terminology is now out of date, pending further review of JASMIN documentation.{{</alert>}}
{{<alert type="info">}}This article was originally written in 2018/19 to introduce new forms of storage which were brought into production at that stage. Some of the information and terminology is now out of date, pending further review of JASMIN documentation.{{</alert>}}

Workflows with some of the issues highlighted below will have a knock on
effect for other users, so please take the time to check and change your code
Expand Down Expand Up @@ -64,10 +64,10 @@ starting another.

#### Opening the same file for editing in more than one editor on the same or different servers

_Here’s an example of how this shows up using “lsof” and by listing user
Here’s an example of how this shows up using “lsof” and by listing user
processes with “ps”. The same file “ISIMIPnc_to_SDGVMtxt.py” is being edited
in 2 separate “vim” editors. In this case, the system team was unable to kill
the processes on behalf of the user, so the only solution was to reboot sci1._
the processes on behalf of the user, so the only solution was to reboot sci1.

{{<command user="user" host="sci1">}}
lsof /gws/nopw/j04/gwsnnn/
Expand All @@ -91,20 +91,20 @@ be rebooted.

## 2\. Issues with small files

_The larger file systems in operation within JASMIN are suitable for storing
The larger file systems in operation within JASMIN are suitable for storing
and manipulating large datasets and not currently optimised for handling small
( <64kBytes) files. These systems are not the same as those you would find on
a desktop computer or even large server, and often involve many disks to store
the data itself and metadata servers to store the file system metadata (such
as file size, modification dates, ownership etc). If you are compiling code
from source files, or running code from python virtual environments, these are
examples of activities which can involve accessing large numbers of small
files._
files.

_Later versions of our PFS systems handled this by using SSD storage for small
Later versions of our PFS systems handled this by using SSD storage for small
files, transparent to the user. SOF however, can’t do this (until later in
2019), so in Phase 4, we introduced larger home directories based on SSD, as
well as an additional and larger scratch area._
well as an additional and larger scratch area.

**Suggested solution:** Please consider using your home directory for small-
file storage, or `/work/scratch-nopw2` for situations involving LOTUS
Expand All @@ -125,22 +125,21 @@ similar issues from writing large numbers of small files to SOF storage (known
as QB ).

**Suggested solution:** It is more efficient to write netCDF3 classic files to
another filesystem type (e.g. /work/scratch/pw* or /work/scratch-nopw2) and then move them to a SOF
GWS, rather than writing directly to SOF.
another filesystem type (e.g. `/work/scratch/pw*` or `/work/scratch-nopw2`) and then move them to a SOF GWS, rather than writing directly to SOF.

---

## 3\. "Everything's running slowly today"

_This can be due to overloading of the scientific analysis servers
This can be due to overloading of the scientific analysis servers
(`sci*.jasmin.ac.uk`) which we provide for interactive use. They’re great
for testing a code and developing a workflow, but are not designed for
actually doing the big processing. Please take this heavy-lifting or
long-running work to the LOTUS batch processing cluster, leaving the
interactive compute nodes responsive enough for everyone to use.

**Suggested solution:** When you log in via one of the `login*.jasmin.ac.uk`
nodes, you are shown a 'message of the day" a list of all the `sci*` machines,
nodes, you are shown a 'message of the day': a list of all the `sci*` machines,
along with memory usage and the number of users on each node at that time.
This can help you select a less-used machine (but don’t necessarily expect the
same machine to be the right choice next time!).
Expand Down
Loading

0 comments on commit 65c4482

Please sign in to comment.