ai2-kit
implements a lightweight HPC executor for submitting and managing jobs on HPC clusters. Compared with other HPC schedulers or workflow frameworks that support HPC (such as DpDispatcher, parsl, DFlow, etc.), ai2-kit HPC executor
has the following characteristics:
- SSH remote execution and local execution
- Connecting to HPC clusters through jump servers
- Remote execution of Python functions
- Executing simple commands or functions directly on the login node
- More efficient and stable job status polling mechanism
- State recovery mechanism based on Checkpoint
- Synchronous or asynchronous waiting for job execution
- Easy to customize or integrate with other frameworks
If you encounter the following problems when using other solutions, you may wish to try ai2-kit HPC executor
:
- Too high learning cost
- Unstable connection
- Need to frequently copy data to the local for processing and then submit it back to the cluster for execution
Currently, ai2-kit HPC executor
only supports the Slurm job system. The support for other job systems depends on actual needs. Welcome to submit Issues or PR.
If you need a more powerful workflow engine, it is recommended to try DFlow, covalent, parsl, redun
There are two ways to initialize ai2-kit HPC executor
, using a dictionary configuration or a Pydantic
object. The configuration items of the two methods are exactly the same. The former is more suitable for direct use, and the latter is more suitable for integration with other frameworks. The following is an example of using a dictionary for configuration.
from ai2_kit.core.executor import HpcExecutor
executor = HpcExecutor.from_config({
'ssh': { # Specify ssh connection information, if omitted, local execution is used
'host': 'user01@hpc-login01',
'gateway': { # If you need to connect through a jump server, you can specify this configuration (optional)
'host': 'user01@jump-host',
}
},
'queue_system': {
'slurm': {} # Specify the job system as Slurm
},
'work_dir': '/home/user01/ai2-kit/work_dir', # Specify the working directory
'python_cmd': '/home/user01/conda/env/py39/bin/python', # Specify the Python interpreter
}, 'cheng-lab') # Specify the cluster name (optional)
executor.init() # Initialize the executor
executor.run('echo "hello world"') # Execute command on login node
The above example completes the following tasks:
- Instantiate a
HpcExecutor
object - Initialize the
HpcExecutor
object - Execute the command
echo "hello world"
on the login node
The usual mode of executing complex computing tasks on HPC is:
- Prepare the input data of the computing task on the login node through the command line or Python
- Submit the job to the queue and wait
- After the job is completed, use the command line or Python on the login node to process the output data of the computing task
In order to meet the above mode, in addition to providing the run
interface for executing commands on the login node, ai2-kit HPC executor
also provides the run_python_fn
interface for directly running Python functions on the login node.
def add(a, b):
return a + b
result = executor.run_python_fn(add)(1, 2) # run on login node
Note that in order to execute local Python functions on remote nodes, the following conditions must be met:
- The main version of the local Python environment and the remote Python environment must be consistent (such as 3.8.x)
- The configuration of the remote Python environment can be specified through the
python_cmd
parameter
- The configuration of the remote Python environment can be specified through the
- The parameters and return values of the function must be serializable (cannot contain unserializable objects such as locks and file handles)
- The software packages on which the function depends must exist in the Python environment of the login node
- For example, suppose the function executed remotely uses the
numpy
package, then thenumpy
package must exist in the Python environment of the login node
- For example, suppose the function executed remotely uses the
- If the function depends on other locally implemented methods or classes, these methods and classes need to be defined in a special way, otherwise
ModuleNotFoundError
will appear remotely- This is a limitation of cloudpickle, see: 1
ai2-kit HPC executor
provides the submit
interface for submitting jobs to HPC clusters. The following is an example
script = '''\
#! /bin/bash
#SBATCH --job-name=demo
#SBATCH -N 1
#SBATCH --ntasks-per-node=4
#SBATCH --partition=cpu-small
#SBATCH --mem=1G
echo "hello world" > output.txt
'''
job = executor.submit(script) # submit job
Except for the method of directly writing scripts mentioned above, you can also generate scripts through the tool class provided by ai2-kit
, as shown below:
from ai2_kit.core.script import BashScript, BashStep, BashTemplate
header = '''\
#SBATCH --job-name=demo
#SBATCH -N 1
#SBATCH --ntasks-per-node=4
#SBATCH --partition=cpu-small
#SBATCH --mem=1G
'''
script = BashScript(
template=BashTemplate(
header=header,
),
steps=[
BashStep(cmd='echo "hello world"'),
]
)
job = executor.submit(script.render(), cwd='/path/to/cwd')
Besides, you can specify additional parameters when submitting jobs to meet different needs, such as setting the execution directory and the checkpoint file for error recovery.
There are two ways to wait for the completion of the submitted task, synchronous and asynchronous. The following is an example:
...
state = job.result() # wait for job completion synchronously
async def main():
...
state = await job.result_async() # wait for job completion asynchronously
Synchronous waiting will block the subsequent code execution, so it is not suitable for scenarios where multiple jobs need to be submitted in parallel. At this time, asynchronous waiting can be used.
Although ai2-kit
does not provide a workflow scheduling engine, for simple tasks, with the support of Python's asynchronous support and the tool classes provided by ai2-kit
, it is easy to implement a simple workflow. Next, take the following simple task as an example:
- Pre-processing: Implement a Python function to create n working directories and an input file containing a random number
- Submit n job scripts, each job script reads the number in the input and calculates its square and writes it to the output file
- Post-processing: Implement a Python function to read the output values of all files and sum them after all jobs are completed
The specific code implementation can be found in simple-workflow.
From the code implementation, it can be seen that the workflow implemented through ai2-kit
is essentially ordinary Python code, but when it is necessary to execute functions remotely or submit jobs, it will call the interface provided by ai2-kit HPC executor
. Therefore, it is not difficult for students familiar with Python coding to get started with ai2-kit
.
More complex workflows are also implemented in the same way, but on this basis, modeling and parsing of configuration files are added, as well as the use of conditional branches and loops for flow control. If you are interested, you can refer to the code in the ai2-kit.workflow
module.