TIP:Before you start running the following commands, HPE recommends you to read the HPE Swarm Learning User Guide to understand about the architecture of Swarm Learning, how these nodes work, how model training happens, and so on.
For examples of how to provide options to the various run commands, see the Examples chapter in HPE Swarm Learning User Guide.
IMPORTANT:
Ensure that network proxy settings are configured correctly and the containers are able to communicate to each other.
Ensure that Docker is configured to run as a non-root user by adding your current user ID as part of the Docker group.
Ensure that the system time is synchronized across the systems by using NTP.
Start and run Swarm Learning in the following order. Make sure that License Server is started and the licenses are installed.
-
The Sentinel Node
-
Start the Swarm Network node before starting any of the associated Swarm Learning nodes.
-
After the training is completed, stop all the containers using the script
stop-swarm
on all nodes.
The scripts in the swarm-learning/bin
directory is used to start these components. To run the scripts, a bash shell and a Linux environment is required.
NOTE: The default directory where Swarm Learning is installed is `/opt/hpe/swarm-learning`. If the user has changed the default installation directory, all the run commands can be found in that location.
All start scripts take the following common options for configuring the Docker run command that is used to start the container.
NOTE: These options do not apply to the `swarm-learning/bin/stop-swarm` script. These options are similar to those of the Docker run command.
Parameter name | Description | Default value |
---|---|---|
--hostname <name> |
The host name assigned to the docker container. | --name , if it is specified. Otherwise, Docker assigns a host name. |
--name <name> |
The name assigned to the docker container. | Docker assigns a random name to the container. |
--network <network name> |
The docker network that the container should belong to. | Docker's default bridge network. |
--pull |
Pull the docker image from its repository before running it. | False, the image is not pulled from its repository, if it is already available locally |
--sudo |
Prefix the Docker commands with "sudo". | False, if the current user belongs to the docker group; true otherwise. |
-d, --detach |
Run the container in the background. | A pseudo-terminal is allocated if the launcher has an associated terminal; otherwise, the container is run in the background |
-i, --interactive |
Keep STDIN open even if not attached to a terminal. | STDIN is kept open if a pseudo-terminal is allocated to the container; otherwise, it is closed. |
-t, --tty |
Allocate a pseudo-terminal for the container. | A pseudo-terminal is allocated if the launcher has an associated terminal; otherwise, the container is run in the background. |
-e, --env var=val |
Set an environment variable inside the container. | |
-l, --label key=val |
Set metadata on a container. | |
-p, --publish host-port:container-port |
Publish a container port to the host. | |
-u, --user { name | uid } [ : { group | gid } ] |
User and group ID to use inside the container. | |
-v, --volume host-path:container-path |
Bind mount a volume. | |
-w, --workdir container-path |
Working directory inside the container. | |
--rm |
Same as --no-keep . Request Docker to automatically remove the container when it exits. |
|
--no-rm |
Same as --keep . Request Docker to preserve the container after it exits. |
|
--keep |
Same as --no-rm . Request Docker to preserve the container after it exits. |
|
--no-keep |
Same as --rm . Request Docker to automatically remove the container when it exits. |
|
-h, --help |
This (helpful) message. | |
--primary-apls-ip <IP address or DNS name> |
The IP address on which the primary Autopass License Server is serving license requests. | None |
--secondary-apls-ip <IP address or DNS name> |
The IP address on which the secondary Autopass License Server is serving license requests. | None |
--primary-apls-port <port numberw> |
The port number on which the primary Autopass License Server is serving license requests. | 5814 |
--secondary-apls-port <port number> |
The port number on which the secondary Autopass License Server is serving license requests. | The value assigned to --primary-apls-port |
--apls-pdf <path to license PD file> |
The path to the license PD file to be used. | None |
--cacert <path to certificates file> |
The path to the file containing the list of CA certificates. | None |
--capath <path to certificates directory> |
The path to the directory containing CA certificate files. | None |
--cert <path to certificate file> |
The path to the certificate file that provides the component's ID. | None |
--key <path to key file> |
The path to the private key file corresponding to the certificate. | None |
--socket-path <SPIFFE Workload API socket> |
Path, volume or container hosting the socket on which the SPIFFE Agent serves the Workload API. | None |
--host-ip <IP address or DNS name> (Mandatory parameter) |
The IP address or DNS name of the host system on which this Swarm Learning node is created. | |
--sn-ip <IP address or DNS name> |
The IP address or DNS name of the host system on which the Swarm Network (SN) node with which this Swarm Learning node must associate, is running. | |
--sn-api-port <port number> |
Host port for the API Server of the associated Swarm Network node | 30304 |
--sl-fs-port <port number> |
Host port for this Swarm Learning node's File Server. | 30305 |
Parameter name | Description |
---|---|
--ml-image <ML image name> (Optional parameter) |
Name of the User's Machine Learning image. |
--ml-entrypoint <entrypoint> (Optional parameter) |
Entry point to the Machine Learning container. |
--ml-cmd <command> (Optional parameter) |
Command to the Machine Learning container. |
--ml-w <directory path> (Optional parameter) |
Working directory of the Machine Learning container. |
--ml-name <container name> (Optional parameter) |
Name of the Machine Learning container. |
--ml-v <host-path:container-path> (Optional parameter) |
Bind mount a volume for the Machine Learning container. |
--ml-e <environmental-variable-name=value> (Optional parameter) |
To pass environmental variable to the Machine Learning container. |
Also see: