From 5dd0f1f72c9d34497ab3a7b0e22f012405113944 Mon Sep 17 00:00:00 2001 From: Nathan Rockershousen Date: Mon, 15 Apr 2024 14:25:10 -0500 Subject: [PATCH] File headings and readme updates --- README.md | 4 +- .../install/compute_install_prereqs.md | 6 +- .../install/configure_softroce.md | 14 ++-- .../install/cxi_core_driver.md | 10 +-- .../install_200gbps_nic_host_software.md | 12 ++-- .../install_or_upgrade_compute_nodes.md | 8 +-- ...ll_or_upgrade_shs_on_hpcm_compute_nodes.md | 12 ++-- .../install/install_shs_on_csm.md | 72 +++++++++---------- .../install/post_install_tasks.md | 24 +++---- .../install/sysctl_configuration_example.md | 2 +- 10 files changed, 82 insertions(+), 82 deletions(-) diff --git a/README.md b/README.md index bf4bca4..8857413 100644 --- a/README.md +++ b/README.md @@ -1,8 +1,8 @@ -# DOCS-SHS +# shs-docs ## Overview -The docs-shs repository holds the documentation and documentation publication tooling +The shs-docs repository holds the documentation and documentation publication tooling for the HPE Slingshot Host Software (SHS) product. ## Documentation Source diff --git a/docs/portal/developer-portal/install/compute_install_prereqs.md b/docs/portal/developer-portal/install/compute_install_prereqs.md index f31f13f..4b327a8 100644 --- a/docs/portal/developer-portal/install/compute_install_prereqs.md +++ b/docs/portal/developer-portal/install/compute_install_prereqs.md @@ -1,9 +1,9 @@ -### Required material +# Required material All material will be available via the source URLs provided below as part of the HPE Slingshot Release for manufacturing and internal development systems. -#### Slingshot RPMs +## Slingshot RPMs | Name | Contains | Typical Install Target | |-------------------------------|-----------------------------------------------------------------------------------------------------------------|------------------------------------| @@ -18,7 +18,7 @@ All material will be available via the source URLs provided below as part of the Libfabric-devel is required on any host that a user would be able to compile an application for use with `libfabric`. -#### External vendor software +## External vendor software | Name | Contains | Typical Install Target | Recommended Version | URL | |----------------------------|---------------------------------------------|-----------------------------------------|---------------------|------------------------------------------------------------------------------------------------| diff --git a/docs/portal/developer-portal/install/configure_softroce.md b/docs/portal/developer-portal/install/configure_softroce.md index b461bb2..d98c60c 100644 --- a/docs/portal/developer-portal/install/configure_softroce.md +++ b/docs/portal/developer-portal/install/configure_softroce.md @@ -1,19 +1,19 @@ -## Configure Soft-RoCE +# Configure Soft-RoCE Remote direct memory access (RDMA) over Converged Ethernet (RoCE) is a network protocol that enables RDMA over an Ethernet network. RoCE can be implemented both in the hardware and in the software. Soft-RoCE is the software implementation of the RDMA transport. RoCE v2 is used for HPE Slingshot 200Gbps NICs. -### Soft-RoCE on HPE Slingshot 200Gbps NICs +## Soft-RoCE on HPE Slingshot 200Gbps NICs -#### Prerequisites +### Prerequisites 1. `cray-cxi-driver` RPM package must be installed. 2. `cray-rxe-driver` RPM package must be installed. 3. HPE Slingshot 200Gbps NIC Ethernet must be configured and active. -#### Configuration +### Configuration The following configuration is on the node image, and modifying the node image varies depending on the system management solution being used (HPE Cray EX or HPCM). @@ -84,11 +84,11 @@ Follow the relevant procedures to achieve the needed configuration. Contact a sy NOTE: Soft-RoCE device creation is not persistent across reboots. The `rxe_init.sh` script must be run on every boot after the HPE Slingshot 200Gbps NIC Ethernet device is fully programmed with links up and AMAs assigned. -### Lustre Network Driver (LND) ko2iblnd configuration +## Lustre Network Driver (LND) ko2iblnd configuration The ko2iblnd.ko changes are needed for better Soft-RoCE performance on LNDs. -#### Compute Node tuning for Soft-RoCE +### Compute Node tuning for Soft-RoCE Tuning on compute node can be achieved in two ways. Follow the steps that work best for the system in use. @@ -160,7 +160,7 @@ Tuning on compute node can be achieved in two ways. Follow the steps that work b /sys/module/ko2iblnd/parameters/wrq_sge:1 ``` -#### E1000 ko2iblnd tuning for Soft-RoCE +### E1000 ko2iblnd tuning for Soft-RoCE Configure clients to use Soft-RoCE and configure storage with MLX HCAs running HW RoCE. diff --git a/docs/portal/developer-portal/install/cxi_core_driver.md b/docs/portal/developer-portal/install/cxi_core_driver.md index 06edde9..8d80085 100644 --- a/docs/portal/developer-portal/install/cxi_core_driver.md +++ b/docs/portal/developer-portal/install/cxi_core_driver.md @@ -1,19 +1,19 @@ -## CXI core driver +# CXI core driver -### GPU Direct RDMA overview +## GPU Direct RDMA overview GPU Direct RDMA allows a PCIe device (the HPE Slingshot 200GbE NIC in this case) to access memory located on a GPU device. The NIC driver interfaces with a GPU's driver API to get the physical pages for virtual memory allocated on the device. -### Vendors supported +## Vendors supported - AMD - ROCm library, amdgpu driver - Nvidia - Cuda library, nvidia driver - Intel - Level Zero library, dmabuf kernel interface -### Special considerations +## Special considerations -#### NVIDIA driver +### NVIDIA driver The NVIDIA driver contains a feature called Persistent Memory. It does not release pinned pages when device memory is freed unless explicitly directed by the NIC driver or upon job completion. diff --git a/docs/portal/developer-portal/install/install_200gbps_nic_host_software.md b/docs/portal/developer-portal/install/install_200gbps_nic_host_software.md index f7700d2..02af311 100644 --- a/docs/portal/developer-portal/install/install_200gbps_nic_host_software.md +++ b/docs/portal/developer-portal/install/install_200gbps_nic_host_software.md @@ -1,9 +1,9 @@ -### Install 200Gbps NIC host software +# Install 200Gbps NIC host software The 200Gbps NIC software stack includes drivers and libraries to support standard Ethernet and libfabric RDMA interfaces. -#### Prerequisites for compute node installs +## Prerequisites for compute node installs The 200Gbps NIC software stack must be installed after a base compute OS install has been completed. A list of 200Gbps NIC supported distribution installs can be found in the "Support Matrix" section under "Slingshot Host Software (SHS)" in the _HPE Slingshot Release Notes_ document. When those have been installed, then proceed with instructions for Installing 200Gbps NIC Host Software for that distribution. @@ -45,7 +45,7 @@ manually loaded with the following commands: To complete setup, follow the fabric management procedure for Algorithmic MAC Address configuration. -#### 200Gbps NIC support in early boot +## 200Gbps NIC support in early boot If traffic must be passed over the 200Gbps NIC prior to the root filesystem being mounted (for example, for a network root filesystem using the 200Gbps NIC), @@ -68,7 +68,7 @@ Due to these caveats, it is recommended that the `cray-libcxi-dracut` RPM only be installed on systems whose configurations require 200Gbps NIC support in early boot. -#### Check 200Gbps NIC host software version +## Check 200Gbps NIC host software version Each 200Gbps NIC RPM has the HPE Slingshot version embedded in the release field of the RPM metadata. This information can be queried using standard RPM commands. The @@ -104,7 +104,7 @@ Distribution: (none) The HPE Slingshot release for this version of `cray-libcxi` is 1.2.1 (SSHOT1.2.1). This process can be repeated for all 200Gbps NIC RPMs. -#### Install validation +## Install validation The 200Gbps NIC software stack install procedure should make all 200Gbps NIC devices available for Ethernet and RDMA. Perform the following steps to validate the @@ -132,7 +132,7 @@ Check for 200Gbps NIC Ethernet network devices. hsn0 is CXI interface ``` -#### 200Gbps NIC firmware management +## 200Gbps NIC firmware management See the [Firmware Management](#firmware-management) section for more information on how to update firmware. diff --git a/docs/portal/developer-portal/install/install_or_upgrade_compute_nodes.md b/docs/portal/developer-portal/install/install_or_upgrade_compute_nodes.md index 9b4c654..0b068b5 100644 --- a/docs/portal/developer-portal/install/install_or_upgrade_compute_nodes.md +++ b/docs/portal/developer-portal/install/install_or_upgrade_compute_nodes.md @@ -1,5 +1,5 @@ -### Install or upgrade compute nodes +# Install or upgrade compute nodes The installation method will depend on what type of NIC is installed on the system. Select one of the following procedures depending on the NIC in use: @@ -9,7 +9,7 @@ Select one of the following procedures depending on the NIC in use: NOTE: The upgrade process is nearly identical to the installation, and the proceeding instructions will note where the two processes delineate. -#### Prerequisites for Mellanox-based system installation +## Prerequisites for Mellanox-based system installation 1. Identify the target OS distribution, and distribution version for all compute targets in the cluster. Use this information to select the appropriate Mellanox OFED (MOFED) tar file to be used for install from the URL listed in the [External Vendor Software](install_metal.md#external-vendor-software) table above. The filename typically follows this pattern: `MLNX_OFED_LINUX---.tgz`. @@ -30,7 +30,7 @@ NOTE: The upgrade process is nearly identical to the installation, and the proce NOTE: If the customer requires UCX on the system, then install the HPC-X solution using the recommended version provided by the [External Vendor Software](install_metal.md#external-vendor-software) table. Ensure that the HPC-X tarball matches the installed version of Mellanox OFED. In the HPC-x package, installation instructions are provided by Mellanox. -#### Install via package managers (recommended) +## Install via package managers (recommended) 1. For each distribution and distribution version as collected in the first step of the prerequisite install, download the RPMs mentioned in the previous section in the Slingshot RPMs table above. @@ -113,7 +113,7 @@ NOTE: The upgrade process is nearly identical to the installation, and the proce c. If the host is a compute node, and a user access node, perform steps 1 and 2, otherwise skip this step. -#### Install via command line +## Install via command line 1. For each distribution and distribution version as collected in the first step of the prerequisite install, download the RPMs mentioned in the previous section (Installation | Required Material | Source | RPMs). diff --git a/docs/portal/developer-portal/install/install_or_upgrade_shs_on_hpcm_compute_nodes.md b/docs/portal/developer-portal/install/install_or_upgrade_shs_on_hpcm_compute_nodes.md index 2991dac..0ee08fe 100644 --- a/docs/portal/developer-portal/install/install_or_upgrade_shs_on_hpcm_compute_nodes.md +++ b/docs/portal/developer-portal/install/install_or_upgrade_shs_on_hpcm_compute_nodes.md @@ -1,11 +1,11 @@ -### Install or upgrade Slingshot Host Software (SHS) on HPCM compute nodes +# Install or upgrade Slingshot Host Software (SHS) on HPCM compute nodes This documentation provides step-by-step instructions to install and/or upgrade the Slingshot Host Software (SHS) on compute node images on an HPE Performance Cluster Manager (HPCM) using SLES15-SP4 as an example. The procedure outlined here is applicable to SLES, RHEL, and COS distributions. Refer to the System Software Requirements for Fabric Manager and Host Software section in the HPE Slingshot Release Notes for exact version support for the release. -#### Process +## Process The installation and upgrade method will depend on what type of NIC is installed on the system. Select one of the following procedures depending on the NIC in use: @@ -15,7 +15,7 @@ Select one of the following procedures depending on the NIC in use: NOTE: The upgrade process is nearly identical to installation, and the proceeding instructions will note where the two processes delineate. -##### Mellanox-based system install/upgrade procedure +### Mellanox-based system install/upgrade procedure This section is for systems using Mellanox NICs. For systems using HPE Slingshot 200Gbps NICs, skip this section and instead proceed to the [HPE Slingshot 200Gbps CXI NIC system install/upgrade procedure](#hpe-slingshot-200gbps-cxi-nic-system-installupgrade-procedure). @@ -205,7 +205,7 @@ For systems using HPE Slingshot 200Gbps NICs, skip this section and instead proc 13. Proceed directly to the [Firmware management](#firmware-management) and [ARP settings](#arp-settings) sections of this document to complete SHS compute install. -##### HPE Slingshot 200Gbps CXI NIC system install/upgrade procedure +### HPE Slingshot 200Gbps CXI NIC system install/upgrade procedure This section is for systems using HPE Slingshot 200Gbps CXI NICs. For systems using Mellanox NICs, skip this section and proceed to the [Mellanox-based system install procedure](#mellanox-based-system-installupgrade-procedure), followed by the [Firmware management](#firmware-management) section. @@ -401,11 +401,11 @@ For systems using Mellanox NICs, skip this section and proceed to the [Mellanox- 10. Apply the post-boot firmware and firmware configuration. General instructions are in the "Install compute nodes" section of the _HPE Slingshot Installation Guide for Bare Metal_. -### Firmware management +# Firmware management Mellanox NICs system firmware management is done through the `slingshot-firmware` utility. -### ARP settings +# ARP settings The following settings are suggested for larger clusters to reduce the frequency of ARP cache misses during connection establishment when using the libfabric `verbs` provider, as basic/standard ARP default parameters will not scale to support large systems. diff --git a/docs/portal/developer-portal/install/install_shs_on_csm.md b/docs/portal/developer-portal/install/install_shs_on_csm.md index e263955..02cb009 100644 --- a/docs/portal/developer-portal/install/install_shs_on_csm.md +++ b/docs/portal/developer-portal/install/install_shs_on_csm.md @@ -1,5 +1,5 @@ -## Install SHS on CSM release 1.4 or newer - Install and Upgrade Framework +# Install SHS on CSM release 1.4 or newer - Install and Upgrade Framework The Slingshot Host Software (SHS) distribution provides firmware, diagnostics, and the network software stack for hosts which communicate using the Slingshot network. @@ -28,21 +28,21 @@ IUF will perform the following tasks for a release of SHS: IUF uses a variety of CSM and SAT tools when performing these tasks. The [IUF section](https://cray-hpe.github.io/docs-csm/en-14/operations/iuf/iuf/) of the [Cray System Management Documentation](https://cray-hpe.github.io/docs-csm/) describes how to use these tools directly if it is desirable to use them instead of IUF. -### IUF Stage Details for SHS +## IUF Stage Details for SHS This section describes any SHS details that an administrator may need to be aware of before executing IUF stages. Entries are prefixed with **Information** if no administrative action is required or **Action** if an administrator may need to perform tasks outside of IUF. -#### update-cfs-config +### update-cfs-config **Action**: Before running this stage, make any site-local SHS configuration changes so the following stages run using the desired SHS configuration values. See [Operational activities](#operational-activities) for more information. -## Install SHS on CSM release 1.3 or prior +# Install SHS on CSM release 1.3 or prior The SHS distribution provides firmware, diagnostics, and the network software stack for hosts which communicate using the Slingshot network. For upgrades, the manual steps or the Compute Node Environment (CNE) installer tool can be used. See [SHS upgrade with CNE installer](#shs-upgrade-with-cne-installer) for more information on the `cne-install` method. -### Common requirements of SHS +## Common requirements of SHS - SUSE Linux Enterprise Operating System for HPE Cray EX product must be installed. - System Admin Toolkit (SAT) product must be installed and configured. @@ -53,17 +53,17 @@ For upgrades, the manual steps or the Compute Node Environment (CNE) installer t - SHS CFS plays should be one of the first plays run in the configuration. - SHS CFS installation must occur before any product with dependencies on the network stack installs software. -#### Requirements for new installations or upgrades of SHS +### Requirements for new installations or upgrades of SHS - All image and node targets must be clear of software with dependencies on the network stack prior to the execution of the SHS CFS play. -### SHS upgrade with CNE installer +## SHS upgrade with CNE installer The CNE installer (`cne-install`) tool can only be used to upgrade SHS in this release. `cne-install` performs all of the manual steps shown in the [Install product stream](#install-product-stream) and [Operational activities](#operational-activities) sections of the upgrade procedure. Refer to the "Compute Node Environment (CNE) Installer" section of the [HPE Cray EX System Software Getting Started Guide (S-8000)](https://www.hpe.com/support/ex-S-8000) for more information about the tool. -### Install product stream +## Install product stream 1. Start a typescript to capture the commands and output from this installation. @@ -162,11 +162,11 @@ Refer to the "Compute Node Environment (CNE) Installer" section of the [HPE Cray SHS now supports installation via HPE Cray EX System Software CFS. Proceed to the next section to install the software via HPE Cray EX System Software CFS. Otherwise, proceed to the [Legacy Install Procedure for non-CFS based installs](#legacy-install-procedure-for-non-cfs-based-installs) section. -### Operational activities +## Operational activities SHS uses the HPE Cray EX System Software Configuration Framework Service to install, upgrade, and configure nodes or images. The following procedures will provide instructions on how to add the SHS CFS components to your CFS configurations. -#### SHS CFS variable reference +### SHS CFS variable reference The following Ansible variables are publicly exposed for use by customers or administrators with SHS CFS playbooks: @@ -195,11 +195,11 @@ The following Ansible variables are publicly exposed for use by customers or adm type: `string` description: sets the target platform to use when defining repository URIs. Available choices are one of [`cos-2.4`, `cos-2.5`, `cos-2.6`, `csm-1.3.0`, `csm-1.4.0`, `csm-1.5.0`] -#### Setup +### Setup Create an `integration-` branch using the imported branch from the SHS installation. The imported branch will be reported in the cray-product-catalog and may be found in the cray/slingshot-host-software-config-management repository. The imported branch may be used as a base branch. The imported branch from the installation should not be modified. It is recommended that a branch is created from the imported branch to customize the provided content as necessary. The following steps create an `integration-` branch to accomplish this. It is required that the user has basic git workflow understanding including concepts to do `git rebase` to completion. -#### Authentication credentials +### Authentication credentials Obtain the authentication credentials needed for the git repository. Git will prompt for them when required. @@ -213,7 +213,7 @@ ncn-m001# VCSUSERPW=$(kubectl get secret -n services vcs-user-credentials \ ncn-m001# printf 'VCSUSER=%s\nVCSUSERPW=%s\n' "${VCSUSER}" "${VCSUSERPW}" ``` -#### Find targets +### Find targets Obtain the `release` and `import_branch` from the `cray-product-catalog`. where `` is the full or partial release version. @@ -236,7 +236,7 @@ configuration: ... ``` -#### Clone +### Clone Clone the slingshot-host-software-config-management repository and change to that working directory. Note that the `CLONE_URL` below is different than the `clone_url` and `ssh_url` reported in the previous step. @@ -248,7 +248,7 @@ ncn-m001# git clone ${CLONE_URL} ncn-m001# cd slingshot-host-software-config-management ``` -#### References +### References Examine the references in the local git working directory using the following command. Keep this information at hand. @@ -257,7 +257,7 @@ ncn-m001# git for-each-ref \ --sort=refname 'refs/remotes/origin/integration*' 'refs/remotes/origin/cray/slingshot-host-software/*' ``` -#### Target shell variables +### Target shell variables Set shell variables that correspond to the desired release, working integration branch, and the base import branch. @@ -279,7 +279,7 @@ ncn-m001# printf 'RELEASE=%s\nBRANCH=%s\nIMPORT_BRANCH_REF=%s\n' \ "${RELEASE}" "${BRANCH}" "${IMPORT_BRANCH_REF}" ``` -#### Workflow decisions +### Workflow decisions At this point, some workflow decisions need to be made. These decisions depend on repository findings and which goals are to be achieved. @@ -350,7 +350,7 @@ At this point, some workflow decisions need to be made. These decisions depend o ncn-m001# git push -f origin ``` -##### Apply customizations +#### Apply customizations Apply any customizations and modifications to the Ansible configuration, if required. These customizations should never be made to the base release branch. @@ -367,7 +367,7 @@ ncn-m001# git commit ncn-m001# git push origin ${BRANCH} ``` -##### Identify commit hash +#### Identify commit hash Identify the commit hash for this branch and store it for later use. This will be used when creating the CFS configuration layer. @@ -376,12 +376,12 @@ This will be used when creating the CFS configuration layer. ncn-m001# export SHS_CONFIG_COMMIT_HASH=$(git rev-parse --verify HEAD) ``` -##### Configuration data defined +#### Configuration data defined SHS configuration data is now defined in the appropriate integration branch of the slingshot-host-software-config-management repository in VCS. It will be used when performing the operations described in the next sections. -##### Recommendations +#### Recommendations SHS ships a single configuration for all releases, and this may result in default values that are not usable for the installed release. Defaults are based on the primary development platform at the time of release, so these values are subject to change over time. @@ -436,7 +436,7 @@ shs_target_platform: "csm-1.3.0" These variables can be defined in multiple ways according to customer or administrator requirements. If they are left undefined, they will be defined by CFS plays using defaults provided in `ansible/roles/setup/defaults/main.yml`, and set by `roles/setup/tasks.yml`. -#### Non-compute Node (NCN) personalization and image customization +### Non-compute Node (NCN) personalization and image customization NCN personalization and image customization are both methods used to configure NCNs. NCN personalization is the process of applying product-specific configuration to NCNs post-boot. @@ -449,7 +449,7 @@ Select one of the following procedures depending on the version of CSM in use: - **CSM 1.2 or earlier versions**: Proceed to the [NCN personalization](#ncn-personalization) procedure. - **CSM 1.3 or later versions**: Proceed to the [NCN image customization](#ncn-image-customization) procedure. -#### NCN personalization +### NCN personalization This section is only for systems using CSM 1.2 or earlier versions. For systems using CSM 1.3 or later versions, skip this section and instead proceed to the [NCN image customization](#ncn-image-customization) instructions. @@ -464,7 +464,7 @@ Installation and upgrade is aimed at discussing the process and procedure for in Migration is aimed at discussing how to replace the SHS networking software stack on a NCN with a different networking stack from SHS. Only migration from systems with Mellanox NICs to systems with HPE Slingshot 200Gbps NICs is supported at this time. -##### Install or upgrade with NCN personalization +#### Install or upgrade with NCN personalization The following steps describe how to use the NCN personalization CFS configuration in conjunction with HPE Cray EX CFS software to install, update, and configure SHS provided content on NCN workers. @@ -619,7 +619,7 @@ If other HPE Cray EX software products are being installed in conjunction with S If other HPE Cray EX software products are not being installed at this time, continue to the next section of this document to configure compute content. -##### Migration +#### Migration If a fresh install of the NCN worker has occurred and SHS has never been installed before on the target node, see `Install/Upgrade` section above. If SHS has never been installed, then the node can be considered to be 'clean' and does not require uninstallation of the Slingshot software stack with Mellanox NICs. @@ -775,7 +775,7 @@ These steps are necessary to provide the networking drivers, management software If the modules are not listed for each worker node, and you have done the steps above, refer to `Perform NCN personalization` in the CSM documentation for NCN Personalization details. -#### NCN image customization +### NCN image customization This section is for systems using CSM 1.3 or later versions. For systems using CSM 1.2 or earlier versions, skip this section and proceed to the [NCN personalization](#ncn-personalization) procedure, followed by the [Compute Node Configuration](#compute-node-configuration) procedure. @@ -857,7 +857,7 @@ At this point, SHS configuration content has been updated in HPE Cray EX System If other HPE Cray EX software products are being installed in conjunction with SHS, refer to the Install and Upgrade Framework (IUF) section of the [Cray System Management (CSM) Documentation](https://cray-hpe.github.io/docs-csm/en-14/operations/iuf/iuf/) to determine what step to perform next. If other HPE Cray EX software products are not being installed at this time, continue to the next section of this document. -#### Compute node configuration +### Compute node configuration This section provides detailed instructions on how to modify Compute CFS configurations to support installation use cases on HPE Cray EX systems. Two separate approaches are provided: @@ -870,13 +870,13 @@ It is highly recommended that `sat bootprep` be used to perform these tasks. If If `sat bootprep` is available, then follow the instructions in the "SAT Bootprep" section below and do not follow the instructions in the "Legacy Compute Node CFS procedure" section. Otherwise if `sat bootprep` is not available, then follow the instructions in the "Legacy Compute Node CFS procedure section" below and do not follow the instructions in the "SAT Bootprep" section. -##### SAT Bootprep +#### SAT Bootprep The "SAT Bootprep" section of the _HPE Cray EX System Admin Toolkit (SAT) Guide_ provides information on how to use `sat bootprep` to create CFS configurations, build images with IMS, and create BOS session templates. To include SHS software and configuration data in these operations, ensure that the `sat bootprep` input file includes content similar to that described in the following subsections. NOTE: The `sat bootprep` input file will contain content for additional HPE Cray EX software products and not only SHS. The following examples focus on SHS entries only. -##### SHS configuration content +#### SHS configuration content The `sat bootprep` input file should contain sections similar to the following to ensure SHS configuration data is used when configuring the compute image prior to boot and when personalizing compute nodes after boot. Replace `` with the version of SHS desired. The version of SHS installed resides in the CSM product catalog and can be displayed with the `sat showrev` command. @@ -905,7 +905,7 @@ configurations: NOTE: The `shs-integration-` layer should precede the COS layer in the `sat bootprep` input file. -##### Legacy compute node CFS procedure +#### Legacy compute node CFS procedure This step should not be executed until after COS install/upgrade has finished on the system. COS provides the instructions for creating a CFS configuration for compute nodes. The procedure in this section aims at updating the existing CFS configuration for compute nodes. @@ -979,7 +979,7 @@ The existing configuration will likely include other Cray EX product entries. Th At this point, SHS configuration content has been updated in HPE Cray EX System Software CFS. If other HPE Cray EX software products are being installed in conjunction with SHS, refer to the Install and Upgrade Framework (IUF) section of the [Cray System Management (CSM) Documentation](https://cray-hpe.github.io/docs-csm/en-14/operations/iuf/iuf/) to determine what step to perform next. If other HPE Cray EX software products are not being installed at this time, continue to the next section of this document. -#### Application node configuration +### Application node configuration Ensure that the `Setup` section preceding this section has been completed prior to running any steps in this section. @@ -1059,11 +1059,11 @@ The example steps below reference how to modify the user access node CFS configu At this point, SHS configuration content has been updated in HPE Cray EX System Software CFS. If other HPE Cray EX software products are being installed in conjunction with SHS, refer to the Install and Upgrade Framework (IUF) section of the [Cray System Management (CSM) Documentation](https://cray-hpe.github.io/docs-csm/en-14/operations/iuf/iuf/) to determine what step to perform next. If other HPE Cray EX software products are not being installed at this time, continue to the next section of this document. -#### Image building +### Image building SHS provides CFS plays for the management of provided content. The process for building images, and how to create/deploy/boot images can be found within the COS, CSM, and UAN documentation. -### Post-install operational tasks +## Post-install operational tasks The firmware must be updated with each new install. The firmware can be updated using `slingshot-firmware` as provided by the `slingshot-firmware-management` package. @@ -1090,9 +1090,9 @@ The firmware must be updated with each new install. The firmware can be updated 3. Firmware updates do not take effect immediately. Firmware updates will only go into operation after the device has been power-cycled. Before putting the server back into operation, it must be rebooted or power-cycled according to the administration guide for the target server. Reference the COS documentation for Compute node maintenance procedures, and the CSM documentation for NCN and UAN maintenance procedures. -### Legacy install procedure for non-CFS based installs +## Legacy install procedure for non-CFS based installs -#### Updating compute and UAN image recipe +### Updating compute and UAN image recipe See sub-section "Upload and Register an Image Recipe" under the "Image Management" section of the CSM documentation for general steps on how to download, modify, upload, and register a image recipe. @@ -1248,7 +1248,7 @@ For systems equipped with Mellanox NICs, follow the instructions in 1. below. Fo sed -i 's/^allow_unsupported_modules 0/allow_unsupported_modules 1/' /etc/modprobe.d/10-unsupported-modules.conf ``` -#### Notes +### Notes The following steps need to occur to build compute and UAN/UAI images prior to boot with the updated Slingshot components: diff --git a/docs/portal/developer-portal/install/post_install_tasks.md b/docs/portal/developer-portal/install/post_install_tasks.md index b01f858..df3006c 100644 --- a/docs/portal/developer-portal/install/post_install_tasks.md +++ b/docs/portal/developer-portal/install/post_install_tasks.md @@ -1,11 +1,11 @@ -### Post-install tasks +# Post-install tasks The `slingshot-network-config` RPM provides template configuration files to be used to create site-specific configuration files. The configuration templates are found in the `/opt/slingshot/slingshot-network-config/default/share` directory, while the binaries and scripts are found in the `/opt/slingshot/slingshot-network-config/default/bin` directory. When `slingshot-network-config` is installed, the RPM creates a link from the specific installed version of the RPM to a `default` link so that it is easy for customers to reference files between releases. -#### Firmware management +## Firmware management HPE Slingshot provides a tool, `slingshot-firmware`, for managing the firmware of a network interface. The utility must be run as `root` since this is a privileged operation. @@ -13,7 +13,7 @@ It is recommended that the version of the firmware match the recommended values It is highly recommended that the firmware for all managed devices on all nodes should be updated with this utility after a new install or upgrade of this software distribution. -##### Usage +### Usage ```screen user@host:/ # slingshot-firmware --help @@ -53,7 +53,7 @@ The `slingshot-firmware` utility limits firmware management to devices specified `slingshot-firmware` provides functionality for two actions: `update` and `query`. -##### Query +### Query `query` is the action associated with device discovery, and device attribute discovery. The `query` action allows a user to query specific device attributes from a device. The list of supported attributes are given as follows: @@ -71,7 +71,7 @@ hsn1: version: 16.28.2006 ``` -##### Update +### Update `update` is the action associated with device firmware updates and device firmware configuration. As demonstrated above with the `-D | --device` global option with `query`, the `update` action can be run on a specific device, or on all managed devices. An example using the `update` action is provided below: @@ -132,7 +132,7 @@ Configurations: Default Current Nex The '*' shows parameters with next value different from default/current value. ``` -#### Generic Slingshot configuration +## Generic Slingshot configuration The `slingshot-network-config` RPM provides example configuration files, binaries, and scripts that are used to configure the network adapters for use on an HPE Slingshot fabric. The example scripts are provided with assumptions made regarding the names of the network adapters used in the system. @@ -142,7 +142,7 @@ For example, if two network adapters on a host are connected to the HPE Slingsho Several aspects of the host's system and kernel configuration should be modified for optimal performance. There are some modifications which are required if certain criteria are met. -#### Host adapter naming +## Host adapter naming On a Linux host, `udev` is responsible for naming system devices according to defined system and site-specific policies. The scripts provided by the `slingshot-network-config` RPM assume that the network devices are named according to a specific convention, `hsn`. To implement this policy, this `slingshot-network-config` RPM provides a script and an example udev rule which can be used as-is or modified to fit a site-specific configuration. @@ -187,7 +187,7 @@ To integrate these files into the image: If the resulting initrd is used for booting the host over the network, such as with a pxeboot, then the resultant initrd from the final step in the example should be used to boot the new image. -#### Slingshot Algorithmic MAC Addressing (AMA) configuration +## Slingshot Algorithmic MAC Addressing (AMA) configuration For network adapters connected to an HPE Slingshot fabric, it is required that the adapter should have an algorithmic MAC address (AMA) assigned to the device. The AMA assigned to the device is required for traffic to be routed within the HPE Slingshot fabric. @@ -272,7 +272,7 @@ root@host ~# ip link set hsn up The HSN device is now up. -##### Check and Modify Interface Admin Status +### Check and Modify Interface Admin Status These steps help in checking and changing the administrative status of a network interface to "rxtx" using `lldptool`. This might be necessary in some cases to ensure proper network functionality. @@ -294,7 +294,7 @@ These steps help in checking and changing the administrative status of a network Replace `hsn` with your hsn interface identifier (for example, hsn0). -#### Multiple network adapters +## Multiple network adapters If a host has multiple network adapters connected to the HPE Slingshot fabric, it is recommended that each host run the `/usr/bin/slingshot-ifroute` script. The script assumes that the network adapters follow the recommended prefix and attempts to configure the host with a routing policy required for a multi-homed network. Every network adapter in a multi-home configuration should be able to communicate with every other network adapter in the multi-home configuration without the use of a bridge. @@ -312,7 +312,7 @@ As a result of the script, new routing tables and policies should be created in The routing script should be run after all network adapters have been named by `systemd` or `udev`. As an alternative solution, the routing script can also be run as part of the `POST_UP` section of an `ifconfig` configuration file for the interface. -#### HPE Slingshot configuration with Mellanox NICs +## HPE Slingshot configuration with Mellanox NICs HPE Slingshot provides libfabric to accelerate HPC applications over an HPE Slingshot network. The `libfabric` RPM provides the run-time libraries while the `libfabric-devel` RPM provides the compile-time headers and libraries for compiling user applications. @@ -347,7 +347,7 @@ root@host ~# ln -s \ /etc/security/limits.d/99-slingshot-network.conf ``` -##### Mellanox software configuration +### Mellanox software configuration Specific tunable parameters should be changed when operating HPC applications at scale with `libfabric`. To avoid connection establishment stalls on Mellanox hardware when running applications at large scale, it is recommended to increase the `recv_queue_size` parameter for the `ib_core` to `8192`. diff --git a/docs/portal/developer-portal/install/sysctl_configuration_example.md b/docs/portal/developer-portal/install/sysctl_configuration_example.md index 21d1258..93aa53a 100644 --- a/docs/portal/developer-portal/install/sysctl_configuration_example.md +++ b/docs/portal/developer-portal/install/sysctl_configuration_example.md @@ -1,5 +1,5 @@ -### `sysctl` configuration example +# `sysctl` configuration example The `slingshot-network-config` RPM contains an example `sysctl` configuration file shown here: