Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DAOS-14442 control: Make NVMe auto-faulty configurable #13548

Merged
merged 9 commits into from
Jan 24, 2024
55 changes: 54 additions & 1 deletion docs/admin/administration.md
Original file line number Diff line number Diff line change
Expand Up @@ -478,6 +478,59 @@ boro-11
```
#### Exclusion and Hotplug

- Automatic exclusion of an NVMe SSD:

Automatic exclusion based on faulty criteria is the default behavior in DAOS
release 2.6. The default criteria parameters are `max_io_errs: 10` and
`max_csum_errs: <uint32_max>` (essentially eviction due to checksum errors is
disabled by default).

Setting auto-faulty criteria parameters can be done through the server config
file by adding the following YAML to the engine section of the server config
file.

```yaml
engines:
- bdev_auto_faulty:
enable: true
max_io_errs: 1
max_csum_errs: 2
```

On formatting the storage for the engine, these settings result in the
following `daos_server` log entries to indicate the parameters are written to
the engine's NVMe config:

```bash
DEBUG 13:59:29.229795 provider.go:592: BdevWriteConfigRequest: &{ForwardableRequest:{Forwarded:false} ConfigOutputPath:/mnt/daos0/daos_nvme.conf OwnerUID:10695475 OwnerGID:10695475 TierProps:[{Class:nvme DeviceList:0000:5e:00.0 DeviceFileSize:0 Tier:1 DeviceRoles:{OptionBits:0}}] HotplugEnabled:false HotplugBusidBegin:0 HotplugBusidEnd:0 Hostname:wolf-310.wolf.hpdd.intel.com AccelProps:{Engine: Options:0} SpdkRpcSrvProps:{Enable:false SockAddr:} AutoFaultyProps:{Enable:true MaxIoErrs:1 MaxCsumErrs:2} VMDEnabled:false ScannedBdevs:}
Writing NVMe config file for engine instance 0 to "/mnt/daos0/daos_nvme.conf"
```

The engine's NVMe config (produced during format) then contains the following
JSON to apply the criteria:

```json
[tanabarr@wolf-310 ~]$ cat /mnt/daos0/daos_nvme.conf
{
"daos_data": {
"config": [
{
"params": {
"enable": true,
"max_io_errs": 1,
"max_csum_errs": 2
},
"method": "auto_faulty"
...
```

These engine logfile entries indicate that the settings have been read and
applied:

```bash
01/12-13:59:41.36 wolf-310 DAOS[1299350/-1/0] bio INFO src/bio/bio_config.c:1016 bio_read_auto_faulty_criteria() NVMe auto faulty is enabled. Criteria: max_io_errs:1, max_csum_errs:2
```

- Manually exclude an NVMe SSD:
```bash
$ dmg storage set nvme-faulty --help
Expand All @@ -491,7 +544,7 @@ Usage:
-f, --force Do not require confirmation
```

To manually evict an NVMe SSD (auto eviction will be supported in a future release),
To manually evict an NVMe SSD (auto eviction is covered later in this section),
the device state needs to be set faulty by running the following command:
```bash
$ dmg -l boro-11 storage set nvme-faulty --uuid=5bd91603-d3c7-4fb7-9a71-76bc25690c19
Expand Down
8 changes: 4 additions & 4 deletions src/bio/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -81,24 +81,24 @@ While monitoring this health data, an admin can now make the determination to ma

<a id="7"></a>
## Faulty Device Detection (SSD Eviction)
Faulty device detection and reaction can be referred to as NVMe SSD eviction. This involves all affected pool targets being marked as down and the rebuild of all affected pool targets being automatically triggered. A persistent device state is maintained in SMD and the device state is updated from NORMAL to FAULTY upon SSD eviction. The faulty device reaction will involve various SPDK cleanup, including all I/O channels released, SPDK allocations (termed 'blobs') closed, and the SPDK blobstore created on the NVMe SSD unloaded. Currently only manual SSD eviction is supported, and a future release will support automatic SSD eviction.
Faulty device detection and reaction can be referred to as NVMe SSD eviction. This involves all affected pool targets being marked as down and the rebuild of all affected pool targets being automatically triggered. A persistent device state is maintained in SMD and the device state is updated from NORMAL to FAULTY upon SSD eviction. The faulty device reaction involves various SPDK cleanup, including all I/O channels released, SPDK allocations (termed 'blobs') closed, and the SPDK blobstore created on the NVMe SSD unloaded. Automatic SSD eviction is enabled by default and can be disabled using the `bdev_auto_faulty` server config file engine parameter.

Useful admin commands to manually evict an NVMe SSD:
- <a href="#82">dmg storage set nvme-faulty</a> [used to manually set an NVMe SSD to FAULTY (ie evict the device)]

<a id="8"></a>
## NVMe SSD Hot Plug

**Full NVMe hot plug capability will be available and supported in DAOS 2.0 release. Use is currently intended for testing only and is not supported for production.**
NVMe hot plug with Intel VMD devices is supported in this release.

**Full hot plug capability when using non-Intel-VMD devices is to be supported in DAOS 2.8 release. Use is currently intended for testing only and is not supported for production.**

The NVMe hot plug feature includes device removal (an NVMe hot remove event) and device reintegration (an NVMe hotplug event) when a faulty device is replaced with a new device.

For device removal, if the device is a faulty or previously evicted device, then nothing further would be done when the device is removed. The device state would be displayed as UNPLUGGED. If a healthy device that is currently in use by DAOS is removed, then all SPDK memory stubs would be deconstructed, and the device state would also display as UNPLUGGED.

For device reintegration, if a new device is plugged to replace a faulty device, the admin would need to issue a device replacement command. All SPDK in-memory stubs would be created and all affected pool targets automatically reintegrated on the new device. The device state would be displayed as NEW initially and NORMAL after the replacement event occurred. If a faulty device or previously evicted device is re-plugged, the device will remain evicted, and the device state would display EVICTED. If a faulty device is desired to be reused (NOTE: this is not advised, mainly used for testing purposes), the admin can run the same device replacement command setting the new and old device IDs to be the same device ID. Reintegration will not occur on the device, as DAOS does not currently support incremental reintegration.

NVMe hot plug with Intel VMD devices is currently not supported in this release, but will be supported in a future release.

Useful admin commands to replace an evicted device:
- <a href="#83">dmg storage replace nvme</a> [used to replace an evicted device with a new device]
- <a href="#84">dmg storage replace nvme</a> [used to bring an evicted device back online (without reintegration)]
Expand Down
129 changes: 103 additions & 26 deletions src/bio/bio_config.c
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
/**
* (C) Copyright 2021-2023 Intel Corporation.
* (C) Copyright 2021-2024 Intel Corporation.
*
* SPDX-License-Identifier: BSD-2-Clause-Patent
*/
Expand Down Expand Up @@ -117,9 +117,6 @@ struct busid_range_info {
uint8_t end;
};

/* PCI address bus-ID range to be used to filter hotplug events */
struct busid_range_info hotplug_busid_range = {};

static struct spdk_json_object_decoder
busid_range_decoders[] = {
{"begin", offsetof(struct busid_range_info, begin), spdk_json_decode_uint8},
Expand All @@ -131,9 +128,6 @@ struct accel_props_info {
uint16_t opt_mask;
};

/* Acceleration properties to specify engine to use and optional capabilities to enable */
struct accel_props_info accel_props = {};

static struct spdk_json_object_decoder
accel_props_decoders[] = {
{"accel_engine", offsetof(struct accel_props_info, engine), spdk_json_decode_string},
Expand All @@ -145,15 +139,24 @@ struct rpc_srv_info {
char *sock_addr;
};

/* Settings to enable an SPDK JSON-RPC server to run in current process */
struct rpc_srv_info rpc_srv_settings = {};

static struct spdk_json_object_decoder
rpc_srv_decoders[] = {
{"enable", offsetof(struct rpc_srv_info, enable), spdk_json_decode_bool},
{"sock_addr", offsetof(struct rpc_srv_info, sock_addr), spdk_json_decode_string},
};

struct auto_faulty_info {
bool enable;
uint32_t max_io_errs;
uint32_t max_csum_errs;
};

static struct spdk_json_object_decoder auto_faulty_decoders[] = {
{"enable", offsetof(struct auto_faulty_info, enable), spdk_json_decode_bool},
{"max_io_errs", offsetof(struct auto_faulty_info, max_io_errs), spdk_json_decode_uint32},
{"max_csum_errs", offsetof(struct auto_faulty_info, max_csum_errs), spdk_json_decode_uint32},
};

static int
is_addr_in_allowlist(char *pci_addr, const struct spdk_pci_addr *allowlist,
int num_allowlist_devices)
Expand Down Expand Up @@ -419,7 +422,6 @@ add_traddrs_from_bdev_subsys(struct json_config_ctx *ctx, bool vmd_enabled,
}

if (strcmp(cfg.method, NVME_CONF_ATTACH_CONTROLLER) != 0) {
D_DEBUG(DB_MGMT, "skip config entry %s\n", cfg.method);
goto free_method;
}

Expand Down Expand Up @@ -479,6 +481,7 @@ add_traddrs_from_bdev_subsys(struct json_config_ctx *ctx, bool vmd_enabled,
free_method:
D_FREE(cfg.method);

/* Decode functions return positive RC for success or not-found */
if (rc > 0)
rc = 0;
return rc;
Expand Down Expand Up @@ -506,7 +509,6 @@ check_name_from_bdev_subsys(struct json_config_ctx *ctx)

if (strcmp(cfg.method, NVME_CONF_ATTACH_CONTROLLER) != 0 &&
strcmp(cfg.method, NVME_CONF_AIO_CREATE) != 0) {
D_DEBUG(DB_MGMT, "skip config entry %s\n", cfg.method);
goto free_method;
}

Expand Down Expand Up @@ -750,7 +752,7 @@ decode_daos_data(const char *nvme_conf, const char *method_name, struct config_e
if (rc != 0)
D_GOTO(out, rc);

/* Capture daos object */
/* Capture daos_data JSON object */
rc = spdk_json_find(ctx->values, "daos_data", NULL, &daos_data,
SPDK_JSON_VAL_OBJECT_BEGIN);
if (rc < 0) {
Expand All @@ -769,8 +771,8 @@ decode_daos_data(const char *nvme_conf, const char *method_name, struct config_e
/* Get 'config' array first configuration entry */
ctx->config_it = spdk_json_array_first(ctx->config);
if (ctx->config_it == NULL) {
D_DEBUG(DB_MGMT, "Empty 'daos_data' section\n");
D_GOTO(out, rc = 1); /* non-fatal */
/* Entry not-found so return positive RC */
D_GOTO(out, rc = 1);
}

while (ctx->config_it != NULL) {
Expand All @@ -789,14 +791,16 @@ decode_daos_data(const char *nvme_conf, const char *method_name, struct config_e
}

if (ctx->config_it == NULL) {
D_DEBUG(DB_MGMT, "No '%s' entry\n", method_name);
rc = 1; /* non-fatal */
/* Entry not-found so return positive RC */
rc = 1;
}
out:
free_json_config_ctx(ctx);
return rc;
}

struct busid_range_info hotplug_busid_range = {};

static int
get_hotplug_busid_range(const char *nvme_conf)
{
Expand All @@ -816,11 +820,12 @@ get_hotplug_busid_range(const char *nvme_conf)
D_GOTO(out, rc = -DER_INVAL);
}

D_DEBUG(DB_MGMT, "'%s' read from config: %X-%X\n", NVME_CONF_SET_HOTPLUG_RANGE,
hotplug_busid_range.begin, hotplug_busid_range.end);
D_INFO("'%s' read from config: %X-%X\n", NVME_CONF_SET_HOTPLUG_RANGE,
hotplug_busid_range.begin, hotplug_busid_range.end);
out:
if (cfg.method != NULL)
D_FREE(cfg.method);
/* Decode functions return positive RC for success or not-found */
if (rc > 0)
rc = 0;
return 0;
Expand All @@ -846,6 +851,7 @@ hotplug_filter_fn(const struct spdk_pci_addr *addr)

/**
* Set hotplug bus-ID ranges in SPDK filter based on values read from JSON config file.
* The PCI bus-ID ranges will be used to filter hotplug events.
*
* \param[in] nvme_conf JSON config file path
*
Expand All @@ -856,6 +862,8 @@ bio_set_hotplug_filter(const char *nvme_conf)
{
int rc;

D_ASSERT(nvme_conf != NULL);

rc = get_hotplug_busid_range(nvme_conf);
if (rc != 0)
return rc;
Expand All @@ -866,7 +874,8 @@ bio_set_hotplug_filter(const char *nvme_conf)
}

/**
* Read optional acceleration properties from JSON config file.
* Read acceleration properties from JSON config file to specify which acceleration engine to use
* and selections of optional capabilities to enable.
*
* \param[in] nvme_conf JSON config file path
*
Expand All @@ -876,8 +885,11 @@ int
bio_read_accel_props(const char *nvme_conf)
{
struct config_entry cfg = {};
struct accel_props_info accel_props = {};
int rc;

D_ASSERT(nvme_conf != NULL);

rc = decode_daos_data(nvme_conf, NVME_CONF_SET_ACCEL_PROPS, &cfg);
if (rc != 0)
goto out;
Expand All @@ -891,7 +903,7 @@ bio_read_accel_props(const char *nvme_conf)
D_GOTO(out, rc = -DER_INVAL);
}

D_DEBUG(DB_MGMT, "'%s' read from config, setting: %s, capabilities: move=%s,crc=%s\n",
D_INFO("'%s' read from config, setting: %s, capabilities: move=%s,crc=%s\n",
NVME_CONF_SET_ACCEL_PROPS, accel_props.engine,
CHK_FLAG(accel_props.opt_mask, NVME_ACCEL_FLAG_MOVE) ? "true" : "false",
CHK_FLAG(accel_props.opt_mask, NVME_ACCEL_FLAG_CRC) ? "true" : "false");
Expand All @@ -900,13 +912,16 @@ bio_read_accel_props(const char *nvme_conf)
out:
if (cfg.method != NULL)
D_FREE(cfg.method);
/* Decode functions return positive RC for success or not-found */
if (rc > 0)
rc = 0;
return rc;
}

/**
* Set output parameters based on JSON config settings for option SPDK JSON-RPC server.
* Retrieve JSON config settings for option SPDK JSON-RPC server. Read flag to indicate whether to
* enable the SPDK JSON-RPC server and the socket file address from the JSON config used to
* initialize SPDK subsystems.
*
* \param[in] nvme_conf JSON config file path
* \param[out] enable Flag to enable the RPC server
Expand All @@ -918,14 +933,19 @@ int
bio_read_rpc_srv_settings(const char *nvme_conf, bool *enable, const char **sock_addr)
{
struct config_entry cfg = {};
struct rpc_srv_info rpc_srv_settings = {};
int rc;

D_ASSERT(nvme_conf != NULL);
D_ASSERT(enable != NULL);
D_ASSERT(sock_addr != NULL);
D_ASSERT(*sock_addr == NULL);

rc = decode_daos_data(nvme_conf, NVME_CONF_SET_SPDK_RPC_SERVER, &cfg);
if (rc != 0)
goto out;

rc = spdk_json_decode_object(cfg.params, rpc_srv_decoders,
SPDK_COUNTOF(rpc_srv_decoders),
rc = spdk_json_decode_object(cfg.params, rpc_srv_decoders, SPDK_COUNTOF(rpc_srv_decoders),
&rpc_srv_settings);
if (rc < 0) {
D_ERROR("Failed to decode '%s' entry: %s)\n", NVME_CONF_SET_SPDK_RPC_SERVER,
Expand All @@ -936,11 +956,68 @@ bio_read_rpc_srv_settings(const char *nvme_conf, bool *enable, const char **sock
*enable = rpc_srv_settings.enable;
*sock_addr = rpc_srv_settings.sock_addr;

D_DEBUG(DB_MGMT, "'%s' read from config: enabled=%d, addr %s\n",
NVME_CONF_SET_SPDK_RPC_SERVER, *enable, (char *)*sock_addr);
D_INFO("'%s' read from config: enabled=%d, addr %s\n", NVME_CONF_SET_SPDK_RPC_SERVER,
*enable, (char *)*sock_addr);
out:
if (cfg.method != NULL)
D_FREE(cfg.method);
/* Decode functions return positive RC for success or not-found */
if (rc > 0)
rc = 0;
return rc;
}

/**
* Set output parameters based on JSON config settings for NVMe auto-faulty feature and threshold
* criteria.
*
* \param[in] nvme_conf JSON config file path
* \param[out] enable Flag to enable the auto-faulty feature
* \param[out] max_io_errs Max IO errors (threshold) before marking as faulty
* \param[out] max_csum_errs Max checksum errors (threshold) before marking as faulty
*
* \returns Zero on success, negative on failure (DER)
*/
int
bio_read_auto_faulty_criteria(const char *nvme_conf, bool *enable, uint32_t *max_io_errs,
uint32_t *max_csum_errs)
{
struct config_entry cfg = {};
struct auto_faulty_info auto_faulty_criteria = {};
int rc;

rc = decode_daos_data(nvme_conf, NVME_CONF_SET_AUTO_FAULTY, &cfg);
if (rc != 0)
goto out;

rc = spdk_json_decode_object(cfg.params, auto_faulty_decoders,
SPDK_COUNTOF(auto_faulty_decoders), &auto_faulty_criteria);
if (rc < 0) {
D_ERROR("Failed to decode '%s' entry: %s)\n", NVME_CONF_SET_AUTO_FAULTY,
spdk_strerror(-rc));
D_GOTO(out, rc = -DER_INVAL);
}

*enable = auto_faulty_criteria.enable;
if (*enable == false) {
*max_io_errs = UINT32_MAX;
*max_csum_errs = UINT32_MAX;
goto out;
}
*max_io_errs = auto_faulty_criteria.max_io_errs;
if (*max_io_errs == 0)
*max_io_errs = UINT32_MAX;
*max_csum_errs = auto_faulty_criteria.max_csum_errs;
if (*max_csum_errs == 0)
*max_csum_errs = UINT32_MAX;

out:
D_INFO("NVMe auto faulty is %s. Criteria: max_io_errs:%u, max_csum_errs:%u\n",
*enable ? "enabled" : "disabled", *max_io_errs, *max_csum_errs);

if (cfg.method != NULL)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit, not needed to test as D_FREE will do it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will fix if repushed

D_FREE(cfg.method);
/* Decode functions return positive RC for success or not-found */
if (rc > 0)
rc = 0;
return rc;
Expand Down
Loading