Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Documentation updates #8849

Open
wants to merge 14 commits into
base: release/2.2
Choose a base branch
from
57 changes: 29 additions & 28 deletions docs/overview/architecture.md
Original file line number Diff line number Diff line change
Expand Up @@ -5,10 +5,10 @@ high bandwidth and high IOPS storage containers to applications and enables
next-generation data-centric workflows combining simulation, data analytics,
and machine learning.

Unlike the traditional storage stacks that were primarily designed for
Unlike the traditional storage stacks that are primarily designed for
rotating media, DAOS is architected from the ground up to exploit new
NVM technologies and is extremely lightweight since it operates
End-to-End (E2E) in user space with full OS bypass. DAOS offers a shift
End-to-End (E2E) in user space with complete OS bypass. DAOS offers a shift
away from an I/O model designed for block-based and high-latency storage
to one that inherently supports fine-grained data access and unlocks the
performance of the next-generation storage technologies.
Expand All @@ -30,10 +30,10 @@ directories over the native DAOS API is also available.

DAOS I/O operations are logged and then inserted into a persistent index
maintained in SCM. Each I/O is tagged with a particular timestamp called
epoch and is associated with a particular version of the dataset. No
epoch and associated with a specific dataset version. No
read-modify-write operations are performed internally. Write operations
are non-destructive and not sensitive to alignment. Upon read request,
the DAOS service walks through the persistent index and creates a
the DAOS service walks through the persistent index. It creates a
complex scatter-gather Remote Direct Memory Access (RDMA) descriptor to
reconstruct the data at the requested version directly in the buffer
provided by the application.
Expand All @@ -43,18 +43,18 @@ DAOS service that manages the persistent index via direct load/store.
Depending on the I/O characteristics, the DAOS service can decide to
store the I/O in either SCM or NVMe storage. As represented in Figure
2-1, latency-sensitive I/Os, like application metadata and byte-granular
data, will typically be stored in the former, whereas checkpoints and
data will typically be stored in the former, whereas checkpoints and
bulk data will be stored in the latter. This approach allows DAOS to
deliver the raw NVMe bandwidth for bulk data by streaming the data to
NVMe storage and maintaining internal metadata index in SCM. The
Persistent Memory Development Kit (PMDK) allows managing
transactional access to SCM and the Storage Performance Development Kit
transactional access to SCM, and the Storage Performance Development Kit
(SPDK) enables user-space I/O to NVMe devices.

![](../admin/media/image1.png)
Figure 2-1. DAOS Storage

DAOS aims at delivering:
DAOS aims to deliver:

- High throughput and IOPS at arbitrary alignment and size

Expand Down Expand Up @@ -96,34 +96,35 @@ DAOS aims at delivering:
## DAOS System

A data center may have hundreds of thousands of compute instances
interconnected via a scalable high-performance network, where all, or a
subset of the instances called storage nodes, have direct access to NVM
interconnected via a scalable, high-performance network, where all, or
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(style) trailing whitespace

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

a subset of the instances called storage nodes have direct access to NVM
storage. A DAOS installation involves several components that can be
either collocated or distributed.

A DAOS *system* is identified by a system name, and consists of a set of
A DAOS *system* is identified by a system name and consists of a set of
DAOS *storage nodes* connected to the same network. The DAOS storage nodes
run one DAOS *server* instance per node, which in turn starts one
DAOS *Engine* process per physical socket. Membership of the DAOS
servers is recorded into the system map, that assigns a unique integer
servers is recorded into the system map, which assigns a unique integer
*rank* to each *Engine* process. Two different DAOS systems comprise
two disjoint sets of DAOS servers, and do not coordinate with each other.
two disjoint sets of DAOS servers and do not coordinate.

The DAOS *server* is a multi-tenant daemon running on a Linux instance
(either natively on the physical node or in a VM or container) of each
*storage node*. Its *Engine* sub-processes export the locally-attached
SCM and NVM storage through the network. It listens to a management port
(addressed by an IP address and a TCP port number), plus one or more fabric
(addressed by an IP address and a TCP port number) plus one or more fabric
endpoints (addressed by network URIs).

The DAOS server is configured through a YAML file in /etc/daos,
including the configuration of its Engine sub-processes.
The DAOS server startup can be integrated with different daemon management or
orchestration frameworks (for example a systemd script, a Kubernetes service,
orchestration frameworks (for example, a systemd script, a Kubernetes service,
or even via a parallel launcher like pdsh or srun).

Inside a DAOS Engine, the storage is statically partitioned across
multiple *targets* to optimize concurrency. To avoid contention, each
target has its private storage, its own pool of service threads, and its
target has its private storage, its pool of service threads, and its
dedicated network context that can be directly addressed over the fabric
independently of the other targets hosted on the same storage node.

Expand All @@ -133,24 +134,24 @@ independently of the other targets hosted on the same storage node.

!!! note
When mounting the PMem devices with the `dax` option,
the following warning will be logged in dmesg:
`EXT4-fs (pmem0): DAX enabled. Warning: EXPERIMENTAL, use at your own risk`
This warning can be safely ignored: It is issued because
DAX does not yet support the `reflink` filesystem feature,
the following warning is logged using dmesg:
`EXT4-fs (pmem0): DAX enabled. Warning: EXPERIMENTAL, use at your own risk.`
This warning can be safely ignored; it is issued because
DAX does not yet support the `relink` filesystem feature,
but DAOS does not use this feature.

* When *N* targets per engine are configured,
each target is using *1/N* of the capacity of the `fsdax` SCM capacity
of that socket, independently of the other targets.

* Each target is also using a fraction of the NVMe capacity of the NVMe
drives that are attached to this socket. For example, in an engine
* Each target also uses a fraction of the NVMe capacity of the NVMe
drives attached to this socket. For example, in an engine
with 4 NVMe disks and 16 targets, each target will manage 1/4 of
a single NVMe disk.

A target does not implement any internal data protection mechanism
against storage media failure. As a result, a target is a single point
of failure and the unit of fault.
of failure and the fault unit.
A dynamic state is associated with each target: Its state can be either
"up and running", or "down and not available".

Expand All @@ -163,7 +164,7 @@ configurable, and depends on the underlying hardware (in particular,
the number of SCM modules and the number of NVMe SSDs that are served
by this engine instance). As a best practice, the number of targets
of an engine should be an integer multiple of the number of NVMe drives
that are served by this engine.
this engine serves that.

## SDK and Tools

Expand All @@ -173,20 +174,20 @@ to administer a DAOS system and is intended for integration with
vendor-specific storage management and open-source
orchestration frameworks. The `dmg` CLI tool is built over the DAOS management
API. On the other hand, the DAOS library (`libdaos`) implements the
DAOS storage model. It is primarily targeted at application and I/O
DAOS storage model. It primarily targets applications and I/O
middleware developers who want to store datasets in a DAOS system. User
utilities like the `daos` command are also built over the API to allow
users to manage datasets from a CLI.

Applications can access datasets stored in DAOS either directly through
the native DAOS API, through an I/O middleware library (e.g. POSIX
emulation, MPI-IO, HDF5) or through frameworks like Spark or TensorFlow
the native DAOS API, through an I/O middleware library (e.g., POSIX
emulation, MPI-IO, HDF5), or through frameworks like Spark or TensorFlow
that have already been integrated with the native DAOS storage model.

## Agent

The DAOS agent is a daemon residing on the client nodes that interacts
The DAOS agent is a daemon residing on the client nodes interacting
with the DAOS library to authenticate the application processes. It is a
trusted entity that can sign the DAOS library credentials using
certificates. The agent can support different authentication frameworks,
certificates. The agent can support different authentication frameworks
and uses a Unix Domain Socket to communicate with the DAOS library.
66 changes: 37 additions & 29 deletions docs/overview/data_integrity.md
Original file line number Diff line number Diff line change
Expand Up @@ -10,33 +10,34 @@ attempt to recover the corrupted data using data redundancy mechanisms
## End-to-end Data Integrity

In simple terms, end-to-end means that the DAOS Client library will calculate a
checksum for data that is being sent to the DAOS Server. The DAOS Server will
checksum for data sent to the DAOS Server. The DAOS Server will
store the checksum and return it upon data retrieval. Then the client verifies
the data by calculating a new checksum and comparing to the checksum received
from the server. There are variations on this approach depending on the type of
the data by calculating a new checksum and comparing it to the checksum received
from the server. There are variations in this approach depending on the type of
data being protected, but the following diagram shows the basic checksum flow.
![Basic Checksum Flow](../graph/data_integrity/basic_checksum_flow.png)

## Configuring

Data integrity is configured for each container.
See [Storage Model](./storage.md) for more information about how data is
organized in DAOS. See the Data Integrity in
organized in DAOS. Also, see the Data Integrity in
the [Container User Guide](../user/container.md#data-integrity) for details on
how to setup a container with data integrity.
setting up a container with data integrity.

## Keys and Value Objects

Because DAOS is a key/value store, the data for both keys and values is
protected, however, the approach is slightly different. For the two different
Because DAOS is a key/value store, the data for both keys and values are
protected; however, the approach is slightly different. In addition, for the two different
value types, single and array, the approach is also slightly different.

### Keys

On an update and fetch, the client calculates a checksum for the data used
as the distribution and attribute keys and will send it to the server within the
RPC. The server verifies the keys with the checksum.
While enumerating keys, the server will calculate checksums for the keys and
pack within the RPC message to the client. The client will verify the keys
pack them within the RPC message to the client. Finally, the client will verify the keys
received.

!!! note
Expand All @@ -47,13 +48,14 @@ received.
has reliable data integrity protection.

### Values

On an update, the client will calculate a checksum for the data of the value and
will send it to the server within the RPC. If "server verify" is enabled, the
server will calculate a new checksum for the value and compare with the checksum
server will calculate a new checksum for the value and compare it with the checksum
received from the client to verify the integrity of the value. If the checksums
don't match, then data corruption has occurred and an error is returned to the
client indicating that the client should try the update again. Whether "server
verify" is enabled or not, the server will store the checksum.
don't match, then data corruption has occurred, and an error is returned to the
client, indicating that the client should try the update again. Again, when the "server
verifies" it is enabled, the server will store the checksum.
See [VOS](https://github.com/daos-stack/daos/blob/release/2.2/src/vos/README.md)
for more info about checksum management and storage in VOS.

Expand All @@ -62,40 +64,40 @@ values fetched so the client can verify the values received. If the checksums
don't match, then the client will fetch from another replica if available in
an attempt to get uncorrupted data.

There are some slight variations to this approach for the two different types
There are slight variations to this approach for the two different types
of values. The following diagram illustrates a basic example.
(See [Storage Model](storage.md) for more details about the single value
and array value types)

![Basic Checksum Flow](../graph/data_integrity/basic_checksum_flow.png)

#### Single Value

A Single Value is an atomic value, meaning that writes to a single value will
update the entire value and reads retrieve the entire value. Other DAOS features
such as Erasure Codes might split a Single Value into multiple shards to be
distributed among multiple storage nodes. Either the whole Single Value (if
distributed among multiple storage nodes. The whole Single Value (if
going to a single node) or each shard (if distributed) will have a checksum
calculated, sent to the server, and stored on the server.

Note that it is possible for a single value, or shard of a single value, to
be smaller than the checksum derived from it. It is advised that if an
application needs many small single values to use an Array Type instead.
be smaller than the checksum derived from it. Therefore, if an
application needs many small single values, it is advised to use an Array Type instead.

#### Array Values

Unlike Single Values, Array Values can be updated and fetched at any part of
an array. In addition, updates to an array are versioned, so a fetch can include
an array. In addition, updates to an array are versioned so that a fetch can include
parts from multiple versions of the array. Each of these versioned parts of an
array are called extents. The following diagrams illustrate a couple examples
array is called extents. The following diagrams illustrate a couple of examples
(also see [VOS Key Array Stores](https://github.com/daos-stack/daos/blob/release/2.2/src/vos/README.md#key-array-stores) for
more information):


A single extent update (blue line) from index 2-13. A fetched extent (orange
line) from index 2-6. The fetch is only part of the original extent written.

![](../graph/data_integrity/array_example_1.png)


Many extent updates and different epochs. A fetch from index 2-13 requires parts
from each extent.

Expand All @@ -106,9 +108,9 @@ The nature of the array type requires that a more sophisticated approach to
creating checksums is used. DAOS uses a "chunking" approach where each extent
will be broken up into "chunks" with a predetermined "chunk size." Checksums
will be derived from these chunks. Chunks are aligned with an absolute offset
(starting at 0), not an I/O offset. The following diagram illustrates a chunk
size configured to be 4 (units is arbitrary in this example). Though not all
chunks have a full size of 4, an absolute offset alignment is maintained.
(starting at 0), not an I/O offset. For example, the following diagram illustrates a chunk
size configured to be 4 (units are arbitrary in this example). Though not all
chunks have a total size of 4, an absolute offset alignment is maintained.
The gray boxes around the extents represent the chunks.

![](../graph/data_integrity/array_with_chunks.png)
Expand All @@ -118,21 +120,24 @@ See [Object Layer](https://github.com/daos-stack/daos/blob/release/2.2/src/objec
for more details about the checksum process on object update and fetch)

## Checksum calculations

The actual checksum calculations are done by the
[isa-l](https://github.com/intel/isa-l)
and [isa-l_crypto](https://github.com/intel/isa-l_crypto) libraries. However,
these libraries are abstracted away from much of DAOS and a common checksum
these libraries are abstracted away from much of DAOS, and a common checksum
library is used with appropriate adapters to the actual isa-l implementations.
[common checksum library](https://github.com/daos-stack/daos/blob/release/2.2/src/common/README.md#checksum)

## Performance Impact

Calculating checksums can be CPU intensive and will impact performance. To
mitigate performance impact, checksum types with hardware acceleration should
be chosen. For example, CRC32C is supported by recent Intel CPUs, and many are
accelerated via SIMD.

## Quality
Unit and functional testing is performed at many layers.

Unit and functional testing are performed at many layers.

| Test executable | What's tested | Key test files |
| --- | --- | --- |
Expand All @@ -142,15 +147,18 @@ Unit and functional testing is performed at many layers.
| daos_test | daos_obj_update/fetch with checksums enabled. The -z flag can be used for specific checksum tests. Also --csum_type flag can be used to enable checksums with any of the other daos_tests | src/tests/suite/daos_checksum.c |

### Running Tests
**With daos_server not running**

```
#### With daos_server not running

```bash
./commont_test
./vos_test -z
./srv_checksum_tests
```
**With daos_server running**
```

#### With daos_server running

```bash
export DAOS_CSUM_TEST_ALL_TYPE=1
./daos_server -z
./daos_server -i --csum_type crc64
Expand Down
Loading