Skip to content

Latest commit

 

History

History
111 lines (72 loc) · 18.7 KB

README.md

File metadata and controls

111 lines (72 loc) · 18.7 KB

Container Service

The container service maintains container metadata, including:

  • Epoch state. This mainly consists of the container HCE and the maximum aggregated epoch (see Aggregation).
  • Container handles. Each container handle is identified by a client-generated handle UUID. The metadata associated with a container handle include its capabilities (e.g., read-only or read-write) and its per-handle epoch state, which consists of its LRE, HCE, and LHE.
  • Snapshots. These are simply the epochs that have been snapshotted.
  • Upper-layer container metadata. Attributes used by the DAOS-SR layer or layers even higher above.

Note that DSM treats the object index table (DAOS Container and Scalable Object Index Table) as data, rather than metadata, of DSM containers.

A container service handles the following RPC procedures. âcontainer_handle_uuidâ is a UUID identifying a container handle. âepoch_stateâ contains epoch-related information such as the container HCE, the LRE, HCE and LHE of this handle, the number of snapshotted epochs, etc.

  • CONTAINER_CREATE(pool_handle_uuid, container_uuid) error. Create a container. Sent only by the pool service in response to a POOL_CONTAINER_CREATE request. âcontainer_uuidâ is the UUID of the container to create.
  • CONTAINER_OPEN(pool_handle_uuid, container_uuid, container_handle_uuid, capabilities) (error, epoch_state). Establish a container handle. âcontainer_uuidâ is the UUID of the container. âcontainer_handle_uuidâ is generated by the client. âcapabilitiesâ indicates whether the client requests read-only or read-write access to this container.
  • CONTAINER_CLOSE(pool_handle_uuid, container_handle_uuid) error. Close a container handle.
  • CONTAINER_EPOCH_FLUSH(pool_handle_uuid, container_handle_uuid, epoch, targets) (error, epoch_state). Flush all writes in âepochâ on âtargetsâ. âtargetsâ is the set of the target addresses.
  • CONTAINER_EPOCH_DISCARD(pool_handle_uuid, container_handle_uuid, from_epoch, to_epoch, targets) (error, epoch_state). Discard all writes in epoch range [âfrom_epochâ, âto_epochâ] on âtargetsâ. âtargetâ is the set of target addresses.
  • CONTAINER_EPOCH_QUERY(pool_handle_uuid, container_handle_uuid) (error, epoch_state). Query the epoch state.
  • CONTAINER_EPOCH_HOLD(pool_handle_uuid, container_handle_uuid, epoch) (error, epoch_state). Hold the epochs equal to or higher than âepochâ for this container handle. The resulting lowest held epoch, which may be different from âepochâ, is returned in âepoch_stateâ.
  • CONTAINER_EPOCH_SLIP(pool_handle_uuid, container_handle_uuid, epoch) (error, epoch_state). Slip the LRE of this container handle to âepochâ. The resulting LRE, which may be different from âepochâ, is returned in âepoch_stateâ.
  • CONTAINER_EPOCH_COMMIT(pool_handle_uuid, container_handle_uuid, epoch) (error, epoch_state). Commit âepochâ of a container.
  • CONTAINER_EPOCH_WAIT(pool_handle_uuid, container_handle_uuid, epoch) (error, epoch_state). Wait for âepochâ to become committed.
  • CONTAINER_SNAPSHOT_LIST(pool_handle_uuid, container_handle_uuid) (error, epochs). List the snapshotted epochs. The result is returned in âepochsâ.
  • CONTAINER_SNAPSHOT_TAKE(pool_handle_uuid, container_handle_uuid, epoch) error. Take a snapshot of âepochâ.
  • CONTAINER_SNAPSHOT_REMOVE(pool_handle_uuid, container_handle_uuid, epoch) error. Remove the snapshot of âepochâ.

The target service exports the VOS methods and abstracts the complexity of the underlying file system structures that store the persistent state of VOS and the target service itself. All object I/Os coming from the DAOS-SR layer are simply passed to VOS, without altering or verifying end-to-end data integrity checksums. This avoids one level of memory copy for each I/O. In addition, the target service provides the facility that helps implement pool and container access control, in the form of pool and container handles in volatile memory.

The target service handles the following RPC procedures in addition to those required by I/O bypasses:

  • TARGET_CONTAINER_OPEN(pool_handle_uuid, container_uuid, container_handle_uuid) error. Establish a container handle authorized by the container service on this target. Sent only by the container service in response to a CONTAINER_OPEN request.
  • TARGET_CONTAINER_CLOSE(pool_handle_uuid, container_handle_uuid, epoch) error. Close a container handle on this target. Sent only the container service in response to a CONTAINER_CLOSE request. âepochâ is the HCE of the container handle.
  • TARGET_CONTAINER_DESTROY(pool_handle_uuid, container_uuid) error. Destroy a container on this target. Sent only by the pool service in response to a POOL_CONTAINER_DESTROY request.
  • TARGET_EPOCH_FLUSH(pool_handle_uuid, container_handle_uuid, epoch) error. Flush all writes in âepochâ to this target. May be sent by both clients and the container service.
  • TARGET_EPOCH_DISCARD(pool_handle_uuid, container_handle_uuid, from_epoch, to_epoch) error. Discard all writes in epoch range [âfrom_epochâ, âto_epochâ] on this target.
  • TARGET_EPOCH_AGGREGATE(pool_handle_uuid, container_handle_uuid, from_epoch, to_epoch) error. Aggregate all writes in epoch range [âfrom_epochâ, âto_epochâ â 1] into epoch âto_epochâ on this target.

Container Creation

Creating a container goes through the pool service. After receiving the POOL_CONTAINER_CREATE request, the pool service either selects an existing container service from the container service index or creates a new one and sends a CONTAINER_CREATE request to the container service to initialize the container metadata. If the initialization succeeds, the container name space (if the container has a name) and the container index are updated and the POOL_CONTAINER_CREATE request is replied.

The placement policy of container metadata involves its own tradeoffs. The major parameters are the number of container services in a pool and the mapping strategy from containers to container services. On one extreme, the metadata of all container in a pool could be placed in one container service that shares the same Raft replicated log with the pool service. This would avoid two-phase commits for inter-container epochs (Inter-Container Epochs), but might raise performance concerns as the state updates of all container metadata operations for all containers in the pool would be serialized by the single Raft replicated log. On the other extreme, each container could have its own container service (i.e., dedicated Raft replicated log), which would separate metadata operations for different containers but make inter-container epochs more difficult. The initial prototype shall start with one container service (i.e., a single Raft replicated log) per pool, evaluate its performance, and implement a tunable or even run-time-determined number of container services if necessary.

Container Handles

When a client process calls the container open method, it may specify:

  1. a container name in the pool, or
  2. a container UUID, or
  3. a container UUID and a container service address.

How the client process calls the open method depends on which of these choices are applicable (i.e., a container may not have a name in the pool) and already known.

In the case of (1) or (2) above, the client library first sends a POOL_CONTAINER_LOOKUP request to the pool service to look up the address of the corresponding container service. Then, or in the case (c), the client library sends a CONTAINER_OPEN request to the container service. The container service processes the request by sending a collective request with the client identifier as well as any open flags (e.g., read-only or read-write) to all enabled targets in the pool and replies to the client with a handle containing the client identifier. Similar to the case of pool handles, the client process may also share the container handle with its peers using the utility methods described in Client Library.

Epoch Protocol

The epoch protocol, which involves the client library and all the three types of services, implements the epoch model described in Transactional Model. This subsection first describes epochs within a single container, and then addresses atomicity across containers in the same pool.

Intra-Container Epochs

The epochs of a container are managed by the matching container service. The container service maintains the definitive epoch state as part of the container metadata, whereas the target services have little knowledge of the global epoch state. Epoch commit, discard, and aggregate procedures are therefore all driven by the container service. This design choice greatly simplifies the commit procedure, making an intra-container commit operation local to the container service. On the other hand, this also puts more responsibility on application or middleware developers, because target services are unable to guard against buggy applications that submit write operations to epochs that have already been committed.

When a client process opens a container with write capability, the container service assigns a small cookie (i.e., up to 64 bits) to the resulting container handle and sends it along with the container handle to all the targets. The cookie then serves as a container handle identifier that is smaller than the 128-bit container handle UUID, and used only internally on the server side to track write operations belonging to a certain container handle. The container service guarantees that coexistent container handles get unique cookies and that each cookie remains constant throughout the life of its container handle.

On each target, the target service eagerly stores incoming write operations into the matching VOS container. For every write operation, the target service passes the cookie of the matching container handle and the epoch to VOS. If a container handle discards an epoch, VOS helps discard all write operations associated with the cookie of the container handle. When a write operation succeeds, it is immediately visible to conflicting operations in equal or higher epochs. A conflicting write operation with the same epoch will be rejected by VOS unless it has the same cookie and content as the one that is already executed. Applications require their own conflict resolution mechanism when they need to read uncommitted write operations from a different container handle, so that their write operations use a higher epoch than the one they read from, as described in Concurrent Producers.

Before committing an epoch, an application must ensure that a sufficient set of write operations for this epoch have been persisted by the target services. The application and the DAOS-SR layer may decide that losing some write operations is acceptable, depending on the redundancy scheme each of them employs. The flushing of each write operation involves a client-side stage and a server-side stage. On the client side, if the write operation is non-blocking, the application must wait for it to complete by calling the event methods in the client library on appropriate event queues. Then, on the server side, the local NVM transaction corresponding to the write operation must commit successfully. The application ensures this by calling the flush method in the client library, which asks for the whole container to be flushed at this epoch. The DAOS-SR layer responses by sending a CONTAINER_EPOCH_FLUSH request to the container service, which then initiates a collective TARGET_EPOCH_FLUSH request to all target services. The âtargetsâ argument of CONTAINER_EPOCH_FLUSH and the TARGET_EPOCH_FLUSH procedure enables future DAOS-SR implementations to experiment with more selective flushing on the targets that have actually been updated in this epoch.

Committing an epoch of a container handle results in a CONTAINER_EPOCH_COMMIT request to the corresponding container service. The container service simply assembles a container metadata update that:

  • increases the container handle HCE to this epoch,
  • increases the container handle LHE to this epoch plus one, and
  • updates the container HCE to minâ¡(maxâ¡(container handle HCEs),minâ¡(container handle LHEs)-1).

When the update becomes persistent, the container service replies the client with the new epoch state.

An epoch of a container handle may be âabortedâ by first discarding the epoch and then committing it as described above. Similar to flushing an epoch, depending on how the discard method is called, discarding an epoch may result in:

  • a CONTAINER_EPOCH_DISCARD request to the corresponding container service, which triggers a collective TARGET_EPOCH_DISCARD request to all target services, or
  • a set of TARGET_EPOCH_DISCARD requests to the target services involved in this epoch of this container handle.

Once the discard method succeeds, all write operations from this container handle in this epoch are discarded. The commit method may then be called to commit this âemptyâ epoch for the container handle.

When a container service is asked to close a container handle, either by the application or the pool service when evicting the corresponding pool handle, it sends a collective TARGET_CONTAINER_CLOSE request to all target services. The HCE of the container handle is sent in this request as an argument. Each target service asks VOS to discard all write operations belonging to the matching cookie in all epochs higher than the container handle HCE. When the collective request succeeds, the container service destroys the container handle.

Target Faults

Given hundreds of thousands of targets, the epoch protocol must allow progress in the presence of target faults. Since pool and container services are highly available, the problem is mainly concerned with target services. The solution is based on the assumption that losing some targets may not necessarily cause any application data loss, as there may be enough redundancy created by the DAOS-SR layer to hide the faults from applications. Moreover, an application might even want to ignore a particular data loss (which the DAOS-SR layer is unable to hide), for it has enough application-level redundancy to cope or it simply does not care.

When a write, flush, or discard operation fails, the DAOS-SR layer calculates if there is sufficient redundancy left to continue with the epoch. If the failure can be hidden, and assuming that the target in question has not already been disabled in the pool map (e.g., as a result of a RAS notification), the DAOS-SR layer must disable the target before committing the epoch. For the epoch protocol, the resulting pool map update effectively records the fact that the target may store an undefined set of write operations in the epoch, and should be avoided. This also applies to applications that would like to ignore similar failures which the DAOS-SR layer cannot hide.

Inter-Container Epochs

When a client process needs to commit multiple epochs belonging to different containers atomically, it shall complete the write operations for all these epochs and flush them successfully before starting to commit any of the epochs. Any faulty targets encountered during the I/Os shall be disabled as explained above. Once a sufficient set of the write operations of all these epochs are persisted, the client may call the commit method with the list of <container handle, epoch> pairs. A POOL_EPOCH_COMMIT request is sent to the pool service as a result. The pool service logs the list of <container handle, epoch> pairs, and then sends one CONTAINER_EPOCH_COMMIT request with each <container handle, epoch> pair to the corresponding container services in parallel. The pool service stubbornly retries these requests if some container services are unavailable, which should rarely happen because of service replication, and at the same time makes sure that the corresponding pool handle remains even if the client process set terminates. Once all commit requests succeed, the pool service removes the list of <container handle, epoch> pairs from the log, and finally replies the client process (if still exists). It must be noted that if the client processes have a programming error, e.g., one of the epochs is not held before being committed, the corresponding CONTAINER_EPOCH_COMMIT request will return an error, while other requests will succeed. Introducing a two-phase commit prepare phase would help with this case. However, the performance overhead applies not only to the error cases, but also the common, well-behaving cases.

Snapshots

The scope of a snapshot is an epoch of a container. The epoch must be equal to or higher than the handle LRE and be equal to or less than the handle HCE. A client taking a snapshot sends a CONTAINER_SNAPSHOT_TAKE request to the corresponding container service with the epoch. The container service simply generates a container metadata update that inserts the epoch into the list of snapshotted epochs. Listing and removing snapshots are similarly trivial.

### Aggregation

Epoch aggregation in a container is driven by the corresponding container service as a background job. The container service maintains one bit for each snapshot ei in the container metadata, recording whether all epochs (ei-1) (or 0 if ei is the lowest snapshot), ei] have been aggregated or not, and periodically sends collective TARGET_EPOCH_AGGREGATE requests to all target services, requesting the lowest epoch range (ei-1), ei] to be aggregated. If all target services successfully aggregate the epoch range, the container service generates a container metadata update that sets the bit of ei. Otherwise, the container service leaves the bit unchanged and retries later with the latest pool map, which may have the faulty targets disabled. This process also aggregates the container handle cookies, wherein write operations from different container handles in an aggregated epoch are no longer identifiable by the cookies. After the bits of all snapshots have been set, the container service may schedule background aggregations for (emax (i.e., the highest snapshot), LRE] whenever LRE increases. When removing a snapshot ei, the container service generates a container metadata update that clears the bit of ei+1. (If ei is the highest snapshot, no change needs to be made, because the next aggregation to LRE will do the job.) The next aggregation will then aggregate (ei-1), ei+1]. In this scheme, target services do not need to track snapshot epochs. This makes snapshots lightweight and straightforward.