319


Hello, everyone. My name is Jin Hase. I work at FSAS Technologies, Inc., and we are developing emergency architecture, which is called Composable Disaggregated Infrastructure. Today, we introduce dynamic scaling of GPUs for content apps with Composable Disaggregated Infrastructure for the AI era. Recently, generative AI has become very popular. The use of AI workloads is also increasing in Kubernetes environments. So, we will introduce the importance of Composable Disaggregated Infrastructure in the AI era and how it works with Kubernetes.

Generative AI opens the age of AI. As a result, enormous computational resources are required; they need lots of power to operate. On the other hand, realizing a sustainable world is an urgent issue for us. We are always expected to reduce power consumption. That means we need to satisfy high performance and power saving simultaneously. To cope with these conflicting requirements, it's important to use expensive GPUs as efficiently as possible.

Therefore, we are happy to recommend Composable Disaggregated Infrastructure. Next, there are mainly three methods for our use of GPUs efficiently. We aim to realize that third method to achieve both high performance and power saving. I explain each method. The first one is splitting the GPU by several ways and allowing multiple processes to use it so that the GPU can be used efficiently. However, it cannot provide more performance than the number of GPUs attached to the server. The second method is to scale nodes with GPUs horizontally. This method allows us to increase or decrease the number of GPUs, which can provide better performance than the first method. However, adding nodes takes time, so there is a possibility of causing a temporary lack of performance, and if a user only wants a GPU, it results in consuming unnecessary power of nodes. The third method is to vertically scale the number of GPUs in the server. This method can provide as many GPUs as the user needs quickly, depending on the user workload. This means both high performance and power saving. So, we are aiming to realize the third method, vertical device scaling.

As an infrastructure for vertical scaling of devices, Composable Disaggregated Infrastructure has emerged. We call this infrastructure CDI. Using a traditional server in the left diagram, hardware resources such as CPUs, memory, and GPUs reside within the server. CDI decomposes these hardware resources and makes them available as a resource pool. So, we can combine these resources by software definition so that we can create custom-made servers. We call it Composed Bare Metal. This infrastructure allows us to dynamically increase or decrease the number of GPUs.

And this slide shows the CDI software stack. The CDI system is composed of a resource pool and CDI manager software. In the resource pool, all components are connected to PCIe or CXL switches. The CDI manager controls the switches so as to create composed bare metal by software definition. It has CDI API and Operator; Kubernetes may call the API. Once Composed Bare Metals are created, the user can install any operating system or container infrastructure.

We would like to realize automatically attaching or detaching GPUs to Kubernetes nodes, based on load. There are two key features. One is dynamic device scaling, which we call DDS. This feature determines whether to increase or decrease the node or GPU, depending on the load. So, this feature is responsible for determining optimal resource allocation. The second one is the CDI operator. This feature accesses the CDI manager and attaches or detaches GPUs as per DDS's decision. This feature is responsible for actually performing the GPU attach or detach.So, I will explain the simple processing flow in this diagram. The user first creates a port. And if the cluster runs out of resources, such as GPUs, in this case, the Kubernetes scheduler will not be able to deploy the port. Such a port is called an unschedulable port. To be able to deploy this port, DDS determines whether to add an additional GPU or just add a node. If DDS determines to add only the GPU, the CDI operator requests the CDI manager to add a GPU. And after the CDI manager adds the GPU to the node, the Kubernetes scheduler can schedule the port. When the Kubernetes scheduler deploys a port, features for using the GPU from the port, such as DRA, the dynamic allocation feature, and device programming work.

Next, I explain the basic operation flow. First, it's important to note that DDS leverages a feature called the cluster autoscaler. Cluster autoscaler is a Kubernetes feature that allows nodes to scale horizontally depending on load. This feature is called CA, and DDS hooks up node increase or decrease processing of CA. DDS then determines whether to increase or decrease only GPUs or nodes. This means that if CA tries to add a node itself, and if DDS determines that adding the GPU is optimal, only the GPU will be added. The upper figure shows the current CA processing of horizontal node scaling. If we use CDI as an infrastructure, and when CA determines to add or remove nodes, the CDI manager creates or deletes composed bare metal. The below figure shows vertical device scaling by DDS. DDS hooks up node increase or decrease processing of CA. And if DDS determines a node should be added, the CDI manager creates composed bare metal. On the other hand, if DDS determines a GPU should be added, in this case, DDS requests the CDI operator to add a GPU. This is a basic computation flow.

Let's take a look at the challenges of achieving vertical scaling of a device. Since DDS leverages the CA feature, DDS needs a mechanism to scale devices while keeping CA concepts. However, when DDS scales devices, CA's processing is affected. Specifically, CA groups nodes that have the same spec. In CA, this is called a node group. And CA adds or removes nodes for a node group. In other words, undetermined removal nodes have the same spec in a node group. However, if DDS scales the device, the nodes in the node group have different specs. In this figure, look at node group number two. In this case, when CA determines to add a node, but CA does not know what spec the node should be added in node group number two, CA cannot make such a decision. So, this is a current problem.

I explain the solution to the problem mentioned on the previous page. Specifically, we provide a node group for vertical scaling.

I explain the basic concept of a node group for vertical scaling. The most important thing is to show the maximum number of GPUs that can be attached to a node, to CA, as a node spec. This table shows the differences between a normal node group and a node group for vertical scaling. At first, I explain about the normal node group. When CA is considering node scaling, CA obtains the current node spec, such as the number of GPUs a node has. For a normal node group, CA randomly picks a node from the node group and checks its specs. And then, CA simulates whether an unscheduled port can be deployed by adding a node with the same specs. Therefore, the node specs CA obtains are the actual node specs, and she assumes that all nodes within the node group have the same specs. This is the current design of CA about the normal node group. And next, I explain the node group for DDS. In this case, CA always obtains a fixed spec. This means that the actual node specs are irrelevant to CA processing. Specifically, the spec that she obtains is the maximum number of GPUs that can be attached to the node. The reason for showing the maximum number is in 200 cases where a user requests more GPUs than a node actually has. With the current CA, if a number of GPUs is greater than the actual number the node has, CA cannot do anything because adding a node does not allow the pod to be deployed. So, in our solution, by showing the maximum number of GPUs that can be attached, CA can determine that the pod is deployable by adding nodes. So, it does not matter if the nodes in the node group for DDS have different specs.

And I explain how to get a node spec. For node groups for DDS, we store the fixed spec of nodes, such as the maximum number of GPUs, in machine-set resource. When CA obtains specs for a node group for DDS, it always obtains the maximum number of GPUs listed in the machine-set. This diagram shows the actual resource relationships: the node group is linked to the machine-set resource, and the machine-set resource manages the collection of machines. For example, it has a parameter called "replicas" that indicates the number of nodes in the group. And there are machine resources and machine-set resource. Machine resource represents the node virtually, and node resource corresponds to the node in the Kubernetes cluster. This node resource has actual specs of the node. So, for normal node groups, CA obtains the actual node spec from the node resource. On the other hand, for node groups for DDS, CA obtains the maximum number of GPUs from the machine-set resource.

To achieve the functionality described on the previous page, we are proposing adding a new option to CA. Currently, we are under discussion in the SIG Auto Scaling community in Kubernetes. Currently, CA has the ability to obtain the specs listed in the machine-set already. However, CA currently only refers to the spec in the machine-set if there are no nodes in the node group. So, if there is no node in the node group, CA tries to obtain node specs from node resources, not machine-set. The reason for this design, CA intends to increase the number of nodes from zero. So, if a node does not exist in the node group, the node resource does not exist. Therefore, CA needs to refer to the machine-set resource. For this design, we are proposing to add an option to CA, "spec fixed," to reference the specs in the machine-set, even if the node exists in the node group. If this option is enabled, the CA can always refer to the specs in the machine-set. Take a look at the figure. The above diagram shows the current CA process. If a node exists in a node group, CA refers to the actual specifications of the node resource. On the other hand, in the figure below, the CA refers to the specs in the machine-set, even if there are nodes in the node group, because the "specs fixed" option is enabled. This is a proposing feature to the SIG Auto Scaling community.

Next, I'll explain about DDS feature, Dynamic Device Scaling.

Dynamic Device Scaling (DDS) keeps the existing CA concept. Therefore, DDS hooks CA processing, such as node increase, and then DDS determines whether only the GPU should be added or not. The functional scope related to DDS is in the red block.

And this slide shows how to hook CA processing. When CA makes a node or to remove decision, DDS intervenes to determine whether to add or remove GPUs or nodes. And then, DDS hooks the following four cases: Case 1 is when CA decides to add a node, and DDS determines whether adding only a GPU or adding a node. And case 2, when CA wants to add a node but cannot for some reasons. In this case, DDS determines if it can add only a GPU. And case 3, when CA decides that the node is not needed. In this case, DDS determines whether deleting only a GPU or a node. And case 4, when CA wants to delete a node but cannot for some reasons. In this case, DDS determines whether it can delete only a GPU. So, for example, I explain an additional case in this diagram. This is case 1. If a pod cannot be deployed due to lack of resources, such as a GPU, in this case, this port is deemed as an unscheduled port. And CA detects this unscheduled port. And if CA determines to add nodes, in this case, CA increases the number of replicas in the machine set resource. And in the normal processing in CA, the nodes are then actually added. However, DDS hooks up this process and determines whether to add nodes or only GPUs. The next slide shows how to do this.

This page shows how to hook CA processing about nodes increase case, for example. CA increases the number of replicas in the machine get when adding nodes. DDS monitors the increase in the number of replicas and goes into the process of determining whether to add a GPU or node. The diagram on the left shows the current CA processing. At first, CA increments the replica count to add a node. In this case, the number of replicas is changed from 2 to 3. Then, the infrastructure provider responsible for provisioning the nodes is monitoring the number of replicas. If the number of replicas increases, the infrastructure provider actually creates servers. The figure on the right shows how DDS handles hooks. In this time, I explain what happens when DDS determines to add only GPUs. At first, CA increments the replica count in this case from 2 to 3. Then, DDS monitors the number of replicas. When the number of replicas increases, DDS determines whether to add GPUs or nodes. If only GPUs should be added, DDS will change the replica count back to the original count and request additional GPUs. In this case, from 3 to 2. Regarding the infrastructure provider, because the infrastructure provider refers to a parameter called cdnNodes, not the replica count, actual node addition has not occurred. This is the difference between current CA processing flow and DDS processing flow. This also requires modification on the part of the infrastructure provider.

Let's take a quick look at how DDS decides whether to add GPUs or nodes. Actually, there are a lot more to consider, but I will only explain the most basic logic here. In a nutshell, DDS virtually adds GPUs to a node one at a time. It then simulates whether the unscheduled port becomes deployable. And if the bot is deployed, DDS requests only GPUs. On the other hand, if a port cannot be deployed, even if adding the maximum number of GPUs to all nodes, in this case, DDS will request additional nodes. So, this is simple logic.

These links to features are currently under discussion with the community. We have only just started discussions with the community, SIG autoscaling. So, we will continue discussions to standardize the DDS feature from now. And if you are interested, please leave your feedback at the link above.

Next, I explain CDI operator feature.

CDI operator sends a request to CDI manager to attach or detach GPUs from the resource pool, and CDI operator has a custom resource. DDS can change the configuration of devices in a node through the custom resource. Therefore, the functional scope related to CDI operator is in the red block. Custom resource CDI operator and CDI manager. This diagram shows the processing flow. At first, DDS creates, updates, or modifies custom resources for the CDI operator. And this custom resource is called a composability request. If the composability request changes, the CDI operator will detect this. Then, CDI operator determines how many GPUs should be increased or decreased on which nodes. And CDI operator then sends a specific configuration change request to CDI manager. CDI operator manages the node device according to the desired setting of the composability request. Therefore, it is possible to automatically replace a GPU that has a hardware failure for some reason. This is because there is a difference between the desired setting and the actual state. Therefore, CDI operator has more logic and functions, but I will not explain them here.

Next, this is a composability request data format. We can specify which GPU model and how many are connected to which node. Composability request has the following parameters: At first, the type parameter. This parameter means device type, such as GPU. And currently, only GPU is supported in CDI operator. But, we are going to support CX memory and other device types in the future. The second parameter is size. This means the number of devices we want to attach. And the model parameter means the model name of devices. And this is an optional parameter. And the node parameter means the node to which you want to attach the device. And this is also an optional parameter. And the next option, force detach. This is the option to force detach even if the GPU is used, and the default is disabled. This option is for the system administrator. If the system administrator wants to remove the GPU but the user rejects that request and the user continues to use the GPU, in this case, the system administrator enables this option so that the GPU can be removed forcibly. And the last parameter is the connection policy. Connection policy is for the GPU when no target node is specified. And there are two policies. The first one is same node. Same node means CDI operator tries to connect all devices to the same node as much as possible. And the second policy is round-robin. This option means the CDI operator tries to connect GPUs evenly to each node. And if we use DDS, DDS automatically updates the discomposable request data format. Alternatively, if DDS is not used, in this case, the user can manually update the composable request to change the configuration of the node devices.

And this is a link to features currently under discussion with the community. We are discussing and developing with IBM Research because they developed the base functionality and are now working together to improve it. So, if you are interested, please give your feedback at the link above. And next, we will show our demo.

Hello, everyone. My name is Lei. I would like to present our demo.

In the demo, I will show two patterns. The first one is adding a GPU automatically, and the other one is adding a node automatically.

This is detailed information about the demo. I will do the demo on starting status. On starting status, we have two worker nodes in our cluster, and each node has one GPU. The GPU is being used by a port. At this moment, I create a port using a GPU because there is no GPU available, so the node is pending. And then DDS adds a GPU automatically. Finally, the port is running. After that, I create another port which requires a lot of memory. Because there is not enough memory, the port is pending. So, the DDS can add a node automatically. Finally, the port is running.

Sorry for the font; it's too small. Before I start the demo, I would like to introduce the layout of my screen. The top left is the operation window, and the top right is used to monitor GPUs, nodes, and ports. As you know, on starting status, we have two worker nodes in our cluster, and each node has one GPU. The GPU is being used by a port, and the middle right is the CDI operator log. The bottom right is the DDS log. Now, let's start.

First of all, I would like to do the demo of adding a GPU automatically. I'm going to create a port using a GPU. So you can see there is... The port is pending because there is no GPU available. So, we are waiting for DDS to detect the pending port and add a GPU by sending a request to the CDI operator. So please pay attention to the CDI operator log. It will take tens of seconds. So, you can see the CDI operator received a request to add a GPU. The model is A30, and the size is one. The target node is worker QEPS3. Now, the CDI operator is attaching the GPU to the target node. You can see the GPU has already been added, and the port is running.

Second, I would like to do the demo of adding a node automatically. I would like to create several ports that use a lot of memory to achieve the effect of making the node insufficient. So, I create the first port, and it's running. We create the second port, and it's running. We create the third port, and it's also running. And... I create the first port. And you can see the first port is pending. The reason is insufficient memory. So, the DDS will detect the pending node and add a node to the cluster. Since scaling out the node will take tens of minutes, I will fast forward to the new node that has been created. We confirmed the third worker is not yet in our environment, but this has not been recorded, maybe. Sorry. We confirmed node addition in our environment by DDS and CDI operator.

Finally, I explain our future outlook. Regarding CA, we need to discuss the detailed specification for a node group for vertical scaling. In the SIG-Auto scaling community, we need to do standardization in parallel with development in the community. Currently, we focus on GPU, but we will expand to other device types, such as CXL memory. We also need to realize hot-plug devices for vertical steering. There are several features to use GPU in a Kubernetes environment, such as dynamic resource allocation (DRA) and GPU operator. These features need to support hot-plug for vertical scaling. Today, I don't explain these features, but we are working on hot-plug support for these features. That's it. Thank you for listening. Any questions?

Hello. Thanks for the presentation here—very good stuff. I have a question about when you add multiple GPUs to a node there. So, because it's using PCIe, keyboard, and something like that, what would the PCIe topology look like? Or, could we see NVIDIA's nvidia-smi topo -m output there?

Do you want to know the topology?

Yeah, PCIe topology or the multiple GPU and the CPU.

GPUs are connected to a PCIe box, and the PCIe box is connected to a PCIe switch. The topology is controlled by the PCIe switch, and the CDI manager controls the PCIe switch. So, in the K8S layer, one cannot know the actual physical topology, but PCI switches manage the topology, and the CDI manager controls this topology.

So there's like no NUMA affinity there for the Kubernetes?

Currently, the PCI switches do not recognize the NUMA node topology. So, I think we need to label the nodes and manually do this, I think.

Okay, I see. So, another question: you just mentioned that the CDI operator—is it running in the computer server, the CPU server there, or is it running in the PCI switch inside the...?

No, the CDI operator is running on the Kubernetes cluster, so the cluster administrator can manually... The cluster administrator can manually change the CDI operator manifest.

Got it, thank you.

Okay, thank you for the presentation. So, maybe just a simple question: can this CDI operator scale the GPU down to zero? I mean, removing the GPU— is that possible?

So, maybe I can't understand your question. Can the CDI operator attach the GPU to the node or detach the GPU from the node?

I mean, it's like can we scale down the number of GPU to zero, right?

Yeah, yeah, it can support it.

Okay, thank you so much.