317


My name is Yasuno Goto from FSAS Technologies Inc. Today, I'd like to talk about exploring CXL memory configuration and emulation.

Here is a table of contents: At first, I'd like to explain what CXL is, and something about the summary of QEMU emulation, basic information of CXL, preparation of the emulation environment, how to start the QEMU CXL emulation environment, and operation on the guest OS side after boot, and conclusion.

Please note one thing: Though I tried to make sure there are no mistakes in my presentation, there might still be misunderstandings or inaccuracies. So, if you can find any mistakes, please let me know.

Okay, let's start with the introduction. So, what is CXL? CXL is an abbreviation for Computer Express Link. It's a new interconnect specification that connects devices such as PCI Express. It's suitable for connecting smart devices like GPGPU, SmartNIC, FPGA, traditional computer storage, and so on. Offloading processing is necessary due to CPU performance limitations. Therefore, GPGPU, FPGA, SmartNIC must handle the processing instead. It allows interactive access between CPUs and CXL devices by adding a cache coherence protocol. This will enable more efficient data exchange between them. In addition, it's also useful for expanding memory, both volatile and persistent. Memory capacity needs to be increased. While the number of CPU cores has increased, memory capacity has not kept pace. But because DDR is a parallel interface, it's difficult to increase the number of CPU pins to connect more memory. Thus, the CXL specification also allows for increased memory capacity. This specification is expected to be pivotal for the next generation of servers in the AI era. I talk about its summary in last year's OSSJ project in the slide below.

So, CXL memory is expected to meet the following requirements. The first one is increased memory capacity. Many software applications, such as AI, machine learning, HPC workloads, and databases, demand a huge amount of memory. CXL allows connecting large capacity memory for such software uses. The next one is increasing memory bandwidth. Memory bandwidth sometimes bottlenecks, so CXL offers additional memory bandwidth. Therefore, more efficient resource utilization is necessary. Users want to make more efficient resource utilization and improve system scalability. CXL provides a way to create memory pools accessible by multiple hosts. In addition, software and hardware support for CXL memory devices is more advanced than support for GPGPUs and smart NICs. GPGPUs and other devices require vendor-specific implementation. However, memory expansion benefits from common development within the CXL specification. Therefore, CXL memory is expected to be the first CXL device released.

So, currently, there is a significant problem with CXL here. So, what is required to use CXL? Basically, obviously, software and hardware. For hardware, a CPU with CXL support includes a CXL host bridge. It must be necessary. And, a CXL-capable platform is necessary. CXL-capable PCI Express connection, CXL switch, and CXL support firmware are necessary. And, officially released CXL devices are also necessary. Unfortunately, officially released CXL devices are not yet available. Currently, only sample devices are provided, and only a few people can use them. However, many software developers need to prepare for the release day. OS drivers, Linux kernel developers, middle management software developers, middleware developers, and etcetera. And additionally, researchers need to investigate CXL's characteristics. So, a CXL hardware emulation environment is crucial for such preparation.

I think QEMU is useful for CXL memory emulation. To emulate CXL memory... So, what is QEMU? QEMU is an open-source machine emulator. It can emulate the same or different architecture and platform on your machine. It handles memory and I/O access and emulates them as needed. And if you use KVM as a virtual machine on Linux, then you are already using a part of QEMU. KVM utilizes hardware-assisted virtualization features, such as Intel VT-X. It detects and intercepts CPU instructions, I/O access, or memory access that should not be executed on the VM guest. QEMU emulates these actions instead. So, why is QEMU useful for CXL emulation? Many CXL emulation features are already implemented in QEMU. Its development is very active. It supports not only a simple connection but also complex CXL memory configuration. This makes it an excellent environment for software development and researchers.

The problem with using QEMU CXL emulation is threefold. First, the information provided is insufficient for average users like me. While there is a document for CXL emulation on the QEMU official site, it seems to be written primarily for experts. Consequently, necessary information is scattered across various places and requires extensive searching. The QEMU site describes only its options, but you need to know how to use the CXL command in the guest OS, too. You may need to examine the source code or Git log of related software—QEMU, CXL command, or kernel, etc.—to find the cause of an error. Today, I'll talk about how to emulate, configure, and use a CXL memory device environment with many tips and my recommendations.

Summary of QEMU CXL emulation. Here is a use case of QEMU CXL emulation: what you can do and what you cannot do. First use case is studying how CXL devices are presented in Linux. Next one is studying the behaviors of new features of Linux kernel for CXL memory. For example, memory tearing, weighted iterative, etc. Next one is developing and testing CXL drivers, commands, and the Linux kernel itself. Actually, my team member found a less conditioned bug of memory hot plug in the Linux kernel by the emulation environment, or hot plug feature. Next one is developing upper layer software for CXL: Middleware, orchestrator, and any other new generation software. What you cannot do is... First one is performance evaluation. Emulation cannot emulate bandwidth and radiation of a real CXL memory device. Additionally, memory access speed in the emulated environment is significantly slower than in a typical KVM guest environment. You don't expect perfect emulation. This emulation is still under development. There are some features that have not been implemented yet. Additionally, there are some limitations yet. For example, a bug currently requires users to fully shut down and start the guest OS instead of using a simple reboot command.

So, what features are currently emulatable? Basically, the CXL 2.0 spec or later is supported. Since making a generic emulation of CXL 1.1 was difficult for some reasons, it was skipped. CXL memory can be emulated; both volatile memory and persistent memory are available. Other devices, like GPGPU, are not emulated yet. So, you can configure more complex connections: many memory devices can be connected via CXL switches, and memory interleaving configuration is also available. An entire CXL memory device is available, which is based on the CXL 2.0 specification. However, today I will not talk about it due to not having enough time. And HOT plug, part of the CXL memory device, might be available; it's based on the dynamic capacity device feature of the CXL 3.0 specification. Unfortunately, I have not tried it yet, so I will not talk about it either. So, today I will talk about the future, marked as "yes" in this table.

So, to make an emulation environment, there is required prerequisite knowledge. To successfully follow this guide, you will need the following: operations of a KVM virtual machine, download of QEMU upstream code, build and start it, and download the compiler and install the upstream Linux kernel. You need to change the kernel build config options. The default kernel provided by distributors may not work well in the CXL emulation environment. And you need to build and install it in the guest site for the emulation environment. Today, I will not talk about the details of the above. If you need them, please check them later. And of course, basic CXL knowledge is required. Today, I will bring the summary of the relevant CXL components, and the concept will be provided in the next section.

So, this is basic information on CXL. Here are the key hardware components of CXL. QEMU can emulate the following hardware components. In other words, you need to specify them to start your emulation environment. The first one is the host bridge, the root of the connection tree for CXL devices, equivalent to the host bridge in PCI Express. Typically, it's integrated into the CPU. The next one is the root port, which is the downstream connection port of the host bridge. In the emulation setting, this is used to define how CXL objects are connected. And the next one is a CXL switch, which connects an upstream port to multiple downstream ports. Its connection information is used for emulation. And the blue HD mark is an HDM decoder. Its role is the translation between the host physical address and the device physical address. It's especially important for the memory interleave feature. At least, you need to find the root decoder, which is included in the host bridge, to use CXL memory.

Next, what you need to know is the concept of a region. To use a CXL memory device on Linux, you need to know about regions. A region is an area allocated from a part of a CXL memory device or from multiple memory devices that were configured for interleaving. In QEMU CXL emulation, you need to configure one or more regions to use the CXL memory devices after your guest OS boots. I'll explain how to configure it in this talk.

So, let's prepare for the emulation environment. At first, you need to make a guest image for QEMU CXL emulation. Please prepare a bare-metal machine for your QEMU emulation environment. If you can use an virt-install, cockpit or virt-manager with KVM, it's an easy way to make your guest image. Create one guest image and install a Linux OS. And ensure there is enough storage to build the kernel on the guest. After installing the VM guest, I recommend configuring the serial console of the guest. Its information will be displayed on the QEMU console. After the VM guest has started, pre-check or record the options specified for the QEMU commands that are executed for your guest. It's just enter ps auxw | grep qemu to check it. Most of these QEMU command options are used for the emulation environment. CPU, DRAM memory, storage, and network settings are necessary. Here's an example from my environment.

Next, one is you need to prepare a new kernel that can work in the CXL emulation environment. You need to build and install it on your VM guest. At a minimum, the following command must be enabled when you build the new kernel: They must be enabled even in your real CXL environment. In addition, I recommend enabling the following config option of memory hot plug. I'll explain the reason later. Additionally, the following config option is essential only for the CXL emulation environment: CONFIG_REGION_INVALIDATION_TEST. If this is not enabled, some operations on the guest OS may not work, with errors. To be honest, I don't understand the details of the regions, but from Gitlog, there is a description about cache writeback and invalidation regions. So if you want to know more details, please check the commit—this one. This option is probably disabled in the distributor's kernel. Since many developers enhanced the kernel and CXL driver, using a new kernel is better.

So, next, one is getting and building the CXL command on your guest. The CXL command is used for configuring and checking the status of CXL devices. A newer version of this command is better for CXL emulation due to bug fixes and new features. The newest version is V80, which was released this month. I recommend downloading and building it from GitHub. Its source code is included in the ndctl repository, which is a command for persistent memory. So, after preparing the kernel and the CXL command in your guest image, you can shut down the KVM guest OS. To start the CXL emulation environment, you need to start QEMU with different options compared to the normal KVM setup.

Finally, you need to download and build the newest QEMU. As mentioned before, CXL emulation development is very active. Therefore, the newer QEMU has a newer CXL feature and bug fixes. Since the version of QEMU included with distributors is relatively old, I recommend downloading newer QEMU source code from the official site and building it on your bare metal machine. In QEMU, there are no special notices for build unlike the Linux kernel. However, it might be better not to install the built QEMU binary to avoid conflicts with the distributor's package. So, I recommend executing the built QEMU command from your home directory.

Next is how to start the QEMU CXL emulation environment. So, I'll show you three examples to start QEMU. The first one is a simple connection of CXL volatile memory: host bridge, root port, and one single CXL memory device. The next one uses persistent memory instead. And the third one is a complex connection to a root port, two CXL switches, and four CXL memory devices.

This is a simple connection of CXL volatile memory. QEMU defines various devices in its command options, and it is the same for CXL emulated devices. To specify them easily, I recommend creating a simple shell script to start it. Here is an example of a simple connection of CXL memory setup, as shown in this figure: one host bridge, one root port, and one CXL volatile memory. The text in black font can remain the same as when you executed your VM guest. The text in red font indicates what you need to add or modify for emulation. Since there are many new options, I'll describe the meaning of each option from the next page.

These two lines are used to create a CXL type 3 memory device. The first one is the definition of a back-end device for memory emulation. "memory-backend-ram" means that you use DDR RAM of the bare metal. Bare metal servers are the back-end for the emulated device. "Size" is the allocation size, and it also means emulated memory size. "Id" is a unique name for this back-end. The next line is the definition of a CXL memory device. "cxl-type3" means CXL memory device. "bus=" specifies which port is used for the connection of this device. "volatile-memdev" specifies the back-end ID. And "id" is a unique name for this CXL memory.

Next one is the definition of the CXL host bridge and root port. This one is a host bridge. In this example, the bus number of the host bridge is 12. Its connection is under PCI Express bus 0. And 'cxl.1' is the ID of this host bridge. And here is the definition of the root port of the host bridge: 'cxl-rp.port=0' specifies the port number, and 'bus=cxl.1' indicates that this port belongs to the 'cxl.1' host bridge. The bus number of devices under this root port becomes 13 by 'id=root_port13'.  'slot' and 'chassis' are mandatory and must be unique for each. Basically, the slot number must be changed for each. And of course, 'cxl=on' must be specified for CXL emulation.

So, this line is for firmware emulation and reserved guest physical address space for CXL. CXL is used for 'cxl-fmw' to emulate CXL fixed memory windows, CFMWS, which shows the physical address map of CXL memory device from firmware to OS. In this example, 'targets.0' is used for the CXL root port. And, 'cxl-fmw.0.size' must be specified with a size larger than the total size of CXL devices that belong to this table. In this example, 4GB is allowed.

Finally, this is the most important thing: You need to disable the hardware assist for virtualization. If not, it can cause unpredictable issues in the emulation. In our case, an illegal instruction error occurred in the emulation environment until we disabled it. Since CXL memory is interleaved at granularities as fine as 64 bytes, the emulator needs to translate read and write access at this size. However, hardware assist can detect only the page size of the architecture, like 4 kilobytes or more. So, 'accel=tcg' should be specified instead of 'accel=kvm', and "-enable-kvm" should be removed. Though this is not written on the QEMU CXL public website, the dispatch commit provides more details and reasons.

So, if you can succeed, or boot up your terminal which is running QEMU, display the OS console as shown below. You can use this console, or log in from another terminal via SSH.

So, here is an example of persistent memory type. For persistent memory, for 'memory-backend-file', a file is better than RAM to maintain data persistency. If the file already exists, its size must match the size specified in this file line. If the specified size does not match the actual file size, QEMU will fail to start. 'shared=on' is recommended to save your data as persistent memory. If it's 'off', then the written data will not be applied to the shared file. In this example, the file size must be 256MB. In addition, persistent memory has a label storage area from specification which stores the setting when the internal area of the device is divided for use. To emulate it, you also need to add its definition with the 'memory-back-end-file'.

Next one is an example of CXL switch emulation. You can create a more complex environment. Here is an example where four CXL memory devices are connected via two CXL switches. A CXL switch definition consists of a pair of upstream port and multiple downstream ports. In this example, two pairs of switches are defined: upstream port and downstream port, and two root ports. And two memory devices are connected to each switch. So, you can create such a complex configuration.

After boot up, you need to operate the guest OS side. What you need is, at first, confirmation of how to confirm the presence of CXL memory and operation about regions. To use a CXL memory device, you need to configure one or more regions from the guest OS side. I'll show you three examples of creation and regions, like an example of a QEMU option, a simple connection, persistent memory, and a complex connection.

Here is a simple example: You can find CXL memory and decoders by this option. -M: list all memory devices; -D: list all decoders; and -u: show some data in human-readable format. And, you can find the name of the memory device and root decoder in this example. This name needs to be specified to create a region.

To create a region, specify the root decoder, memory device name, and memory usage type as a volatile memory. "-t ram" means that this region is used as a volatile memory. The OS and user application can use CXL memory-emulated memory after this command completion. However, if the kernel of the guest OS is not compiled with the CONFIG_MEMORY_HOTPLUG_DEFAULT_ONLINE option, memory hotplug operations are required after the region creation. This requires knowledge of memory hotplug. This operation is a bit troublesome for the first trial of emulation, CXL emulation. This is why I recommend this option.

You can confirm the new memory is audited, and the node is created by using the free command and the numactl command. This is an example: 8GB 6 memory region is created in an environment with two 8GB DDR DRAM NUMA nodes. The total memory size is increased right from here to here. 8GB is added, and a new node is created as a CPU-less NUMA node.

Next one is persistent memory operation. For this example, currently, you may need to clear the label storage area first because the CXL driver's LSA support is inadequate. The namespace creation operation described below may not work unless this operation is performed. To clear it, the ndctl zero-labels command can be used as a background. For persistent memory, you may need to specify pmem for the -type option of create-region. If you specify the size option, you can use a portion of the device. In this example, a 256MB size region is created in a 2GB memory device. The storage-like nature of persistent memory makes this feature useful in cases where you need different users to access different parts after region creation. Uh, you can create a namespace in the region by the ndctl command, similar to non-volatile Optane memory.

Next one is how to create an interlibrary region. When you want to create an interleaved region, you need to specify all of the members of the memory device. If all the memory devices are connected under a CXL switch, it's easy; you just need to specify all of the names of the devices.

However, if we want to create an interleaved region with multiple switches, it's currently a bit more difficult. Please note that you must specify the correct order of the memory devices. If you select a memory device under one switch as the first specified device, you need to specify a memory device under another switch next. In other words, the order of CXL switches must be alternate. The order of memory devices under a switch can be whichever comes first. To confirm this connection, this command will be useful. It can display the topology of all the memory devices and switches in JSON format.

So, conclusion: I have introduced how to use the QEMU CXL Emulator. To make it easier for everyone to use, I showed many tips and my recommendations. I hope this is a good start for you to use it. And there are more features in QEMU CXL Emulation, such as Hotplug, that I couldn't introduce today. So, please refer to the official information and other guides on the QEMU official site, the man page of the cxl command, and blog posts about this and talk with each member of this community. I really hope that this will increase the number of software engineers interested in CXL. Please try the CXL Emulator on your machine. Thank you very much.

Any questions? I uploaded this slide to the Speaker Deck, and of course, the Open Source Summit Japan site. So, you can refer to this slide after returning to your home or your office. Thank you very much. Okay? Thank you very much.