280

YouTube:https://www.youtube.com/watch?v=uMpfg04GeZw
Text:
Hi, good afternoon. My name is Seth Friedman. I am the CEO and co-founder of Liquid Markets Solutions. At LMS, we build high-performance network edge devices that are used primarily in the financial services sector, in particular in securities trading.We've implemented a product called UberNet, which is a general-purpose Ethernet adapter using a lot of new technology. It is the first general-purpose Ethernet adapter built entirely using an FPGA chip rather than an ASIC. It implements a network stack wholly in the FPGA chip. In particular, I'm here today to talk to you about our innovative use of CXL in an uncommon use case implementation, again, UberNet. 

As I said, UberNet is built as a general-purpose Ethernet adapter. It is a general-purpose Ethernet adapter.It is a high-performance Ethernet adapter meant to meet the needs of ultra-low latency securities trading on one hand and ultra-high capacity data moving needs associated with hyperscalers on the other. Because it is built using an FPGA chip, that allows us to create an enormous amount of custom functionality and to optimize the location of functionality in regard to the processing that's occurring. UberNet is a sub-microsecond for certain types of data processing. It is a sub-microsecond for certain types of data less within certain throughput rates and certain payload sizes, Ethernet adapter.Because it is implemented using an FPGA chip, that means it is also a programmable smart NIC where an end user can put actual application logic on the FPGA chip commingled with our network stack and some of our other IP. It has hyperscale capacity. UberNet supports anywhere from 1, 10, 25G, with or without FEC.40, 50, 100, 200, and 400G, all from the same physical device. With things like PCIe and CXL bridging and switching, we can also implement composable infrastructure I.O., allowing UberNIC to service a high-performance network I.O. conduit to other PCIe pluggables, such as GPU and other CXL devices. UberNIC also implements the very latest in precision time capabilities, so we implement the 2019 version of PTP. We also implement something called White Rabbit that was created by CERN, the physics research laboratory in western Switzerland. This allows for sub-nanosecond, picosecond-level time granularity measuring link delays on fiber networks. In terms of synchronizing the host, we implement precision time measurement and something called TGPIO-PPS that allows us to measure the link delays on fiber networks through a host to fine-tune and optimize the time values that are passed across the PCIe bus to the host real-time clock using PTM, allowing us to achieve server-to-UberNIC time synchronization down to single-digit picoseconds. In fact, as low as 1.65 picoseconds, a server can be synchronized to UberNIC, and UberNIC is, of course, synchronized to a Grandmaster, which is synchronized to the GNN.We also implement lossless line rate packet capture, and as mentioned with White Rabbit, we will improve from the current nanosecond level time synchronization to picosecond-level synchronization. 

UberNIC itself, again, why does this matter in terms of CXL? So in financial services, every second or microsecond, or in this case, nanosecond matters. More performance, more responsiveness matters. So being able to save any amount of time in communicating messages across the PCIe bus is beneficial to a financial services use case. Again, as mentioned, UberNIC is built using an FPGA chip. It supports, in addition to PCIe 5 and, of course, backwards-compatibly 4 and 3, CXL 1-spot-1 and 2-spot-0. So with the forthcoming, 6th generation Xeons, UberNIC will be supporting the CXL 2.0 standard, and you will be able to actually see UberNIC on Xeon 6 servers at a variety of events coming up in the fall, and in particular at SC24 in Atlanta. UberNIC has many other features, including significant fiber density with a maximum of eight, one, ten, or twenty-five G-fiber pairs on a single QSFP version of the product, and up to sixteen fiber pairs on a dual QSFP version of the product. 

We all understand FPGAs are flexible. It's important to keep in mind why. High level, we have the flexibility and reprogrammability of software with the outright performance of hardware, and in this case, all of that functionality, all of that capability. We're operating in the network edge, literally, given our technology with a fully hardware network stack, literally in the network edge. 

Now, one of the key aspects of UberNIC is complete network stack offload. So this is not a TCP offload engine. This is not partial offload.This is every single aspect of the network stack is running in the FPGA. This gives us additional advantages. This empowers our clients.With additional advantages, by centralizing processing in the FPGA. So, no longer are full network frames with their headers moving back and forth across the PCI bus. Now, it is simply the message payload with perhaps some network header fields based upon client requirement, so we make much more efficient use of the PCI bus. So, imagine a 64-byte UDP payload, which is a typical financial services message size.With a normal software network stack, that payload plus 46 bytes of header and FCS is a 110-byte frame. In a typical software network stack environment, it's that 110-byte frame in its entirety that's moving back and forth across the PCI bus.It makes more utilization of bandwidth, uses more flow control credits, and therefore has a corresponding limit and decrease in throughput capability relative to the UberNIC implementation where the network stack is running solely in the FPGA chip, meaning only the payload, and again, select headers as chosen by a particular client, moves across the PCI bus.That means in a financial services use case where we're sending or we need to receive massive amounts of relatively small-sized payloads, the reduction in data size and the corresponding reduction in flow control credit usage can be 40% or more, so massively increasing the overall throughput rate on the host side by way of more efficient use of the resources that are present. In addition to reducing the amount of data size that goes across the PCI bus is one interesting aspect of TCP operations.They never go across the PCIe bus, significantly reducing the load on PCIe bus utilization.

Skip a couple of slides around precision time, and let's focus on performance, which is really what we're here to talk about.What's the benefit of CXL?

Well, in May, we completed what are called N1 audits with an organization called Stack.Stack is an organization that has existed for more than 12 years.It functions at the intersection of financial services and technology and conducts third-party benchmark testing of different systems to validate what people are saying about their products.So, UberNIC was subjected to various Stack N1 tests in May.It is the first wholly new network card, network stack combination since at least 2017.In that testing, UberNIC was tested according to both its 10G and 25G without forward error correction network stacks, and a PCIe bus transfer mechanism built using CXL.Not PCIe, but CXL, one spot more.Why do we use CXL?Because it can help us reduce latency.Interestingly, not on the entire path.So, our implementation...is CXL.io on the path from FPGA to host.There are technical reasons why that gives us better latency than trying to use CXL.mem or CXL.cache.But as you will understand, CXL.io, with a PCIe 5 foundation, has literally the exact same performance as PCIe.So, on the receive path, there is no latency savings at all.However...On the path from host to FPGA, we implement CXL.mem as a type 3 solution.And as you'll see in a couple of slides, the overall in and out round trip is between 5, roughly 5% lower latency than PCIe.But that's based on the total round trip.The savings that we gain is only on the host to FPGA.So, it's that one half.That leg that is responsible, ultimately, for total 5% savings, meaning that one half is improving performance by more than 10 to 12%, depending upon throughput rate.So, it actually offers a significant latency improvement.And so, in May, we completed the STAC-N1 audits with STAC.And for a brand new product, UberNIC has only been available since late March.We set a number of records, beating every previous STAC-N1.For both 10G and 25G, UberNIC achieved the lowest minimum, 99th percentile, and maximum latency at 100 messages per second and 1 million messages per second.And the message sizes were both 66-byte payloads and 264-byte payloads.We have the lowest standard deviation of latency.STAC only measures to one decimal depth.So, at 0.1 or 100 nanoseconds, we set the lowest official standard deviation of performance at 1 million messages per second.This is also, UberNIC is the first system to achieve a maximum latency of less than 10 microseconds for any system tested by STAC.We have both 100,000 messages per second, and 1 million messages per second.And compared to the previous next best results, for 10 gigabit communication, UberNIC provides up to 67% lower latency.And for 25 gigabit per second communication, up to 59% lower latency.And again, this is as tested and validated by STAC research.A third-party organization.

This is possible.This performance is possible because of a combination of factors, but critically, CXL.So, I'm going to show you a couple of performance slides now.This is not based upon STAC testing.This is based upon LMS testing and is a subset of what STAC tests.STAC tests network I.O. plus certain application functionality.What I'm about to show you is pure network I.O.Measured using another FPGA product that we created called TASER, testing and simulation rig, which tests network-attached devices and systems with an accuracy of 3.103 nanoseconds.And here we're showing traditional PCIe Gen 5 communication.This is 95th percentile, tested across numerous throughput rates from 1 megabit per second, all the way through full 10G line rate.All the way through full 10G line rate.Incrementing at 100 megabits per second.So, 100, 200, 300, 400, all the way to full 10G.That means each individual test here for eight different payload sizes is, sorry, 5 million frames.So, the total of a single test, so, for example, on the left-hand side, UberNIC to the host using UberLoad is 40 million frames.And we're reporting the 95th percentile latency.And as you can see, for typical financial services messages, the overall latency is at or slightly below 1 microsecond through just about 7 gigabits per second.But again, we support all the way through 10 gigabits per second without dropping a single frame.No software network stack solution is capable of anywhere near this type of performance.Again, left-hand side is UberLoad, which is our sockets-compliant interface.Very easy to implement and able to achieve market-leading performance.The right-hand side is something called UberSOC, a CAPI implementation that is just about the same performance, slightly better than UberLoad, but really, you can't differentiate the performance.And this is UDP.

Next is TCP.And as I mentioned earlier, TCP processing benefits from the FPGA implementation in that acknowledgements never, ever go back or forth to the host.And so typically, it's considered that TCP is more demanding from a processing standpoint than UDP on the host.But in our model, by removing acknowledgements from the host side, actually, we make TCP easier for the host to handle because for a given throughput, the amount of data that's actually going across the PCI bus to the host or coming from the host across the PCI bus is significantly lower than it would be in a software network stack solution, hence, essentially, flat lines.The latency is that consistent.

Then there's CXL.Now, it's a little bit hard to see the lines and exactly how they measure against each other.So there'll be a slide in a moment that shows explicitly, the difference between CXL and PCIe.But as you can see, there's the overall same general pattern of performance, but CXL is actually slightly lower latency, especially at lower throughput rates with smaller payload sizes.CXL as a technology produces truly outstanding outcomes.And I'll come back in a minute and comment about actual implementation, and where implementation decisions can actually negatively impact the CXL outcome on a particular device.So this is UDP and CXL, again, similar performance as PCIe, albeit slightly better, 5, 6, 7% lower latency.

And TCP, same general pattern, except that you will notice at the extreme edge, the latency starts to climb.And I'll describe again in a moment, implementation details that result in that pattern.

Now, if we look specifically at CXL versus PCIe, and this is 64 byte payloads, so typically what is of most concern in financial services messaging, and we're comparing CXL, where if it is below zero, in this case, CXL is outperforming, has lower latency than PCIe, or if CXL is above zero, above the yellow dividing line, that means CXL performance has higher latency than PCIe.And as you can see, for most parts of the performance envelope, certainly for minimum latency other than the extreme edge, and 50th percentile, and even for the 95th percentile, CXL is a better outcome in most situations.Whereas CXL becomes not necessarily the best outcome, is, as you can clearly see, at the extreme edge for both minimum latency and 50th percentile latency, and from about seven gigabits per second for 95th percentile latency.Now, this has nothing to do with CXL as a technology.This is related to the actual underlying hardware, and in particular, flow control threads.So the combination of UberNIC and its FPGA network stack and CXL, and from the hardware standpoint, fewer flow control credits available to CXL than PCIe, when we are sending massive amounts of messages, as we would at higher throughput rates, and as we would with larger payload sizes, because of course flow control credit usage directly corresponds to both the number of messages and the size of messages that are being transmitted across the network.And so we end up starting to exhaust flow control credits at particular moments in time, which means messages must wait, they must be queued on either the FPGA side, on their way to the host, or on the host side, on their way to the FPGA, until FCC credits, flow control credits, sorry, are released by the other side back, so that they can then be reused in communication.So a critical aspect of making effective use of CXL is ensuring that the underlying hardware has sufficient flow control credit resources to fully support the messaging requirements that you might be trying to implement.And this is all very early days.We've been working with Intel and Altera for more than two years on creating UberNIC.So as we all know, new technology, new implementation, new use cases, this is certainly a, not the use case everybody envisioned when they were dreaming up CXL, it is a novel use of the technology.But we have found, as you can see, that CXL does provide a better outcome than PCIe in certain parts of the use case envelope.So, for organizations that are implementing cross PCIe bus communication, and they think that CXL might be able to offer benefits, it certainly can.But people should be very careful in analyzing exactly what their data moving use cases are, and whether those use cases are amenable to CXL technology as it exists today in the form of CXL one spot one, and as it will exist in the future with CXL two spot zero, and of course, three spot three, and other future generations.I thank you for the time.