-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy path156
48 lines (24 loc) · 13.9 KB
/
156
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
Thanks, Mike.Good afternoon.As Mike said, my name is Bob Wheeler.I'm an analyst at large at Light Counting, which means that that's not where I spend all of my time.I also have my own company, having spent about 20 years with the Lindley Group and then left in 2022.And on my own, I've been doing a lot of work on the CXL space.
And then in the middle of this year, I started working with Light Counting, more on the optical networking side of the world.And so this talk actually kind of represents the culmination of those two vectors coming together.So I'm mostly going to be talking about CXL and memory disaggregation and kind of what that means from a market perspective.Not going to be talking much about optics, except to talk about the challenges of adopting optics in these applications.
So we're talking about memory disaggregation.It really is kind of the holy grail of disaggregation, meaning that storage and networking disaggregation have been pretty much done in various forms in hyperscale data centers.There's different architectures out there and different proprietary architectures for how storage disaggregation is done.But it's well established.Main memory disaggregation, on the other hand, has been something that has been very slow.If you remember Intel's rack scale architecture from a decade ago, they were talking about photonics in the rack and having main memory disaggregation was kind of on the roadmap.And for various reasons at the time, very high power for the onboard optics.That never came to fruition.More recently, the Gen Z Consortium formed in 2016 and had backing from both server OEMs and the data center space, the memory vendors, but it was a very ambitious effort, something that HPE referred to as memory-centric computing.We all know that it basically never got past what are essentially proof of concept prototype products.The good news is that compared with a lot of past efforts, CXL is moving to market very quickly.
The reason for that is that it rides on top of the ubiquitous PCI Express physical layer.Just reusing the PCIe physical layer has been really key to the rapid adoption of CXL.What I mean by adoption, I mean on the host in particular.As I think Ron mentioned, basically what you get with CXL is you get coherency.The good news here is you're not having to add to processor pin counts because you already have the PCIe interfaces.We've already seen very wide PCIe interfaces on modern processors.Adding cards and modules can use the exact same slots and form factors as PCIe.So now basically meaning if you're a server vendor or ODM, if you build a server with the latest Epic or Xeon processor, you're essentially getting CXL for free and your end customer can populate a CXL module in what was a PCIe slot.It should, in theory, just work.Having said that, CXL 1.1 is kind of baby steps, not a fully featured spec.We need to wait for CXL 2 and even CXL 3 to get some important enabling features.CXL Type 3 devices are the devices that provide cache coherent memory expansion.That's where really most of the action is around CXL, is in these Type 3 devices.The protocol related to that is called CXL.mem.CXL.mem uses FLITs, so instead of the traditional variable-length packets over PCI Express, you've got fixed-length transactions.The simplest CXL topologies add about 100 nanoseconds of unloaded access latency.The goal in the consortium was to add essentially a QPI hop worth of latency, same as kind of a socket-to-socket interconnect.In reality, we're looking at probably about 100 nanoseconds for the simplest topologies.
Here's the good news.This is my own forecast for the host units that will ship with CXL.If you've seen this on the web before, this is updated through 2027.The bottom line here is that essentially all servers based on Epic, Xeon, and then presumably even the ARM-based processors that people are using for captive use and are developing next-generation products now, they'll all be CXL-enabled by the 2026 timeframe.The good news is there are going to be a lot of sockets out there.The challenge is how do you actually penetrate that?
I want to talk about the near-term use cases.Mike up front talked about three to five years.I'm talking three to five years here.I'm not talking something that's beyond five years.I'm trying to bring us back to some near-term use cases where there's actual demand and you can make a business case for why a customer would want to deploy this.In the single host expansion case, the vast majority of uses will basically just fit in a server chassis.You've got enough PCIe lanes on, again, modern processors.You can build a server chassis that can hold, say, eight CXL memory expanders.You can get terabytes of memory inside your chassis, and that's enough to satisfy the vast majority of use cases.People talk about SAP HANA.There are some applications that require large memory footprints, but the actual volumes you see for the really, really large memory footprints are quite small.The other use case is memory pooling, and that's what gets people more excited because the idea here is to recover stranded memory.It's kind of like the Berkeley use case, or the NERSC use case, but in this case we're talking about hyperscalers, cloud computing.If you've been following this space, then you will know that Microsoft published a paper, I think it was around March of last year, talking about how much memory they could fairly easily recover if they started pooling memory across a relatively small number of sockets.Sweet Spot is kind of about 16 sockets before you get to diminishing returns.Again, this is a very simple implementation, but Microsoft showed that they could recover about 10% of their stranded memory.Since memory is about half the cost of a hyperscale server, that can be a big deal.The numbers add up.In terms of how you actually do that, there are several physical topologies.One option is you build a chassis that looks like Meta's Grand Teton.Unfortunately, Meta's Grand Teton was in a plexiglass box, and we couldn't actually look at the guts of the system, which is unfortunate.Inside, there is essentially a cabled backplane.Branding of the latest generation of cables that they use.Essentially it's a cabled backplane connecting the different components within Grand Teton.You could do something very much like that if you were trying to disaggregate your main memory and keep it within a chassis.We're talking about a 7U, 8U chassis, something like that.The other option, which is more what we all think about in terms of disaggregation, is one you compute nodes in your rack and then a separate memory appliance, at which point now you have to have cables to connect that memory appliance to the hosts.That becomes now interesting in terms of PCIe cabling and potentially optical.By the way, I'm showing a multi-headed memory expander here.I'll note that none of these actually exist beyond two host ports today.The other way to do this is you add a switch to the topology and then have separate memory expanders below the switch.There are different ways to deploy this.
Challenges in the memory pooling use case.Latency is really the key metric here.I talked about 100 nanoseconds of additional latency.Even the simplest topologies can essentially double your memory access latency.Software has to become tier-aware, and I won't get into the NUMA details.But basically, if your software is not tier-aware, you're going to have a significant performance impact.The good news is there has been some work done here.Meta developed a transparent page placement, or TPP, for Linux.It's been upstreamed to Linux.Meta, I think it was actually a year ago, here, showed some data across four different workloads that they had modeled at the time, not using actual CXL, but modeled it using multiple sockets, and showed that they could really minimize performance impact on their well-known workloads.Now I'll point out that public cloud providers don't necessarily know what the workloads are because it's their customer's workload.More work needs to be done in terms of making all the software, either at the OS level or the application level, tier-aware.Another problem, which people tend to minimize here, is the whole point of recovering stranded memory is to improve total cost of ownership.The fact of the matter is CXL adds component cost.If you're building an appliance, there's a chassis, there's a power supply, there's fans, there's memory expander ICs.If it's an external appliance, there's cabling, connectors.Right from the start, you're actually adding cost.You're eating into the amount of cost you're recovering in terms of stranded memory.When you look at optics, the problem is the math.When you do the math to figure out, "Okay, how much DRAM am I recovering?What's the cost of that DRAM?" The math doesn't support traditional optics like AOCs.
I'm not going to spend a lot of time here, but I'm just going to point out that the longer-term view on CXL is it could become a true fabric.A long way to get there.Of course, there's interest in a standardized alternative to NVLink.We all, I think most people in this room, are going to be familiar with NVLink, which is NVIDIA proprietary.To get away from proprietary GPU to GPU interconnects, CXL could be very attractive.There are a lot of features that have been added in CXL 3.x that can enable that.One downside I would point out is that CXL is tied to the PCIe physical layer, which today, is two generations behind NVLink.Now, everybody's talking about PCIe Gen 7.That'll get at parity with what NVIDIA is shipping today.One of the challenges here is to keep the industry standards a little bit closer to what's happening at the leading edge.
I'm going to open a can of worms, because there isn't really time to explain all the assumptions behind these numbers.I just wanted to give folks a sense of the scale of the market here, because CXL use cases are going to take time to develop.This is comparing, in the blue, my own forecast for CXL expander ports.These are x4 ports.It's just the way I've normalized the forecast, a x4, meaning four lanes.This is a combination of pooled and shared memory expanders over time.These are port units.Then I took the aggregate of light countings forecast for 400 gig and above Ethernet transceivers over the same time frame.You can see, just order of magnitude here, the opportunity in the CXL memory pooling and sharing use cases is not going to be on the same order from a volume perspective as traditional Ethernet optics.On the one hand, we need lower cost.On the other hand, the volume's not there to drive the cost down the way we have with traditional Ethernet optics.
I'll just leave you with a few comments on optics, and then I think we're supposed to have a break.Copper obviously is very appealing in terms of addressing some of the challenges around copper.The copper bulk, Ron mentioned the bulk, the mass, the bend radius.You're talking about 16 lane cables.I would encourage everybody to see the TE cabling in the experience center if you haven't seen it.They are demonstrating four meters.If you have re-timers close to the CDFP connectors, four meters seems to be achievable for Gen 6.For short reaches, optics are up against essentially passive copper.The issue is cost.Cost is the biggest barrier to adoption for optics.By comparison, VCSEL-based AOCs for a 400 gig AOC, we're talking about on the order of $400 right now.The use cases I talked about for memory pooling just won't support that kind of pricing for optics.It's kind of a non-starter at that point.The GPU use cases are still really speculative.Part of the problem is frankly NVIDIA's choice to stay proprietary because that really limits the market opportunity unless you're specifically addressing just the NVIDIA part of the ecosystem.The good news here is that GPU utilization is expensive.You pay a lot for these things.To the extent you can improve your GPU utilization, that has a lot of value.The economics look a lot better if you can start to show, "Hey, I can make much better use of these really expensive resources." The other thing is the performance considerations get harder and harder as you scale up these cluster sizes.As people have pointed out, at some point, three meters or even five meters doesn't cut it anymore and optics start to become more attractive.That is it.I'm happy to take questions during the break, but I don't want to hold people back from getting a break in.
Can I ask a quick question?
Sure.
This is great stuff.Can you back up one slide on the projection?
Yeah.
First of all, don't show this to any VCs because you're going to cut off the gravy train for all these startups if you ever show that.I know you said you had lots of assumptions behind this, but can you at least give the assumption on the optics cost for the Ethernet optics versus this optic?
Just to be clear, the blue portion is not optical.I'm not assuming any optical penetration here.This is simply the number of CXL expander ports.
For sure, don't show this to VCs.You're going to just kill the funding.
I'm having that discussion with somebody in the space.
Bob, you talked about lowering the acquisition cost with all of the components, but you didn't say anything about the operational costs.I've been led to believe that memory that's attached via CXL uses substantially more power than memory plugged into the backplane.Could you comment on that?
Really the only additional power is the power of the expander.It's the same old thing as with memory buffer chips going back to a lot of old servers.They used memory buffers.It was always more power, more cost.Same deal here.You're adding some power, some cost.The one twist there is you could support LPDDR, for example, on CXL and lower power that way.One of the things that people have talked about is the ability to repurpose memory, reuse memory that's already been out there in the field, potentially use LPDDR.By decoupling it from the processor controller, it gives you more freedom in memory choices.
That memory reuse is kind of bogus, I think.
I won't disagree with you there.
Thank you.
Thanks.