-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy path152
28 lines (14 loc) · 13.4 KB
/
152
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
Hello.We're almost at lunchtime.Don't worry.My presentation is 20 minutes, just like the rest of them.Hopefully it's educational.So Michael Ocampo, Senior Product Manager of Ecosystem, as well as Cloud-Scale Interop Lab at Estera Labs.My topic for today is breaking through the memory wall.
So what we'll do in this agenda is just cover what is the memory wall and what are we doing to break through it.And we'll talk about a few use cases where we've seen success in breaking the memory wall and other areas where I think we as a community can work together to chip at the other aspects of the memory wall, particularly on the GPU side.We'll also look at a couple different system designs, accommodating double-width add-in cards as well as shared infrastructure.Throughout this OCP summit, a lot of people have talked about DCMHS.So we'll kind of talk about that design as well and see how CXL can be used there.And then we'll also talk about some of the CXL collaboration that's happening now.And so this morning, you can tell there's a lot of synergy actually around CXL, obviously, here.This is CXL Forum.And so this energy is around us, which is great.
So what is memory wall?You've seen probably this very similar chart before.This is just a little different perspective.You can see over the last 20 years, we're looking at performance of compute.And here at the top end, this red line going up to the H100, that's kind of the state-of-the-art technology right now.And then the green line is essentially this memory bandwidth.And so this disparity is basically this wall that we're trying to break.And I think we've done that.In previous attempts, it's been challenged.Not enough memory bandwidth capacity and didn't scale efficiently.It just wasn't compelling enough for the market to really say, I want to standardize on this.The CPU vendors, they've got to support this.At the end of the day, there's got to be buy-in from the community to get this thing working.Also, latency.They look at the performance.If it doesn't meet the SLA, then it's really not compelling, especially if you compare that to local memory.And not deployable at scale.So in previous life, we did a lot of system deployment.And so if your BIOS isn't right, your BMC is not right, your software stack is not ready to provision this at scale, then you're just kind of just hoping that this is going to work.And so if those preliminary requirements are not met, then definitely the software stack is not going to be ready for application stack to really take advantage of this new technology.
So what are we doing to break through the memory wall?Well, here's our approach.We're significantly increasing memory bandwidth and capacity.So if you look here on the right, we're actually doing hardware interleaving on fifth gen Intel Xeon scalable processors.So we're just doing two memory controllers.Each of those have two memory channels.And so we're interleaving in between 12 memory channels.So this was a significant boost in performance.And I have a couple use cases that we'll talk through.And we even reduced latency by 25%.So this is low to latency.And so yeah, pretty compelling there.And of course, to really do this at scale, all the hyperscalers, cloud companies, they really want to use things they already have.They already have purchasing agreements with DIMMs.And so they want to use DIMMs.So let them.That's going to optimize supply chain, help them control costs.And with that, it's also got to be plug and play.So it's not just the hardware, but also the software stack that's got to be there.All the drivers have to be there.And I'll kind of talk about even Kubernetes and how all those abstractions need to be there to then scale out elastically.
So where we've seen success for sure is databases, right?On your left-hand side, you're looking at your e-commerce and business intelligence, data warehousing.And so there's a couple of different workloads, right?They're categorized in two main areas-- OLTP, OLAP.That's online transaction processing, online analytics processing.And so in other words, you're looking at what is happening now or what has happened over time, right?So some of these workloads is looking at a huge amount of data.And so adding more memory-- the more things that you add in memory, the better performance you'll get versus just using, of course, spindle drives or even NVMe drives.And some of the other opportunities we see for CXL is AI inferencing.So we've actually worked with VMware and other software companies and even done our own Redis demonstration to show that caching with CXL can significantly boost performance.Popular streaming companies use Redis for doing recommendation engines.So if you're looking for a movie, how does it recommend things so quickly?Well, it already knows your profile.These things are cached.Same thing with semantic cache.So large language models, instead of pinging your GPU all the time to figure out to recommend something or generate something, you can actually use embedding tables and semantically cache that information.So vector databases become very popular in the cloud.And we see that as a major opportunity for CXL to really show significant performance for applications that we really use every day.
So the results.On the left, we've actually shown this in a previous exhibit at Flash Memory Summit.We won an award for it.And so the green line, just to make it really quick and to explain this, the green line is DRAM.Blue line is DRAM plus CXL.So we're able to improve transactions per second by 150%.And this is basically simulating 1,000 clients over time.And it only increases CPU utilization by about 15%.This was actually using software tiering, thanks to MemVerge memory machine.On the right-hand side, this is the OLAP test results.And we're looking at basically lots of different queries.So this is a TPC-H, pretty standard benchmark that a lot of folks like Oracle or SAP do.And we actually did 12-way interleaving.So we have 512 gigabytes of memory, 256 gigabytes of CXL attached memory.And we're able to cut query times in half, which is pretty significant, considering some query times can be an hour and a half.So for a database admin, that's-- get to go home early.
So we want to take it further.We want to break this memory wall and just move it.Move it out of the way.So if you look at this diagram on the left, it's basically two machines.This is without CXL.We can do basically 24 DIMMs per system.So to get to 48 DIMMs, you've got to buy another system.You've got to buy another power supply.You've got to buy fans, a backplane, parts that you may not need if your application is memory-bound.So high cost, high power, high utilization.Whereas on the right-hand side, we're actually showing this in our booth A11.We're showing a server that can accommodate eight double-width cards with our CXL memory controller.So it's pretty obvious what's the improvement and what's the significance of this.Going from 48 DIMMs to 56 DIMMs, precisely, that's 2.33x memory capacity, 1.66 memory bandwidth improvement.So definitely lower TCO.
But not every system can support a double-width card.So we've been looking at the OCP community, looking at different system designs.And so DCMHS is a pretty popular system.Previous life as system architect, I've always sought after these kind of designs and have this affinity towards it because it has shared power, shared cooling, shared infrastructure, good TCO.And so it's good to see the alignment of the community for shared elements within this type of architecture.And one of the subspecifications is a pluggable multipurpose module, the PMM, based on SNIA SFFTA 1034.This is interesting because it supports a coplanar design.And actually, if you go into the CMS experience zone, Amphenol actually has this connector.And a number of people have bought in on this idea.So it's very easy to basically plug in, basically, this module.So I've done my best to actually show a little mock-up drawing on the bottom right.So you'd have basically-- so you have your HPM, which is your host processor.And that could be connected to some kind of backplane or midplane.And that's connected to some host interface board.And then that host interface board has these PMMs.So it's not available today, but it's a concept that a lot of the community has really bought into.And I think within a year, you'll start to see these things.So why would they do that?For all the reasons I mentioned-- standard DIMMs, but also high-power connectors.So 200 watts on this connector, an additional 400 watts through an auxiliary connector.Why this is compelling?Again, it's all these things-- TCO, as well as hot plug support, enterprise features.
But of course, that's not without challenges-- signal integrity, link bifurcation configuration, latency and performance-- things they have to think about-- as well as DIMM interoperability.So as you look at this design concept, and if you're looking into designing something like this, you should take into account that signal integrity for Gen 5 is not an easy hump to get over.Usually, retimers are used.And it's something that actually Astera Labs offers.
So this is just what it kind of looks like.You drop in these retimers on the board-- boom.Your eye diagram looks perfect, which is what you'd need if you want to have stable performance every time you boot up your system.Last thing you want is to wait for 10, 30 minutes for your DIMMs to come up.That's not a good sign.Especially-- these DIMMs are-- they're going to fail.So that's why the community is behind the concept of DIMMs, as well as it being accessible in the front.So you don't have to turn off your system just to service one DIMM.
So same thing, different diagram to make it super simple to follow.So what are we enabling?With our Leo chip, we can do chem connectivity.That is your typical memory expansion, support real-time applications.With our hardware interleaving, this is incredible value proposition.Short reach CXL attached memory in the middle.So one retimer, maybe have a backplane and a midplane, and then add CXL.So this is enabling just a bunch of memory.So it's holy grail of over-after.And if that's not enough, if you want to have more memory, you can have PCIe cable, which is another thing that we're showing in our lab.So we have-- sorry, in A11, you can go there, and we have a three-meter cable with a CXL controller at the end of it, which is working today.It's not vaporware.
So what happens to latency, you ask?Maybe you're asking in your head, and you didn't vocalize it.Well, minimal impact, depending on what application.So on the left-hand side, you've got Leo directly attached to the host.Then we've got one retimer, which is our Aries smart retimer.And then we've got two back-to-back retimers with CXL.So your latency impact is less than 10%.So we won't really know if this is really harming your application until we work with you, and we optimize this, and really just see what happens.But from our perspective, looking at database applications, anything that's not super latency sensitive, this is going to work.Usually, if you have an application that wants more bandwidth, then this is a pretty suitable option.So it's unlocking new architectures.
And so that brings me to what collaborations we're doing across the ecosystem.There's a lot of interoperability that we're doing with all the major CPU vendors, memory vendors, OS vendors.So I mentioned VMware earlier, Linux communities involved, even Windows.So a lot of folks involved.And why is that so critical?Well, again, it comes to those enterprise features, the ability to scale and deploy confidently, reduce the time to market.And that's why we want to be your partner to really bring this forward.And so we want to prove that to you with these performance metrics that we shared earlier.Please come to our booth and check it out.There's a lot of work being done for, of course, hardware interleaving as well as software interleaving.You can see the members just mentioned several times today with their memory machine, their memory viewer.So that's important, right?If you're a system administrator, you need to know what's warm, what's cold, how do you manage your page policies, how much granularity do you need to have to really make smarter decisions for your memory uses, especially when you have different tiers.And I know Greg Price, for example, is also working on different weights.And so those are all super, super important as you deploy this at scale.Once this is at cloud scale, fleet management becomes absolutely critical.CXL 2.0, RAS features, telemetry, Redfish software integration, orchestration, these things you can't ignore.And so you have to really work across the aisle, even if it's your competitor, to say, hey, let's solve this problem together.This is why it's called OCP Summit, right?Let's work together, really bring this forward.Shout out to Supermicro with their single pane of glass called Super Cloud Composer.Good friends with Kevin Culp.And so he's a big advocate for Redfish compliance.And so I know that there's some work with Redfish for CXL support.And so as a community, I think it's good to want these things to really scale out.
And so that brings me to the call to action.Get involved, come to the DCMHS meetings, as well as CMS.I'm pointing to a hyperlink there for the CMM proposal and also for DCSCM.So that's the management portion, which will include CXL as well.And yeah, Linux community, very involved with CXL innovations.And also our website.Please reach out to us through there or directly through me.