347


How many of you have ever used a supercomputer? Okay, a few. Okay, that's good. Well, what I'm going to talk about today is sort of the history of memory-driven computing at HPE. I joined HPE in 2016 as a contractor to write a proposal for the Department of Energy Path Forward Program, and we won a deal, so I got a job there. And I think it was actually better to work for them at that level. But I've been there ever since and working in the CTO office on system architecture for leadership-class supercomputers. If you're familiar with the top 500 list, HPE Cray has four of The Machines in the top 10, including number one, number two, number five, and number six. The supercomputer conference in November will have the number four for the El Capitan machine at Livermore, which is up and running, actually, and so that we expect to take the number one slot. We also just had the inauguration for the ELPS machine in Switzerland, so we expect that to make the list as well in that top 10. So you know, it's a lot of really fast computers. So to give you some idea of the numbers, you probably heard about exascale, exaflops, which is what the Frontier machine is at about 1.2 exaflops. We expect the Livermore machine to come in at about two exaflops, that's 10 to the 18th double-precision floating operations per second, so that's a lot of math. And it's pretty impressive what they've been able to do with the Frontier machine, that's really, it's intellectual leverage, right, that's allowing people to do science that they haven't been able to do before. But what I want to talk about here is disaggregated memory, this notion of being able to separate CPUs and memory and having global access to to data structures, and how that might be used, and so I think I'll just dive into it.

So first, I want to talk a little bit about the taxonomy that I use because I think there's a lot of confusion when terms get thrown around. So when I talk about FAM, I'm talking about fabric-attached memory that is disaggregated from CPUs and attached to the network fabric. Now, what happens in some of the systems that we have built, that fabric-attached memory does have computational capability, and some of our prototypes, and so, you know, some of these things start looking more and more like compute nodes with lots and lots of memory. So it's a little bit of a misnomer because you can have computation in memory, as you well know. Persistent memory, the way that we use the term is, persistent memory is memory where the results of one process survive to be used by the next process. So this is very important in the next generation systems because we're really moving towards integration of workflows, especially with artificial intelligence. Simulations on these supercomputers have an artificial intelligence component. It turns out that inference can really speed up simulations rather than just grinding out solutions. It can help you reduce the search space. So for example, they've now mapped all of the proteins that the human body can produce with the help of AI, which was previously a computationally intractable problem because of the protein folding, and so that has been achieved. So you're going to see more and more of these complex workflows in supercomputing workloads and more data analytics, and I'll talk about that too later. Non-volatile memory, I think you all know what that is. That is memory that retains its contents if power is removed, and so not to be confused with persistent memory. We use persistent memory to talk about the actual data surviving a process that created it. Resilience, when we talk about resilient memory, this is memory that delivers some level of service. You have a contract between the hardware and the software, and so resilient memory is memory that can be returned to a known state in the event of a fault typically. So to do this, you have to do logging and you have to know what happens in the event of a system failure of some kind so that you can return to a known place. And the right-ware problem, you're all probably very familiar with from the flash world and persistent memory, memory, like phase-change memory or Memristor ReRAM, right, where is a very big problem currently.

So let's go on. I really want to talk about the origins of memory-driven computing at HPE. I think probably a lot of you have heard of The Machine Project, and we'll talk about that in a bit, but there were some key inventions that really motivated this, and one was the sort of rediscovery of Memristors by HPE labs in 2008, and the fabrication of titanium dioxide memristors. So, if you put titanium dioxide between a couple of electrodes and apply voltage on it, the oxygen atoms migrate and you can actually change the resistance based on the history of the voltage. So you can see how much voltage is applied and the amount of time they've been applied. So in this way, it really also can be used as an analog component, which is important, and we'll talk about that too. Phase change memory, you're probably all familiar with the late great Optane, and that's based on a chalcogenide glass, and that is also similar to a memristor. You can apply voltage to it and change the resistance of the cell, and so it is also non-volatile. And there are other types of persistent memory, magnetic RAM, which is very fast, but really hasn't achieved the capacities that I would like to see, and then there's carbon nanotube, and some other types of memories that have been proposed, and nobody has really actually succeeded in this. I think the whole industry is a little bit, I would say, suffered from the fact that Intel and Micron lost a lot of money investing in Optane, so it's really inhibited investment in persistent memory. A lot of the memory companies had plans to introduce phase change memory. They had plans to introduce phase change memory. And that, unfortunately, all went away because of the losses. And so there are a lot of these technologies that are shown to work. I was personally in the Nantero lab and saw what went on there, and they had some pretty impressive things, two megabit nanotube RAM that they did with Fujitsu, but that was done in like a 65 nanometer or something like that. So it wasn't state of the art. It was a process. And one of the biggest problems these guys have is they can't get access to a state-of-the-art memory fab, because if you go to talk to Samsung or Micron or TSMC or somebody, they're not going to shut down the line and let them pour their black goo in the spinning machine. So that's just not going to happen. So these guys don't have access to the fabs really to develop the technology, but the technology does exist. And I really hope that this turns around because of things like the CHIPS Act. One of the things that we've advocated for is really to get innovators access to state-of-the-art chips. So if there's anybody here from the Department of Commerce, I'll say it again.

So anyway, the idea behind memory-driven computing was that if you had this persistent storage, that you could change the paradigm from a cluster, which was sort of a network of compute nodes that send messages to each other, which is the current state-of-the-art with MPI, for example, to a paradigm where you have shared memory and a memory semantic fabric. And so the idea there was that you could have high capacity persistent memory and access it from the compute nodes that were connected to the same memory semantic fabric. And then it's much easier to program because you have a shared memory view of the world and your data structures live in this memory and you can have multiple processes that can survive, in other words, persist, and you can do some interesting things with it. And so that is really what...

What kicked off The Machine project? So the memory semantic fabric was, of course, Gen Z, which is something that originated at HP. And then also the fabric topology of HyperX, which is similar to a flattened butterfly type topology. That really allows you to scale these networks to hundreds of thousands of endpoints and build effective communication structures. So that was the state of things.

So, HP actually prototyped this in the labs and built one of these things. It wasn't really ready for productization, despite all the Star Trek ads and stuff that HP ran at the time. It didn't make it to market because it was really designed as a research project and not as something that could be produced. And also, I think some of the performance really wasn't there.

So, but it served as a basis for our path forward work with the Department of Energy. And another thing that happened in HP labs was our optics program, which HPE has really good mid-board optics technology in the labs. And so the idea that we could build a Gen Z fabric that would scale that was 100% optical is something that we pursued.

And... As part of the Path Forward program, in 2017, we won an award from the Department of Energy, a fairly large one, to develop a prototype. And this was really a part of the exascale computing program to develop the Coral 2 generation of machines. In other words, Frontier and El Capitan. El Capitan is now finishing its acceptance and will be in production soon. But I will show you the prototype that we built as part of this for Path Forward. This is 100% optical network. And we built a system based on AMD Rome CPUs at the time. We had a Gen Z bridge chip called Wildcat. And it turns out that one of the things that we learned is, with memory-driven computing, is that it works really great for load store of small things like semaphores and things like that, atomic operations. But if you really want to move a lot of data, you have to go to RDMA. So our Gen Z bridge component had a fairly sophisticated RDMA engine in it. To also facilitate the large transactions. And we did modify the Gen Z protocol to support small messages with minimal packet overhead. So it was a really nice network and 100% optical. The Alphastar switch that you see here, these brass things, are copper on the right for the heat spreader. That is the optical switch component. And so this is the compute node on the left and the switch on the right.

We also had the fabric attached memory module that we built with an optical interface, which used one of the chiplets in the switch. So it had a built-in switch, and an electrical/optical translation chip. And then we built the FAM controllers out of FPGAs. And this was the fabric attached memory in the Department of Energy prototype. And so we were humming along on this, and then the Coral 2 bid process kept chugging along, right? So we were hopeful that these technologies would make it into the next generation of Department of Energy machines.

We built a chassis called Badger. And here's a piece of it, a liquid-cooled enclosure. And you can see in the bottom there's a tray there which has an optical shuffle in it. So the 100% optical network comes into this box in the chassis. And it has something that looks like a flex circuit. It's all optical data paths in there to shuffle the incoming connections to the right places. And then we have mid-board optics from that box to the chassis connectors that go right to the switch components on the NICs on the compute node boards.

So we were sort of about two-thirds of the way through the path forward, when—when Cray was awarded the Coral 2 project and Coral 2 contracts. And of course, this is an article from 2019 when that contract was announced to the world, but we really knew that Cray was awarded the contract about a year before this. So, before the path forward program actually finished, the other vendors were kind of left with things to do for path forward, but they didn't really have a future because we weren't awarded this contract. But we were hoping, possibly, that there would be some other customers for it, but that did not materialize for our Gen Z prototype.

So, but we continued with our notion of memory-driven computing, and I want to talk about this picture a little bit because this is sort of a notional drawing of the data flow. It's not, it's not you shouldn't look at it as as the actual architecture of a machine, but from the perspective of a user doing data analytics, you have external data sources in some cases with very high amounts of data being ingested per hour. And so that data needs to be curated and formatted and put in some form that the analytics can use it. And that, in our idealized world, world goes into a global persistent memory and in-memory data store that the compute nodes can access. So we have a compute cluster that allows us to run large-scale analytics, and then on the front end, we have an interactive capability. So something like Jupyter Notebooks where somebody can use Jupyter Notebooks, fire up their Python program, and then maybe write their application to use a library like NumPy, and then the supercomputer can emulate NumPy in the back, and they don't necessarily know anything except that their application runs much faster and a lot more data. So what we're really trying to do is make these machines a lot easier to use, and we have had some success in that. So we had a contract that we began with the Department of Defense to look at this in more detail—a small contract.

And then, so the next event that happened was the acquisition of Cray was announced. So HPE bought Cray, and then I think it was really a good thing for the Department of Energy because we had the financial muscle and the operations capabilities, and they had the contract and the technology to field these exascale machines. So I think the combination of HP and Cray really worked out well. So it was a pretty smooth acquisition. HPE acquired Cray. Cray acquired me, so that's how I ended up doing this. And so, everybody was working to deliver first the Frontier machine at Oak Ridge. So we discovered actually that there was a tremendous amount of work to do, particularly on Slingshot, so we did field that machine, and ultimately, all working. I think the largest Slingshot installation now is the Aurora machine at Argonne with over 70,000 endpoints, so that's probably the largest HPC network in the world right now at Argonne. So we then, with our Defense Department contract, were really working to further our optical Gen-Z work. We had to make a shift because the company was moving to a Slingshot and a different product plan, so we wanted to align with something that we could deliver to customers and align our research with that. So we merged our program with a similar program that was at Cray and came up with the program called Golden Ticket.

And this project, we actually built a cluster which is actually operational in Houston. A small compute node cluster and two FAM partitions, one built with AMD Milan nodes with a lot of memory attached to them, and that really allowed us to get started on our FAM research early. The fact that you use DRAM for it. It wasn't non-volatile, but that really wasn't a problem for the computational research. But the fact that it was DRAM meant that we could emulate slower memory just by slowing down accesses. So it allows us to do some research on the effects of FAM latency, and so that was quite effective. And then the partition below is an Intel Ice Lake partition, and that's 10 nodes. Each one of those has 8 terabytes of the Optane Barlow Pass DIMMs. So the whole machine has 82 terabytes of DRAM and 80 terabytes of Optane, and so it's a pretty formidable machine that costs a couple of million dollars to build. But it's a—now it's one of the main systems in HP Labs research. So I can show you a little bit more here. So push the right button this time.

This is the architecture of The Machine. It's slingshot based. We have four 64-port slingshot switches, and basically, a hundred percent bisection bandwidth. So the idea there is: if we had very memory-intensive operations, and a lot of traffic, that we could support all of the computer memory nodes running full blast with this—with this fabric.

And so we had a number of components in the program besides building the prototype. We developed a software stack for the memory nodes that sits on top of Linux, manages the fabric attached memory on each of the memory nodes. We simulated extending this to over 10,000 nodes using SST, if you're familiar with that simulator from Sandia. And then we did a demonstration of this NumPy emulation that I was talking about earlier, using some software called Arcuda which really was originated at DoD and then has some support from New Jersey Institute of Technology and us. So there's some development of that. That's open source, by the way. Arcuda, if you want to take a look at it, is on GitHub.

This is The Machine in Houston. There's four racks to the right, and this is in a computer room and Spring, Texas, in the corporate headquarters called the Misfits Room. And it's called the Misfits Room because it's designed for high-powered systems that are air cooled. And so we're kind of lonely in there because a liquid cooled lab and HPC is where all the work is going on now. So we mostly have this room to ourselves. And I had built this machine actually with a couple of technicians and one of our IT guys doing all the work. I live in Santa Cruz, so I was mostly on the phone, and this all happened during COVID, so I couldn't go there. So, you know, I had a bill of materials and plans for how we were going to put this together, and the guys built this machine, and I actually didn't get to see it until two years after it was built. So I finally got to see it last February, got into the room, and there's just such a roar of fans in that room that, you know, I couldn't wait to get out of the room, but I was excited to see The Machine. But you would have thought that I would have had the presence of mind to remove this dolly that's sitting on the side of The Machine when I took the picture, but sadly I didn't notice that until afterwards.

Anyway, I did want to acknowledge the guys that built The Machine here, Daniel Moore and Binoy Arnold. Daniel actually was the guy who unpacked all the boxes and screwed everything together, and he did a phenomenal job. And so, this machine is now running in HP Labs, and it has a fairly large user base, and what we, we have probably more people that want to use it than the ability to support them, but it actually turned out to be quite an asset to us.

And one of the outcomes of this program that we had was our OpenFAM API, which is also open source and on GitHub, and that's got an API and a management stack so that you can actually run this. This will run on just about anything, on your laptop or an Ethernet cluster or whatever you want. Yes, thank you. So we did open source this, and we also have an optimized version for Slingshot. I don't have more than 10 minutes left, so I want to get to the point.

So what I really wanted to talk about today was disaggregated memory and where things are going, and make you aware of some of the problems that we face in the leadership class systems. So, in 2019, CXL happened, and Intel finally agreed to do an open-source accelerator interface, which was great news. So, that was one of the big obstacles to Gen Z, is we never had a processor that spoke Gen Z. AMD was doing CCIX, and IBM was doing OpenCAPI, and NVIDIA had a closed system with NVLink, and IBM, or excuse me, Intel, also did not have an open interface until this. So, CXL 1.0 or 1.1, the spec was whatever Sapphire Rapids could do. That was basically what it was. And so now we have the 2.0 generation coming out, or is out, and 3.1 support is coming up.

So one of the things that we had planned to do for our DOD system was build a node around Granite Rapids. And so this was our plan: to have very high injection bandwidth capability per node, and also 16 terabytes of Optane on each node for our memory nodes in the system. And then Intel killed Optane, and so we couldn't build this, but we did a lot of good work studying that. And so I'm not going to go through all of the results.

A lot of these are published papers that you can look up, but I do want to give you one example here. HPC applications do lots and lots and lots and lots of linear algebra. And so graph analytics, artificial intelligence, a lot of other applications, this is what they do mostly—as matrix algebra. And so what we did here, I don't know if this is too easy to see, but this really demonstrates one of the problems with scaling these linear algebra operations: these compressed sparse matrix formats, in particular CSR, require you to broadcast these vectors all around to every node in the system, in the beginning, so that all the nodes can figure out where their operands are. And that's one of the problems with the sparse matrix. So it's a great format for squashing the matrix down into a manageable size, but it's a terrible format for HPC because of this broadcast problem. So we played around with and developed our own format, called column partition sparse matrix. And what this does is it gives the row, it also includes the row and column values for each of the operands. So now you can say, if I am carving up the calculations, then I can just send the pieces of this, of the vectors to the compute nodes, only the pieces that they're going to use. So I don't have to broadcast the entire set of numerical values, non-zero values, to everybody. So if you look at this blue line that points to that line on the bottom, that's scaling linearly, you can see that this was very effective. But it does have the property of increasing the memory footprint of the matrix. So if you look in the literature, there are a lot of alternate matrix formats for supercomputing applications, but they all have this property.

Some of the things that we learned: We don't have the GenZ memory semantics anymore in Slingshot or Ethernet or InfiniBand, so that means that the FAM references need to be converted to RDMA. So the effect that this has is it hurts small message performance. So we're doing things like message aggregation and things like that, to improve small message performance, but we don't have the memory semantics that we had in GenZ, so maybe we'll have a future interconnect someday like CXL that could do that. FAM is not great memory because of the latency. You have to go over the network to get to it, so it's slow compared to local DRAM, but it's really, really fast storage, so if you have shared memory databases and you can put them in fabric-attached memory, that is a way to get very high performance on something like SAP HANA or something like that—would be perfect for that—and the design of your data structures is critical, so this is something that will drive the scalability of your applications. You know, a sparse matrix example is one of those.

So I want to talk a little bit about CXL itself. This is a case study on MI300A. This is the chip that's used in the Livermore machine El Capitan, and it has eight stacks of HBM-3, and they can get a net 5.3 terabytes per second out of that memory. It also has 4 x16 links of Infinity Fabric and 4 x16 FlexBus links that can be either PCIe or CXL or Infinity. And we actually use two of those links to connect the MI300As all to all in the Livermore node. So what we have left is 32 lanes of FlexBus, and we use 16 of those lanes to attach a NIC to each one of these chips. So what we really have left in the node is 16 lanes that could be used for CXL. So since we only have two of those, that's two 8-bit links, a total of 64 gigabytes per second that we could get on CXL, which is only 1.2% of the bandwidth of the HBM.

So really, what we're saying is that CXL doesn't really significantly give us any memory bandwidth improvement on a node like this, or nodes with lots of HBM or LPDDR channels. CXL is relatively insignificant because of the bandwidth limitations. And the next generation of leadership-class HPC systems, the nodes don't really have memory capacity limitations, but I realize that everybody can always use lots of memory. But if you look at the way these systems are deployed now, if I have a node that has 16 DDR5 channels, like say, Venice or Diamond Rapids in the future, the minimum I can configure that node with is 512 gigabytes, if I use 32-gigabyte MRDIMMs. So I have 16 of those, so 512 gigabytes per node. I have a lot of nodes on there, so I don't really need CXL as a memory expansion. And then the HBM stacks will be 64 gigabytes each, with 16 high 32-gigabit parts. So even that capacity problem is being mitigated. So where are we with CXL and HPC? The link bandwidth is going to improve to 16 gigatransfers per lane with PCIe Gen 6, but that's still not enough, because the state-of-the-art 30s in that timeframe runs at 224 gigabits. So we really need to get these CXL speeds up to make that an effective path for connections. And a single HBM 4 device is gonna run at two to four terabytes per second. So when you talk about just raw memory capacity or bandwidth expansion, we're not there with what CXL can currently do. So we really need to improve the speed of that interface to get to where we want it to be.

But where might you help it? I think as a first-tier storage, it could be really good at that. I think we have a prototype, DAOS, actually running on our golden ticket machine. So this is something that we're looking at. But really what we would like is a non-volatile persistent storage that has significant capacity over and above where DRAM is, and also has a much lower cost per bit for DRAM. So we're really waiting for that memory. And then I think another area where this can be very useful is with unconventional accelerators, like neuromorphic accelerators. That's something where you're really not trying to put just plain vanilla computing out there because the compute nodes on these systems are so powerful, that if I put a few ARM cores in the controller, yeah, I might be able to offload some stuff, but it's minuscule compared to what the compute node can do. So I do think there's some opportunities for unusual accelerators, neuromorphic accelerators with analog memory. This is kind of one of my favorite ideas because it's low power and has real potential.

The other thing, the last thing I want to say, is CXL has some competition, right? So the industry is kind of ganging up on NVIDIA because NVIDIA has a closed system of NVLink. So UAL, what's being kicked around is a specification that looks a lot like AMD XGMI at this point, and that's being opened up, and there's a consortium to get behind it. So this is an effort to have an open interface for accelerators that operate at these state-of-the-art speeds. So you know, NVLinks are, what, 900 gigabytes per second and climbing, this is a lot of bandwidth that's required there.

And the other technology that you need to watch is Ultra Ethernet. So, I worked for Andy Bechtolsheim for a lot of years at Sun Microsystems, and one of the things that he said to me was, 'In the end, Ethernet always wins,' so I think he's right about that, you know. For better or worse, if we can scale Ethernet and put the features in there to support more than just RoCE, and we can support multi-dimensional networks with adaptive routing and some of the other things we'd like to do, this is going to be also a direction for future HPC systems.

So thank you. I realize I'm a bit over time. For those of you that I don't know, I can do one thing: I have some eye candy here. And this is an aisle of Argon, showing the Aurora machine. And this is showing two of the rows of systems, and there are eight rows, so this is one hundred and sixty-six Cray EX chassis. This is a massive machine at Argon, and it's the largest machine that we've ever built. So, you know, just to give you an idea, you can see this guy here, that's a man. And of course, I got to show this, yeah. And this... all right, thank you.