-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy path70
66 lines (31 loc) · 45.1 KB
/
70
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
There's been a lot of good stuff I've learned here at the conference. I'm not gonna mention about you guys, but it's been certainly a pleasure to hear all the different people talk about their experiences, especially the last one. His name was Andy too, if you were in that one, but it seemed like we were both doing tiering, he kept his hair and I didn't, so I'm losing mine. So I think it's interesting to see how both different camps are following different principles. If you're in this last session, they talked about tiering as being kind of more dynamic, and then there's static and analysis-based tiering. We're gonna talk a little bit here about the dynamic aspects, this is based on a real world application we put together. So with that, let me kick off.
The purpose of this talk here is really to try and recap a little on what can we learn from storage. Storage has been around doing this thing for many, many years, right? And I've personally been involved and touched many different aspects of storage over the years, and I'll just touch on that a little bit. But memory can learn a lot from storage, but scaling, high availability, recovery, RTO, RPO, for those of you who are familiar with the whole recovery objective scenario, because I think everyone's glossing over that little, your blast radius for memory, you take out a shared memory pool device, it has much more dire consequences in many ways than storage did, which was more of a passive thing than memory. And stateful versus stateless, there's all kinds of things, VMware, we won't touch on that here, but there's a lot of different things that we can learn there. Topics covered here, we're gonna do a quick refresh on the storage area networking, a little bit of that, it's quite a complex set of things that evolved over the years, but we can certainly kind of point out a few things, how caching tiering works. Gonna go through a case study of a project that was very near and dear to me in my last company, which is now part of Smart, where we actually developed a dynamic tiering engine for storage, for Linux and Windows applications, and actually partnered with people like AMD and Dell to distribute this out in real-world applications, it's about a million seats deployed, at least of this technology. So this is real-world stuff, not a hypothetical.
And then what are the lessons that we can potentially learn here going forward? A little bit of background on myself, I come from a kind of a background of compute, Transputer, for those of you who've ever heard of it, was one of my earliest projects working with multi-parallel processing systems in Sonar Submarines, going into optical fiber, FDDI, Ethernet, networking, shared storage, really got into storage in around 2000 when we started doing a lot of RAID and storage HA environments, involved with Dothill, for example, other folks doing semiconductor-based storage devices, and eventually software-defined tiering, which is gonna be the subject of here, or software-defined storage, with the focus on tiering, is what we did for the last 10 years. And then what I'm involved with now is I actually run a group of engineers inside of Smart that are developing CXL add-in products. If you see the E3S, that's our team produced that, there's an add-in cards with DIMMs on there now that can do memory expansion, and we're now exploring the ability to tier between those various components.
So a little bit of a recap on disaggregation and composability, everyone, for those of you who don't know, disaggregation has been around for like 20 years, it's not a new concept, it's just like a lot of things in storage and memory and compute, new things are really inventions of old things just done better, or they now become practical realities, you can actually do them, right, because the industry was able to catch up and do it. So, you know, SANs is kind of a first attempt at disaggregation of storage when you think about it, right? It's the separation of the storage from the compute. And that developed the whole thing starting in the 90s. And then the word disaggregation really came out in 2013, around that time, to 2017, when Intel started popularizing the whole disaggregated server or disaggregated rack. And I remember being at several conferences where this was kind of bleeding edge, you know, will it happen, will it not happen? And we're now experiencing and seeing that happen and CXL has been one of those, you know, those great adders or enablers of that technology, I think, as we've moved forward here. Composability, it started to really take shape. Disaggregation is the act of separation, you know, separating into component parts. Composability is the ability to orchestrate or configure these components into something useful, right? And so, NVMe over Fabrics was kind of the first attempt to kind of make that a little bit more dynamic, where you could make it command line driven from a central brain, as it were, in the system that can now allocate, you know, chunks of storage a lot easier than having to have a PhD in SANs or storage, right? You could make it a little bit more OS command line driven. GPUs become disaggregated in 2020. You see companies like Liqid and other folks like that come into the scene. And the first demo of memory disaggregation really starts in 2022. So that's kind of the history.
Let's recap a little bit on memory expansion types. We've got, look at a typical motherboard. I'm gonna take you right down to the depths of the server now. You've got standard DIMMs, possibly even custom modules. We as a company were involved in the OMI standard with the DIMM, for example, adopted heavily by IBM. But there are various ways of adding direct storage.
And then you've got the newer concepts. We've literally, this week in our lab, brought up our first eight DIMM, half-height, half-length board that they can plug in and allow you to easily expand memory over CXL on the PCIe bus now. So if you wanna add that half terabyte, one terabyte, up to potentially four terabytes on a card, you can now do it with a quote plug and play. We're not quite plug and play yet, but we're getting there. The industry's getting there.
The other one, of course, which we've heard about throughout this conference is the ability to hook up a memory box or pool box. We see that happening a lot in our company. We have Penguin Computing, who's involved in the HPC side where we have to put together a lot of large memory model stuff now as we go forward. So we look at either one-to-one relationships with a JBOD expansion or a JBOM is the best way to look at it, and push a bunch of memory.
And then as we go out, we've seen, of course, CXL Fabric 3.0. A lot of people think for enterprise, that's where it really starts to come together for CXL. But personally, I see there's a lot of opportunity in these first two here, just the add-in memory expansion, because we are hitting a DIMM limit, how many DIMMs and how much memory you can put on a single CPU. And there are certainly applications breaking that now.
The other one we've seen various substantiations of this diagram. Here's my own personal rendition of it, of the memory hierarchy. And I chose to draw this more from a, if you have like in a two socket system, you've got quite a bit of latency creeping in now. And that's the whole point. I think we've heard this before. CXL before, before CXL, before disaggregated memory, you just had the CPU with a bunch of memory. And that memory was one large blob of memory. Now, what you're talking about here is different tiers of memory with different latencies. And nine times out of 10, some applications may not care, but mostly there's a lot of emerging applications that do care now, where are they running? So when a CPU comes up and just declares, this is all one blob of memory, for example, it's clearly insufficient to just assume that I'm okay running down on the far right-hand corner there where you got CXL expansion device going through a switch, which might be something as much as half a microsecond up to a microsecond, maybe with contention and other things going on, versus am I running out of HBM or running out of local near DDR? If you need some element of being able to map the workload to that. Hence, that's why the whole interest in transparent tiering and the ability to tier memory is really being talked about a lot.
Okay, so a quick caching refresh, I'm not gonna go over caching theory here, but just to be clear, caching and tiering are not the same thing. I just wanna be clear about that. The caching, as you look at caching, caching is really, and this was a very quick draw of it, I apologize for the simplicity of the diagram, but you have basically a primary storage tier and a cache tier. When you read, you either have this hit or miss kind of operation going on. If I hit the cache, I'm gonna be able to get very high speed, high performance out of that hit, that read, the red line shown on the circle there in the cache engine, versus a general read where it misses, I'm gonna pull it from the primary storage at the lower, but I'm making a copy of my cache for the next time I come back and it's there. So pretty straightforward. Right through operation is when you basically make a copy, you write through to the media, but you make a copy in the cache maybe. And I'm oversimplifying caching here, I appreciate, but there's a lot of complexity sometimes just in the whole caching world for that. And then you have write back operation where you're gonna put it in the cache and then lazily write it back, which gives you the benefit of writing very fast to the cache and then it goes back. So that's caching. And typically caches have some, there's a diminishing point of return. In our own world over the last 10 years, we basically switched to full on memory mapped or page translation tiering because it gave you the benefit of having dedicated access to a chunk of storage. And the same will happen for memory here, as opposed to, is it a hit or miss kind of thing. The unpredictability aspects of this was sometimes for certain workloads, it's just too great. But in most cases, it works fine. So that's why we see caching so prevalent and it continues to be. The important concepts here, the copies of the data are managed in the cache and all data eventually ends up on the primary storage. It has to be, that's the nature of a cache. It's temporary storage.
Let's turn to transparent tiering. Transparent, there's various, I'm gonna skip over a bunch of stuff for Linux. You have a number of newer base tiering, load balancing. You have the whole application based tiering. In this case, what we chose to build was a fully transparent tiering model. That meant the application, the operating system had no idea that it was actually talking to a tiered device, right? It was totally down in the lower layers that this device would make its decisions. And I'll show you a bit of the architecture on that in a second. So you've got a fast tier and a slow tier. And in general, all of that appears as one large bunch of memory, right? Or a bunch of storage. I'm gonna use storage as the example here because that's what we actually built. So you've got one terabyte in the fast tier, 10 terabytes in the slow tier, for example. You have 11 terabytes of available storage. It's no longer a copy. It's actually an island of storage that you're managing data, fractions of data between. And then in the background, you have this background tiering engine concept, which we'll talk about in a second. So when you read, it's a mapping operation now. When you're doing transparent tiering, you're actually consulting a lookup table and you're saying, which one do I need to go to to get that data? And you want to make that as low latency as fast as you possibly can. And you have a lookup, a page translation table, very much like memory does today. This is a course applied to storage. And then you have the write operation, the same thing. You either look it up and you say, am I writing to the fast tier or am I writing to the slow tier? But the tiering is done more importantly after the fact. It's a balancing operation that happens after the fact. The reason you do it after the fact in a lot of this particular set of applications is you want the data to land where it lands and then you learn over time. Certainly certain workloads can't tolerate that. That's the benefit of a write-back cache. You get the instantaneous benefit of writing to a cache if there's room. Whereas in tiering, you land where you land. And if you've done your predictive technology right, you're gonna end up with an island of storage you're reading and writing to that's already in that fast tier if you do your predictive technology right. So that's the difference between caching, tiering at the very, very high level.
Data, the important concept is data is split across fast and slow. So you can have a file that maybe has 10% on the fast tier, 90% can live on the slow tier. But the virtualization layer in that virtualization engine will make sure it still appears as one continuous block of a file or storage or memory.
I'm gonna go through this one pretty quickly. Disaggregated storage refresh from a SAN perspective. I have to confess sadly, I did live through the floppy net era where the only way to get data reliably between machines was to literally copy it on a floppy which ended up being a USB key of course nowadays. And you still get a lot of that going on. Then you went to LAN based drive copy. When we, by the time I ended up in the block storage, Ciprico Dothill world, we were starting to look at file-based extent tiering. So tiering started to evolve around that, early 2000s, there was various people doing it, but it was still very file-oriented and involved, it really is part of a backup process. And in many cases, I'm just copying stuff out to the extent now when they started to realize we need to stop moving data around. That's why tiering really came into existence. I really don't wanna keep moving data. I just wanna move the bits I'm using. I don't wanna have to keep pulling that big file up. Every time I need to use it, extract what I need and push it back down. Here, I'm just pulling up the pieces I need. File extent tiering was the first attempt to break it down into pieces. So I could just pull up the pieces of that file I needed into the fast tier. So there's been various evolutions of that. Then eventually you kind of pushed into JBOD and the SAN appliance. In around late 2000s, we were starting to build tiering into the box itself, the SAN box.
So that's a little bit of a refresher. It's kind of evolved over time. The most important concept, and I think this is being missed, obviously in a lot of the discussions on CXL today is high availability. One of the things that SAN pretty much figured out before we got to the more scale out architectures like the Google architecture today I can assume a complete fail of a node. Therefore I have multiple copies around the place. Before we lived with the whole concept of high availability, meaning this unit needs to stay alive 24/7. And the only way to do that was to duplicate the power supplies, duplicate the controllers, duplicate the switches, duplicate everything essentially. So you can see it gets pretty complex the way you end up wiring this thing up. So if you get access from one compute node down to the shared storage in a dual ported, this is where a dual ported device is coming in. It has to that dual ported device being an SSD needs to talk to either controller A or controller B through switch A or switch B. So if A path ever goes down, you have this alternative path. This is HA simplified probably grossly here, but that's essentially what HA is about.
So let's move on to what we were doing in tiering. So with that backdrop, what I want to do is here is just walk through the implementation we did a little over last year. I was actually personally involved in writing most of the code and the architecture for this. So know it well. You know, we chose an architecture for tiering that was designed for simplicity. And the emphasis was we actually, it was basically AMD Store and my adopted a consumer version of this. We developed this originally for the Dell and the HPs the world distributing this as an alternative to caching, you know, through the channels. So we got deployed in certainly a number of data center applications. But I think the biggest one that we got most of the volume was we need to make a bootable, very simplistic architecture. And what drove this simplicity in plug and play was consumers. You cannot go to consumers with a complex, you know, architecture and 10 hours of instructions on how to put it together. And we had noticed a lot of early tiering involved two people going into the site to get it configured and done. So we built this transparent tiering architecture back in what, 2011, I think is when we had this first running in the lab. And we quickly came down to the key components being, the first one was really page virtualization. The ability to virtualize and masquerade or emulate as a block device to the operating system or the applications. But then you take over all the devices below you. And that can be just a simple two SSD hard drive combination, SSD, SSD combination. So you get the page virtualization. We later added auto discovery and classification because we found some of our customers were putting the hard drive in the fast tier, for example, and the SSD in the slow tier. Unwittingly, we had no idea. So they're getting reverse tiering going on. They had no idea why their system was automatically slowing down, which was not the intent. So this is very much a bottoms-up design approach where it's like, oops, we better fix that one. So we added auto discovery and classification. I think the memory, very similar thing. You can't just trust the NUMA tables. You really got to go in and see what's the real-world life performance I'm getting out of this tier. What is this real latency that I measured, not what I'm just trusting in an ACPI table, for example, buried somewhere in the kernel. So auto discovery became a very important component. Page virtualization, then hot page tracking, and then ranking. We had to develop a whole scheme for tracking what the hot pages are, and the cold pages more importantly. And then the whole thing about migration was to preserve as much capacity as you could for the operating system. So instead of just reserving a whole block of fast memory, expensive SSD, for example, we decided we wanted to exchange that. So we came up with a whole mechanism to exchange the two areas for the hot and cold as part of that. So you're automatically displacing a cold with a hot. So it's something that just dynamically keeps going till it balances out. And essentially, once this thing balances out, we observed like four minutes of intense activity in some applications, and then zero for 24 hours. Because the thing would then be balanced and it would be accessing the fast tier. And you're getting pretty much, you can step back. The beauty of not being a cache at that point is you're just doing a translation of LBAs, for example, in the case of storage for memory, a translation of memory accesses. That's all you have to do and just keep tracking statistics. We'll get to that in a second.
Then the important thing after that was APIs. We started to layer in over the five, six years after that, we started getting better at adding in these layers. So you ended up with something that looked like this. On the left-hand side, you can see the file system all the way down to the EFI bias drivers. We even had to come up with a virtualized boot that EFI layer because somebody wanted to be able to boot this as a boot drive. So you had to understand your virtualized environment and make that into a boot volume that Windows or Linux could boot. But you can see you're essentially dropping this into an OS environment in the block layers. And we chose to go 100% kernel because again, transparency, remember our objectives for transparent tiering was? You don't want the user to know what the heck's going on in theory. If you can do your algorithms right, you want them to just drop this in and this thing will figure life out itself. So kernel was the best way to do it. And it also gave us access to a lot of the low-level APIs needed to be able to keep the performance. We were able to do this by the way and keep NVMe performance at NVMe performance. In fact, slightly higher in some cases because our queuing sometimes got a little bit better, sometimes worse, but it was in about five to 10% of what the native performance would be. So you could take a Gen 5 for example, SSD today, put it with a hard drive and you'd see Gen 5 performance for a lot of these applications because it would slab relocate most of the workloads up there and then you'd operate off that fast too. But that's what the tier looks like. You present a number of virtual block devices and essentially get yourself access. And it became more important to be able to get access through RESTful JSON kind of sideband tools to be able to see and get visibility to what's going on. We'll talk a little bit more about that. The other key component, we talked about mapping, micro-tiering, and then the policy-driven stats. One of the other important aspects of this was to develop a policy engine that the user could tune if they wanted to. You had defaults out of the box for certain applications, but what was really useful was they used to come and say, well, hang on, I'm more heavy right-oriented or I'm more heavy IOPS driven versus bandwidth driven. So we got much better or we got endurance driven. So you could put endurance policies in there. Hang on, this tier is actually a low-endurance tier versus a high-endurance tier. So all of a sudden this architecture became much more useful in intelligently mapping.
So a little bit of an insight into, sorry for the blog, hopefully you can download this presentation now. I think you can get access to all the presentation online now. You can see the basic flow from left to right is the host IO. I'll see if we can get a pointer going here. The host IO, you've got the data path as it flows through. There's your mapping layer and there's your different components or block storage. You can have an SSD hard drive or SSD or a SAN device in the case of the enterprise world sitting off here, all running on the host within the kernel. This is all a kernel kind of a viewpoint here. You can see what happens is, as part of the LBA command control path, you wanna keep your data paths as clear as possible, unencumbered, and you wanna really just capture as much as you can of the statistics. And you'll see in memory, there's a lot of discussion going on right now about where those statistics should be collected because one of the issues for storage was you have, you still even with SSDs, they're slow compared to memory, you know, dramatically slower. You have time to collect statistics. In memory, you don't have time. In fact, you're interfering with the whole flow if you start to use the host itself to collect statistics about what it's doing. So there's a large debate going on where you store this, but you generally need to store a table, an access patent table of what those IOs look like. You know, how much memory am I using? How focused is it? You know, you need to put that somewhere. And that in this particular architecture which we built is done in a RAM and then echoed out to, sorry, stored in metadata on the drives itself. So you have this kind of a constant loop though. So once you come off here, you picked off and you collect the statistics, you get a statistics kind of page based statistics table here in RAM. You then go into this analytical modify repeat loop. You know, in our case, we just had a two second tick that went back in the background. And what it would do is look at the statistics and look for a rebalancing opportunity. And then rather than trying to go and interfere with the IO process, it would actually go off and schedule that. It would say, you know, to a data movement engine, okay, it's really better these guys here were on the fast tier and these were down on the slow tier. And that kicks off the whole exchange. Then you sit back and it's a background task. Again, I emphasize background task because you don't want to interfere in a high performance situation with the data flow. So again, the downside of that is there's a latency. There's a time it takes to react to a data patent. And clearly, by the way, just while we're on that topic, you end up with some scenarios which just don't make sense for this architecture at all. And in fact, we used to say that to people, if you have a hybrid architecture, can I make it do, you know, IOPS, like an iometer full random sweep of the whole volume, just go buy all SSDs for that application if that's really what you intend to do. It's not a real world application, but this is really good for tight locality, mostly reads kind of applications.
Virtual page statistics, I just briefly touch on this, the way this kind of works here, just to decode what this drawing is, you've got the virtual drive consists of a number of virtual pages that are mapped into fast tier pages and slow tier pages. These maps are the physical devices. This is your virtual device. The operating system only sees this guy here. So, you know, zero, zero P is really an indicator of I'm on tier zero, page P in the fast tier. And this one here, for example, is currently mapped to one X, you know, is on the slow tier or the second tier on page X. And you can see, you know, you can see it's a very simple mapping technique. The trick is to do that as fast as you possibly can, you know, get in, get out, map that thing. And then in the background, you're moving stuff independent of the host. You obviously look for opportunities where the host's not busy, you know, to do that kind of stuff, but you're basically trying to do this in the gaps between if you can ideally, or if you really have to hold off the host while you temporarily move stuff and get out the way, it is beneficial. And that's where some of the cleverness of the algorithm starts to come in. And then, you know, some, just some points I did more for the handout slides more than anything else. You know, you basically have pages in various states. I won't touch on them here, but there are those that are heating up and there are those that are cooling down and you just have to keep track of those. And there's statistics for keeping track of that. So the point of this is you can see there's a lot going on in a tiering engine that in behind, you know, under the hood, as it were, behind the covers.
The other lesson learned, by the way, was just, and as you go to memory, this gets acutely worse, is the operating system NUMA and the whole allocation of processes gets pretty complex. So, you know, we found when we did this just for purely NVMe, you know, CPU and NUMA association becomes a big deal, even for NVMe drives when tiering, because if you're not careful, you can end up with the driver usually is associated, for the example, for an NVMe driver is associated on the CPU that's the closest to the PCIe attach point, because they don't want to have to keep going through, you know, through a hierarchy to go back and forth with the driver. But your process might be running on a totally different NUMA node. So you've got issues here that are potentially coming up on, you don't really know where you're going to be assigned in a truly transparent environment. Well, remember, you're trying not to touch the system, you're trying to embed yourself in and be clever and hide, as it were, in the background. So it's gonna be interesting, one of the lessons we're gonna have to go through here on memory is just how much can you get away with not influencing where NUMA balancing comes in versus where you come in versus, and, you know, there's gonna be Linux, of course, is popularizing that right now with the NUMA load balancing. But we found it really is a case of trying to, get the best affinity you can, you know, with where the memory, the storage tables are, even our lookup tables got put on a different CPU, for example, if we just let it do all the allocation.
So unfortunately, there's some trade-offs there you got to go through on where this stuff can live. The statistics table, just a quick peek into what that is, is just basically a bunch of, you know, the mega region counters and the virtual page counters. We have mega regions, we had local regions for the pages, and you're tracking read, writes, read blocks, promotes pending, total promotes on the high end, then you're tracking things like the same kind of things on the virtual page. And a virtual page might be a region of four megabytes, for example. So you take four megabytes, you keep counts and statistics on all of those, then you consult those with the tiering engine. Okay, what am I looking like? What's my curve? What my hotness curve or my hotness pages look like? The other important things which are often glossed over, the rigidity controls. We have to build in certain controls where we said, "Hang on, we know this region here is used heavily by the OS, we don't want to move it. "We don't want it to be keep jumping around here." So you have to also now be clever with the way you allocate these pages to say, "You're sticky, you're not." "You're kind of sticky. You know, you can live there for a while." So we got to the point where we even had to build in that kind of mechanism from a rigidity standpoint.
So the policy engine finally was one of the other areas that we focused on to make it easier to tune to specific applications, promote on reads, promote on writes. Those are the easy, you heard the term maybe if you were here in the last session, the big concept of promote and demote. Promote means you're being pushed up to the fast tier, demote means you're being demoted from the fast tier. In the case of transparent page tiering, it's a page that's being demoted. You don't know if you're dragging along multiple, especially with storage, pages of other files, you don't know. So we used to call it slab relocation. I mean, people talk about cash lines, 32K cash lines. This is really slab. You pull out a slab, you pull it down there and yeah, you're dragging a bunch of stuff with it. But statistically, you ended up with a better performance in general for many of these applications. So again, trade-offs, you've got trade-offs there. So those policies help you control that a little bit better. Pinning was a really important one for us and intelligent pinning, the ability to learn and then go in and retrospectively pin certain pages to the fast tier or pin them to the slow tier. There are certain things you don't want. The example we used to always give was an MP3 file played here very frequently should not be taking all the premium resources on the top tier of an SSD, for example, versus a slower tier, because you don't need it. You know, but yet it's played frequently. So you've got to be careful of those kinds of things. So we used to have the override mechanism. We say, hey, stay, you stay down there. You're not allowed to come on up here. So, you know, pinning is a very important thing. So page locking as we're referring to it here.
So a lot of things going on. Finally, the thing that got really interesting and I think gave people some excellent insights was what's going on. I think MemVerge have a really nice tool for this too. I think with their memory, like visibility into where all the processes are using memory. This was our version of it back in 20, let's see, this is about 2015. We produced the first iteration of this where you could go in and look at the workload burst, the long-term activity and start logging stuff as far as three months back of how this thing's been behaving. And what we started to observe in our customers also was what's going on at 4 a.m.? Why is that doing what it's doing? And they started to uncover a lot of these background tasks that were going on on their system, causing numerous issues with the tiering engine because that was maybe a maintenance task where you need to shut off the tiering during that time. The biggest headache we had was virus scanning for a while. It would go through and start touching it and touching all these files and you go, "Hang on a second, I need to ignore what you're doing." So you do need some element of handshake. Even though we're transparent, it's good to have those hints. So we have to develop the whole concept of hints or something running in the user domain that could actually say, "Hang on, ignore this activity right now." So there's a lot of complexity behind this, but this tool was useful to be able to see both time-based and more importantly, across the volume itself. Nice thing about map tiering is that you can see what's going on in this part of the drive, that part of the virtual disk, that part of the virtual disk, and see how much it's shifting over time. And that's what this, if you ever play it back in real time, it gives you a nice kind of fluid motion of what's going on on the system.
And finally, I really do have to evangelize a little bit more about the future of memory in terms of the HA appliance. You know, we talk about it. The question is, is whether we can live with our, our, you know, the hyperscale kind of approach or the cluster environment where you're replicating data. Memory, replicating memory, we're trying to get away from moving data, right? That's the whole, one of the things we're trying to do here. So the question is, is this HA creep back in? And it's more of a question than a probability here. Do you need two CXL switches or do you need that kind of hierarchy again? Do you need multiple controls in there? So we're going through, kind of looking at that and trying to figure out, well, okay, you can do all the tiering and stuff behind there. You can go nice little offload engine, you can do this stuff. But apart from all that, you've got HA as a real environment to consider. And, you know, you start to think about tiering in that environment, complexity starts to go through the roof again, right? Because now you're tiering duplicate copies of things or you just let them autonomously operate, which we chose to do in this case, just let them autonomously tear in both the copies here and figure out life itself.
So to wrap up, you know, a couple of things we learned along the way, you know, I think, you know, our kernel-based VMAP, we called it a VMAP, a virtual map. It was a huge table of what the translation between virtual and physical was. It would get messed up occasionally. And, you know, you get very angry customers when that happens. And, you know, you had to build in a lot of what I call the VMAP repair nightmare, which kept me up at night many, many times in the first iterations of these things, trying to figure out what the heck happened to this customer here. And it'd be a power loss scenario coupled with something else going on, coupled with something else. But, you know, we, I'm glad to say over the first, about a year or two, three into this thing, we developed ways to make sure with the journaling methods, transferring data between the slow and the fast tiers and vice versa, so you could replay anything and you never lost any data, right? That was the key, do not lose data. You know, rule number one in any storage company, you don't survive long if you lose data, right? So that was the first thing we had to get solid. So that was, you know, VMAP introduced another complexity to that because you're no longer talking one-to-one between the operating system and the application. The next one was processor affinity. I mean, we scratch our heads sometimes, especially when AMD, I think Threadripper came out, multiple processes, and it had some funky mechanism in the early years. Why are we going slower? Well, this is supposed to improve things. And you suddenly realize that even though I was running off, quote, the fast tier, I was now going through two or three layers of cores and stuff because things were getting, you know, spread out in ways that were not necessarily accessible to the I/O engine. So that's where we learned about affinity. And tiering engines were, you know, were obviously the pieces of the tiering engine to get that parallelism going and the multi-thread. Remember, NVMe was the first environment where you had multiple threads and multiple, you know, OS and application threads talking to the same device. Before that, HCI and SATA really was single-threaded when you look at it. And so I think this is the first instance where you really lit up this big engine of multi-processor and multi-threading. So that was another thing to deal with. Now, memory, order of magnitude more. I think it's gonna be very interesting to see how we deal with that. And that's one of the areas that my team's, you know, going through looking at now. Translation of I/O access. One of the things that we're talking about is the table, where it lives, it gets pretty big. The beauty of memory is there's already a page translation table, you know, whereas in storage, we really didn't have one. So we had to invent our own. So that's gonna be interesting to see how we play. We don't wanna reinvent the wheel. There's a lot of good stuff going on in the Linux community today to solve that. But how do you add the value-add aspects of tiering? The policies, for example, that's where you can differentiate as a product guy, as a guy trying to ship a product to the market. So those are gonna be interesting to kind of work through. Low-level media device, SSD housekeeping and block migration for us. Less of an issue with CXL memory, but as we, you know, we also make, as Smart, we make NV devices, we make SSDs, and all of that stuff eventually ends up on CXL. I think that's the general viewpoint here. So you're gonna have to deal with multiple types, not just memory. You've got memory, you've got .io, you've got .cache, you've got all kinds of modes, as well as different kinds of media with different reactions going on there. So it's gonna be important for us to figure out how to avoid interfering with the low-level intelligence that's going on. And lastly, no one size fits all. You know, we thought we did a great job of the tiering engine, and yet we get customers showing, "Look how bad it is over here." So you're always gonna get, you have to kind of shoot for the 80%, if you can, if you're developing a product. And so in the end, that's why we ended up being, you know, largely in the gamer community, funnily enough, as opposed to, because that's where it's very predictable. You pulled in a large chunk of data, then they operate it out of RAM most of the time. So, you know, all they wanted was to load the next screen or load the program as fast as they could as they context switch between their different things going on on a PC. But when we got put in, for example, I think it was like, who was it? Comstream, I think they were public when they talked about it. When they were doing stuff with petabytes of data captured into a federated server environment, you know, and using it as a means to lower the cost of storage, because the whole benefit of tiering is you can put cheap storage with a small amount of expensive storage. They wanted to be able to capture that and process it. Well, they did show they got like a two to three X improvement, but occasionally you get this, you get certain traffic patterns which would destroy your tiering engine. So it's no one size fits all. So I think it's going to be a combination of what Andy, I think said in the last session where you're having to analyze your workload environment and just see if it is a candidate for tiering or not.
One little plug for the OCP CMS group. You know, there's a lot of different things going on right now, but this is where there's a fairly healthy discussion going on about composable memory systems. And they are actually working on a draft specification.
You can see more of the OCP and Smart here, just a small plug for Smart. That's what my team have been working on here is obviously the first E3S modules are now here. We were demonstrating it just around the corner here in the hackathon yesterday. And the guy on the right is a mechanical sample, but we actually do have the real thing in house now. And we're starting to see a lot of interest, funnily enough in the guy on the right, a simple memory expansion, the ability to add an adapter card in with a bunch of DIMMs, throw it in your system and a nice plug and play way to get that up and running. And there'll be plenty more where they come from.
So, okay. So that's it. Do we have time for questions? I guess, yeah, I guess we do. We've thrown a lot of stuff at you, so any questions? Yeah.
Say again, in the upstream? No. Yeah, this implementation here was a totally self enclosed blob, like a closed blob with an open source wrapper. That's how we handled our Linux. But as we go forward into the memory world, we're starting to take a lot more closer look at what's going on there. So we haven't yet worked openly in that area, but it's certainly something we focused on. The question, sorry, was are we doing any work in or have we been involved in the upstream stuff? I think there's gonna be a lot of stuff, upstream Linux kernels. I think there's been a lot of good work going on there. And our plan is not to replicate, is to augment that effort. And build more of the tools around it. 'Cause I think what we found in our experience was the core tiering engine itself is, I don't wanna say simple, 'cause I lost too much hair developing parts of what we have to do. But I think it's actually, it's really the tools and management and the ability to plug into your environment are really the bits where it's gonna get interesting. So for hyperscale environments, I think they've got their vertical. I think for general purpose enterprise, which we were focused on, you don't know what application you're gonna be shipped into. So that's what makes it a little bit more complicated. And then you have the Windows problem. We developed this same engine for Windows and Linux, so we could cross port between the two. It was important for us to be able to go back and forth. Now going forward, I think Linux is gonna, we're gonna see a lot of the early work here on Linux and tiering.
Yeah. Okay. Yeah, so the question was, we only have so much memory to track statistics and keep track of what's going on in the system. And hence, one of my comments are about interference with yourself, with the application. If you're running applications out of that same memory, it becomes quite problematic. So we had two methods. I mean, for the smaller capacities, we had a paging system that could handle up to like four petabytes of storage with about two gigabytes of RAM for statistics keeping. We kept a fairly efficient kind of block if we could. Then we had a super block kind of concept where we would keep more detailed statistics where most of the activity was going. So you had to do a two-tiered system. You can't keep track of everything. So in the end, I think we ended up, we could get away with it with the first generation, but as you go towards the petabyte threshold or the hundreds of terabytes, typically we get, I think our largest deployment was like 256 terabyte back in 2017, 2018, when we did the first implementation of this. So that took up about a couple of hundred megabytes to just about a gigabyte of RAM. Server didn't care. That was fine. I had that. You try to go to petabytes and beyond, you go to a switching architecture with the memory. The good news about memory is, I mean, there's less. You're talking about 32 terabyte boxes being proposed today for external boxes. That doesn't take a whole lot if your granularity of your pages is fairly large. So we ended up tuning. We had a one, two, four, eight megabyte option you could go to on your page size. The bigger you go on the page size, the less, the smaller the tables are because you're keeping track of a bigger chunk of the memory, but that's a trade-off. That's a trade-off. So we were moving towards, do you do a smaller chunk for the active areas and larger chunks for the inactive areas to optimize your memory? But you have to start playing games such as how do you keep that statistics table small by doing more of an activity curve, almost using your hotness map to define how much detail you keep.
Okay, all right, oh, hey, Jonathan. No. Not yet, not yet. Memory is, we're just beginning the journey with memory. I think we've always taken the approach we play with what's available rather than too much theory. So I think we're just starting by building a first, how much can you put in a box? It's about 32 terabytes if you look at the math today, sensibly, right, with DDR5 remote box behind the CXL, maybe, maybe 64, maybe 128. You're not talking petabytes though, right? But I think it's gonna be interesting because the other ceiling you're hitting is the ability of the OS to address memory, right? There are caps on how much Linux can address, for example, right? So how big can you go is gonna be, it's gonna be dictated by the application and many things like that. But I don't really have a good feel for it yet, Jonathan, not yet, but something we're gonna be looking at, one of the team members going through it right now, so. Good question.
Okay. I think that was it. All right, thank you so much, appreciate it. Have a good day.