-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy path120
32 lines (16 loc) · 13.7 KB
/
120
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
Okay, so I'll go quick here because I'm the last and I'm basically holding you back from your post-evening activities, let's call it that.
So CXL is going to enable a large population of new silicon in the data centre. I know like last year there was a lot of people, MEDEZ published a few papers where they're complaining about how at scale they're seeing these data corruption events, they're seeing CPUs calculating answers wrong, they're seeing sort of this non-zero probability that you're going to get something that's sick. And one of the reasons why I'm going to go on this little bit of a rant here is that with CXL there's a lot new players building a lot new silicon and you're going to have a bad time if we don't take into account some of the data integrity that was developed in storage, you hear this theme of storage. In storage for the last 30 years there was a requirement of no user data corruption and that required certain design techniques and capabilities which is never written down. It doesn't say anywhere in the CXL standard you really should design for data integrity but if you don't, when you deploy at scale you're going to have some unhappy customers. So the sins of data integrity, marking corrupt data is good, silent data corruption, it's more than just the ECC on a DRAM interface and then data corruption loss, misdirection, you're allowed to do these things, it's just that you have to be able to inform someone, you have to detect it, inform someone such that the cloud providers can do amazing things on the software stack to mitigate these errors. And it's a journey and I know there's people out there with silicon right now and they're kind of just learning on this because when you take silicon to scale and you put a million things in the field, that's going to be a problem.
So CXL, it's a little bit more than an interface and a processor, it pulls things apart but like most connectivity standards, CXL leaves a lot of solution specific RAS features to the architects and implementers. Part of the working groups, CMS being one of them, the working group within CXL on a memory controller trying to fill in these gaps in the background but fundamentally there's this cost performance trade-off that comes in especially in the reliability side of just how much you're going to actually add to your solution to cover off the reliability side. And it's really, it's interesting as I talk to some people who are entering the CXL market, they are not familiar with the requirements for reliability.
You know, and it really comes back to how we make silicon and making silicon is a probabilistic process. We don't have a process to make chips that's guaranteed 100% of the time to make a perfect product. So if I build a subdivision full of homes, 10% of them are in, cannot be inhabited, I throw them away and then sometimes later someone will show up and says, "My door sticks in my house," and I'll say, "Sorry sir, you got a bad house, you got to move out. We'll give you a new house." And that's really what it is, is that we have all these processes like testing for stuck at's, testing for transitions, but fundamentally these are based off of models. We don't test every possible at, stuck at process within VLSI. And it's this overlapping sort of test process that eventually you get to a product that has some non-zero error rate of dead parts per million. And you see that at scale. So if you're like 300 dead parts per million versus 10, your customer is going to know about it. Afterwards, we have these processes like ECC on memories for software events, software errors are a real thing. There's a lot of radioactive rebar in the world today, mainly because of dental units back in the 70s. I can tell you there's some data centers, especially in the private cloud, they make my ECC on my memories pop off weekly within one data center. You'll see it. The cloud guys see it. And at some point, the last thing that's on the list is called data integrity. And this is where we add transistors to protect our data paths. And it's a light protection. It may be ECC, especially for memory, which are very sensitive. But even on their data path, the hardware side, we'll add essentially parity to protect for data corruption events. And normally, we would say this is designing for software. If you think about it, in low earth orbit, you have one satellite, it may experience a few thousand software events per day. We come back to earth, we put a million units in flight. It's almost the same probability. And most people don't understand that. But to cover that, we have parity-protected data integrity.
So we start with the soft error events, but we find that the same mechanism for protecting our data paths actually protect for these manufacturing faults. And it's very easy that even though you have the best process of testing and system-level testing, you have this non-zero manufacturing defect that normally it's DI, the soft error event protection that comes in and actually saves you. And I've seen encryption engines with duplication go through all the tests, run for six months, and it's the right piece of data, the right key. And we have these techniques where for soft error, starting with memory ECC, parity protection ECC, we run CRC on data blocks that are moving throughout the chip. T10 DIF, it's a protocol method of moving data within storage. It again puts tags of say what data is on it, CRC. Something is mathematically tainting things like Reed-Solomon's such that if you, doing the correction, if you don't have the right taint back, you detect misdirected operations. And like even in CXL, my team plus a few other companies said, "Look, with link encryption IDE, you're going to force us to put two encryption engines per link to cover soft error events and data integrity for the encryption engine." And the reason why CXL.mem has a different IDE algorithm is that we actually basically watermarked the IDE algorithm by doing a minor change such that most people don't realize this. If you have a soft error event within or any DI event within your encryption engine for IDE, you will detect it as a man in the middle type attack. PCIe didn't do it, we did it for CXL. And the difference is we saved a whole bunch of power because we didn't have to duplicate the engines. And that was like between my team and Intel and HPE back in 2.0.
So when you actually look at a memory controller, we start with the receive side, CXL receive port, address mapping, fabric that says, 'Okay, let's go to the ECC engines,' then to the DDR, DRAM is the thing that's connected to us. You see this process where we'll actually say, 'Okay, here's ECC.' We'll terminate the FLIT, that's the ECC. We'll start protecting things with parity. So in case one of the fields gets lost, corrupted, misdirected, we'll be able to determine something bad has happened, be able to inform the user, don't do anything bad. And it's more than just ECC because from an ECC point of view, that gets us a data protection of the data through DRAM. But again, there's a probability of having that data being misdirected because of a timing error in the DDR. So you start then to add additional data integrity on that, again, with mod or marking or extended Reed-Solomon messaging. So you can actually say, 'Well, when I read this, if it's not associated to the correct address based off the ECC, I can detect misdirected reads and writes.' So you have this overlapping structure. And if you do this, things like manufacturing defects are caught. And it's a very good technique. And it's essentially, when you look at a storage processor, this is status quo.
Duplication for high complex data transformations, either we would duplicate or do the reverse operation. Again, from a storage point of view, when you encrypt something, the data is changed. There's no parity. So you basically have to either recheck that the encryption gives you back the right value or you compare two engines. And there's trade-offs between latency and power you can do here. If you're doing this around DDR, DDR can either read or write. So you can have two engines and share the time. But for a silent data corruption to happen in one engine, when it comes to handling these failures, you literally have to add quite a bit of logic. And these blocks are highly complicated, become really hard to test, especially for transition faults where you have all these nets during the same transition. And to get coverage, you can just throw thousands and thousands of vectors at it. But to do it properly, you really should put in the two engines.
So DI doesn't come for free. There is a cost associated to the part. Depending on what you're doing, it can be anywhere between 5% and 25%, both in power and area. But I'll tell you, the cloud people want this capability. It's one of those things where they're already taking a large amount of risk associated with how their architecture, their highly optimized architectures are going to handle this new memory with new latencies, new management techniques. But then it's like, okay, are you a proven provider? Are you going to have the quality? And you can only test qualities so far before you really need to start putting in more logic and more techniques. I would argue, this is my thing, is we should err on minimizing silent data corruption versus uncorrectable. So like with Reed-Solomon, error correction around DRAM, you're going to get these cases where it's like, I have the best UE, uncorrectable error rate. But in doing so, they're taking a little bit of a hit on the SDC where they're saying, well, when I do correct, I'll actually have a higher chance of giving you the wrong answer. And at scale, I would argue that's not a trade-off you want to make. So at Rambus, when we do these things like this, we will literally make sure our SDC is at zero. That's our goal. No silent data corruption. And that bleeds through all our performance and power targets. You know, it is hard with third-party IP. There's a lot of third-party IP that will lie to you. It's like, oh, CRC comes in, and you have a parity-protected datapath coming out, and there's unprotected logic in the middle. Audit your IP. That drives me crazy. And we're just learning how to handle some of these errors. It's not a normal thing for a memory subsystem that says, hey, look, I've got a DI event. I do know a lot of people right now turn off the reporting of certain errors on a directly connected DRAM. A lot of correctable events are basically ignored. But as we get to a larger scope, correctable events actually give us information on potentially how the DRAM is going to fail. So a lot of these things, when we start talking about a larger scale for CXL, a lot of the things that we've been ignoring, like correctable ECC events for DRAM, are actually going to be the thing that tells us that, well, this DRAM, while we can correct for it now, in a week's time or a month's time, may actually fail completely, and we really should take it offline. So this is where some of the interesting innovations coming in, where we're trying to pull apart sort of these failure mechanisms and handle them more gracefully. Because if you think about it from a hyperscale point of view, if I have terabytes of memory and I put 50,000 users on that server, if that memory goes down in a bad way, there's 50,000 people that will probably want to phone the support line of the hyperscaler, and they don't have enough people on the phone to support those number of service calls.
So this is where, from a final thing, final call to action. CMS, we're getting into the tiering, but they're closing the gap on the RAS, and this is going to be an ever-expanding topic. I want to see the IP vendors start to sort of take this stuff seriously. They learn from the storage guys, but I find that from the network, IP coming from the network or even some of the newer stuff, there's been a time-to-market focus as opposed to a quality. And yeah, let's build more memory systems. So hopefully that was done in enough time. All right, guys. Have a good evening. If anyone has any questions, I'll take them. All right.
Larry, really great. Again, this is so fundamental and glad that you brought everybody's attention on this topic. So how do we bring it into OCP? And as you said, IP vendors should look into it. I understand there is some of your secret sauce. But at a community level, we should make this as a topic that should be given enough importance. It's not just AI or -- this is fundamental.
It shows up in sort of -- you can map it all back to a fit rate. So if we define it, a fit rate from a soft error, the math is imprecise and subject to interpretation. But that's about the only way. I see both sides of the coin. I see reasonable fit rates requirements, and I see really unreasonable. If you expect that, it's like, yeah, that's space quality stuff. I think it's going to come from better specification of fit rates. Because most people do not have a fit rate right now. And it's tied to architecture a little bit. It's like, we're still working on this, and we don't know what our blast area is. But really, I think that's where you guys are going to find it.
Yeah. I mean, silent data corruption is an important topic. Even at AI, Hot Chips, the Google presentation talked about it.
Yeah. The CPU guys, I don't know what they're going to do. The CPU guys have always done different things. But if you come from a storage background and bring it into CXL, it all makes sense. And I know that you included and appreciate that level of reliability. But yeah, it's amazing. Some people, it's like, you know about DI, don't you? And it's like, what do you mean? It's not in the standard. But it doesn't take too many flip-flops, unprotected, to really affect the fit rate.
From the beginning.
Yes. Yes. It has to be day one.
Yeah. Great. Thank you.