150


All right, so we talked about the big picture and how we're trying to glue things together.  We've talked to you about how do we deal with memory failures, which are hard for us to figure out.  Now we're going to talk about how do we connect everything together, how do we plumb things.  So that's what the RAS API work stream is about.

So Antonio and Yogesh and I will be presenting.  Antonio is a colleague of Yogesh's at Intel.  And we'll go over a problem statement, a North Star.  So really we've got a big agenda.  I'm going to just kind of skip the agenda slide so you have more time to talk about the fun stuff.

But basically at the end of the day, one of the problems we have with RAS is it's stinking hard, right?  And that is where all of this comes from.  If this was easy, I wouldn't have a job.  And the problem is that the features are hard for us as end users to use.  We don't always understand how they work.  We don't necessarily have enough people to make an investment in the code.  We also lack the subject matter expertise to understand what's going on in the silicon.  So if I look at some of the processor manuals, I can look at the list of registers and there are 10,000 pages of registers on some new SoCs.  How in the world am I as an end user going to read those and figure out what I need to collect to understand what's going on in the machine?  So what's happened in the past is people have thrown up their hands and said, "Well, I'm just going to count errors." Like I said in the previous talk, counting memory errors at Google, we've demonstrated 0.6% precision.  So we can do better.  We have to do better to make managing large-scale fleets affordable.  To do that, we need vendor expertise.  We need help.  We also don't always have a big enough team.  We lack the teams.  We don't have the people that can write all the software to glue this together.  And so what we said is, "Can we make this easier?  Can we find a way to create a driver so that when someone brings the next generation CPU, we don't have to write a whole bunch of software?  We can discover that new thing via a standard API.  We can discover what RAS features that thing supports and start using them.  We can kick the tires on RAS features with very low investments, like a Redfish call.  Imagine that.  And then we see how effective it is.  And as we converge on what's effective, we will eliminate features that aren't useful to us, and we will enable the vendors to add new features that are useful, and we'll be able to prove to them that they work.  And all of this configuration and work to drive things is unbearably hard for us.  And so what happens?  I've talked to other CSPs and other server vendors.  They each have their special RAS feature that they've invested in.  They insist the Silicon vendor continues to provide their special features that they've invested in and love, and they don't have the bandwidth or the people to build something to try the other features.  We're fixing that.

And so our North Star is that this is open and scalable.  We have this driver.  We could practically upstream it.  Now, we've talked about the last thing we want to do is prevent innovation in Silicon.  The first thing we want to do is get rid of the toil in using the innovation.  And so that's our goal.  It's open and scalable.  We can have open source drivers.  We have enumerated things.  So everyone here, I hope, if you've done anything with memory RAS, knows that there's a feature called post-package repair.  That feature is described in the JEDEC spec.  It is not secret sauce, so why do I need a secret interface to invoke it?  Why?  I can't...  There's just no value in it.  All it does is it makes it hard for us as end users to use the feature.  Now, there are other features that the CPU vendors have put in their parts or other Silicon vendors that are secret sauce.  And so what we want to do is we want the ability to have generic things for the things that are common and well-known, but we want to also have space for the vendors to innovate and add new things that help us manage our fleets better.  And then the other thing that's really important here is that this is platform agnostic.  So if I discover a piece of Silicon that has error logs and has a list of RAS actions that it can take, it shouldn't matter if it's a GPU, a CPU, a transcoding accelerator, some other ML accelerator.  Who knows what it is?  But what we see is we see a list of errors.  We see things that we can call back.  We send those errors to analysis tools and we can call back into the machine to invoke those actions with very little effort and with the help of the vendor's expertise.  Now, that simplifies things.  It reduces our toil in coding.  It increases the uptake on the features the vendors are providing, and it allows us to easily gather better data from the machine so the Silicon vendors can improve their parts.

And so this is it.  This is our solution.  And with that, I'm going to hand off to Antonio.

Hey, good morning.  My name is Antonio Hasbun.  I work for Intel.  So I'm going to talk a little bit about the solution that we're proposing and what we're working on the work stream.  So basically what Drew described is the ideal case.  How do we get there?  So we contributed a preliminary version that we're working on, and we're adding to it with the input of everybody in the work stream.  And I'm going to explain what's in that preliminary version because we're really early in the meetings and what's the discussions that are taking place right now.  So our solution here, it's a platform agnostic, like Drew said.  It enumerates features and it has semantics for in-band and out-of-band mailboxes.  So what this means is it goes beyond having a protocol layer.  It gives you the semantics.  So there's an op code to trigger actions.  There's an op code to discover and to enumerate.  And it enhances error logging.  So it's one thing we've seen in the past that our error logging was more of a telemetry thing and for some RAS features we have error storms and we have other requirements like different queues and different priorities.  So we're changing a little bit from just a stream of errors into something more specific for RAS that hits the needs that we have in RAS.  And finally, the only thing that we will not cover is synchronous failures.  Synchronous failures is when your core finds an error and cannot make forward progress.  At that point, you really cannot go through any other mean through a different mailbox to go and recover.  This is a well-known method.  And if we try to go through a mailbox, we'll crash the whole system because of a timing constraint.  So that's the only error that's out of scope for this specification.

So a little bit of the illustration of how we're visualizing this.  You can see on the left side -- I hope that's the left side for you -- the software view.  So we want to abstract out the RAS features.  So the software and the OS or the BMC, there's a management agent.  Wherever you're using to manage your fleets, you can see an abstraction layer.  And this RAS API block, it's a bit of a driver, a daemon.  It can be a piece of software or a piece of software and firmware.  And I'm going to go into that in a second.  And then it abstracts out the device itself, the RAS from the device itself.  So there's an in-band access and an out-of-band access.  So we've proposed several -- using several standards.  We're not reinventing the wheel.  Like out-of-band access, we're talking about MCTP or having PLDM in there.  But it's specified how you get to the agent.  Now the agent is what's running in the SoC or next to the SoC.  And that's the abstraction layer.  So the management agent in the OS or in the BMC doesn't have to know, like Drew said, it doesn't have to know a CPU ID to go and find registers.  Now, it has an opcode that always enumerates RAS features.  It has an opcode that enumerates queues.  And it can go on query errors and it can go on execute actions without knowing which CPU, GPU or piece of silicon it is interrogating.  The right side view is more of a hardware view.  But it's a little bit not as clear as it's here.  So I'm going to explain a little bit.  So you have the OS or the BMC.  Now you're seeing more from what will happen inside of the SoC.  You see there's an in-band mailbox and an out-of-band mailbox.  And we've designed this to have two sets of error queues.  And when I say two sets of error queues, it's because we really need to separate the in-band and out-of-band so they don't compete with each other.  And then there are sets because in RAS handling, we need to have differentiation by severity.  So the way we do pull-in or that we do interrupt-in depends on how severe the error is.  So we are getting the different error queues for severity.  So from a warning to fatal, you have different actions to take, different priorities.  So we separate those, and you have a copy for in-band and a different copy for out-of-band.  And then you have the RAS actions.  And then you see the RAS API agent, which is written there.  Now the RAS API agent could be implemented as a micro-firmware in a microcontroller.  But that's just one option.  Once we brought this to the team, we started the discussions.  Other options have come up, and we are getting those developed to see how we can incorporate them or how we can encompass that use case.  So there's a part in the workstream where we're going to debate a piece of this is going to be firmware.  So don't think of it as necessarily embedded in the hardware.  It could have some firmware layer to it, or it could even have some software layer to it.  So those use cases are coming shortly into the workstream, and that's where the discussion is happening.  We have a starting base, and we're building on top of that.  And at the bottom, you can see the IPs.  Those IPs are abstracted out.  So the OS or the BMC, they don't need to know the IPs.  Errors flow through error queues.  So you just catch your error events from your error queues.  You don't really need to know that you need to go to register 352 because I'm in this IP.  It's no longer necessary.

So a little bit more of contained RAS actions.  The part of abstracting the RAS actions, it's, for me, very important and core to this initiative.  When you're doing PPR or when you're doing any RAS action, you should separate what's the policy from the RAS action.  Policy is why are you taking the action?  Who's deciding how many errors?  Like Drew said, are you counting errors, or are you finding rates, or are you doing something more complicated?  That's the policy.  That's abstracted out.  That's the responsibility of the rest of the framework.  The analyzers plus the CSP policy will tell you when to trigger an action.  But then the action itself, how to do PPR, it should be a monolithic thing that the SoC vendor will be responsible for.  So as you can see in this picture, the management agent will just send one command with an input.  For doing PPR, it doesn't matter which hardware vendor you are.  You just need to say, I'm doing PPR in this address.  Why sometimes do we have 16 instructions, 20 instructions to do that?  We should just have one instruction that says, I want to do PPR in this address.  And if we all say that, then how that gets managed inside the SoC, it's up to the SoC vendor.  I'll make sure I'll do a PPR on that address and report back the results.  So that's the abstraction.  And when we can do this, not just for memory, but for all the RAS actions, it'll be much simpler to manage your fleet.  You don't need to, again, move away from registers.  And I just want to reemphasize, why are we moving away from registers?  Well, we're scaling.  We all heard the keynotes yesterday.  We're scaling in heterogeneous fleets.  Keeping track of registers, reading CPU IDs, having different vendors in the fleet, having different platforms in the fleet, it's impossible to scale out.  You need to be very nimble.  And the way we do that is we abstract.  Everybody can call an action and all the inputs that it needs to take place.  The other part that we have is the enhanced error logs.  You can see here, what are we adding to error logs?  Error logs have been there forever.  We're using CPER.  But what do we want out of these error logs?  We want to have clear timestamps.  We want to have severities.  We all agree on what is fatal.  When do you put it in failure versus fatal?  What actions is the OS expecting to take?  That's what we want to standardize.  We want to standardize having UUIDs.  That's part of CPER.  But every record should have a UUID so you can clear, you can reference it, you can coreference it with other errors.  And the other thing we're adding is something that helps us feed these error records into ML models, which is overflow counters.  So once you reach your capacity for storing errors in your error queues, you should start dumping -- not taking in any more errors, because you have the error queues by severity, right?  So you will never have to prioritize one severity over another, because they have independent error queues.  So if you have to stop taking more errors because host is not responding, it's not taking the errors, whatever, you want to know how many errors happened and in which time lapse it happened.  That you can feed to an ML algorithm, right?  If you override the errors, then you cannot feed that into an ML algorithm.  You need to know the sampling that happened.  So these flags will help you, again, build for the future.

I'm just going to -- here, the -- where are we right now in the spec?  So we contributed a .7 version.  We're working for -- and that's just a whip.  We needed to show that this can be done.  This is a very ambitious goal.  But I do believe we can do it.  And that's why we contributed a .7 version.  We're working to the 1.0.  And there's a lot of things that are coming in from different partners.  So NVIDIA is bringing some new option for out of band, and AMD is bringing some option for in band.  And we're listening to all of them, and we're going to pick the best that gets us to our north start, right?  And that's what we're going to do as 1.0.  So we are meeting 7 a.m. every other Tuesday.  So again, Harvard management means meetings at very funny times.  And the reason we do that, by the way, is because we have people from all over, right?  We have people from China, we have people from Europe, but east coast, west coast.  I'm from Costa Rica.  So we really need a time zone where we all can meet.  So that's the place.  And it's a CLA work stream.  And you can see the companies that have joined.  I think two or three more have joined since we made the slide.  So we even had one last week.  So that's good.  So now I'll give it to Yogesh.

Thank you, Antonio.  So I think John and Hemal, you all did a pretty good job of kind of showcasing the whole hardware management structure.  We're kind of going one click deeper here.  What we're trying to do in the next part of our talk is to compare and contrast RAS API with various other RAS related efforts that we have in the hardware fault management.  So first of all, how it compares with the hardware fault management, that is RAS API and hardware fault management.  So there's a lot of synergy between the two, right?  Because we have contributed in the hardware fault management industry standard based vendor and silicon agnostic fault management infrastructure framework.  And RAS API kind of enables a part of that by allowing to establish a standard connection between SoC and the fleet control system.  And that's also, again, vendor and platform agnostic.  So it helps the framework.  Also it is developed under CLA to provide general industry agreement and strong adoption from all the major hardware vendors and equipment manufacturers.

The next one is synergy of RAS API with the fleet memory fault management work stream, right?  The FMFM that we just talked about.  So like you've seen, RAS API is a standard for communication between the SoC and the platform.  So in the diagram here, you see the red lines.  They're heading out of the SoC.  We are in-band or out-of-band agents.  And from there, FMFM efforts kind of take over because then there on the interface is standardized under that effort where we have standard log formats and logging requirements and which after fleet consolidation travel over to analyzers.  Memory errors are sorted out, filtered, and they go to hardware vendors.  And the analyzers, which are again being defined under the FMFM work stream.  And whatever RAS actions are generated from there based on the analyzer output, CSP or fleet operator policies, and tolerance to errors, whatnot, they are fed back to the SoC via, again, out-of-band or in-band agent, whatever the implementation might be.  One little thing there is if it's a CXL memory, then like we discussed a little bit ago, the CXL has its own path for handling those.

So the next one is RAS API synergy with CSM cloud service model project.  So I don't think I need to reiterate RAS API.  But cloud service model project is a new project that's coming from future technologies to the hardware management.  And what they are trying to achieve is much larger than just the RAS.  RAS is one of the things that they can influence.  But what they're basically trying to do is to solve the management at scale problem.  And they are basically trying to create interface for providing the data from the fleet to the fleet operator to help enable manage the fleet in a better way.  And they have a lot of different considerations.  The first one is scale.  Other ones are latency, privacy of data, ease of querying data.  And telemetry is one of the major things that they are going after.  So there's no conflict, again.  RAS API is the path via in-band or out-of-band agent to the fleet management infrastructure.  And from there on, the CSM, which is the bold, thick purple arrows, they take it to hardware vendors, offline root cause analysis tools. 

Finally, another new work stream that is kind of related to this effort and also has a play in the RAS area is GPU management work stream.  Now, GPU management is a standard for GPU RAS, error injection, power management, and GPU-specific performance telemetry, and also some of the functionalities that are required.  So they are creating requirements for GPU RAS features also.  They also use the same-- they also use the standard log format, which is CPER.  In addition, they have Redfish-specific sensors and actuators from which they can get some telemetry for the GPUs.  So again, there is really no kind of overlap or conflict between the two efforts.  We are pretty well aligned.

And with that, I think we have our call to action.  Again, this is a CLA work stream.  So if you are interested or your company is interested in joining this, then please reach out to Drew or Antonio or any of the OCP.  And then they can guide you how to get that agreement signed.  And then you can join the effort.  We have some links here for you to see and download the CLA if you need to look at it first.  And you can also join our mailing list.  The link is given here.  With that, I think we have a--

Hello.  I'm Caleb from Astera Labs.  And I want to-- before I raise any problems, I want to begin by voicing some support for your work.  It has the potential to solve a lot of pain that we have experienced over the years with telemetry and debug logging and everything.  Question for Antonio.  Is it possible to pull up the RAS API structure slide?  Here we are.  Yeah.  So I notice you're connecting in band and out of band, which is great.  And you also have separate queues for the RAS-related requests coming from two different directions, which is also great.  So you don't-- one side can't deny service to the other side because you've got independent queues.  But have you given a lot of thought to ordering problems that arise?  So to give a concrete example, your system software comes in in band and queries some device status.  And the device status, it sees there's some bit set somewhere that says, hey, you should go query this other log.  And then you service a request that came in out of band that changes the device state.  So then the two queries for in band return inconsistent state.

Very good question.  The error logs really record events.  So in band, you'll have that there was an error at certain bit.  And in the out of band, you'll have the same record.  That doesn't mean that the error has not been corrected or is not a status.  It is a log of what happened.  It's what we feed to a predictive failure analysis or what we feed our root cause analysis tools.  For actions, though, if you take an action, we do have some semantics on how to arbitrate between in band and out of band.  So a lot of these we're leveraging CXL CCI protocol for in band and out of band because we wanted something easy to implement.  Every hardware vendor has a CXL interface these days.  So we added this part, which is in band or out of band, they have to request the RAS feature before they use it.  Because as opposed to just reading error logs, which you can share, taking RAS actions is dangerous if you do two at the same time.  So we do have that arbitration added on top in the semantics.  But for error logs, an error log is not exactly a status.  You could be pulling for correctable errors, and the error could have happened a week ago because you might not be doing it in real time.  Those correctable errors, you're probably not that interested for certain of those, and you're doing a low priority.  So the event will be there.  You can read it.  It doesn't mean that it's still there or that the error was not solved.  So status, yeah, that's different.  This is event records.

Can I add something?  So one of the biggest things we want to change is the perception of error handling.  Because today you think, okay, I'm going to go read this register, read that register, read that register and figure things out.  Okay.  If an OS is about to use corrupted data, you have to synchronously interrupt it.  You have to stop it.  So those exceptions still have to happen.  So we're not talking about that.  But when it comes to for that OS figuring out a lot of detail, you know, why did this happen, well, maybe it can get an error log.  Maybe it just needs to know I tried to consume poison.  This address is bad.  You know, kill this process, recover the process.  But then when it comes to fixing the thing that caused the poison, that's an error log.  And this is the huge thing that I think we need to change across the industry is that when there is a failure, it is an event.  And for that event, you need to go to each thing that might have data for that event and collect them.  And so we call those hermetic error logs.  And I said this last year, people may not remember, but if you remember one thing I tell you, joins are the root of all evil at scale.  So what timestamp are you using?  Is it using an OS timestamp, a BMC timestamp?  If there's a new management controller, was it that timestamp?  And then do I take them up and just pray that I can correlate something?  Failures happen at light speed inside the machine, and they have repercussions throughout the system.  So what we want to see is we want to see they have an emergency.  We want to gather error logs from the CPU.  We want clock power thermal error logs.  We want the GPU errors.  We want to bundle those.  It is an event.  And for that event, we've gathered CPER logs from all the sources into a hermetic bundle that we can send up to analyzer tools.  And they'll look at each CPER, go, hey, that one's mine.  And we can analyze them.  So we're not looking at bits, and we're solving the race conditions because there's something gathering the data for us.  So hopefully that helps.

Yeah.  So one of my questions has already been answered.  The other one is for the RAS API agent.  You said it could be implemented in firmware or other places.  Is there going to be any reliance on system management mode if it's implemented in firmware?  And, you know, what are the impacts of that?

Okay.  Yes.  I just looked over and said, can I answer that?  So if you go and look on the Internet, you can find tutorials that teach you how to find vulnerabilities in system management mode.  System management mode is a mode where we take all of the CPUs from the OS and pull them into firmware.  We rendezvous them, do something, and resume.  The problem with that is you kill performance.  And if someone manages to corrupt the SMM handler, that handler can persist across reboots.  So now an aggressor can put something in SMM, you reboot, and the SMM vulnerability is still there acting on behalf of the aggressor.  So I believe if you look in our drawings, we had SMM in some of the other presentations, and there was an X through it.  And so we are -- and I'm not opposed to firmware first.  I actually did a lot of work with Intel at the time EMCA2 was developed to figure out how that worked.  But what we've learned since then is that the performance and security vulnerabilities inherent in SMM make it too dangerous to use.  So our proposal is explicitly to get rid of it.

That's great. That's what I was hoping.

Okay. Good. Good. I'm glad I don't have to justify that too much more with you.