110


All right, good morning everyone. This is the first talk in our, today we have a full day of hardware management track. I'm Hemal Shah, architect at Broadcom and also hardware management project co-lead, and we have Jeff Autor here. Go ahead. I'm Jeff Autor Otter with HPE and the other co-chair of the hardware management track. So what we thought that first talk here, we will go over the whole structure of the hardware management project. We'll talk about the subprojects and work streams, and for the rest of the days you will see significantly detailed talks on different work streams as well as the subprojects. I'll kick it off, and then I'll hand it to Jeff Autor, and we do have one additional speaker who will be joining us at the end, and then we'll wrap it up. Okay, so with that, let's go ahead and get started.

So as I said, we have a number of sub projects in hardware management over the years this has grown and recently what we have done is we have created a number of work streams which are even more streamlined and we can start doing more fine-grained work there. So over the last year or so there were a number of work streams that got added to the project as you can see because of this project scope has been expanding and then we'll also, based on that, we have updated the charter of the project. Overall this project not only covers all the platform management but we pretty much work with a number of other projects for defining their manageability aspects as well as they leverage some of the baseline specifications hardware management project does.

So this just talks about very simple charter but it covers management of all the OCP platform and it starts with the hardware management. The goal here is to have interoperability so that all the features that you find on OCP platform there is no vendor-specific aspect you need to worry about, and this gives you the common foundation on which you can build your manageability applications and features.

So let me spend some time on this. Thanks John for putting this picture together. We were all talking about how do we provide our org structure and then we had like traditional org structure but this made more sense and what you see in here is within the project what are the subprojects, they are all in sky blue colour and then we have light green boxes showing the work stream and the work streams where you see the dotted lines around it, those are just being formed and then they are now part of the charter. We are looking for the co-leads. So let's look at it from, there are different ways you can look at it. We will look at it from left to right as I am facing the slide. To the left you see a bunch of component related management, GPUs, we have a talk that you will see satellite management controller, all the components within the devices, also for out of band interface, this is where the platform interface comes into picture. There is a RunBMC, DCSCM, these are the module specification, they are defining basically how you build those management modules which are compliant with the spec. We have recently had this, what started with silent data corruption which has now been server component resilience group that we have been defining all the different aspects of platforms and then how to mitigate the silent data corruption in there. Then as you move up you can see on the platform side we do have manageability profiles which span both in band and out of band interfaces. There are projects like telco projects where they have rack management age RMC which basically leverages the APIs defined by hardware management work group which is the rack manageability APIs. That is between platform and software. We do have, you can think about this manageability interfaces are used by some orchestration layer and then there is also an implementation of the aspect of that called OpenRMC device manager which is basically part of hardware management project. Recently we have this cloud service model which became part of hardware management project which will allow this kind of orchestration layer to directly manage the cloud services. And then as you move to the side, this is where you see the RAS API pretty much spans everything within the platform whether it's a component level, platform level and then fault management sub project which started working on a number of things. Within that there is also fleet memory fault management work stream where they are specifically looking at memory related fault management. So this is the overall structure that you see that we have today and as I said some of the areas that are shown in dotted just got recently added to the hardware management project charter and by next year you will start seeing more and more work from those work streams.

So what we already have published so far is the profiles where we have the baseline profile defined by this project and then we co-own the server profile which is derived from that with the server project. Server projects can take this baseline profile which defines a set of baseline requirement for management including some basic inventory, monitoring and control which includes like temperature monitoring, power. So that is what is in the baseline and then for a specific domain one can basically specialize from this baseline profile to add more capabilities. So that's on profiles. For hardware management modules there are two things, RunBMC which is the BMC daughter card IO spec which has been published and it has gone through several minor updates which is fairly mature spec at this point. And DCSCM started for the data centre secure control module which includes both security and manageability functionality. I have a slide where you will see the picture of this. It has been major rev to 2.0 and it's already published. There are already implementations compliant to the specs. People have shown, demoed that either in the experience centre or in the talks. And then OpenRMC, this is for the rack management related where we started with rack management profile but then it became the usage guide for not-bound profile and then there's also a companion design specification for OpenRMC. Those are already published specs.

So that gives you the project overview, its charter and what are the documents already published. Now I'm going to get into a couple of subprojects followed by that I'll hand it to Jeff Autor who will cover other subprojects and work stream. So let's start with hardware fault management subproject. We have this subproject led by a number of people. The idea behind this subproject is to focus on standardising behaviour for hardware failures. And then you define a key set of requirements, how do you manage these errors and then also the reference guidance and the API part for that hardware fault management. So that's what this subproject is focused on.

Specifically, there is a work stream within that where it's focusing on fleet memory fault management and how do you handle the error handles. And then from what this does is it allows you to have from different hardware vendors, you can standardise the representation of how the memory errors are reported. There are APIs and connections with different modules that you can use, and based on that you can basically standardise and make this whole architecture vendor agnostic. Additionally, you can have all the telemetry information related to errors, memory errors; they can be also formalized as standard content. So those are the specs this subproject is working on, and they have been talking with other projects to help with this standardization.

The next subproject, hardware management module. As I talked about there are two things here. One is the BMC daughter card I/O specification. What you see in here on the left, the picture which shows in the spec it defines exactly the layout connector and pinout for that RunBMC card. Then based on the spec any implementation that is compliant to that can design its own card and you can expect it to be connected to a standard connector as long as all the pinout are followed by the implementation. Then any stack, BMC stack can run on this one. So this allows you to have at the hardware level have a standard I/O connector card for the BMC. So that's RunBMC. And then data centre secure control module which combines hardware root of trust and the BMC functionality in the single card and you see on the right this is basically again that module, the connectivity, the connector, all the pinouts, that's been standardised by that spec. And there are also reference implementation compliant to the 1.0 and 2.0 version of the specs. So that's about hardware management module. Let me give it to Geoff who is going to cover a few other subprojects and work streams. Go ahead, Geoff.

Right. So the -- I think we have two more of the subprojects that have been around for a while. The OpenRMC specification started to define the two layers, defining the interface out of an aggregator up to an orchestrator level. And then it's -- so it defines that API and then it also specifies the requirements for any device that is being managed by a rack manager. Defines those -- the individual pieces that need to be reported for a device to be a good citizen in that rack ecosystem. So the device manager, and that's why you see the OpenRMC-DM, device manager is an open source implementation of the OpenRMC spec that is available on our OCP GitHub. And so that's available. It has a set of features and it's still under active development. So it does inventory, temperature monitoring, power control of the units under the -- you know, under its purview. Measuring the power utilization. And then I think this was actually fairly recent for adding the ability to go to do firmware updates of the devices, you know, under management.

And then I guess this is actually a work stream, not a work group, but this is going to be an important one for us for the next year as we really want to encourage folks to work on or submit or give feedback on these management profiles. So this is trying to take the descriptive specifications that we have as an industry and then provide effectively checklists or the prescriptive requirements that an implementer needs to have to, again, be a good citizen in that ecosystem. So this is trying to get those checklists so that we can have a consistent set of features that are available through the management interfaces so that folks can write software like the Rack Manager -- excuse me, like Device Manager that can consume that data and be able to expect that so they can, you know, write functions that will actually work across multiple vendors. So there is a start. We have the server profile that has already been published. There is a Rack Manager one under development. We just recently received the submission for a power shelf and one for an Ethernet switch. So this work stream will take and, you know, curate these profiles and then we'll need to work with the other projects within OCP that actually own those devices and talk to their subject matter experts, make sure that we get those requirements correct to know that, you know, both what the devices are going to be able to provide across the industry and also from the client side, what are the client software -- you know, what do they need to have. We don't want to over-specify this thing. We don't want to make it such a high bar that nobody can meet these things. But we want to set some decent baselines and then show some direction going forward. And there's -- and all of these slides now will start -- will end with me saying -- and there's a session on this, you know, later today.

So one of the newest work streams is the RAS API. So, you know, that's an old, old acronym. But effectively, this is going to be a collecting up of probably, you know, very device or vendor-specific, you know, crash or error data, probably in some binary form that needs an analyzer to decode. And so our job here is to be able to collect that information up, be able to migrate that information through the firmware and software stack and get it out to the, you know, the correct software at the client side that can make heads or tails of that mess. And then point the fingers at whoever needs to fix whatever caused the thing to begin with. So it's a combination of the logging and the transport of the resulting blobs and how to get that through. So it's going to be a very important part of the ecosystem, especially as we have much more distributed compute going on between the DPUs and the CPUs and all these modular architectures. So as soon as I finish up here, there's going to be a set of sessions that all kind of talk through all this fault management pieces just in the next hour. So you'll get plenty of information about that.

So lastly, we have several new work streams that are just coming in. That's one of the nice things about having the summit is it sure puts a hard deadline on people getting stuff done. So we have a bunch of new work streams that have been approved that come through the steering committee and they've landed here now. So the newest one is the cloud service model. This is one that's been in incubation for quite a while. So I've got some catching up to do to see what all's in there. Hemal, I hope you know a little more of that. So these are the charters for the four new work streams that we have now in hardware management. The cloud service one has had their co-leads approved. The rest of them we need to nominate and approve the co-leads to run those and then start scheduling meetings. So you can see on the slide that the cloud service model is doing things at very large, what we would call fleet scale, so tens or hundreds of thousands of nodes and how that is going to impact the management stack. The other work stream is GPU management. This is a very broad topic and I think we've got a lot of overlap. That's why we need to have these charters kind of crisply defined to make sure that we're not duplicating a lot of effort. So I think we'll have a lot more discussions on GPU management. And again, there's several sessions of this today. The absolute newest one that I'm going to hand off here in a second and give you guys a quick introduction about is the satellite management controller. So I will not beg the question, let Jeff Autor do that in a second. And then the last one is the server component resilience. This is the silent data corruption that Hemal already mentioned. And there is a talk on that also today.

So with that, let me move on and let me introduce Jeff Autor Hillen, who is a part of the SMC specification. I'll let him give you a few minutes of this because we didn't have a full time slot available for another talk today. Jeff Autor. Thanks. Hi, I'm Jeff Autor Hillen. I'm one of the authors of the satellite management controller spec that you'll see coming into OCP. I'm also president of the DMTF and co-chair of the SPDM working group.

Right now, every PCIe and CXL and other devices out there has their own way of their own API and management interface for managing their components. And that makes life miserable for BMC vendors because you've got to create a custom interface for each one. And usually that's more than one per product team. So your company may have more than one product and that interface may vary based on product. PMCI suite of specs helps a little bit, all the PLDM and MCTP stuff that the DMTF comes up with is real helpful. But like any descriptive spec, it has a lot of information and a lot of options. And those vary from vendor to vendor as well. Plus there's a whole bunch of other stuff in there. Do you support SPDM? What level of SPDM do you support? Do you support firmware update? What are the options there? And so what we felt was the need to standardize in the industry on a common set of items that you would need for any component inside of a system to try and gain a little bit of interoperability. There's a lot of specs out there, though, and all of our vendors have our own, too. So we've got one. HPE's got one. OpenBIC for servers. From Google, Mini BMC for them for storage. Samsung has the DMC for storage. I know HPE has our spec. Dell has theirs.

So we got together and decided to do -- maybe we shouldn't have called it satellite management controller because apparently that caused a little bit of confusion. We're not managing satellites. We're managing components in a system. So maybe think of it as a subsystem management controller instead. That may be a chip on your card. It may be gates that are already existing in silicon inside of your implementation. But really these are the components inside of a server that feed into a BMC. And so what we've come up with is a spec that outlines all those requirements, goes through and says for each of the different DSPs inside of PMCI and other specs, yes, you must have an I2C interface, things like that. So we've gone through and done a proscriptive spec about all the optional requirements for a subsystem manager for a component inside of a system. And we submitted that into OCP.

We're hoping to kick off the work stream pretty good. Maybe it ends up in profiles. It does feel like very much of a profile but for MCTP and PLDM specs instead of Redfish specs. But a suite of components, Redfish profiles, would go along with it quite nicely. So we expect some of those to come along well. And then there may be some other submissions to help this process along and help vendors meet the requirements of the specification. So stay tuned for those. Thank you. Thank you, Jeff Autor.

OK. So just to summarize, as you saw, and Jeff Autor, I think we had this because we didn't have separate sessions. So it was good to get that intro included in this talk. But as you can see, we have all these projects and subprojects and work stream within hardware management. Its scope has expanded significantly. We have two different kinds of profiles, as Jeff Autor and Jeff Autor, they were mentioning. Redfish-based profiles, which you can start with the baseline and create new ones. The PMCI profile term that Jeff Autor was talking about, one of them, like we have kind of that in OCP NIC 3.0, this kind of satellite management controller. The more and more specs will be like that. And there are other work streams which are defining the APIs. So through that, you can see we cover all this different spectrum from the OCP platform hardware management. So really encourage everybody to look into what we have done so far, provide us feedback, submit proposals. There are new work stream coming up. If you have some ideas there, you want to contribute, join the specific subproject, project calls, or work stream calls, and then contribute. And here are some links that you can look at. Jeff Autor, you want to say something?

No, it's like we just -- we have a total full day today with no breaks in between sessions. So I'm going to ask that we're going to get our next speakers to start moseying up here so that we can get with you so we can try to keep on time. But as Hemal said, there's plenty of specs and lots of work streams to get into. So please join in and help us move a lot of this new work forward.

Okay. Do we have time for some Q&A, or how are we doing on time? All right. So we are going to -- if you have any questions, we can talk about it offline. Let's get to our next speaker, Yogesh, who will be talking about the fault management subproject. Let's give a round of applause to Yogesh.