231


Hello. Good afternoon. My name is Matt Bowman, a hardware engineer at Meta.

And I'm Jeremy Baumgartner, also a hardware engineer at Meta.

Today we're going to talk about Grand Teton, give a system overview, and also a bit of  lessons learned and some of the gotchas we got as we're going through and kind of building  it, deploying it, and having fun. 

So we'll start off, and this is it. Kind of  jump right in. Didn't want to do too much of a buildup. This is Grand Teton exterior  ISO view. What we plan to talk about today is both the detailed specifics around the  system, the system design, the modularity, the tray, but also some of the things we encountered  around SI, serviceability, availability. So less, you know, things that you could go read  in the spec, and more the value add we did during the system design and build.

I wanted to start and cover some of the motivation. So maybe a show of hands. Has anyone heard  about AI throughout the show today? Yesterday? Yeah. Yeah, probably. And in our sense, we  are in a similar boat. AI, ML workloads at Meta, they're continuing to grow, quantity,  intensity, they're not going away. The whole LLM revolution, you know, we're part of that  too. And what we found is we needed hardware that could keep up, so to speak. And, you  know, hardware design, it's a longer process than software generally. And we needed something  modular that let us optimize, tweak, and adjust so that you could get something that is fine  tuned, optimized, and get that performance TCO and kind of roll with the punches, so  to speak. 

I'll pass it over to Jeremy now. Yeah, I'll go in a little bit of detail with  Grand Teton's overall architecture. This is a heads up. We have all our open source specs  up on the OCP website. So what you see here, it's all in the spec too, with a lot more  detail, breaking it all down. So just an overview, this is a big eight OU chassis. You can see  on the picture here, sort of exploded without the sheet metal. We have three main trays  visible from the front. We have the CPU tray, you got a switch tray, and the accelerator  tray. The whole chassis itself is ORv3 compliant, slides right in the rack. You can see out  the back, we have the fans and power boards, and it slides right in. And the whole chassis  and all the trays in the chassis are hot swappable. 

Go in a little more detail to the three main  trays we have here. We have the CPU tray, which is a two socket. There's an AMD and  Intel version implementation we've done. Again, one spec for each. You can see that on the  website. Downstream, we have PCIe Gen 5 going on down the switch tray, which then connects  to the accelerator tray below. We have a cable backplane connecting all three of these trays. We also have a DCSCM, shout out to that standard on the OCP. We've done some slight customizations  for Grand Teton, we'll get a little detail there too. And two front end OCP NICs. We  have the switch tray below that, two OU. We have four Broadcom PCIe Gen 5 switches. And  you also have eight back end NICs, capacity for 16 SSDs. And also all the downstream PCIe  Gen 5 going to the accelerator tray. In this implementation we're showing here, we are  compatible with NVIDIA's HGX pinout specification. 

Just a little more detail, just another view  of the PCIe architecture here. Two sockets to CPU going down to the four PCIe Gen 5 switches,  goes to the SSDs and the back end NICs, and then the GPU tray at the end. So all highly  interconnected.

Cool, thank you. So Jeremy just gave us a great detail of the system specifics of how  it's going together. And tying it to some of the earlier comments about modularity,  flexibility, doing it. This is just a great example I love, and it's a huge shout out  to Meta's thermal and mechanical teams for designing it with this mindfulness. Because  of these different combinations of trays that you can put in, what we have to deal with  actually is varying airflow impedance through the system. If you were to not have this tuned,  airflow much like current would just take the pass of least resistance, bypass the parts  that you really need to cool, things wouldn't work. So within, if you were to go to our  system, pull out all the trays, peek in and take a picture, that's exactly what this is. And it's showing examples of the different perf patterns that you have depending on the  configuration or flavor of the system. And we do it both to make the system work, to  have a valid thermal kit, but also to let us optimize the airflow. We don't want to  over cool things, especially at our DCs where airflow itself is a commodity and you might  be paying too much for the fans, all of this other stuff. And I love this as just a very  practical example of we knew we were going to have to adjust and roll with the punches. And so it was designed so that these perf patterns, it's not like you got to rip out  a whole bunch and it's hugely painful. It's meant to be kind of swapped, rearranged and  go and adjusted. 

And this was another one that I, it was perfect. This was probably  two or three months ago when we were at one of our data centers, the, in just walking  into the room, going to the data hall. And I saw it, I was like, I need that in my OCP  presentation, because it's this huge sign as you're going to the data hall of most people  associate, okay, Facebook move fast, move fast, break things, move fast. Our DCs and  with an infrastructure, we take it a bit step and there's a few more caveats where it's  move fast, safely and do it with high availability. And I just found this perfect. And I put it  here because pictures of this and to tie it in, it feels more real. And it is something  that we truly actually focus on or are believing in that, you know, it's the first thing you  see as you're walking into the data hall and I'll pass it back because we have some examples  of how we're doing this on Grand Teton. 

Yeah. So these are really simple, but really  high impact for us and the DC techs. It sort of makes everyone's lives easier. This is  what we call the back saver button. So it's basically, we just call it an AC button we  put on the front of the DC SCM and it power cycles the whole system. No software firmware  in between. One of the biggest remediation actions we take in the data centers, just  power cycle the system. And if we can't reach it remotely and you're in person, well, how  do you do that? Well, you take the server out, put it back in. This thing weighs 160  kilograms, 350 pounds. So this button is a whole lot easier. And by just having solid  copper from there to the hot swap enable pin on the back, it's equivalent to power cycling  the entire server and pulling out of the rack and putting it back in. So it was a nice win  there. 

Another really simple high impact addition we made to the system are these LEDs. These  are LED power LEDs tied directly to the power rails for each hot swap sort of going down  the line. We have a couple, the most evident ones are on the SCM. We have a couple on the  motherboard as well. And again, if you're in person and you were unable to diagnose  it remotely, you can instead look at these LEDs and make a high confidence decision about  how to fix the system as fast as possible. These power boards, if you have to replace  one of them, take a little longer than average compared to the easily swappable modules we  have in the front. So again, another nice win there.
 
So Grand Teton, throughout all the main interconnects to the GPUs, the NICs, it's PCIe Gen 5. This  is actually our first AI/ML system using Gen 5. So there was some learnings on our part  and things we took away. So I wanted to share a few of them because I found them very interesting,  in fact. So up on the screen, you can kind of see the picture. This was a snapshot of  the layout. And you have the breakout regions, typically within the BGA fields near the connectors. You would have some geometry that needs to change. Say it's a diff pair, it's spacing  the width. And before, you could kind of get away with slop. Okay, it's so short, it's  so don't care. Just shrink it down, get it routed and connect. What ended up happening  is our EVT builds of these Grand Tetons, we passed. We looked good. We did all the 4x4,  5x5 testing, bit error rate. It all passed on EVT. We did not have what we're calling  this neckmode regions as a controlled impedance in our PCB stack up that was defined. And  when we got our first batch of DVT boards, we were actually seeing failures. It was like,  oh man, what's going on? And through PDR testing, kind of some investigation, we realized it  was this uncontrolled region of impedance near the connectors underneath the BGA that  basically it's uncontrolled. You get the PCB fab, we use multiple vendors, and it just  was causing havoc. And so what we ended up doing was saying that needs to be a controlled  impedance. You can't quite hit maybe the 85 ohms that you need, but to at least have it  controlled as something that is a test coupon on the board that's validated that when you  get the board, you know it's not going to be some surprise. We found that very important. And so we wanted to share here. Yeah, that was a fun, interesting learning.

So another one that is, I won't say unique to Grand Teton, but certainly it was exacerbated  because of our modulars and trays and using cable back planes and things that mate is  you're not always guaranteed. And I'd say it's probably more than unlikely that you  push a tray in and it's going to be all the way perfectly inserted. So that simulation  model of that ideal connector mate, it probably isn't what you're actually seeing in reality. There'll be some connector Demate. And we actually knew about that. Like that part we  knew in our simulations, we took it into account. And it's one of those fun and amazing cross-functional  things where this bartering of how much SI margin do you need on the mechanical design? What do you take in this tolerance? What can you expect? Connector gather, what is it going  to look like? So it's actually a multi cross-functional thing that you really have to look at. What  we didn't necessarily have nailed down as well is this more reality. This bottom picture  of connector Demate is not always perfectly orthogonal. And in fact, it's probably going  to have some twists, some bow, some skew. It's going to be do something weird. So when  you're specing it and when everyone in the different teams are talking about it, you  say, okay, we can have 1.5 millimeter of Demate. Where do you mean that? Is that the  worst case, the best case, the middle? And it's just one of those important areas that  you need to align of where, when you're talking about Demate, where is that? What do you mean  it? And go to that degree. And you should do it in simulation because faster speeds,  you're getting lesser and lesser of this extra margin that you can play with. And this can  come up and bite you. 

And in addition to simulation, it needs to be validated in the lab. And this  was another picture I was like, I got to put this in the presentation because I liked the  juxtaposition of we have one of our, you know, this state of the art AI system and we just  use zip ties to just really make sure it's fully sated, mated. And then you run all your  validation and go through there and you're like, okay, I know it was fully mated. It's  pretty rewarding to get it done. 

All right. That brings us to our final slide  here. It's call to action. And as we kind of indicated earlier, these specs are contributed. So you can go onto the OCP contribution website. They're there today, right now. Our little  QR code for anyone with the phone, that will bring you right there. You can pick it up. This is all part of the OCP server work group. So we've talked and presented in it. I'd encourage  anyone interested in it to go look at recordings, participate. At the floor today, there is  a grand Teton on the floor. You can take a virtual tour of one within the meta booth  or you can go to metainfrahardware.com and there's different flavors of the tour, not  just grand Teton, but a bunch of other meta infrastructure you can look at. 

And we did  set this up so that our questions, we're happy to take them.

It's a good one. We ship full racks. Yeah. At the scale that we're at, the full racks  are what ship to the DC. That's what we kind of order and plan around.

Thanks. Referring to the modular design, what components  would have to swap out to convert the system to take a different form factor, GPU, PCIe  or OAM module? You want to take that one?

Just to repeat the question, you're asking what modules get swapped out if we put in  different GPUs? 

Yeah. If you want to adapt the system for  PCIe or OAM. Right, right, right. Yeah. So that 4U GPU  at the bottom, that whole chassis in this instance was an NVIDIA HGX. You can also consider  like a UBB style sort of standard. You need to match the pinout at the back of the cable  backplane here. And so if you're able to switch at that level, it becomes an easier swap. If you want to do a more rigorous redesign, you might have to change a little more of  the chassis and how it's cabled. 

Anything else? All right. Thank you very much.