-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy path231
44 lines (22 loc) · 13 KB
/
231
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
Hello. Good afternoon. My name is Matt Bowman, a hardware engineer at Meta.
And I'm Jeremy Baumgartner, also a hardware engineer at Meta.
Today we're going to talk about Grand Teton, give a system overview, and also a bit of lessons learned and some of the gotchas we got as we're going through and kind of building it, deploying it, and having fun.
So we'll start off, and this is it. Kind of jump right in. Didn't want to do too much of a buildup. This is Grand Teton exterior ISO view. What we plan to talk about today is both the detailed specifics around the system, the system design, the modularity, the tray, but also some of the things we encountered around SI, serviceability, availability. So less, you know, things that you could go read in the spec, and more the value add we did during the system design and build.
I wanted to start and cover some of the motivation. So maybe a show of hands. Has anyone heard about AI throughout the show today? Yesterday? Yeah. Yeah, probably. And in our sense, we are in a similar boat. AI, ML workloads at Meta, they're continuing to grow, quantity, intensity, they're not going away. The whole LLM revolution, you know, we're part of that too. And what we found is we needed hardware that could keep up, so to speak. And, you know, hardware design, it's a longer process than software generally. And we needed something modular that let us optimize, tweak, and adjust so that you could get something that is fine tuned, optimized, and get that performance TCO and kind of roll with the punches, so to speak.
I'll pass it over to Jeremy now. Yeah, I'll go in a little bit of detail with Grand Teton's overall architecture. This is a heads up. We have all our open source specs up on the OCP website. So what you see here, it's all in the spec too, with a lot more detail, breaking it all down. So just an overview, this is a big eight OU chassis. You can see on the picture here, sort of exploded without the sheet metal. We have three main trays visible from the front. We have the CPU tray, you got a switch tray, and the accelerator tray. The whole chassis itself is ORv3 compliant, slides right in the rack. You can see out the back, we have the fans and power boards, and it slides right in. And the whole chassis and all the trays in the chassis are hot swappable.
Go in a little more detail to the three main trays we have here. We have the CPU tray, which is a two socket. There's an AMD and Intel version implementation we've done. Again, one spec for each. You can see that on the website. Downstream, we have PCIe Gen 5 going on down the switch tray, which then connects to the accelerator tray below. We have a cable backplane connecting all three of these trays. We also have a DCSCM, shout out to that standard on the OCP. We've done some slight customizations for Grand Teton, we'll get a little detail there too. And two front end OCP NICs. We have the switch tray below that, two OU. We have four Broadcom PCIe Gen 5 switches. And you also have eight back end NICs, capacity for 16 SSDs. And also all the downstream PCIe Gen 5 going to the accelerator tray. In this implementation we're showing here, we are compatible with NVIDIA's HGX pinout specification.
Just a little more detail, just another view of the PCIe architecture here. Two sockets to CPU going down to the four PCIe Gen 5 switches, goes to the SSDs and the back end NICs, and then the GPU tray at the end. So all highly interconnected.
Cool, thank you. So Jeremy just gave us a great detail of the system specifics of how it's going together. And tying it to some of the earlier comments about modularity, flexibility, doing it. This is just a great example I love, and it's a huge shout out to Meta's thermal and mechanical teams for designing it with this mindfulness. Because of these different combinations of trays that you can put in, what we have to deal with actually is varying airflow impedance through the system. If you were to not have this tuned, airflow much like current would just take the pass of least resistance, bypass the parts that you really need to cool, things wouldn't work. So within, if you were to go to our system, pull out all the trays, peek in and take a picture, that's exactly what this is. And it's showing examples of the different perf patterns that you have depending on the configuration or flavor of the system. And we do it both to make the system work, to have a valid thermal kit, but also to let us optimize the airflow. We don't want to over cool things, especially at our DCs where airflow itself is a commodity and you might be paying too much for the fans, all of this other stuff. And I love this as just a very practical example of we knew we were going to have to adjust and roll with the punches. And so it was designed so that these perf patterns, it's not like you got to rip out a whole bunch and it's hugely painful. It's meant to be kind of swapped, rearranged and go and adjusted.
And this was another one that I, it was perfect. This was probably two or three months ago when we were at one of our data centers, the, in just walking into the room, going to the data hall. And I saw it, I was like, I need that in my OCP presentation, because it's this huge sign as you're going to the data hall of most people associate, okay, Facebook move fast, move fast, break things, move fast. Our DCs and with an infrastructure, we take it a bit step and there's a few more caveats where it's move fast, safely and do it with high availability. And I just found this perfect. And I put it here because pictures of this and to tie it in, it feels more real. And it is something that we truly actually focus on or are believing in that, you know, it's the first thing you see as you're walking into the data hall and I'll pass it back because we have some examples of how we're doing this on Grand Teton.
Yeah. So these are really simple, but really high impact for us and the DC techs. It sort of makes everyone's lives easier. This is what we call the back saver button. So it's basically, we just call it an AC button we put on the front of the DC SCM and it power cycles the whole system. No software firmware in between. One of the biggest remediation actions we take in the data centers, just power cycle the system. And if we can't reach it remotely and you're in person, well, how do you do that? Well, you take the server out, put it back in. This thing weighs 160 kilograms, 350 pounds. So this button is a whole lot easier. And by just having solid copper from there to the hot swap enable pin on the back, it's equivalent to power cycling the entire server and pulling out of the rack and putting it back in. So it was a nice win there.
Another really simple high impact addition we made to the system are these LEDs. These are LED power LEDs tied directly to the power rails for each hot swap sort of going down the line. We have a couple, the most evident ones are on the SCM. We have a couple on the motherboard as well. And again, if you're in person and you were unable to diagnose it remotely, you can instead look at these LEDs and make a high confidence decision about how to fix the system as fast as possible. These power boards, if you have to replace one of them, take a little longer than average compared to the easily swappable modules we have in the front. So again, another nice win there.
So Grand Teton, throughout all the main interconnects to the GPUs, the NICs, it's PCIe Gen 5. This is actually our first AI/ML system using Gen 5. So there was some learnings on our part and things we took away. So I wanted to share a few of them because I found them very interesting, in fact. So up on the screen, you can kind of see the picture. This was a snapshot of the layout. And you have the breakout regions, typically within the BGA fields near the connectors. You would have some geometry that needs to change. Say it's a diff pair, it's spacing the width. And before, you could kind of get away with slop. Okay, it's so short, it's so don't care. Just shrink it down, get it routed and connect. What ended up happening is our EVT builds of these Grand Tetons, we passed. We looked good. We did all the 4x4, 5x5 testing, bit error rate. It all passed on EVT. We did not have what we're calling this neckmode regions as a controlled impedance in our PCB stack up that was defined. And when we got our first batch of DVT boards, we were actually seeing failures. It was like, oh man, what's going on? And through PDR testing, kind of some investigation, we realized it was this uncontrolled region of impedance near the connectors underneath the BGA that basically it's uncontrolled. You get the PCB fab, we use multiple vendors, and it just was causing havoc. And so what we ended up doing was saying that needs to be a controlled impedance. You can't quite hit maybe the 85 ohms that you need, but to at least have it controlled as something that is a test coupon on the board that's validated that when you get the board, you know it's not going to be some surprise. We found that very important. And so we wanted to share here. Yeah, that was a fun, interesting learning.
So another one that is, I won't say unique to Grand Teton, but certainly it was exacerbated because of our modulars and trays and using cable back planes and things that mate is you're not always guaranteed. And I'd say it's probably more than unlikely that you push a tray in and it's going to be all the way perfectly inserted. So that simulation model of that ideal connector mate, it probably isn't what you're actually seeing in reality. There'll be some connector Demate. And we actually knew about that. Like that part we knew in our simulations, we took it into account. And it's one of those fun and amazing cross-functional things where this bartering of how much SI margin do you need on the mechanical design? What do you take in this tolerance? What can you expect? Connector gather, what is it going to look like? So it's actually a multi cross-functional thing that you really have to look at. What we didn't necessarily have nailed down as well is this more reality. This bottom picture of connector Demate is not always perfectly orthogonal. And in fact, it's probably going to have some twists, some bow, some skew. It's going to be do something weird. So when you're specing it and when everyone in the different teams are talking about it, you say, okay, we can have 1.5 millimeter of Demate. Where do you mean that? Is that the worst case, the best case, the middle? And it's just one of those important areas that you need to align of where, when you're talking about Demate, where is that? What do you mean it? And go to that degree. And you should do it in simulation because faster speeds, you're getting lesser and lesser of this extra margin that you can play with. And this can come up and bite you.
And in addition to simulation, it needs to be validated in the lab. And this was another picture I was like, I got to put this in the presentation because I liked the juxtaposition of we have one of our, you know, this state of the art AI system and we just use zip ties to just really make sure it's fully sated, mated. And then you run all your validation and go through there and you're like, okay, I know it was fully mated. It's pretty rewarding to get it done.
All right. That brings us to our final slide here. It's call to action. And as we kind of indicated earlier, these specs are contributed. So you can go onto the OCP contribution website. They're there today, right now. Our little QR code for anyone with the phone, that will bring you right there. You can pick it up. This is all part of the OCP server work group. So we've talked and presented in it. I'd encourage anyone interested in it to go look at recordings, participate. At the floor today, there is a grand Teton on the floor. You can take a virtual tour of one within the meta booth or you can go to metainfrahardware.com and there's different flavors of the tour, not just grand Teton, but a bunch of other meta infrastructure you can look at.
And we did set this up so that our questions, we're happy to take them.
It's a good one. We ship full racks. Yeah. At the scale that we're at, the full racks are what ship to the DC. That's what we kind of order and plan around.
Thanks. Referring to the modular design, what components would have to swap out to convert the system to take a different form factor, GPU, PCIe or OAM module? You want to take that one?
Just to repeat the question, you're asking what modules get swapped out if we put in different GPUs?
Yeah. If you want to adapt the system for PCIe or OAM. Right, right, right. Yeah. So that 4U GPU at the bottom, that whole chassis in this instance was an NVIDIA HGX. You can also consider like a UBB style sort of standard. You need to match the pinout at the back of the cable backplane here. And so if you're able to switch at that level, it becomes an easier swap. If you want to do a more rigorous redesign, you might have to change a little more of the chassis and how it's cabled.
Anything else? All right. Thank you very much.