-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy path91
69 lines (34 loc) · 17.8 KB
/
91
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
All right. Hi everybody. I'm Christopher Blackburn. This is my colleague Brian Costello. We're principal technologists at TE Connectivity on our system architecture team. And we're here to talk today about disaggregated and composable architectures and the importance of extended PCIe cabling.
To give you a little bit of background on composability and disaggregation, one of the things that we've heard so far at the summit throughout the day is just the increasing power it takes for this compute infrastructure. If we look at server utilization, it's staggeringly low, about 10%. So the theme here with composability and disaggregation is let's do more with what we have rather than adding more servers. So if you take a look at the top graphic there, right, there's a variety of workloads in the data center and each traditional server has a fixed, finite amount of resources, whether it's memory, acceleration, storage, etc. And composability adds an element to that where you can configure the hardware according to the workload on the fly. So if you have an application that requires more memory, you have access to that memory or access to that acceleration. So these architectures are really done through PCIe and CXL fabrics. And we're going to talk a little bit about some of the applications there and then also some of the connectivity that makes that happen. So some applications, and you heard some speakers before me talk about pooled and shared memory, GPUs and AI accelerators, obviously that's a big topic for this summit. And then storage, whether it's drives or flash storage, CXL switching, we just heard a question about that. We'll talk a little bit about that rack level disaggregation. And a new concept where disaggregating the network interface card from the server. And you see these acronyms, JBOM, JBOG, right, just a bunch of, right? And that's really pooling all of these devices together into one aggregate box.
To take a look here at a couple server designs, these are called what we call sandboxes, where we create designs that showcase the connectivity in all of its variety of applications. So they're not necessarily real boxes, but just to take you through where the external PCIe connectivity lies. So a CXL switch on the left, you'll see a cabled I/O port from a CXL switch. And then on the other side, you'll see some belly to belly or even multi-board CXL switches. If you come by the Composable Memory Systems Experience Center, you'll see some boxes there. I think H3C has a really nice CXL switch to take a look at. And then off to the right, and just the last talk that we heard about these memory modules, right, there's a variety of different device card form factors, whether they're new cards like TA1034 or existing cards like PCIe add-in cards, OCP NICs, et cetera. So front faceplate pluggable memory modules, and then also PCIe retimer cards. We'll talk a little bit about that throughout the talk. The endpoints of these devices usually have a retimer card at the end, and that definitely impacts the type of connectivity that we'll see.
So I'm going to pass it over to Brian. Thanks Chris. I want to talk a little bit about standardization. So there's a lot of effort going on right now to standardize external PCIe connectivity. The PCIe SIG has three workgroups that are all working toward that. The EWG, the electrical working group, has historically been for the base spec, but they've expanded into internal and external PCIe connectivity. The cable working group has historically worked on mini SAS HD and has worked through the different generations up to about Gen 5, which they're working on now. The optical working group is a new group that is newly formed and is working at standardizing the optics and trying to figure out how we can make optics work for PCIe connectivity. And there's two kind of paths there. There's one where you really need to modify the base specification, and then the other where we're looking at optics technologies and the form factors that can be used to get there. So the SNIA is the SFF organization has the SFF 1032, which is standardizing the mechanicals of the external connectivity, and we have INF 1003, which was formerly used as an Ethernet standard but has been utilized for PCIe as well. So the OCP actually has an extended external connectivity work stream right now under the server group, and that group is working kind of in a larger or more holistic approach where we're looking at passive connectivity, active optics, active copper, as well as even co-packaged optics. Now there are supporting standards that are also involved, which CMOS is a common management interface that's being used for the external modules and the cables, as well as SFF 8024, which is standardizing the actual, just describing the codes that are needed to tell the host system what's there.
So CDFP is a connector that has been around for a little while. It was selected by the PCIe work group for Gen 5, Gen 6, and beyond. Now just to give a little history of CDFP, it was originally developed as an Ethernet standard. There was an MSA, the 400 gig Ethernet MSA, and then that was utilized in two different form factors. There's style one, style two, which is a larger. On the right side upper we have an optical module that TE produced a while back for Ethernet in a style one, and then on the below image is a style two optical module that TE produced. Again that was for Ethernet. And so this solution was then utilized by some companies for PCIe connectivity and for the Gen 3, Gen 4, as the industry moved more towards Gen 5 and higher speeds. Improvements were made to that, both the connector and the cable, and that became the basis for what was chosen for PCIe external cable. And the specification is listed there, the CopperLink. That was an acronym or a code or a trademark that came up with by the PCIe group to try to differentiate. And so that specification is at version 0.9 now. We expect by the end of the year to fully release that spec. And then there's a mechanical specification that's partnered with that called the SFF-TA-1032 I mentioned earlier. That's really just defining the mechanicals of the CDFP connector.
So just to give you a little more description of it, so CDFP, again an I/O connector, been around a while. One of the nice benefits, it is a die-cast cage that's bolted down. It's a very robust I/O interface. We have three different sizes, a x16, a x8, and a x4. So it's really just scaling that interface down for the smaller counts. The cables we're shipping today are x16s, and we are working towards filling out the connectors with a x8 and a x4. And you can see kind of what they look like. It's a dual paddle card design, so it's space efficient in the system. And we'll talk about that next.
Is when we compare to other form factors, CDFP, which was really designed in its latest incarnation around PCIe, has all of the sideband signals that are specified by the PCIe SIG. It also, again, it was designed for PCIe, so it's an 85-ohm basic system. Most of the other standards, or all the other standards that are out there, were really developed around Ethernet. So they're not really optimized for PCIe like CDFP was. And the last thing I wanted to mention is the power capability. In the spec, we added a 12-volt power pin to enable higher power, and for the active devices that might be needed, optics as well as active copper, we can get about 23 watts of power. And when we really look at the, we calculate out for what's out there as far as a retimer at about one watt per lane, plus overhead, probably 18, 19 watts. So all of the interfaces really have enough power to do what they're doing. So in that case, they're kind of on par.
Next I want to talk a little bit about electrical performance. So in the standard testing topology shown in the upper right, where we really just test the connectors and the cable, and really we eliminate the test boards and anything coming up to it and passing that part of the channel, the data that's shown there, the insertion loss, that's de-embedded. So again, it's really just the cable and connectors. We're showing for a 30-gauge cable, a third one and two meter. And the two meter is just above the limit line. I think there has been some perception out there that the max you could do is one meter. That's simply not true. We can definitely get to two meters within the realm of the PCIe specification.
However, that's only one part of the actual, of a topology or of an actual PCIe channel. And that's really where the budget comes from, is where each of the individual components are kind of added up to make up the whole budget.
But as you start to architect systems, things can be a little bit different than what the spec would have you believe. So in the upper left is kind of a graphic of what I really just described, where you have a channel from chip to chip. So in this case, a root complex to a non-root complex. We have a host board. We may have an add-in card with a connector. The cable and connectors, as well as the device side, all of that has to pass the budget. And so that's really where the conservative nature, I think, of the PCIe spec comes from. However, as in the real world, in actual architected systems, then purpose-built designs, that's not necessarily the only way to do it. That's more of a worst case. As we look at more of a normal use case, where we may need more reach, let's say. So in that case, we would maybe have an add-in card with re-timers, as shown in the graphic at the lower left. And in those cases, our budgets reset as far as noise, as far as losses, and jitter. So in the chart on the right, now this is an insertion loss plot of five different cables. We're adding a three- and a four-meter cable to this. And in this case, it's not de-embedded. So this is a little more similar to what you would have on that graphic, where you have the -- because the test board is not de-embedded, it's maybe similar to where you would have a re-timer that's located relatively close to the interconnect. And in this case, we're showing that even a four-meter cable is actually not really losing or getting past the budget of a Gen 5. At 29 dB, we still have 3 dB or more until we get to the limit for PCIe Gen 5. And in fact, even Gen -- you even have a budget for -- that's for Gen 6, I'm sorry. And so we have a lot of budget there. Again, if your system is architected for that specific purpose.
And to that end, we're demonstrating today out in the Experience Center where we have a 30-gauge four-meter cable, and that's using -- it's a Gen 5 cable, Gen 5 connectors, running Gen 6 data with Marvell SERDES. And so the point being -- and the bit error rate of 10 to the minus 10, and yesterday it was running for about 12 hours, so well above the limit for PCIe Gen 6 of 10 to the minus 6. So your mileage will vary. If you're architecting your system with this in mind, you can get a lot more for your application and going in just a strictly passive copper cable, which as we heard Andy Bechtolsheim mentioned earlier, that it's about a 10X cost penalty to be optics over passive copper. When you go linear, maybe you're saving maybe 30% or so. And so people are going to use passive copper as long as they can, and we can enable that with CDFP.
So next, Chris is going to talk a little bit about some of the form factors. All right. Thanks, Brian. So I brought a 4-meter cable up on stage. The questions earlier about the rack, right, this is coiled up. It's coiled up in the image that Brian showed, but it's a long cable, right? So when we think about rack-scale designs and even going from rack to rack, right, a passive cable is going to get us those disaggregated architectures in the rack. Brian talked a little bit about retimer cards being at the end point, and one of the most important things about an I/O connector in this space is fit within the standard device form factors. So on this slide here, we show a few of the PCIe add-in cards. These are typical for the retimer cards. Across the bottom, you'll see a couple options where on a full-height card, you can fit 4x8 connectors or 2x16 connectors. We do have some customers that are interested in bifurcating from a x16 to 2x8, and other customers that want the bifurcation already to happen on the card, so you have straight x8 links. If you move to the right a little bit and you look at the low-profile add-in cards, that's 2x8s or 1x16, and in the next slide, we'll show a little bit of some graphics comparing these two to the next.
So Brian showed a comparison table of the Ethernet networking form factors, and one of the lines there was XY area on the board. And the important thing to look at here is not just XY area on the board, but also the aspect ratio. So one of the first things you'll notice, let's take a look at the x16 to start. So OSFP XD compared to CDFP, both x16, but you notice that the XD is extremely long. So when we look at that in comparison to the length of, say, OCP NIC, if we had a retimer on a NIC or a half-length add-in card, it doesn't really leave you much space for your silicon device or any other components. And to take that comparison a little bit further, if you look to the right-hand side, these low-profile add-in cards, what we did here was we set the maximum routing distance from the pad on the connector to the first ball on a retimer device. So that distance is the same for XD and CDFP, but what you see is the depth of that connector pushes the retimer far back. And when you need to now route your traces to your card edge, to the CEM connector, you have a complex trace routing you have to snake back through, and you also have about a one-inch trace increase. So if you look at a retimer card or an endpoint device like this, the shallow depth of CDFP is very attractive. Now we're also balancing the height. So there are strict height requirements for PCIe add-in cards or OCP NIC or other device form factors. So we have a line or a marker there that you can see for the PCIe add-in card height opening. And OSFP XD, OSFP, these form factors are too tall to fit into those devices. So if you have a custom endpoint, perhaps that doesn't much matter, but if you're using standard devices, that height is a big concern. There are writing heat sink options for these connectors, but then we have to ask ourselves, we have Ethernet networking in these form factors in the data center, and to make these form factors work for PCIe, it really is a custom non-standard solution. There's discussions in the industry about shortening these, about cabling the sidebands, moving to writing heat sink. So that really is a new form factor. And do we want the same form factor doing two different things in our racks, in our data centers, with that chance of maybe misplugging? So CDFP has now been a dedicated interface for PCIe. It checks all the boxes that we talked about earlier for these applications.
And you can come by the TE booth, come check out some CDFP. We have the Composable Memory Systems Experience Center. There's a partnership with Molex, where TE and Molex offer the same CDFP connectors and cables, so there's a variety of samples out there. And then as Brian showed, the demo with Marvell running a 4-meter passive cable, enabling that rack scale disaggregation through PCIe and CXL. So we appreciate everybody's time. That concludes our talk. Maybe time for a few questions.
Yeah, did your demo for 4-meter include the 9 dB loss of a CPU and 4 dB loss of an endpoint?
So the 29 dB is ball-to-ball, just includes the trace loss and the connectors and the cable.
So all the simulations that you're showing do not include the extra 13 dB loss of the endpoint and the CPU?
Correct. So it depends on whether you're using a root complex, non-root complex, and what the architecture is, right?
Well, you need the root port, root complex port, and an endpoint device added to that simulation to get really what the actual distance should be.
Sure.
In the design we had, what we showed was if you were putting a retimer in that, so it's resetting the budget when you have the, from a, in that, what we're doing, retimer to retimer, we're not really looking at host, root complex to non-root complex in the one graphic. We're really, if you're saying, if you're architecting a system where you need more reach, you can get more reach with retimers in the channel.
Right, but the retimer is sitting on a separate card, so it still has a 4 dB loss on each end.
So if you put a retimer to the root complex, there's a loss there, but that gets...
No, on the other end, where you showed the connector and the retimer on a card vertically, so you have the additional, like, on that lower solution.
Yeah, so we're showing, in this case, the retimers on both ends.
Yeah, so that's additional 8 dB loss for the retimer on both ends. Right?
In the retimer?
Well, the actual retimer, the board, the PCB, and the connector.
Yeah, and, well...
The receiver of the retimer on both ends is gonna hit you.
You're gonna have loss there, and that's what we're showing in our demo, that, you know, even with all that, this works, you know, with, again, retimer, but if your point was, you know, if you didn't have the retimers in the channel, you'd have to accommodate for a lot more loss.
Any other questions? Go ahead.
You mentioned that the CDFP was designed for 400 gig Ethernet. What was the reason for that? Or do you anticipate that 400 gig Ethernet or 800 gig Ethernet would adapt CDFPs in the future?
So it was originally, like I mentioned, as an Ethernet standard, and then it was -- the design was improved to be changed for PCIe. That design would not, I don't think, be usable for Ethernet. We would use the original design if you had to do an Ethernet, but we have -- you know, that standard sort of isn't really used anymore for Ethernet, so if we did have application for an improved CDFP for -- that wanted to run Ethernet-type signals, you know, we would have to look at that. But as it is today, it was really designed around that 25 gig by interface.