-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy path128
54 lines (27 loc) · 14.4 KB
/
128
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
So in the interest of time, let me just jump into the outline of this talk. So it has been discussed several times in different forum, presentation, and also the booth demos. So for this particular talk at OCP, we want to focus on the technical aspect, especially the system and optical module integration and evaluation. And so first thing, I will briefly touch upon the optical interconnect scaling, challenging and opportunities, and also where the LPO stands compared to other options. And next, we are going to talk about the evaluation metrics for LPO, which is the metrics that we are using as a guiding principle and meta to approach the technology investigation. And next, we are going to talk about the system calibration, as well as the 51T switch, data, evaluation, et cetera. And we will summarize the talk with a call for actions.
All right. So as we all know from the past one and a half day or so, we are all convinced that the optical interconnect is very important that we should focus on, and especially for the future AI cluster advancement. So we all know that this is not without challenge, right? So on the right-hand side, the table summarizes our thoughts back a few months at the beginning of the year, comparing different solutions, naming the retime pluggables, the LPO in the center, and also the CPO. And the color here represents our thoughts at the time that we represented to the industry. So I won't go into the detail here, but if you are interested, I'm happy to take it off line with you. For the purpose of this talk, though, I want to draw your attention to the LPO. And based on the progress made in the industry, I think there are a lot of convincing results that we are seeing that the link performance is making significant progress. But as you can see, in the center column, there is a flashing red, which is the link accountability for LPO. And that is the particular reason for an end customer like us to adopt it at a wider scale. So this will be the focus for this particular talk.
All right, let's jump in. So here it shows a very simplified view on the physical implementation for different types of optic solution. So on the top, this represents the retimed pluggable. And we are all very familiar with this pluggable type. And this has been the driving force for our data center and across the many different hyperscaler peers as well. So in this case, the definition, there is a well-defined test point that allows the interoperability and also allows a multi-vendor open ecosystem. And with this implementation, the innovation can be allowed within the accountability area, which is the gray box for each case. So this is very powerful because as long as the test point requirements are still being met, the technology can be innovated in an independent way so that we can take advantage and accelerate a whole ecosystem. On the other end of the spectrum at the bottom is the CPO. So the industry, including Meta, has been invested in this technology. And of course, it does have its own challenges in a different way. But in terms of the link accountability, the CPO can be made interoperable to the retimed optics. So in the center, there are two scenarios for the LPOs. As we are almost convinced by the demonstration and results showing by the industry partner, the end-to-end with a closed control of the entire system, the LPO could work. And as long as you have this end-to-end control and everything is owned by one single, let's say, vendor or vendor sets, this could work. But in order for the hyperscaler like us to deploy the LPO solution in a meaningful way, we really want to see the interoperable open ecosystem. So again, this is the focus. And I will pass on the mic to Qing to talk about more technical details.
Thank you, Janet. Hi, everyone. My name is Qing Wang. I'm an optical engineer at meta. So the big picture of the LPO is with the recent development of the service capability, especially the ASIC. So there is a trend to switch from the analog service to digital service that makes the LPO possible. And it's a complicated problem because traditionally, the DSP in the optics kind of separates the electrical domain from the optical domain. So it's a divide and conquer. Each section was taken care of separately. With the LPO, the host service needs to take care of the whole link. It's a complicated problem, but it doesn't mean there's no solution. To simplify and accelerate the development of the LPO ecosystem, we are making the proposal of this evaluation matrix of the LPO. So first of all, we limited the problem to 100G per lane for now. It means 2 by 400G, 8 by 100G, and 1 by 800G mode only. We're not considering-- we are deprioritizing the backward compatibility with the legacy optics, like 50G per lane or 25G per lane. Yes, we need to have interoperation. You guys just saw there is a problem with this interoperation just now. However, what does interoperation mean in this case? So for LPO, we require interoperability between different LPO vendors and also between LPO and the retimed module to form a healthy ecosystem. And on the host service tuning, we want to avoid, to the maximum extent, the handshaking between the optics and the switch. It means the switch calibration should be independent of the optics, and the optics calibration should be independent of the switch. There are some nuances in this kind of criteria. For example, there could be per port, per switch MPN, per service calibrations. And also, there could be some common settings provided by the switch on a per-module basis, on a per-switch MPN and per-port basis, like the RX equalization we had before. On the end-to-end, a prefix could be our target, with preliminary setting 3E-6 as our target. And then for the fiber optic link budget, this is about FR4 lite. So we define our use case to be 500 meters and also insertion loss less than 2.5 dB. So it has been widely discussed about the test points, TP1 to TP4. For TP2 and TP3, our thought is to use the Meta 2 by 400 FR4 lite spec. It's similar to the IEEE FR4 spec with some relaxation. On the TP1, TP1a, TP4, TP4a spec, it's still TBD with some thought, with some draft. And we're going to work on that and try to polish that more. So it's right now TBD. On the 200 per lane, we would like to have some feasibility there because we don't want to have this investment and effort into 100 per lane wasted. So diagnostics, as some speakers said before, it's very important to take the serviceability and the diagnostics at heart when we operate millions of optics in the data centers, especially for the AI/ML use case, where a lossless network is assumed.
Now what's the problem? So in terms of the impairments, at a high level, there are three kinds of impairments: jitter, ISI, and the noise or distortions. So let's look at them one by one. For jitter, the module is not a source of jitter and should be transparent to host jitter. For ISI, previously, the optics DSP in the module would take care of the different spans. However, right now, without the bit timer or DSP in the module, the host service needs to take care of the whole link, which is two C2M links and also the module O/E and the E/O links. This requires more precise host TxFIR adjustment. On noise and distortions, the host and module noise sources are additive. And there is no regeneration during the E/O, O/E process. And there is also limited tuning on PAM4 eye linearity.
So like we talked about, ideally, as we proposed, the system calibration should be system-focused. That means the host setting should be independent of the LPO module. In this talk, we are proposing to determine the host setting based on the TP1 metrics. And in a moment, we'll talk about what that means. And this is mainly addressing the ISI on the TX channel at the host side only. On the module calibration, module calibration and control should be adaptive. So on the RX side, it's pretty familiar to everyone that AGC should be running. But on the TX side, we think the AGC is also required, which is very different from the retimed module. And there is a certain margin that we need to reserve to compensate for different corner cases, like changing over temperature or change over life.
So right now, we're going to showcase a methodology to calibrate the system and the LPO so that it can form a healthy link. So in this case, we start to set the TxFIR with some initial setting that should be able to get us to mirror the eye and the TECQ and the DCA. So then we send this signal to a DCA, which is treated as a reference RX. So the RxFFE type values will be automatically adjusted in the DCA by repeatable and consistent LMS method. After we got that RxFFE type values, we just do a convolution between the RxFFE types and the FIR_init to get the final types. And if needed, this TxFIR_final can be also scaled to adjust the TP1a swing. So all the type values, including the min type values, will be adjusted similarly.
Now following this methodology, we calibrate the TP1a for all the 64x800 ports of this 51.2T switch for each port, 800G or 8 x 100G per lane. As you can see, the result is very encouraging. The TECQ through all ports is less than 1.1 dB, and the Vpp through all ports is between 230 millivolts and 500 millivolts. And in the latest RF draft, if I remember correctly, the maximum limit of the swing is actually 500 millivolts. The low limit is TBD. And on the right, you can see the TxFIR pre-tap weight and also the post-tap weight. You kind of can get an idea of the host insertion loss indirectly.
Now on the LPO side, how do we do the LPO calibration? Basically, the point here is same peaking should be applied to modules in all ports. And the same swing gain should be applied to modules in all ports as well. So to support this, the whole swing should be consistent across all ports by scaling all the TxFIRs. So for example, this means for the shorter trace, we would scale the taps down to a certain level so that the TX swing is uniform across all the ports. Now with the AGC of the LPO driver, it will improve the tolerance to the swing variations. Rx TIA tuning is relatively straightforward because it's already implemented in the retimed module.
Now this is the TP2 result following the TP1a calibration we just talked about. As you can see, two groups of data were presented here representing two designs. So as you can see, the TP1a swing is across through all the ports, which guaranteed the ER, uniform ER through all the ports. And the RLM is pretty good as well. So the TDECQ can marginally meet retimed TP2 spec, and the ER strongly depends on the driver and the host swing, which is understandable. So with some further optimization, you may imagine the TDECQ can meet the spec. At least it's promising.
So we don't have too much concern about the end-to-end BER. I think this has been shown in various talks. I think the problem is if we can achieve no BER in all the cases through the quarter cases, not just the average or the median cases, but also the three-sigma scenarios. So here we showed different interoperation scenarios between the retimed module and LPO and LPO and LPO, and also the DR and FR use cases. As you can see, in terms of the sensitivity margin, we have about 3 to 5 dB sensitivity margin. But we also noticed in our development, there is some dependence on the BER floor because of the crosstalk. So this is something we need to take a closer look.
Now this shows the correlation between TDECQ and the TECQ for the LPO. And also, as you can see, it's pretty correlated, which is understandable as well. And we didn't show here, but TECQ and the retimed TP2 are uncorrelated, which is also understandable. On the right side, it shows there is some tradeoff between ER and TDECQ. So this gives us some leverage on how to optimize the module design so that you can meet these two vecs at the same time.
Finally, call to action. So LPO, as we just mentioned, because of the lack of the retimer or the DSP in the module, the electrical domain and the optical domain were tightly coupled. So the switch host and the module vendors should align to the calibration methodology that optimizes interface accountability, manufacturability, and integration. And here, we proposed an approach based on TP1a to do the host calibration and adaptive module settings. Thank you.
All right. Thank you very much. Just a clarification question. Some of your charts were showing data for design 1 and design 2. Are those different switch designs or transceiver designs?
Good question. That's different module designs.
Module designs.
Yes.
Okay. And were they based on different technologies or what were...?
Similar technologies, but optimization of the subcomponents.
I see. I see. All right. We only have time for two quick questions, and then we have to move on. So please go ahead.
Thank you. Thank you for the excellent talk. So I have one quick question. So does Meta will the application of this LPO technology mainly for a small percentage of the network, certain niche applications like some bookended and lower loss links, or Meta is looking to deploy this to a much larger scale as a backbone of the network? Thank you.
Thank you. That's a very good question. So there is a balance between the larger scale deployment and the limited use cases. And it's not surprising to say for each one there is a certain advantage and disadvantages. I think what we're thinking right now is to start to penetrate into this space through some MVP definition, which means we are willing to relax some of the requirements just that we talk about to accelerate and speed up the ecosystem development.
All right. One more question.
Can you tell us something about the TDECQ data, did you see any correlation between the TDECQ or the TDECQ-like data and the overall bit error rate? And was there correlation or anti-correlation or were there situations where the correlations changed? If you could comment on that, that would be helpful.
Thank you. That's a very good question. So I think traditionally there is a lot of discussion about correlation between TDECQ and the BER, even for the real-time module. And we all know it's probably not the best correlation through all the core cases. But assume every vendor or every design is done in a sensible way, we still think of the TDECQ as a useful metric. It doesn't mean it's the only metric, but it's still a useful metric for reference.
Did you, for example, find that usually the best TDECQ corresponded to the best bit error rate or
Yeah. Yeah, thanks. I think, like I said, it's probably not correlated in all the cases, but trend-wise they can still be correlated.