-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy path130
26 lines (13 loc) · 13 KB
/
130
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
Let's get started. So we're here to talk about thermal characterization of high power pluggable optical modules. So I'll start off with a quick round of introductions and then we'll jump into some of the focus areas that we have for today's presentation. So I'm Hasan Ali, Associate New Product Development Manager at Molex, leading the development of next generation high-speed I/O solutions, primarily focused on 112 gig, 224 gig, and PCIe Gen 6 solutions with a focus on SMT connectors, cables, and bypass solutions. Joining me is Joe Jacques, Senior Technical Leader, Mechanical Engineering, Cisco. So Joe is an industry veteran with more than three decades of experience working on the physical design of networking and communications equipment. Recently he's been working on the thermal architecture of fixed configuration routers. And those of you who are involved in the MSA standards would know that Joe is a prominent voice when it comes to thermal management discussions of next generation form factors like OSFPXD, QSFPDD 224 gig, and OSFP 224 gig as well.
With that we'll go into some of the focus areas around which we basically structured our presentation today. So we'll start off by talking about some of the power trends that we are seeing within the industry from a pluggable optics perspective. This data is based on surveys that have been done in different standards. Then we'll link that to how, what is the current approach to basically characterize these thermal optical modules. What is the approach that is being used in the industry today? And then we'll link this discussion to the power trends and highlight some of the limitations of the current approach and how it conflicts with the very goal of the future which is to have more power efficiency with the systems and the architectures that we're designing in the future. Then Joe will step in and he'll highlight the new proposed approach that we have for today. Joe will also highlight the implementation of this proposed approach and how it solves some of the very challenges that we see with today's approach. And we will also highlight some of the work that is done to basically showcase some of the application examples and demonstrate how the new approach can help us have more power efficient systems in the future. And we'll conclude it with a strong call to action to basically highlight how we foresee extending this discussion beyond just industry standards and presentations and more of an active actionable work group where we can start implementing this in pursuit of, again, power efficiency and system efficiency in the future.
So taking an overall look at the power trends as it relates to the pluggable optics in the industry, this is a survey that was done in the OSFP XD MSA. So OSFP XD 16 lanes running at 100 gigs, so a total of 1.6 terabit. Based on that survey, we saw that the majority opinion is that at the highest level, 1.6 terabit modules would be expected to be around 35 to 40 watts. Now the good thing is with doubling the bandwidth, the power doesn't double. But that still does not alleviate the very challenges that we would be highlighting with the current way to characterize these modules. So just to put it in perspective, five years ago when I joined Molex as a thermal engineering intern, the biggest challenge for that summer was how can we cool a 12.5 watt QSFPDD module. Fast forward, 2023, the form factor hasn't changed. The way to characterize those thermal modules hasn't changed. But we are now challenged to cool 35 to 40 watts. So what that basically highlights is that we need to come up with new ways to characterize our thermal modules. Otherwise, we would just be impacting the viability of the future designs in the next generation systems and the next generation modules.
Now talking about how exactly do we characterize the modules today. So the current approach is we use the optical module's case temperature as a proxy for the module's DOM reported value. By that, what I basically mean is that the system is taking input from the case temperature to basically determine what the thermal health of that module is going to be. For example, if a module is running at 95 degrees C case temperature, then the system takes that as an input that we need to dial the fans up so we can cool the module better if the I/O is the limiting factor. Typically the upper limits are 70 degrees C or like sometimes even push to 75 degrees C. But generally for industrial and extended application, maximum would be 85 degrees C. So if it was this straightforward, then where is the confusion? So there are different approaches in terms of where that monitor point is going to be on the case temperature. And there is no standardization across the industry. I'm quite confident that even if we do a survey in this room right now amongst the thermal engineers, there would be a huge standard deviation in terms of where we put the monitor point amongst the more popular, I would say, locations are the nose temperature, which is located by the location A, the mean top back shell temperature, the maximum back shell temperature, the hotspot at the center of the pedestal. So there's no standardization in terms of where on the case should we probe that module to basically understand what the case temperature is. That is just a limitation when we talk about how do we get that data. That doesn't talk about the margin and the issues that we see, even if we standardize on the monitor point.
So now that we have kind of a background and an overview of what the current approach is, I want to highlight some of the limitations of the current approach. So the challenge is when we use the module case temperature as the proxy for the DOM temperature, what we are seeing is with these high power modules, which are expected to go, as we saw on the early slide, as high as 40 watts, we're leaving a significant amount of margin on the table. The reason why I say that is just as a case study for this optical module A, the case temperature above the DSP, we only have a margin of 2.4 degrees C. But when we do a deep dive inside the module and look at the internal components, based on the margin of the least, least margin of all those components, we see an 8.6 degrees C margin that's available. So what that means is we have 6.2 degrees C of unused margin. So in reality, if the system is going with the traditional approach of using the case temperature, the fans would be running at a higher speed than what they would need to be running, which pretty much highlights why this case temperature approach results in inefficiencies, especially when that is the fundamental goal of the future, one of the fundamental goals of the future. So that's pretty much when we took a look at it from an efficiency, system efficiency, power efficiency perspective. I also want to highlight some of the challenges that we have in terms of putting the monitor point at that particular location without impacting the thermal performance. So we know that a large group in the industry uses the thermocouple right at the area where the heat sink makes that intimate contact with the module. But as we know, that basically deteriorates or degrades that metal-to-metal contact between the module and the heat sink. So using an approach like that doesn't really allow us to characterize those modules when during the process of characterization, we are hurting the very contact between the heat sink and the module. So those are what I would say are on a high level are pretty much some of the limitations of the current approach.
And then I'll hand it over to Joe to talk about some of the, what the new approach is and how we plan on solving this problem. Thank you, Hassan. Good morning, everybody. I'm very excited to be here to share some of our thoughts on how we will overcome these obstacles presented by the current standards and methodologies. The very first thing on the agenda to address is to disconnect the idea that DOM is a stand-in or going to represent the case temperature of the module. Instead we'd like for the DOM output to be representative of the thermal health of the module with respect to its long-term reliability limits. For example, we know in the current methodology at 75C for a standard module that is the trigger point for a high temperature alarm. It also happens to be in the system fan control algorithms the trigger point to go to higher fan speeds. So this is a critical definition. So in order to maintain backwards compatibility with the system algorithms, we want to have the DOM report 75C when you're at a zero margin position for your internal components. So the formula looks like 75 degrees minus the minimum margin amongst the polled components. Here we have a table that's the same data set that Hassan presented. In this case we see that the current approach for DOM reporting would yield an output of 73 degrees. Now if we take the new methodology where we poll the margins, we see that the laser has about 9 degrees of margin. And so therefore the DOM output would be 66 degrees. So this is a vast difference in the way you would address how your module is doing in a system.
So there's actually a lot of end users for this information. We have besides system engineers needing to design architecture, we have component engineers that need to qualify the modules. So we need to address the way the MSAs define this case temperature. And the thing that we want to do here is to flip the MSA so that instead of saying we're going to impose a limit of 75C on your case temperature, instead and leave the supplier in this ambiguous situation where we don't know where to measure the case temperature every time, instead of that we're going to specify the monitor point and allow the supplier to select the appropriate case temperature that's going to guarantee your minimum component margins. So in selecting that location, there's a couple of parameters we have to consider. One of them is accessibility for a 36-gauge thermocouple wire. The other thing is we don't want to interfere with things like the heat sink interface, the EMI, the latching. And so as a result, we can come up with a good location we think that would be common to most modules. Now that might not be universal, so we would allow some flexibility for the supplier to pick a different location. So in addition to that, because we need to do component qualification, the supplier would also need to report the offset from the DOM reporting value to that monitor point location. Now here's where things get kind of interesting. The current MSAs have this situation where we have case temperature as a range from standard would be 0 to plus 70. What we would like to do is suggest that we replace the T ambient with a T case-- excuse me, replace T case with T ambient, with the T ambient being a local condition based on the reference kind of being that this is analogous to electrical component where we have commercials with 0 to 70, industrial minus 40 to plus 85.
All right, so here's where we did a simulation on three different types of modules. Here we bumped up the power to 40 watts, and the results are that you can see by the prior standard that you would be failing across the board. But the new proposals are a more accurate representation of those internal components, and you can see that we're passing. So this is a great example of how this methodology would work. On the image, you see a little blue dot. That is our proposed location for the new monitor point.
All right, so what are these benefits? We don't want to leave any margin on the table. We don't want to drive-- we don't want to overdesign. We don't want to drive to bigger heat sinks, bigger fans. We want to have increased system efficiency. We want to make things easier and standardize our process and our methods for qualifying modules. And just as a general way, it enables thermal engineers to treat these optics as just like any other electrical component, ASIC, FPGA, CPU, SSD. And then lastly, as I said, we're not forced into overdesigning some of these features for the modules.
OK, so call to action. We've actually already kicked this off. We have a white paper that was released within the last month. This is from the working committee for the QSFPDD 1600. This white paper actually has some discussion of the new methodology for DOM reporting. We also have to address the common interface management specifications. There's one by OIF. But basically, we're getting the ball rolling. This is basically going to be driven by necessity as we get above 30 watts. So that's it.
All right, thank you very much. Any questions to the speakers? Do you think air cooling of transceivers will be sufficient for the next decade, or should we start thinking of liquid cooling as well?
So if you look at this white paper, you'll see that we've got a real solid roadmap out to 40 watts and possibly out to 50 watts. So that's, I think, where maybe you transition from air to liquid.
And at the Molex booth, we've got this solution called the Dropdown Heat Sink that allows us to achieve that. It enables a TIM material for pluggable I/O applications, so it extends the life of the air cooling applications to beyond 35 watts.