-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy path122
14 lines (7 loc) · 6.41 KB
/
122
1
2
3
4
5
6
7
8
9
10
11
12
13
14
Looking up. OK, well, I want to talk about the differences, or rather the important metrics for optics in the age of AI. And we're having a little problem synchronizing here. Let me try this one more time in a different way. Cancel everything else. Well, the limits of computing is when the laptop doesn't synchronize with the display. In any event, the AI world is very distinct from the traditional ethernet optics that we're familiar with. I brought an example of an OSFP module, which is eight lanes of 100 gigs today, a total of 800 gigabits. Going forward, this will be a 1.6T module. And there's another version called OSFP XD, which can extend this all the way to 300 to 100. But compared to the bandwidth requirements in the AI world, this is unfortunately very little. And if there was a way for me to get to my slides, I could make these points much better. OK, let's try this one last time here. I may have to just relaunch this application. I don't know why this happened. Ignore. OK, final attempt here.
So as was widely discussed yesterday, AI clusters are growing by leaps and bounds, like numbers of 10x per year. I was talking about how do we get to the next 100x. So if you assume for a second these same metrics, which is the next generation chips are 10 times faster and clusters are 10 times bigger, the aggregate bandwidth for that network is 100 times larger than today, keeping it as a constant number of network fabric per throughput. And obviously, reliability, power, cost of the optics that interconnect these things are absolutely crucial. So to put this in context, the aggregate bandwidth compared to current networks used for current clusters, if it's 100 times faster, would be 1 exabit per second. I know this is a new number. We used to talk about petabits per second. We are definitely now in the exabit era. This is the equivalent of 1 million of these little modules that I just highlighted here. And obviously, this is just one cluster. There's many such clusters in planning or various stages of development.
So let me start with reliability. Current optics modules, like one of these, on a good day have a failure probability of 200 fit. That is, failure in trillions of whatever units, which translates to 500 hours MTBF. The dominant failure mode, as we all know, is the laser, typically counting for 90% of all failures. If you have a million of these at 200 fit, one optics would fail every five hours. Clearly, this is not practical, and reliability needs to improve substantially to even have a practical system here. Now, some of these connections may be copper cables, which of all things, or while they're much shorter, are much more reliable in optics. They basically don't fail. But nevertheless, one could argue that we need-- I said three orders. We need at least two orders of magnitude improvement in optics reliability to have an acceptable scenario here.
On the cost side, the forward projected cost of these 800 gig modules, $0.50 a gigabit, you need two optics per link. So that's $1 per gigabit, or roughly $800 per link. For a million optics, that's $800 million, a lot of money. Copper cables are, in a sense, of $0.05 per gigabit per site. So that's like $80 a cable. There's a long way for optics to reducing cost to get competitive with copper. And of all things, even the fiber infrastructure, which is encountered on these optics, is at least as expensive as the copper cable. So that needs to be cost reduced as well.
On the power front, the DSP-based optics that we're all familiar with today take the latest versions for 800 gig, consume between 10 and 16 picochoules per bit. Linear pluggable optics, which remove the DSP, can cut this by half to about 5 to 8 picochoules per bit. Saving 10 picochoules per bit at 1 exabit is 10 megawatts of power saving. So yes, it does adapt. And you also get lower cost, lower temperature, and lower failure rates without the DSP. So we do expect a significant adoption of LPO optics for AI clusters starting next year. And again, the pluggable form factor will eventually go from 800 gig to 1600 gig to 3200 gig. However, we need much more improvements on both the power and density of these optics.
So beyond linear or co-packaged optics, we fundamentally need to address the reliability, cost, and power challenges ahead of us. And reliability also means testability and serviceability. In other words, if the optics or the laser fails, there's a way to replace it. Solving reliability means fundamentally fewer and more reliable lasers, most likely a comm laser that makes many colors and just reduces the number of lasers required. Reducing the power issue means to minimize both the electrical and the optical power required to convert from the single to the optical photons, which really means lower voltage swing interfaces, low voltage swing modulation technologies. And reducing cost means higher integration and fundamentally fewer fibers with most likely a narrowband WDM approach, something like 16 to 32 lambdas per fiber.
And the issue here is, to be totally practical, there's a lot of investment, time, and effort being spent on developing new optics technologies by the companies we're going to hear from today. Many of them are startups. Some of them are bigger. But the context of the product spec or the exact thing we want the industry to build has to be defined. And I think this is where OCP can be very helpful. So by basically aggregating or collapsing this big space of possibilities into a very specific-- whether it's a chiplet or basically the choice of electrical interface, signaling the optical thing, the bit error rate targets, the channel spacing, and so on, is ultra important to be able for the companies that are making significant investments here to focus their efforts. There are many important trade-offs that need to be evaluated, ranging from the speed per channel-- you go wide or slow, slow and wide or fast, and perhaps it's higher power-- to how to get the high bandwidth numbers. The good news is that, compared to several years ago, the implementation technologies that allow this new sort of white sheet approach to how to design the best optics have matured significantly. So it is, I would say, more predictable now to make these choices than a couple of years ago, when this was still much earlier. So reliability, scalability, cost, and power are the ultimate challenge for AI optics. And you will hear from a lot of interesting companies in this session here today. Thank you very much.