-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy path154
52 lines (26 loc) · 20.3 KB
/
154
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
Welcome, everybody.Let's get this show on the road.I would like to welcome our panelists today.We have with us Jayaprakash.We're just going to call him JP for the day.And JP leads the advanced package and system development efforts at D-Matrix.He's also leading PoC and interop work groups at OCP, ODSA.Prior to D-Matrix, he was with Cisco Compute Server Business Unit.And JP has a master's from the Indian Institute of Science in Bangalore and a PhD from the Inter-University Microelectronics Center in Belgium.Please help me in welcoming JP.We have with us today Majid Foodeei.He's director of standards at Kandou, a global fabless semiconductor company with its headquarters in Switzerland.If you will help me welcome Majid to our panel session.And we have Bill with us, who is now Subaru for the day.Thank you, Bill.I couldn't possibly list all of the wonderful accolades.
I think just mention that I'm here for Professor Iyer.And in one sense is that he is a university professor interested in developing future generations of students.I am interested in the roadmap in thinking about how those future generation students will follow up and build the industry according to the roadmap.
All right.Great.And please, a very warm welcome and thanks for Bill for stepping into it for us.So I'm Tom Hackenberg and I'm a principal analyst from Yole Group, more specifically Yole Intelligence.Our services are for market analysis, but we also offer services in teardowns and consulting and many other services for your market intelligence needs.We go from system level all the way down to advanced packaging and teardowns.So please, if you have any questions in that regard, speak to me afterwards.
I'm here to just set the stage for the actual economic value that we're looking at for the chiplet industry, and then I'm going to turn it over to the much smarter people than me.This particular slide gives you an idea of what the market size is for several different markets that are targeting the, that we feel are prime targets for a chiplet economy.And first off is sort of the low hanging fruit where we see the main players like AMD and Intel, already deeply involved in their proprietary solutions and also offering their expertise to the open compute platforms to help develop the chiplet economy.As you can see here, we're looking at around 10 million units in 2018, driving up to 20 million units of servers.That's about a 25% growth of the market, but within that we're looking at about a 41% growth, we think, out through 2028 for chiplet based solutions.So they're pretty much completely converting over to a chiplet economy already in the server development.Starting out from the servers, those companies are also key to developing processors for the PC industry, which is a much larger industry, even though the per chip revenue is not as high.We're looking at an order of magnitude more chips going into PCs.And while this may be a slower growth rate, or at least a slower start rate, the growth rate is actually higher.So we're looking at about a 57% growth rate for a chiplet based PC compute processor, GPU processor solutions.And within that, we're looking at AMD, Intel, even Apple with its current M1 Ultra solutions are we consider a chiplet based solution.But even more interesting probably is the opportunity value that we see in terms of moving chiplets out into the much larger embedded solution space, such as consumer systems.And mainly there we're looking at the smartphone technologies, where we have around 2 billion solutions for opportunities for chiplet based solution.This is again, much slower to be a starting because they are yet feeling the crunch that the data centers are for the expansion of memory and the expansion of accelerator services and many other things.But an SoC has quite a bit of a demand for splitting up the processes into features that can be much more optimized in different process nodes.And then finally, one of the more interesting and we see rapid growth industries is the automotive industry.And while it's much smaller than say the consumer devices, the growth rate for this industry is phenomenal.And features like advanced driver assist systems and advanced telematic systems, the connected car electrification are all demanding more processors and more process performance.And within that, as those systems get larger and more complex, the advantages of chiplets will certainly present themselves.And with that, I'd like to turn this over now to Majid, so he can give us a little bit more.
Thank you, Tom, and welcome.So I'm with Kandou, as mentioned, and I'm here to share experience of developing a real product that you could actually see the demonstration in the experience center that's chiplet-based and sort of follows on the promise of a chiplet economy.
what we've done is at Kandou, decided on this journey of taking our chiplet IP that dates back to 2016 and was used by our customers in wide-scale deployment into our own product.And we built a PCIe Gen5 CXL retimer based on chiplet, and we have a demonstration of this in the demo center that's sort of an end-to-end data center use case from CPU, root complex, retimer, switch, and endpoint NIC from NVIDIA.And you could actually see that's a real product, and it's ready for deployments.But why did we do that?The promise of chiplet, instead of waiting necessarily for the standards, because we had our own IP, we decided to actually, for now, go that path and leave the room for adopting standards in future.And the idea is really to be able to do less tape out and produce many products.And you'll see this in action.And this allows us to do faster time to market and bring the chiplet economy to the smaller company startups and beyond what Tom mentioned, large adoption that's already going on with the larger players.
this is really the story of bringing chiplets to products.On the left-hand side, what we decided to do is x4 PCIe CXL, die with our own chiplet core die-to-die interface.And then this was actually part of the OIF standardization, so we were managing to bring that into the other standards.And here now, we are able to bring two terabit aggregate die-to-die into a chip.And in the middle, you see a buy 16, so we could actually replicate these dies, two of them or four of them, to produce variations of the product.And beyond that, in future, we're also, the roadmap allows us to mix and match with other dies.
this is a big journey and what we want to share with you really the challenges in totally rethinking architecting chip that's chiplet-based.There was a lot of talk earlier around the testing, but there is also how do you partition, how do you make sure that topology of the dies are right, and how do you make sure that distribution of the control, finite state machines, et cetera, allows you to do the way you have architected it, and how, in terms of the management and having a master-slave situation inside, or the security aspects that was mentioned earlier talks are ensured.So without going through addressing these challenges and making sure that you don't compromise on critical requirements of the product, end product, in this case, for example, ultra low latency, you cannot really deliver on these promises that we mentioned that you could do fewer tape outs and products that actually can sell in the market and be competitive.
this goes rethinking through the whole product development cycle in terms of portfolio definition, in terms of derivation of products that you foresee in your roadmap, and how do you make sure that this thinking is going to actually sustain the market because you have to be very agile in the market.So flexibility was mentioned during the bunch of wires session.This is something that we also really think is very important as we develop these products.And as previous slide, I mentioned the architecture level, challenges that are involved there in terms of reference clocks and synchronization and the skewing of the various clocks across the dies.This was something that had to be dealt with during the architecture phase.And then as you go to development, this is really a co-design of die, product, package, et cetera, and you really have to take a lot of these details in a brand new way of thinking.That's the chiplet development.Going forward in the validation, there was a lot of tests, also test discussion here.That's something that we also went through.How do you test the die, the product?If you have any issues, how do you trace it back and improve if you do your test chip and or you do the final product?What are the concentrations in a specification you have to take into consideration?And finally, in the production, how do you handle your supply chain?How do you handle the wafer sort and other aspects of the full production?So in summary, really rethinking the way products are done, it's much easier by the larger companies, that we bring it to our scales and making the enable really and showcase the chiplet economy.And with that, thank you.
Thank you, Majid. Let's welcome JP to the presentation.
Thank you.So hope you can hear me now.Good morning.My name is Jayaprakash Balachandran, go by JP.I'm with D-Matrix.So D-Matrix is a four-year-old startup working on high-performance, low-power inference accelerator for generative AI applications.And in fact, D-Matrix is one of the first few companies to adopt open chiplet approach.So why did we adopt open chiplets?And what difference did it make to our products is what I'm going to talk about.And then I'll also talk about what we are demonstrating in the OCP Experience Center.Right?
So with that, let's get started.So the heart of generative AI is really the transformer models.The transformer models have revolutionized the generative AI.And they're scaling at very exponential rates.So the graph here on the left shows how it actually scales.And you may see the several versions of this chart in this conference.The important thing to understand is the weights, the parameter, that's scaling at the rate of about 240x for every two years.And the same rate that compute part, the inference part of the compute, is scaling at a staggering 750x for every two years.The compute part is really multiplication.All it does is kind of large language models, generative AI, so really matrix multiplication addition.Right?And this actually--you need to have a lot of this matrix multipliers on the silicon.And if you look at that, what we can implement silicon that's given by Moore's law, that only scales about 2x every two years.It's very dismal number, 2x every two years compared to what we want.It's really 750x for two years.So really the silicon scaling is not really helping us anyway.So the way--the chiplets is the only way to address this kind of problem.There are other ways to do it like algorithms.But the key aspect of the implementation should be you want to have a lot of matrix multipliers on the silicon and the silicon diode is limited by a reticle size.And how do you break the bottleneck of the reticle size?That's where the chiplets come in.Right?And the other challenge is really not just the compute.For generative applications, the memory bandwidth and memory capacity is also a big bottleneck.You need a high bandwidth memory as well as--also you need a high capacity as well.Right?And most solutions uses like HBMs and DDR5 and other things, even that bandwidth is not sufficient.And you can see here in the right--the chart on the right shows there's a big gap between what compute can do and what's the memory performance lacks behind.Right?So how do you address these problems?So let's talk about that.
So what we have here, the product, the first generation product is an eight-chiplet inference accelerator on a PCI card.And it uses bunch of wire interface, the B-Wavelength interface.And so the main reason we chose this bunch of wire interface is actually enable large scale integration of the chiplets.It does it on the organic substrates.There is no limitation of the interposer reticle sizes here.So it is also open and interoperable and enables on a low cost organic substrates.Right?So with this, the eight-chiplets look like a one large and monolithic silicon.Right?So we kind of break that analytical limit bottlenecks with this approach.And what it brings to us is actually we can scale now the SRAM capacity.One of the bottlenecks we have is really the memory bandwidth.Right?And we can break that bottleneck by aggregating SRAMs and multiple chiplets.And in fact, we have two gig SRAM.And more importantly, the memory bandwidth now shoots up to 150 terabytes per second.Right?So 150 terabytes per second.I repeat, if you had to compare that with the leading GPUs today, they'd offer somewhere around 6 terabytes, 5 to 6 terabytes in that range.Right?This is like 30x a big factor in improvement in memory bandwidth.And so we also implement compute within the memory, what we call the digital in-memory compute in the chiplets that actually avoids data movement.This brings a lot of performance efficiency and low power.Right?And so what we also have is actually we aggregate a DRAM capacity connected to each of these chiplets.And we have about 2 to 6 GB of DRAM.And this, again, if you compare that with leading edge products available in the market like GPUs, this is actually -- they are available about less than 200 gigabytes.And this is actually higher than that.And also importantly, the chiplet approach, it also brings the best in class TCO benefits.TCO per token is really the best in class.And the other advantage of chiplets is actually in our approach.It's the chiplets scale of the model size.You can see a kind of mushroom of models being developed today for LLM applications.The chiplets actually kind of fits well in that approach.You scale the number of chiplets based on the model size.Right?
So that's our product, what we built.And today what we are showcasing at the experience center is actually the BoW bunch of wire D2D interface running at actually 16 gigabits per second.And what you can see here is a Jayhawk chip. And Jayhawk chip implements this BoW interface.And also implements the digital in-memory compute, if you're interested in seeing that.And so this part of the demo shows that eye diagrams between the D2D interfaces.And you can see on the left side, you have an aggregated eye of all the channels put together.And on the right side, you see the eye with all the channels implemented separately.All this eye diagram shows you have a very open eye and have a healthy margin.And you can implement practically the data communication between the chiplets is really error-free.Right?And so it's also implemented energy efficiency.We could achieve something in the order of less than 0.5 picojoules per bit.And it has got excellent beachfront bandwidth.And this helps to aggregate the performance, the compute, the multipliers, and the memory across the chiplets to look like as one monolithic silicon.And this allows us to achieve the best-in-class performance.Thank you.
Thank you, JP.So at this point, we're going to--
I'll just make a statement.I understand why Bobby wants me to speak on behalf of Professor Iyer.Professor Iyer, in his new position, is going to make packaging happen.If we did talk about chiplets, for chiplets to happen, we need to make chiplets into hardware.Hardware requires packaging.Packaging technology is going to be an important part of this chain of innovations that make the AI we talked about, make the development, the other development that we talked about, into hardware, software implementations.So advanced packaging is very, very important for us to make it into reality in making everything to happen.And also, we have to remember, not everything needs to be high-speed, high-performance.So the rest of the packaging industry, wire-bond-- think about wire-bond.87% of packages that come out are done by wire-bond.They are cheap.They are versatile.We have to think about materials.We have to-- all the equipment community that support everything that we do.So let us think about how we can make packaging to be useful, applicable, and productive for this chip-to-chip economy.
Hang on to the microphone, because I'm going to take advantage of you having it here for a little bit.I'm going to welcome anybody in the audience to come on up to the microphone and ask questions of our panelists.While you're doing that, I'm going to throw one additional question out for Bill.Bill, can you-- I just want to ask you real quick, can you tell us whether you think that our current education systems, globally or even regionally, are in a position right now to keep up with the packaging demands, with the data center compute and memory demands?Are we seeing growth in our educational system to enable the next generation of engineers to facilitate chiplet technologies?Do you have an opinion on that for us?
I think this is a tremendously important question.If we go to different schools and said, do you have a packaging course?Of course they would say no, right?But they do.Because the packaging community have disciplines from many, many different areas.Think about chemists.Chemists go into industry, and all the people from all the materials go into the industry.The key-- and then also physicists.If we think about the original eight people that formed the Fairchild semiconductors, who are they?They are not-- I think there's two of them are engineers.There's the physical chemist, Gordon Moore, a physicist, Roman Neusch.And so what we want to be able to do is to bring out the multidisciplinary areas of packaging and bring these people, this knowledge, together into our packaging industry.
Thank you very much.So we have our first question from the audience.If you could speak into the microphone.And if you want to direct it specifically, feel free.But let's try to open it up to the whole panel.
Yeah, sure.My question is for JP.I think this is an excellent one.The one questions I have, let's say, in chiplet strategy wise, instead of inferencing, if I would like to do for training and training GPU-based system, and then see memory hierarchies wise, you have a lot of options now, HBM, you put HBM plus DDI, you put CXL.What's the best hierarchy we can get the best throughput at the end of the day, deploying the chiplet economy?
Yeah, great question.The question is about training what's the right memory hierarchy, right?So yeah, we primarily focus on inference, but training, I'll give an idea.The inference workloads are focused on-- I mean, important thing in inference workloads is latency, how quickly you want to get your answers.Whereas training, you don't have that constraint.You can train for two months, right?So it's actually more batch-oriented, what we call.And the latency is not really that critical, right?So today, HBMs actually are widely used, right?So that will continue to be used for training.That's what we think.Whereas the fact that it's not really latency critical, you can assume CXL or the HBMs-- HBMs will be there probably connected to CXL for a larger memory pool.But the fact that since it's not latency critical, you can be tolerated with this kind of hierarchies, right?Whereas inference, it's very sensitive, so you want a very high bandwidth memory.That's the kind of trade-off between the two.
Thank you.Sure.So we are running out of time.I have one more audience person.If you have a quick question, let's go ahead and take that, and then we're done taking questions.
Yeah, this one also for JP.Your views, please, on the high bandwidth and throughput that you mentioned for inference as well as training, could that potentially be served with disaggregated memories in clouds that is being talked about CXL-based?
Yeah, absolutely.The disaggregated memories are-- we need access to high-speed memories for all this LLM workloads.And I see a potential for disaggregated memories as well, but more so for training because the latency is not critical when you're going off-chip.But you want a very tight integration of memory for the inference workloads.But I see a potential for both of those options, disaggregated as well as on-chip memories.
Thank you very much to our distinguished panelists.Thank you for the audience for coming and listening to us today.And we hope you have a very enjoyable experience at the OCP Summit today.We-- Yeah, make sure that you get a chance to go through the Experience Center and talk with all of the extremely brilliant people there and see the actual chiplet technologies in its infancy so you can all tell your kids and grandkids that you were part of that explosion.
And come to the packaging booth.