-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy path223
36 lines (18 loc) · 12.5 KB
/
223
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
Very good afternoon, everyone. This is Glenn, I'm from TetraMem. So today I will talk about this, the analog in-memory computing for the AI. So you know our brain is analog computing, so it only takes like 25 to 50 watts. So our work is really inspired by this brain structure, you know, using the analog computing. Hopefully you can enjoy this talk.
I think probably everyone is very familiar with this, the AI and the AI chip market. It will bring tremendously, you know, this GDP revenue and growth in the next two decade. And because of that, the AI chip market also grew at a tremendous speed. So basically it's like a 30% growth rate. This is tremendously.
So recently the SORA for the open AI basically generate a lot of interest, also from the public. But at the other side, it gives like a lot of the demanding for the AI computing. Just take like one minute, this is like AIGC using the SORA, so it take like almost like an hour to generate that. So imagine today, you know, we watch all these short videos from TikTok, you know, daily have like 800 million people watch that for two hours. Imagine if this kind of the contents is a wholly generated by AI. You know, for analysis that they basically, we need like 80 million additional 800 GPU to generate this kind of video. So that will be tremendously a lot of the computing and also power.
I think this is probably a lot of people already see these slides from Lisa Su in the last year, the ICC, the panel keynote talk. So you can see now we are kind of along this trend. In the next few years, if we don't do anything, you know, very soon, we basically need like 500 megawatts just for one data center. You know, a nuclear plant is only can generate like gigawatts. So that means each of the data center, we need like a dedicated nuclear plant. That is almost impossible. So that's why it's something heavy to take, you know, to take care of this kind of the energy efficiency given such like a high demand.
That's why, you know, people start to talk about so-called in-memory computing. Unlike the traditional, the volume and digital system, which is, you know, separate this like AI processor with memory, it has served us in the past, like almost like a few decades. So that has served us really well through the CPU, GPU, and the ASIC, it works well. But now the problem is, because we separate this kind of like the memory and the processor, we have very limited on-chip memory, which is the SRAM. But we know SRAM is made with like a six transistor, just to represent a zero or one. That take a lot of the silicon real estate. Because of that, the SRAM is very limited, you know, for the on-chip usage. For the CPU wise, we have, we talk about like kilobytes, megabytes, but even the GPU, like 800, you only have like 400 megabytes. The only, you know, the SRAM can go like a go by gigabytes is by the WIFO scan gene, you know, made by Cypress. But the WIFO scan gene, you know, tree has a 44 gigabytes. So the SRAM is extremely expensive. But, you know, because of that, we have to use the DRAM. But once we use DRAM, we face a lot of this data transfer and associated cost. Because any shift from the on-chip SRAM to the DRAM, we talk about it from the pico-joule to the nano-joule, and also from picosecond to nanosecond. Even we're using the HBM, but it still take like a tremendously energy and time. So that's why even in the 2019, you know, people start to talk about the in-memory computing architecture. This is like a non-volume architecture. It utilize, you know, this memory set, not only use it for the memory storage, but also we can utilize it itself, you know, for this computing through the physical law. So basically, because we minimize all this kind of movement between the processor and the data, and also using the cross-bar array architecture, we limit a lot of the intermediate data generation as well. So that give us a very low power consumption. At the same time, you know, compared to the CPU, GPU, which has a very limited cost, this in-memory computing using the cross-bar array architecture have a massive parallel computing center. That give us also the high throughput at the same time. And the last but not least, now the computation is done through the physical law. Basically, we utilize this, the Ohm's law and the Kirchhoff current law. That give us a very low latency, because we only need like almost like 1/10 of the clock cycle to give one step of this VM generation. So because of that, you know, this architecture has a lot of tremendously advantage.
But the key is like, you know, to enable this in-memory computing, we do need a special memory device. Which is used as memory, but also has a capability to do the computation. That's why even back to 2013, our co-founder, Joshua Yaung, so he already gave a kind of a very good summarize, you know, between what is the difference in the need for the computing and the memory. So at one time, on the one side, we kind of released a lot of requirement, for example, the per-site cost for this special device. But on the other side, we do need a very high level multi-level. So that is kind of the key. At the same time, you know, we need a retention of generation and so on. So if we use that to apply to all the memory devices we have so far, that covers in a volatile from the SRAM DRAM, now volatile from traditional flash to the emerging devices.
Unfortunately, you know, each of the memory take like decades to develop. And also, it has its own usage. But when we apply to this in-memory computing, basically, we find them very hard to use. For example, for the volatile for the DRAM, this is a 1T1C, which is not good for this in-memory computing. Because the read would destroy the contents. The SRAM is extremely expensive. We need to take care of this kind of limited size. Because that, you know, we have to take care of the IO and so on. Then for the non-volatile, the non-flash has this scaling issue, has the charging issue, especially when we-- the retention issue, when we use this for the multi-level. The MRAM, PCRAM also associated with the other problem.
That's why at TetraMem, you know, we come from-- very fundamentally from the device level. First of all, we come with this, like, the device, which is like a multi-level RRAM. People also call it a computing master. You know, tuned for this-- the computation usage.
If you look at the device, you know, we published in Nature last year. It has all these desired attributes used for the computation. For instance, you know, we achieved 11 bits per cell. This is the highest single cell memory density ever made by a human being. And also, the device we developed has a very good retention, very good endurance. This is like even the memory grade. And also, it has a very good uniformity, I/O neutrality, and the control.
And this device, not only suitable for the current dimensions we already developed for the technology node, but also it has a huge potential for the future scanning. For instance, we can go, like, very small. So because the nature of this conducting filament itself, this multi-level RRAM device is in the future, we can go, like, even sub-3 nanometer. And also, in order to have the memory density, we can go to the 3D stacking as well. So all these results are published in the Nature level publication.
And this year, our result from the chip also, like, show a publication in the Science.
From this publication, we're basically using the analog computing to kind of achieve arbitrary high-precision as software for a lot of the high-performance competition. So this method is, like, unlike the traditional bit slicing approach. That is actually, basically, not utilize this analog device advantage, but come with this, like, a lot of other disadvantage. For this, our approach, we use so-called true analog approach. We're basically able to use this analog computing, but just using a few cross-binary devices. We use a little bit of redundancy. Actually, we can do this very high-precision competition.
So we demonstrated this one in several example. We have the Poisson equation solver. Also use this one to solve, like, solve the very complicated equation, like the MHD problem solve. Basically, that is very sensitive for each step of the error. So if you can see the result, this is like a matching with the software-level procession. And also, now the competition interaction cycle can be reduced by just the measure size with using the large cross-binary, which is for the memristor area we have. Because this is like a 1T, 1R process of the cell. We can have a lot of these kind of cells. And also, the proposal solution also is more than all of the magnitude, which is more energy efficient than digital solution, the result as I show here. So basically, now we can solve this kind of even the HPC problem using the analog computing.
So in a summarizing, this is in memory computing. We are using this memristor cell, the multilevel RRAM, using the conductance or resistance to represent the neural nets. Then we just competition using the Ohm's law and the current law. The voltage is coming as input. We can cross this, the resistor. So basically, we can have this current. Once we're submission that, that will finish one step of this vector matrix multiplication accumulation. Compared to the digital, that is much more efficient and also much more shorter time.
So at TetraMem, we are not only just developed this device, the circuitry, but also for the software side, we have our own software stack, including the compiler, the SDK, and so on. So basically, we're just using one pre-trained model. We do the quantization, we do optimization, and the localization. We can have millions of the chip deployed using this method. So we don't need every single die, you need to retrain that. So the unique advantage from this is this is near the zero boson time. Because this is a nonvolatile memory cell. Very low power, basically, compared to the digital. We have orders of the magnitude of the improvement. And also, very low latency, because now, using very limited clock cycle, we can achieve this one step of the VMM, which is digital cycle taking 1,000 clock cycles.
So basically, for this application, our strategy is to start with the edge inference first. We work with the sensor company. We will enable this sensor fusion AI computation at edge. But later, when we enter the more advanced technical node, we will try to work with several potential partner to bring this one to the data center, as well as for the HPC. So the first step, we'll be using the chiplet. So basically, this is like a in memory computing enabled MPU chiplet, which is like a supplementary to the HBM usage for the data center.
So at this moment, we have the edge chip. If you go to our website, you can see this. We have the MX-100. This is based on the 65 nanometer. But even using the 65 nanometer, we demonstrate that we can achieve more than 20 TOPS per watt efficiency. So from there, we have the small neural network demonstrated. For example, we have the MNIST. This is just like a toy model, which is for us to debugging all this kind of our SDK compiler. But also, we have the practical usable, the tiny machine learning model. For example, we have this eyeball tracking. So this is the model itself. It's been trained by our company itself. And also, we have a visual workup for person detection. And just last week, we bring up the keywords recognization.
So currently, we are working on a 22 nanometer. But if you look at the plan, in the future, we are working on even a much more advanced technical. So basically, this year, we are going to start with the 12 nanometer. In the future, we are working with our partner for 5 nanometer, 3 nanometer, and so on. So basically, from the 12 nanometer onwards, our target for the on-chip storage capacity will be like a 7G. And in the future, we envision we can do even more than 200 gigabytes in the 8-bit storage. So the goal is we are able to move into more than 300 TOPS per watt efficiency. Imagine the GPU is running by this efficiency. Instead of running 500 watts, 700 watts, you only need a couple watts. So that would greatly reduce all this footprint of the carbon and all this energy efficient for the data center. At the same time, we develop-- continue to work on our software. So we are able to enable this efficient machine learning and a runtime compiler. So our goal is to enable this high-performance computing, as well as the AIGC for both edge and the data center in the future. That's all for my presentation. Any questions?