223


Very good afternoon, everyone. This is Glenn, I'm from TetraMem. So today I will talk about this,  the analog in-memory computing for the AI. So you know our brain is analog computing,  so it only takes like 25 to 50 watts. So our work is really inspired by this brain structure,  you know, using the analog computing. Hopefully you can enjoy this talk.

I think probably everyone is very familiar  with this, the AI and the AI chip market. It will bring tremendously, you know,  this GDP revenue and growth in the next two decade. And because of that, the AI chip market  also grew at a tremendous speed. So basically it's like a 30% growth rate. This is tremendously.

So recently the SORA for the open AI  basically generate a lot of interest,  also from the public. But at the other side, it gives like a lot of the demanding  for the AI computing. Just take like one minute, this is like AIGC  using the SORA, so it take like almost like an hour  to generate that. So imagine today, you know, we watch all these short videos  from TikTok, you know, daily have like 800 million people  watch that for two hours. Imagine if this kind of the contents  is a wholly generated by AI. You know, for analysis that they basically,  we need like 80 million additional 800 GPU  to generate this kind of video. So that will be tremendously a lot of the computing  and also power.

I think this is probably a lot of people  already see these slides from Lisa Su  in the last year, the ICC, the panel keynote talk. So you can see now we are kind of along this trend. In the next few years, if we don't do anything,  you know, very soon, we basically need like 500 megawatts  just for one data center. You know, a nuclear plant is only can generate like gigawatts. So that means each of the data center,  we need like a dedicated nuclear plant. That is almost impossible. So that's why it's something heavy to take,  you know, to take care of this kind of the energy efficiency  given such like a high demand.

That's why, you know, people start to talk about  so-called in-memory computing. Unlike the traditional, the volume and digital system,  which is, you know, separate this like AI processor  with memory, it has served us in the past,  like almost like a few decades. So that has served us really well through the CPU, GPU,  and the ASIC, it works well. But now the problem is, because we separate this kind of  like the memory and the processor,  we have very limited on-chip memory, which is the SRAM. But we know SRAM is made with like a six transistor,  just to represent a zero or one. That take a lot of the silicon real estate. Because of that, the SRAM is very limited,  you know, for the on-chip usage. For the CPU wise, we have, we talk about like kilobytes,  megabytes, but even the GPU, like 800,  you only have like 400 megabytes. The only, you know, the SRAM can go like a go by gigabytes  is by the WIFO scan gene, you know, made by Cypress. But the WIFO scan gene, you know, tree has a 44 gigabytes. So the SRAM is extremely expensive. But, you know, because of that, we have to use the DRAM. But once we use DRAM, we face a lot of this data transfer  and associated cost. Because any shift from the on-chip SRAM to the DRAM,  we talk about it from the pico-joule to the nano-joule,  and also from picosecond to nanosecond. Even we're using the HBM, but it still  take like a tremendously energy and time. So that's why even in the 2019, you  know, people start to talk about the in-memory computing  architecture. This is like a non-volume architecture. It utilize, you know, this memory set,  not only use it for the memory storage,  but also we can utilize it itself, you know,  for this computing through the physical law. So basically, because we minimize  all this kind of movement between the processor  and the data, and also using the cross-bar array architecture,  we limit a lot of the intermediate data generation  as well. So that give us a very low power consumption. At the same time, you know, compared to the CPU, GPU,  which has a very limited cost, this in-memory computing  using the cross-bar array architecture  have a massive parallel computing center. That give us also the high throughput at the same time. And the last but not least, now the computation  is done through the physical law. Basically, we utilize this, the Ohm's law  and the Kirchhoff current law. That give us a very low latency, because we only  need like almost like 1/10 of the clock cycle  to give one step of this VM generation. So because of that, you know, this architecture has a lot  of tremendously advantage.

But the key is like, you know, to enable  this in-memory computing, we do need a special memory device. Which is used as memory, but also has a capability  to do the computation. That's why even back to 2013, our co-founder, Joshua Yaung,  so he already gave a kind of a very good summarize,  you know, between what is the difference in the need  for the computing and the memory. So at one time, on the one side, we  kind of released a lot of requirement,  for example, the per-site cost for this special device. But on the other side, we do need a very high level  multi-level. So that is kind of the key. At the same time, you know, we need a retention of generation  and so on. So if we use that to apply to all the memory  devices we have so far, that covers  in a volatile from the SRAM DRAM,  now volatile from traditional flash to the emerging devices.

Unfortunately, you know, each of the memory  take like decades to develop. And also, it has its own usage. But when we apply to this in-memory computing,  basically, we find them very hard to use. For example, for the volatile for the DRAM,  this is a 1T1C, which is not good for this in-memory  computing. Because the read would destroy the contents. The SRAM is extremely expensive. We need to take care of this kind of limited size. Because that, you know, we have to take care of the IO  and so on. Then for the non-volatile, the non-flash  has this scaling issue, has the charging issue,  especially when we-- the retention issue,  when we use this for the multi-level. The MRAM, PCRAM also associated with the other problem.

That's why at TetraMem, you know, we come from--  very fundamentally from the device level. First of all, we come with this, like,  the device, which is like a multi-level RRAM. People also call it a computing master. You know, tuned for this--  the computation usage.

If you look at the device, you know, we published in Nature  last year. It has all these desired attributes  used for the computation. For instance, you know, we achieved 11 bits per cell. This is the highest single cell memory density  ever made by a human being. And also, the device we developed  has a very good retention, very good endurance. This is like even the memory grade. And also, it has a very good uniformity,  I/O neutrality, and the control.

And this device, not only suitable  for the current dimensions we already  developed for the technology node,  but also it has a huge potential for the future scanning. For instance, we can go, like, very small. So because the nature of this conducting filament itself,  this multi-level RRAM device is in the future,  we can go, like, even sub-3 nanometer. And also, in order to have the memory density,  we can go to the 3D stacking as well. So all these results are published in the Nature  level publication.

 And this year, our result from the chip  also, like, show a publication in the Science.

From this publication, we're basically  using the analog computing to kind of achieve  arbitrary high-precision as software  for a lot of the high-performance competition. So this method is, like, unlike the traditional bit slicing  approach. That is actually, basically, not utilize this analog device  advantage, but come with this, like, a lot of other disadvantage. For this, our approach, we use so-called true analog approach. We're basically able to use this analog computing,  but just using a few cross-binary devices. We use a little bit of redundancy. Actually, we can do this very high-precision competition. 

So we demonstrated this one in several example. We have the Poisson equation solver. Also use this one to solve, like,  solve the very complicated equation,  like the MHD problem solve. Basically, that is very sensitive for each step  of the error. So if you can see the result, this  is like a matching with the software-level procession. And also, now the competition interaction cycle  can be reduced by just the measure size  with using the large cross-binary, which  is for the memristor area we have. Because this is like a 1T, 1R process of the cell. We can have a lot of these kind of cells. And also, the proposal solution also  is more than all of the magnitude,  which is more energy efficient than digital solution,  the result as I show here. So basically, now we can solve this kind of even the HPC  problem using the analog computing. 

So in a summarizing, this is in memory computing. We are using this memristor cell, the multilevel RRAM,  using the conductance or resistance  to represent the neural nets. Then we just competition using the Ohm's law and the current  law. The voltage is coming as input. We can cross this, the resistor. So basically, we can have this current. Once we're submission that, that will finish one step  of this vector matrix multiplication accumulation. Compared to the digital, that is much more efficient and also  much more shorter time. 

So at TetraMem, we are not only just developed this device,  the circuitry, but also for the software side,  we have our own software stack, including the compiler,  the SDK, and so on. So basically, we're just using one pre-trained model. We do the quantization, we do optimization,  and the localization. We can have millions of the chip deployed using this method. So we don't need every single die, you need to retrain that. So the unique advantage from this  is this is near the zero boson time. Because this is a nonvolatile memory cell. Very low power, basically, compared to the digital. We have orders of the magnitude of the improvement. And also, very low latency, because now,  using very limited clock cycle, we  can achieve this one step of the VMM, which is digital cycle  taking 1,000 clock cycles.

So basically, for this application,  our strategy is to start with the edge inference first. We work with the sensor company. We will enable this sensor fusion AI computation at edge. But later, when we enter the more advanced technical node,  we will try to work with several potential partner  to bring this one to the data center, as well as for the HPC. So the first step, we'll be using the chiplet. So basically, this is like a in memory computing  enabled MPU chiplet, which is like a supplementary to the HBM  usage for the data center.

So at this moment, we have the edge chip. If you go to our website, you can see this. We have the MX-100. This is based on the 65 nanometer. But even using the 65 nanometer, we  demonstrate that we can achieve more than 20 TOPS  per watt efficiency. So from there, we have the small neural network demonstrated. For example, we have the MNIST. This is just like a toy model, which  is for us to debugging all this kind of our SDK compiler. But also, we have the practical usable, the tiny machine  learning model. For example, we have this eyeball tracking. So this is the model itself. It's been trained by our company itself. And also, we have a visual workup for person detection. And just last week, we bring up the keywords recognization.

 So currently, we are working on a 22 nanometer. But if you look at the plan, in the future,  we are working on even a much more advanced technical. So basically, this year, we are going  to start with the 12 nanometer. In the future, we are working with our partner for 5  nanometer, 3 nanometer, and so on. So basically, from the 12 nanometer onwards,  our target for the on-chip storage capacity  will be like a 7G. And in the future, we envision we  can do even more than 200 gigabytes  in the 8-bit storage. So the goal is we are able to move into more than 300 TOPS  per watt efficiency. Imagine the GPU is running by this efficiency. Instead of running 500 watts, 700 watts,  you only need a couple watts. So that would greatly reduce all this footprint of the carbon  and all this energy efficient for the data center. At the same time, we develop--  continue to work on our software. So we are able to enable this efficient machine learning  and a runtime compiler. So our goal is to enable this high-performance computing,  as well as the AIGC for both edge and the data  center in the future. That's all for my presentation. Any questions?