148


Hi, everybody. Welcome. Thank you for joining. Since it is kind of bright up here, I really can't see how many people we have. But thank you for coming. My name is Scott Stetzer.I'm here on behalf of the Linux Foundation's open source project for software-defined Flash called Software-Enabled Flash. So how many people here are developers? Who writes software for storage and for Flash? I've got one hand, two hands. Excellent. Two. 

Okay. It's going to be a very fast presentation, I think. So we're going to talk a little bit about what the Software-Enabled Flash project is. It's an open source project. What some of the unique software components are and how it's focused for developers to use those capabilities to deploy Flash in exciting and unique new ways. And since it's an open source project, of course, there is a plug for how do you get involved. 

Okay. So the next evolution of Flash.So we've been using Flash for a long time, primarily in the form of SSDs. So what's the next evolution for where do we go with Flash from here? Well, let's make it software-defined.And software-defined can bring a lot of advantages to deploying Flash in your own applications and your own environments. Some of those capabilities that software-defined will bring to you, the developer, the two developers we have in the room, anyway, the two brave developers that raised their hands, I should say, the ability to do fine-grained data placement, which also allows you to do absolute workload isolation all the way down to the Flash dye layer. And since that's with the capacity of drives today, that's probably not fine-grained enough. So you can also get down to a further distinction of software QoS domains to get even finer-grained workload isolation. Allows you to work on write amplification reduction, get it down to that holy grail of the number of one, be in complete control of all of your latency outcomes all the way down to the individual command level, every single command, read or write or otherwise. How can we accomplish that? We can do this through advanced queuing methodologies under software control, define dye time I/O prioritization as a policy or on a per-command basis under software control, allow you to completely rewrite the FTL and the software stack to customize the protocols to your requirements and your needs. And all of this is available through an open-source software API and a software development kit.Source codes are included with the project. Of course, it's open source. 

So what is the Software-enabled Flash product? 

This is a Linux Foundation project brought to you by the Linux Foundation. It's an open-source governance process completely managed by the Linux Foundation.It's specified and set up as an umbrella project so we can have multiple implementations of this project. And as an open-source project, it allows multiple vendors, competing vendors, to collaborate to bring those solutions and their hardware to you to be able to deploy in your environments. So a completely neutral place for everybody to work together for the benefit of the industry. 

Of course, the Linux Foundation has a couple of things that they talk about. The antitrust policy is probably the most important one. This is where we get to work together without running into difficulties. So I'm going to leave that there for a second.

I am not going to read this to you. And then let's talk a little bit about this. So there are two aspects to the software-enabled Flash product. Obviously, software needs hardware to run on. So there's a hardware component and then there's a software component. 

So let's talk really briefly about the hardware. The hardware exists today. We have a proof-of-concept working sample that you can currently go take a look at running in the KIOXIA booth out on the show floor. Up and running, running the SDK, running workloads, and showing all of the capabilities that we just talked about in terms of data placement and workload isolation and latency management. The hardware, as you can see, standard controller, DMA logic, the PCI interface, the controller logic, optional DRAM. This is important, optional DRAM. You can put as much DRAM on the device as you want or you can equally not install any DRAM on your drive and completely use host DRAM or some mix thereof. You'll notice the attached Flash doesn't identify whose Flash. It's an open source project. It's vendor agnostic.We can use anybody's Flash. The vendor configurable parts of this, of course, what Flash technology you're going to use, whose Flash are you going to use, what kind of Flash are you going to use? Is it QLC? Is it TLC? Is it third gen, fourth generation, generation 32? It doesn't matter because the API is set up to be able to abstract all of those differences and keep that programming interface to you consistent from generation to generation, from vendor to vendor, from type to type. Of course, everybody's got a different requirement for how they use DRAM, what form factor they need, how the power limits are set. This is all stuff that can be delivered by the hardware vendor or the SSD vendor that will be delivering those products to you. 

So some of the features that exist in the hardware, of course, that advanced queuing control. This gives you the ability to control your latencies all the way down to an individual command layer. The Flash abstraction and management I just mentioned, this allows you to be able to move from one variant to the other, from one Flash type to another, from one generation to the other. One of the TCO benefits that this is designed to work with is as each new Flash generation is introduced, the API abstracts that so you can pick up the newest, next, best Flash generation from each vendor. You don't have to change your code. You just implement the new drive. There's some low-level hardware partitioning and isolation capabilities. This allows you to specify which Flash die you want to sign into a particular isolation capability. So you can use the entire drive and all the Flash as one particular zone of Flash storage, or you can break it down and say, "I want individual Flash dies as individual zones." There's another feature that's built into the hardware for copy offload. So the idea here, of course, all Flash drives need to do some garbage collection.This copy offload feature is something built into the drive hardware itself so that you can send a map of all of the blocks, super blocks to be collected, and the drive will make the data movement for you. Unloads the host from having to do garbage collection.That is launched by you, the software developer. So your ability to control when and how that garbage collection is going to occur is in your hands. Garbage collection will not interfere with your I/O until you tell it there's enough time for you to go do garbage collection.

So for the software side of things, this is where the rubber really meets the road in terms of the project. 

So it's an open source project. The open source, of course, the API is up and out in open source today. There are a number of tools that are going to come with the SDK. The SDK will be out in November. First and foremost is the CLI, the command line interface. This is the device orchestration and management tool and it's completely scriptable.As a matter of fact, Python is included inside it so the interpreter is there so you can write scripts in Python to be able to deploy this within your environments. This allows you to set up, define the isolation of Flash drive, set up QOS domains, define the policies for reads and writes and the awaiting of those reads and writes. All of this is managed through the CLI. And again, I probably mentioned this once before, but because it's an open source project, that CLI comes as open source that you can see out of the SDK. FIO test tool, I think everybody here that works with storage and tests storage knows what FIO is. This is the tool that everybody uses to run reads and writes onto devices and test performance.We've ported FIO to Ceph native language, the software defined language, so you can actually pick up the hardware and the SDK and start testing right away without actually having to write a single line of code. There's a complete set of reference virtual device drivers. Again, these set up, they work, they run. We've got FDP. We've got the FDPM and we've got the FDPG versions, complete source codes for the drivers. We've got a ZNS module.We've got a standard block, SSD block mode module, all available as part of the SDK.And of course, the Flash translation layer, the reference Flash translation layer is supplied, so this is the meat of the project that shows you how to implement all the command sets and how to use all the functionality built into the software defined Flash methodologies.

In terms of the open source project, again, open source is free for everybody to use.We've picked up and selected the BSD, the BSD3 clause license specifically. This is the most permissive license that exists out in the open source community that we've been able to find. Basically, it says pick up the source codes, use them as you see fit, modify them as you see fit for your own requirements. Of course, as an open source project, we do like to see people contribute back to the project. So if you come up with a really exciting new way to deploy this Flash and you want to share it with everybody, please share that information back with the project. So everything comes as C language based. There are 32-bit and 64-bit code base supporting multiple CPU architectures. Think big Indian and little Indian, both are supported. It's driven on modern Linux kernels, of course. Event-driven callbacks, asynchronous, thread safe, lockless operations, completely modular. Everything's picked up to demonstrate how you might implement this in your own code, but again, you can pick up the code snippets and use them in your own projects. So full source codes available.All of the tools that I mentioned, full source codes that are available. The device driver is there and we're using the new kernel-based io_uring driver out of the Linux projects.

So how does this work for you, the developer? Do I get a new show of hands? How many people here are actually developers? One, two. Okay, two. This is primarily for you guys, right?

So the ability to sit down and completely isolate the I/O through software isolation right on top. So it's application-driven, host-driven, software-driven approach to deploying Flash. Data placement is completely under your control as well. You define where and how you want this data stored and it goes right where you've told it to go. Write amplification control, WAF control. Anybody from my developers, are you guys working with FDP at all? Yes.Okay. A little bit. So again, the source code samples bring in FDP. Both flavors of FDP are possible. And source code showing how that's implemented. Once you've got it up and running, if you want to tinker with it, if you want to change it, completely available to you through the source code modules. The ability to manage those latency outcomes that I talked about, all the way down to an individual command level. So you can set up a very simple policy that says, "I want read rates and write rates at a certain policy level, a certain percentage that specifies how you want read and write to happen on your particular quality of service domain." You can have multiple quality of service domains and they can have different priorities. Then, once you've got that established and it's running smoothly, if you get a priority or IO comes in or priority workload that needs to take precedence, you can go in on that particular workload, those particular commands and say, "This has to have the ultimate priority for my rights. They have to happen first and they will happen first." Okay. And again, all of the housekeeping operations, things that might impact your latencies are accelerated and they're under your software control. And finally, I know I've said this, but repeating once again, the software-defined protocols that come with the SDK, block mode, standard flash block mode, flexible data placement, zone namespace, and we're looking at some other capabilities as we move forward. 

Okay. So, on the hardware side, capabilities, virtual device driver. So, there are two aspects to this. When you bring up the CLI and you say, "I want to set up my device," you set up first the hardware with what we coin a virtual device. So, this is where you say, "I want die separated into these zones." It gives you that complete physical separation of the data and it's user configurable by you when you deploy the drives. Right? And it completely supports multiple types of flash, TLC, QLC, and either TLC or QLC can then be established and say, "Listen, I want a portion of that drive." Even though it's QLC media, four bits per cell, you can say, "Listen, I want to run this in a pseudo SLC mode." Right? So, instead of 16 voltage levels, you get two voltage levels. And that PSLC mode runs significantly faster, about 10X faster than QLC on our baseline testing. So, you could have your cake and eat it too.Now, the nice thing is, when you specify this, we have to have a certain portion of the drive that works in its native mode. For example, if it's QLC, there's a certain portion that needs to be QLC, but you can have anywhere from about one half percent of the drive defined as pseudo SLC up to about 99% of the drive defined as pseudo SLC. Okay? So, a tremendous feature that's built into the hardware. On the software side, this is where we layer on top of that physical virtual device, the physical flash die as you've defined it, and put another layer of software control that we call a quality of service domain. This is where you can do workload isolation. Think of your streams or placement IDs. Right? You can do separation of data by superblock. Once the workload goes in and says, "Listen, I need a certain number of blocks," those superblocks are assigned to that stream, that workload, or those placement IDs, and they're not shared with anybody else. So, you've got no intermixing of data at the superblock level. You can isolate garbage collection within those quality of service domains. So, imagine taking one 32-terabyte drive, split it into four pieces, and you can garbage collect in each of the four zones separately from each other. Imagine not having garbage collection on the drive interfere with the workload in another specified quality of service domain. All of this adds up to reducing that write amplification factor.So, the other things you can do with quality of service domains, you can overprovision separately in the quality of service domains. You can assign security encryption separately in each quality of service domain so each user gets their own encryption methodologies.

Okay, so how does this work? So, a standard 32-die flash drive. You go in and you layer in your virtual devices. This is defining how you want those flash die to be split up.Then once you have that set up, you can go in and create those quality of service domains.You can have one using that whole physical device, virtual device, or you can have many quality of service domains depending on how you need to isolate your users' workloads and their data. So, remember this capability. We'll be going through a couple of demos, and this capability will be shown to you as to how we've set up some of the demos. Okay?

The advanced queuing capabilities that I was talking about before, this is how you control that latency within the device, right? You can have massively parallel I/O queues going on, right? You can separate the read pass from the write pass and they won't crash into each other. This goes along with that ability to set those read and write priorities as a weighted level in terms of a policy. You can have the hardware enforce those I/O prioritizations, those policies, and it's all under application control and scheduled by that application or you, the developer, the programmer. The most powerful capability, so these are three different queuing methodologies built in, priority queue, a simple round-robin queue, die time weighted for queuing. This one's the most powerful because you can now go in and not only set those read and write percentage policies as to how much emphasis or priority to put on reads versus writes, but you can also go in with that die time weighted for queuing and specify individual commands to be prioritized above others. 

There's also some background controls. Again, managing garbage collection is one of the biggest hits to latency management within a drive. So garbage collection is a big issue. Garbage collection happens within a QoS domain instead of across the entire drive. And again, remember, these are those software-defined zones. So you can set garbage collection to happen in one zone.And again, you're in control of when do you launch that garbage collection. So you can hold it off for as long as you want, but garbage collection happening within one QoS domain won't affect somebody else's QoS domain at all. This is driven by the internal hardware-driven copy offload engine. So you get a bitmap of the blocks that are in those zones, and you can do that garbage collection based on the bitmap. The drive and the controller of the drive will take care of moving that data for you. So think not only garbage collection, but there's some other functions this could be very useful for. So database compaction is a very useful feature using this copy offload methodology. Range queries, range copies are also other things that this could be used for. And again, it doesn't cross any data across the bus, doesn't use any host DRAM, and no CPU cycles.

So the software-defined protocol. So what's the value proposition here? You can set up for all your different applications by one type of device, use the software-enabled Flash API, deploy the different drivers on those devices, FDP, ZNS, block mode, and when you need to come in and re-portion, portion some of those drives, bring in a completely new protocol and a new application and deploy that within the existing infrastructure that you have defined. Simplifies sourcing, optimizes the Flash interface, simplifies your purchasing plan, you buy one drive, you can deploy it any way that you need to completely under your control. 

The reference drivers that come with the software-development kit, again, no host code changes are needed in order to implement them and use them. We've implemented them using QEMU as a virtual machine host. You can run the block driver, the FDP driver, the ZNS driver, hook up virtual machines to these, run different workloads, see how the isolation works, completely out of the box, completely ready to run.

So a little show and tell. 

I hope I'm doing okay on time here. I know the other guy ran late. So remember what I said about how the virtual devices and the QoS domains are set up. So the first demo that we're going to talk about is how to isolate drives. So I've set up -- I've taken a drive and I've set up two virtual devices, virtual device A and virtual device B. I've layered a QoS domain on top of that, a single QoS domain, so I'm not splitting up the QoS domains. And I'm going to run an FIO workload, the identical workload, in both of those separate virtual devices. And we'll show you how that isolation actually works. 

So what we have is a very simple graph, virtual device A on the graph and virtual device B. And when we kick this off, you'll see the two workloads, read I/O and write I/O. Right? Reads in blue, red for writes. They're both running identical workloads. 

If I go turn off the I/O on virtual device A, you didn't see any impact to the I/O on virtual device B. 

If I go turn it on again, again, no impact on virtual device B. Again, because of that physical isolation of the flash drive in virtual devices. 

If I reverse the process, I can do the same thing. I can turn off and then turn back on virtual device B, and you'll see no impact to the I/Os on virtual device A.

Complete physical isolation of your users' workloads or your applications' workloads without any impact to the I/O or the performance. Okay?

Application-controlled queuing. This is another very useful feature that we brought up. 

In this case, now we're going to create a virtual device across the entire drive, and we're going to set a quality of service domain, and we're going to use that die time waited for queuing capability that I showed you. Again, running an FIO workload. And what we're going to do is we're going to set a standard 50/50 workload of weights, reads and writes pretty equal. Then we're going to start changing the read and write weights on the fly as part of the demo. 

So you can see the setup here. Again, reads in blue, writes in red. The die time balance is set at 50/50. 

And we kick off the workload. And we're running a pretty equal set of weights and percentages. So the reads and writes are handled and managed equally.

Then I go in and I give more priority to writes. And you can see instantly the write I/O capability goes up while reads are pushed down a little bit. I can increase that even further up to the 90/10 level, I think, at this point. 

And then I'm going to take it and completely flip it. And I'll give more priority to read I/O. And you can see the reads are happening at a much higher ratio than the writes are. And I've done this in real time on the fly while I/O is occurring. Can I get a hallelujah from my software developers?Can I repeat that, actually? I heard you say that's a pretty good achievement. Yeah.Great. Yeah. Yeah, I'm sorry. The folks in the back warned me if the audience says anything, I need to repeat it so the cameras can catch it.

All right. Let's go on to the last show and tell bit. Software-defined protocols. So this is software-defined flash, which means you, the software developer, are in control of what you're going to do with it. You can define what protocols you want. So this next demo is going to talk not only about how to isolate things and how to run workloads on things, but how you can do this running different protocols. So I'm going to take one drive.I don't imagine anybody's going to do this with a single drive, but it's a great demonstration of the capability of using software-defined flash in a very unique and interesting way.

So I'm going to take one drive. Sorry, I should move forward here. I'm going to take a single SEF unit, a single drive. I'm going to create three virtual devices, three completely isolated hardware differentiated zones, layer on top of that, again, another QOS domain so I can expose the drive to the host. Then I'm going to run three different workloads using three different protocols. So the first zone is going to run FDP on a virtual device. The second zone is going to run ZNS. And the third zone is going to run a standard block mode protocol. And then within each of those zones, I can run different workloads, for example, like a web app or a database app or a general read I/O path. So you can see FDP, ZNS, and block. 

Okay. So what I'm showing you now in the graph is going to be a heat map. So there's 32 die here. You can see there's a red and a blue spot on each die for the reads and writes.And there are going to be bar charts showing the relative level of I/O as we're reading and writing. 

Okay. As the workloads kick off -- there we go. As the workloads kick off, there's a little bit of a run-up to get this rolling, but you can see mixed read and write on the first zone, write only in the middle zone, and read only on the right zone. So we can show you how that workload isolation works. Now, you'll notice that it's completely isolated. So there's unused flash die between each of these zones, again, showing you isolation at the die level. Also, what you see is I'm going in and I'm turning off the workloads and then turning them back on again, showing how that impacts or has little or no impact to the actual I/O that's going on. So one drive completely isolated down to the hardware die level, flash die level, different workloads, different protocols, and different mixes of reads, writes. Pretty powerful demonstration, I think. Can I get another hallelujah from my software developers? Thank you, sir. 

Okay. Software-defined, software-enabled flash, it is an open-source project. It needs you. So some of the good news is the hardware exists.If you're interested in it, you can go look at the hardware in the KIOXIA booth. It's right out here on the floor. If you're interested in getting a drive to play with, talk to the folks in the KIOXIA booth because we're giving out loaners. We're not charging for the POC samples. 32 terabytes of QLC flash loaned hardware. So if you're interested in working with us, come and talk to somebody in the KIOXIA booth. Okay? In terms of the OSS project, software-enabled flash needs you and you, the software developers, to come in and join the project. I'm looking for software developers to come in and show us new and unique and exciting different ways to deploy flash that we haven't dreamed of so far simply because we've been locked into the legacy paradigm of using flash like hard drives. It's about time we broke that paradigm and moved forward. 

So the legal notices, of course, Linux Foundation is part of the Linux Foundation. They actually have Linux. They're the Linux Foundation.I thought, but I guess we don't have it. My apologies. Are there any questions?

Okay. So the question was, where did we start from? Did you start from something like Open Channel? The answer is no, we did not start from Open Channel. We're--started as a project within KIOXIA, which at the time was Toshiba Memory. We're a flash vendor. So we started from the perspective and the thought processes that went along with what does it take to deploy flash rather than what does it take to deploy a controller. So what you'll find is the software-defined focus of this is how is the best way to deploy flash and make it easy for you to use. And one of the aspects of the project is we've invited multiple ecosystem partners that have--and some of those not only are other flash vendors, but they're also different controller vendors.And every controller vendor is bringing their own unique capabilities. Different controllers have different capabilities. They're all bringing their unique capabilities into deploying their controllers as a device that can be used or utilized by the API.

A very interesting talk. Thank you.You mentioned that you could use any flash supplier's product. It seems that they usually keep quite a bit of their capabilities secret though, being able to adjust voltage thresholds, for example, inside heroic recovery kind of things.How does that fit with your open model? 

Actually, it fits very well. We--what you'll find is KIOXIA has the same concerns, right? Each flash vendor's ability and methodologies for programming that flash, the voltage levels, the timings, the--how to get the best health out of that flash is different from vendor to vendor. The idea of the API is how do you abstract those differences from you, the upper layer developer, so you don't have to worry about them. And it allows those flash vendors to keep their IP within their own structure.So, the firmware and the--what goes into deploying the flash is not exposed in open source. But the API abstracts those differences so that you don't have to worry about them.

So, you still need to have the relationship with the supplier to be able to use those even if you can't see them? 

I don't know that you need a relationship with the supplier. You need the supplier to develop and deliver you the drive that works with the API. But once the API is enabled, your software works through the API. So, you could, in essence, pick up vendor A's product, develop on it, and then plug in vendor B without having to make any code changes or understand what the relative differences are in the flash underneath. 

And that's if you're making a--your own SSD that conforms to this or if you're already purchasing an SSD that conforms to this?

You need the SSD hardware that conforms to the API.

And that SSD manufacturer is the one that has coordinated all of the-- 

That is correct. 

Okay. Now, I got it. Thank you.

Yes. Yeah. As a matter of fact, we have a number of folks that are interested and have been talking to us about putting one together. Okay? Any more questions?Okay, everybody. Thank you very much.