177


A lot of the work has been done over the last few years by some people in the front row and some people in some other rows.So I'd like to thank all of the folks for their contributions going forward.

So today, we're going to go over just a quick background of what the NVMP, Non-Volatile Memory Programming Technical Work Group is.We'll talk about some of the interface concepts, a little bit of the requirements and extensions for RDMA, or something that we'll someday we'll call remote persistent memory.And current and future technical work group focus, which is where we're actively seeking volunteers who would like to participate that have knowledge of how RDMA works and how remote operations, flushing, fencing, all of those things happen.

So our goal really is to accelerate the availability of software to use persistent memory.So most applications that are out there today understand that there's memory.And they don't understand how fast it is.They just know that it's there.And I can load stuff from my storage into my memory.And then I can pull it into a CPU cache and execute on it.And the whole goal of the technical work group is to present some guidelines on how applications should think about these things and give you a little bit of architecture of what it means to have tiered memory. So the hardware includes SSDs and persistent memory.Software spanned applications and operating systems.I think back in the beginning, there was a lot of work around the operating system piece.Both Microsoft, Linux, all the different flavors, some other operating systems, and how they would interact with persistent memory, how direct access worked, things like that.And the programming model really describes the application's behavior and how you want to think about how you want to change your application to use persistent memory.Some really good examples that I've seen actually here in the conference today are starting with caching.There are a lot of people that use persistent memory as a caching agent.So they'll pull things straight from hard drives or a storage array somewhere over a network, plop it into persistent memory, and then bring it in.It allows some of the APIs to align with the operating system.So when you're developing your application, you have to think about what operating system you want to put it on and what are the capabilities of that operating system.And then as we progress through time, it went from just things that happened in the local box to, hey, there's some cool stuff that we can do remotely with persistent memory as well.So if anybody's heard the term triple replication, or if anybody's heard the term all the different things that you can do with remote direct memory accesses, there's some persistency things that you want to have happen as well.And Tom will go through a lot of that today.

So the programming model version 1.2 was published back in June of 2017.If you think about all the things that they were trying to talk about in the programming model, there's discussions around atomicity, thin provisioning and management.How do you use memory mapped files for persistent memory?What does your application want to do?Do you want to do just basic gets and puts?Or do you want it to look like a file system?Do you want it to look like an SSD?I've seen people develop storage applications that operate using the persistent memory as well.So you can use it as I/Os and just as a plain storage space.And it's a programming model.It's not an API.So it's really kind of the guidelines, or like those lines in the road, just guidelines.Nobody caught that.My best friend works for the CHP, so I like to tell that joke.

There's a couple of different programming model modes.On the left-hand side, you'll see the block modes that have happened over the last, I'm going to say, couple of decades.You get into files and blocks.Your media type is a disk drive.You've got a nice storage driver or a storage application that will take data from memory from an application, package it up into a 4K block, and ship it off to either a hard drive-- back in the old days, today it's a flash memory drive.There's Optane SSDs, et cetera.But you have an application that manages a lot of those things, right?A storage application or a storage driver.And the difference is, when you're teaching your application to use memory, you're really just teaching it that, hey, there's two tiers of memory there.So when you boot up, there's an ACPI table or an HMAT table using the ACPI spec.And it'll see that there's two tiers.It'll show you that one tier is persistent.So now you can do things in DRAM, which is not persistent.Everybody remembers when you turn off the box what happens.And then you can teach your application that there's persistent memory there as well, using the nvm.pm.file or .volume.And the nice thing is, is if you have to reboot a server that might have a lot of log files or a big database, now I don't have to flush all those things out to my storage array or my storage locally.All those things are just there.And so when my application comes back up, poof.All of this data that was in memory is there.And it gets memory mapped back into the application.And the application can immediately resume what it was doing.So there's a lot of nice time-saving things there.

There was a remote access for high availability white paper that we've been working on for quite some time.I don't want to say it goes ad nauseum into high availability.I think it's kind of an overall coverage and some general guidelines for the different things that you want to think about.There's a lot of talk about, has anybody done plumbing in a house?Flushing and fencing and a lot of construction terms.There's some really interesting things in there that talk about atomicity and the difference between the types of flushes that you can do and why you want those things.There's a big difference between when you push something out of a cache, do you do it asynchronously, do you do it synchronously?And there's a lot of different recommendations and guidelines as part of that paper and how you want to think about the development of the application as you proceed with telling it there's actual persistent memory to use there.

Pretty simple block diagram.So you have a persistent memory where application in the kernel, you now have the nvm.pm.file if you've updated the kernel of the operating system with the right libraries.These are all open source libraries that are available from numerous places.I know the Intel one is pmdk.io that you can download all these libraries to use to update your applications with.And it creates a persistent memory aware file system.So now your application can look down and see, OK, here's a persistent memory aware file system.Here are the things that I can do with it.And this has all probably been in existence, I want to say, for the last decade.Really has come to light in the last five years.And now I think the new things that we want to talk about is what happens when I want to do this remotely.So there's two RDMA NICs here.And really the magic that happens between those two things is kind of where the technical work group is starting to move.So there's a lot of different things you want to think about.What is the difference between persistence, visibility, durability, and all of those things?And how do you let an application know that all of those things are available?And how does the application make sure that things are visible?And how do you make sure that things are durable once they become visible?And do you want to use all of these techniques within your application for redundancy?

So with that, I would like to bring up Mr. Tom Talpey.And he's going to walk you through the interface and some of the things that we're working on with RDMA.

So hi, Tom Talpey, co-chair.The background of the NVM programming interface is kind of important before we talk about remote behavior.And I just want to catch up just a tiny bit here before diving in.The basic premise of the NVM programming interface that the SNIA has been working on is map and sync.You have a device, a non-volatile memory device of some kind, byte addressable device, which is mapped into the address space of a process.That allows you to do loads and stores on it as if it were memory.You're not doing I/O operations.You're literally doing loads and stores, synchronous CPU instructions that do not block.And then a second operation called sync allows you to guarantee that your stores are made persistent.And these are implemented in CPUs with simple instructions.There's a flush, a clflush or clflushopt style instruction on Intel.And then there's an sfence, which waits for those flushes to drain out.So that is the map and sync in the local case.Well, in the remote case, it's a little bit more complicated.You have remote addresses which are accessible to you, but only via an RDMA adapter.They're not mapped any longer.And so there are some subtle differences in the programming interface that come in.Semantically, it operates the same.You're still dirtying some local memory.But the sync operation, the flush and sync operation, becomes much more asynchronous, if you will.It's pushed over the network.And the implications of that are really kind of interesting.So flushing still applies.Mapping kind of applies.Think of mapping as sort of a first level location for the data that you're going to flush, and not the final location.So we've been led to a few interesting areas that perhaps we didn't expect to be quite so important.An asynchronous flush, in particular, is the reason for this.

So there are a couple of key remotable NVM programming interfaces.Most of the interface is pretty much the same.The basic setup, the management, and manipulation of the address spaces and devices is not too different in the remote case.But these two interfaces, flush and flush and verify, change radically under the covers for remote.And under discussion, as I mentioned, we have new operations called async flush and async drain that I'll just get into a little bit.And a second thing comes into play with respect to remote flush, and particularly async flush, is the ordering.If I do an async flush, what guarantees do I have that things happen prior and things that happen post all occur?And the ordering becomes really interesting, actually.It's the asynchronous behavior, I guess.Other NVM programming model methods are typically remotable via upper layers.We can open devices and map devices and connect to devices.All these things are outside the scope of the NVM programming model.

A very simple remote access ladder diagram shows the call on the left from the application called optimized flush, which triggers a number in the remote case of things RDMA writes.Those are adapter movement of data from local memory on the initiator to remote memory on the responder, on the target system.They are bundled up as one or more RDMA operations between the two NICs.The very center of the diagram shows the logical location of the wire in the remote paradigm scenario.And then over on the right, actual PCI writes are issued by that local NIC on the remote side.So we start with reads of local memory, pushed through the NIC to another NIC, who then performs writes of remote memory.Following that is a flush.And that flush is a new operation that I'll talk about in a sec.And that flush then lassos-- you see that little oval line-- flush all those writes and makes sure they've all completed prior to actually pushing data to the persistent memory domain on the remote node and then responding so that the initiator knows that the data has been made safe.Very much like the flush and drain operations remotely locally, but implemented radically differently remotely.

There's a problem, though, in that it's a synchronous-- optimized flush is, if you will, a synchronous operation.When we called optimized flush, it went all the way down to the bottom before it returned.That's not too bad locally.But it's a long time remotely because you've waited for the wire, and you've waited for a response, and you've waited for a lot of processing.Plus, it's very inefficient because you weren't using the wire at all until you got to the optimized flush.So the programming-- the NVMP TWG realized that asynchronous flavor of flush was critically important.We always viewed it as kind of a local quirk.You could do it synchronous.You could do it asynchronous.We didn't get into that level of detail in the non-volatile memory programming TWG.We felt that was a local implementation decision.You can't avoid it in the remote case.In order to use the network properly, in order to keep the data flowing well and efficiently, you need an asynchronous flavor.And so unpublished-- we're still working on this-- we've worked on a new paradigm we call asynchronous flush.And it separates those two phases, flush and drain.Here, you'll see that flush and drain both occurred as a result of a call called flush.Maybe that was a little bit of an oversight in our early planning.But we've separated it.We now have a flush and a drain in just a little more detail in a sec.But the idea is that it allows early scheduling of writes without blocking.So it makes more efficient use of the network.It allows the NIC to execute in parallel with the CPU in the application.And you can use it in both libraries and applications.It becomes the key to building an efficient library.It also allows for more efficient use in the actual flushing operation.With less data remaining to flush, both locally that hasn't been shipped over the network and remotely, because it's already been written, even though it hasn't been finally flushed, there's less wait latency.So by overlapping this processing, you reduce the overall latency of the operation.That's critically important here, because the whole reason to use remote persistent memory is latency.Who cares if it's stored on one device or another?It's how quickly it can be stored and how quickly that guarantee can be returned.Async flush introduces a whole lot of interesting error cases.What if it breaks?How do you know what went and what didn't?How do you know what was damaged in flight?It's very difficult. And so the error cases require a lot of thought.There's a big area of uncertainty in the error cases.I'm not saying that we can wave our magic wand in the interface specification and make these problems go away.But we can light them up.We can put a bright spotlight on them so that the application developer knows what to expect.And that's what the Remote Access for HA document was all about.

So here's just a quick picture of the asynchronous flush.This is not yet published.You won't find this in the current version, but you will find it if you come to our TWG meetings, which we hold roughly every other week, and at face-to-face opportunities and the like.But here you are.You notice at the top, you see some stores followed by an async flush call instead of an optimized flush call.And you notice that the async write call begins a little bit of network operation.But afterwards, the application continues to perform some stores.It's just dirtying local memory.It's not pushing it across the network.Then it does a flush.That flush continues to push through dirty data.And the application proceeds to call drain, saying not only did I flush, I want you to tell me when I'm done.And so you see multiple application calls on the left now, instead of just a single call that did everything.And the choice of each one of those stores, flushes, and drains is the applications to make.So the application is now enlightened with respect to the persistent memory behavior.But if you look over on the right, it's the exact same traffic on the network, and it's the exact same traffic to the persistent memory device.Same thing has happened.We've just enrichened the local interface to this mechanism.So we're trying to dig in on what that means.

And as we've traveled down this road, we've realized there are additional NVM programming concepts that we wish to bring up to the community.This is application visible concepts.The first one, and maybe the most interesting, is consumers of visibility versus consumers of persistence.Visibility and persistence are closely related, but are totally separate concepts.Visibility means that you've written data to some domain, and that others in that domain can see that data that you've written.It's become consistent across the platform by some meaning.When the domain has a network in the middle of it, visibility becomes really a big concept.I didn't even know my packet was sent, much less that the other guy can see the data.So consumers of visibility and consumers of persistence, we believe, are application classifications.They're sort of concepts that some applications will want.A network shared memory application, for instance, may not care so much about persistence.May care more about visibility.Whereas a storage application, absolutely.Persistence, persistence, persistence.And they're very different.And there are a couple of other factors coming in that I'll get into in the next slide.Here's an example.It's a very different semantic.A lot of people think they can do a compare and swap on an address that sits in persistent memory.And they get a persistent lock.I just did a compare and swap, and I got the answer to the compare and swap, and it was synced to persistent memory.Therefore, everybody obeys my new state.I set the lock bit, and it told me I was the first one to set it.OK, I'm good, and everybody else sees it.No.Unfortunately, desperately, no.The achievement of visibility and the achievement of persistence are two very different things.The network of the motherboard doesn't even guarantee this.It might be in cache, but it may not have been placed in memory yet.And depending on the consistency guarantees of the platform, the other consumers way over on the other side of the motherboard may not know that your compare and swap is done.And they definitely aren't going to sit there and wait for your possible persistence operation to complete before they look at it, not without some sort of upper layer application behavior.So we try to light this up really brightly to a lot of application developers who might think otherwise.It's easy to think otherwise, but it's not true.And it's critically important that you understand that.Another thing that's risen to the surface is, how do we know it's done?How do we know it's good, rather, that the data is good?In the existing NVM programming model, we have pretty much a platform best effort kind of methodology.If the platform detects an error, it'll raise a non-maskable interrupt of some kind.And it'll say, oh, parity error.Oh, data loss, something like that.And then the machine may either machine check, may crash.The application may get a signal.But really, the locality of that error and the timing of that error are very, very vague.And then beyond that is the validity of the data.It's been written to the thing.And you expect memory to work well.But this is persistent memory.It's not only stored in some volatile buffer somewhere.It's actually been burned into something way down at the bottom there.And that burning process may or may not have achieved what you wanted.So how do we know that it's good?We want an explicit integrity check.We want to define what that means.We don't want to implement it.We don't want to describe everything about how you do it.But we do want to describe how the application might go about verifying this.Today, we do it with big, heavy software stacks, file systems with integrity protection on blocks and things in the file system.And something reads it back.Or there's all kinds of exotic error-checking behavior.None of that is present on these motherboards.So we want to think through what the basic concepts are and document them in our programming interface.And finally, the scope of the flush.Let's say we're doing a bunch of writes.And then we do a flush.Well, what about all the other writes that happened on that platform?Does my flush wait for everybody else's write?Does my flush wait for this guy's write?Does my flush have a problem with writes that I do afterward?And can I scope that out?Can I say, flush this, but don't flush that?No, you actually can't.You have no control over that.Most hardware today will arbitrarily put things into the persistence domain without being told to.Flush guarantees that it's there, but it may have already happened.The cache may have just simply pushed the data down, which is what caches do with dirty data.So we want to describe what it means to have a scope of flush.The API, the interface today, gives a base and a bounds.It says, make these 27 bytes persistent.And the interface comes back and says, OK.But it may have made a whole lot more bytes persistent.And this is especially important with an RDMA adapter, which really isn't going to look at 27 bytes for you.It's already packetized it and sent it over with a bunch of other data.Lots of guarantees don't become as solid in this case.So describing the scope of the flush and modeling these on application expectations is something that's important.

So I'm going to move down a layer from the NVM programming interface down into the protocols which we use to implement the remote flavor of this programming interface.

And of course, we all know them.We've seen them year after year.They're called RDMA protocols, Remote Direct Memory Access.And there's a bunch of them.There's InfiniBand.There's RoCE.There's iWarp.There's a bunch of proprietary ones.There's a bunch of ones that have various flavors of behavior that all match this remote direct memory access paradigm.A number of years ago, I and a few others proposed this operation called an RDMA flush.And it's made a tremendous amount of progress.There is broad agreement across the InfiniBand Trade Association, the IETF, and other standards organizations on the basic semantics of RDMA flush.The RDMA protocols are in the process of being extended to support this guarantee of remote persistence.RDMA typically in the past was about visibility, was about moving the data into memory so that it was visible to your peer.RDMA never got down to the point of actually saying where the data resided and when it was actually pushed into a memory cell.That's because it didn't used to matter whether it was in cache or memory.The CPU saw the same picture.And all the other NICs on the platform saw the same picture either way.Now it really matters.Persistence is a special-- it's an extra step on top of visibility.So RDMA protocols need an extension to support this.And there's some really interesting ordering implications.Additionally, the platforms need support.RDMA protocols sit on a PCI bus.The PCI bus has no concept of persistence either.In fact, the PCI bus write guarantee is incredibly weak.It's called a posted write.Posted means fire and forget.There's no acknowledgment to a write.You just simply say, please do the write.It goes into a queue somewhere.And it might happen immediately.It might happen next week.You really don't know.You can't force the write to occur.Well, you can in certain ways.Sort of a side effect of a read.But there's no explicit flush operation on PCI Express.So we envision a possible PCI Express extension.All these standardization efforts are progressing slowly but surely.I'm embarrassed to say that there is still no standard out there.We thought it would happen last year, maybe spring.Here it is, the fall.And the goal right now in the IBTA is end of the calendar year.I really, really hope it's there because there's full consensus.But it's kind of hard to finalize these protocols and to say they're done.They're at the right level.Because people are going to start to build a lot of hardware.And once that hardware is out there and starts speaking these protocols, it's going to be out there for a long time.So a lot of people want to make sure that this is at least good enough for what we want it to do right away.So we're proceeding with care.I guess that's the most charitable way I can put it.I'm frustrated, but hey.

So here's the new flush.It's a new transport operation in the RDMA protocol.The existing RDMA operations remain unchanged.RDMA write, RDMA read, send, receive, all those things remain the same.No change whatsoever to the existing protocol.Flush executes very much like an RDMA read.It's queued.It's flow controlled.It's acknowledged.An RDMA read, of course, returns the data.Flush just says, yeah, it's done.But it's queued.It resembles existing operations.So it's not a radical change to the protocol nor to the implementations in these little highly optimized adapters.The requester specifies a byte range to be made durable.The RDMA flush, RDMA operation, is basically a single packet that says, please flush these 27 bytes, just like we would have called optimized flush or async flush in our own NVM programming API.That said, the responder may have flushed much more.They may have already flushed because time leads to flush.But it may actually not have scoreboarded all these writes.And it may be forced to simply say, I'm going to flush everything that I have that's dirty before I respond.That's an implementation decision.There are better ways and worse ways to do it.But it's not something that's explicit in the protocol.The protocol does not allow you to narrow this down and say, I don't want to do something else or whatever.It's just simply a basic guarantee that the bytes you requested to be flushed have been flushed.The scope considerations are the same at the RDMA level.In fact, they're even harder at the RDMA level than they were at the interface level.The three that are being considered in the IBTA right now are per connection, per region, or per region range, a subset of a region.A region here is the RDMA region.It's a memory registration.It's basically a mapping for a bunch of discontiguous physical memory.And so the region can be quite large.The range is obviously specified down to the byte level.Most adapters today are expected to do per connection behavior.When you call flush, all the writes that appeared on that queue pair prior to the flush will be pushed.And then the flush will be executed.That's the simplest, easiest, most efficient way for the adapter to do it.However, it's possible that it could scoreboard writes on a per region or even on a per range basis in the future.So the expectation among the RDMA adapter community is that maybe we'll get there.Or maybe in some special cases, we'll be able to provide this better behavior.And they feel that it will lead to a more efficient implementation down the road.So the RDMA protocol that's currently being envisioned will have this scope capability.It's actually not in the protocol.It's in the requirements of the implementation.But the requirements are a little bit squishy.And there's no way in the protocol for the remote peer to figure it out.The remote peer has to take a very conservative approach unless it has better information.A second thing that is being considered-- and this is kind of new-- is the selectivity consideration.This is where visibility and persistence come into play.Visibility in an RDMA adapter is actually quite simple.What that means is don't buffer your writes.Visibility means push the writes to the PCI domain.The PCI root ports in most platforms are in the consistency domain.So if the PCI root port has seen the write, you've made your data visible.But a lot of adapters will actually buffer data, a little bit, right?Not a lot.They're not buffered adapters.That's why they're fast.But they may have a few cache lines or something of write buffer.So push to visibility means flush those write buffers.Get them on the PCI bus right now.Basically, that's the RDMA read semantic.But it's now explicit, not a side effect.And applications can code to the explicit behavior and get a guarantee that they did what they did.If all they do is an RDMA read, there's no guarantee it reached visibility across the whole platform.But if you call flush and the adapter responds to it, you know that it's visible across the entire platform.And by the way, to all the other RDMA adapters on the platform.So that's critical.And then a second stage is persistence.Persistence in the IBTA model today implies visibility.You always get visibility when you get persistence.And persistence means not only did I push it to the domain, I made sure that it made it all the way into the domain, past the safe point.Be it an ADR, asynchronous DMA refresh domain, battery-backed domain.Or it's burned into a cell or whatever.It went not only to the PCI route, but into the memory controller.So persistence is kind of two pushes.It always gets to visibility, and then it gets farther into persistence.And so those are explicit bits in the new protocol.And by talking about it in the NVM programming TWG, we may be able to give guidance to the RDMA community how important these two modes are.

The idea of visibility, by the way, is that it's more efficient, it can be done faster.And for an application that doesn't need persistence, visibility may be preferable.So here's just a quick picture of the RDMA flush.I swiped this slide from Idan Burstein from Mellanox, who gave this information at the January, eight months ago, PM summit, which was located right here in this building, right?Yeah.I'm pointing that way, but it was right around here somewhere.Anyway, the RDMA flush is that big orange box on the right.And you can see that it comes after a bunch of writes.And in his picture, there are writes after it.And those writes go right through the box.They were issued after the flush without waiting for the flush, so they actually flowed to memory.And there's a couple of really interesting things that you may want to look into.I don't have time to talk about it today.But basically, preserving the RDMA operation model with high performance and low latency is the key.And expressing these two types of selectivity, they call it in the protocol.

This is the statement from the spec, the RDMA spec, about persistency and visibility.Flush type persistency shall ensure the placement of the preceding data accesses in a memory that persists the data across a power cycle, respond only when that guarantee is made.Visibility ensures the placement of the preceding data aspects into the memory domain for visibility, for reading, only for reading.It doesn't make any guarantee of persistency.So that's the formal definition draft in the spec right now.And like I said before, in the IBTA model, in the RDMA model, like I said, it's fair to say, persistency always provides visibility.

Flush scope, the memory region range, this is the protocol message.Flush provides what's called a triplet, a handle, a virtual address, and a link within the QP.That's the narrowest possible scope.Memory region means everything within that handle.And all means everything I wrote previously on the queue pair.All basically maps best to the NVM programming TWGs model.So we're good, right?If we have stronger guarantees possible, great.But the minimum guarantee is easily met.The implementation of flush scope is a question of the provider implementation.It's not specified in the protocol.It is not known to the application that issued the I/O. This is an implementation detail.Do not make an assumption here.

There's one other operation that's really important.And that is a transactional write.We want to be able to support a two-phase commit using persistence, right?This is basically a database or a log writer style model.We want to drop the data.We want to make the data safe and ensure that's happened.And then we want to drop a pointer or a flag that says, hey, there's data there, right?A log writer will do this with a big buffer and then a log pointer, and then another big buffer, and then a log pointer.And it'll do write flush, write flush, write flush for each one of those stages.We don't have in RDMA any sort of ordering for writes.Writes are called posted operations.Whenever you send them, they're eligible to be written.So they might land in memory at any time, which would be really bad for the log writer.Your alternatives to do a send, which signals software, right?I'm sending a message to my peer process.That's going to do it.Well, that's all great, but that's latency, right?That's an interrupt.That's a work request.That's software processing.That's more stuff.And what we want is a highly efficient, low latency stream of these atomic transactional operations.And so our solution is an atomic write operation.This was another thing that I proposed years ago, and I'm proud to see it in this spec.It's a separate operation on the wire, which is actually quite interesting.The atomic write doesn't have to go to PMEM, by the way.It can go to ordinary memory.And this opens up some interesting questions for RDMA applications, even without PMEM.But it atomically updates an eight byte size on an eight byte alignment.And what that means is that it can operate without exotic PCIe atomics, which is kind of important.And it's non-posted and queued, but most important, at the very end, it's ordered after the flush.So you can pipeline it.You can say write flush atomic write.And if something breaks, boom, the pipeline's broken too, which is really important.

And the last bit of protocol is PCI bus.An efficient RDMA flush will require some sort of PCIe extension or some sort of platform-specific support.We would expect a PCIe extension, because that's the most portable, universal way to do it, right?Everybody would have a similar capability.We wouldn't have-- a driver wouldn't have to know how to push a magic PCI config register to make a flush or whatever.The adapter would simply send a PCIe flush.The PCIe SIG is reportedly-- they're a very internal organization.I'm not a member-- is reportedly considering a flush semantic.I'm aware of a couple of different proposals that have hit the PCIe SIG.But it would enable that platform independence, which I believe is critically important.And second, back in 2017, there was an Atomic Ops Engineering Change Notice.It's basically an informative, non-normative, non-required extension, which allows for atomic operations on the PCI bus.And a number of PCI root ports are beginning to support this.All the Intel Perlis and later support this.So the Atomic Ops ECN may provide additional guarantees for this atomic write operation.It would be a safer way to do it than these unaligned writes that we anticipate.And I just want to mention that it's not a do or die thing if the PCI protocol is not extended.There are out-of-band solutions possible.

All right, I'm going to spin through some workloads of these things, and then I'm going to wrap up.

I'm going to accelerate a little, because I've got 10 slides to go in just 10 minutes.So there are a few example remote persistent memory workloads.These are what we keep in mind in the NVM program-- oh, I guess I should just wrap up one more thing about the RDMA protocols.The idea of the NVM programming TWG is both to feed requirements to the RDMA community and take back whatever the heck they decide.Our goal is we've done some deep thinking, I believe, about the behavior of persistent memory applications that drive the requirements that RDMA adapters should darn well listen to.So we've told them what we think they ought to do.And they've gone ahead, and they've not done everything we wanted them to do, but that's fine.Some of them are done for very good reasons.They're hard to do.They'd radically change our RDMA architecture.We can't build that in two years, that kind of question.But that pushing of requirements is something that the TWG has already done.Now that the RDMA protocol community is responding, the TWG is kind of honor bound to take back what is the result and think through what it means.So we're entering that phase.These requirements came, however, from these original remote persistent memory workloads.And I'll just have three of them here.One is high availability, basically replication.You can use it for resilience recovery, RAID-like, scale out multiple devices, that kind of thing.But replication is basically what's going on.You can have multiple copies, and you can use a network to spray those copies around.Second, transactions, where atomicity becomes really, really important, failure atomicity in particular.When it works, that's great.What happens when it breaks?How far back in time do I have to go to recover, that kind of question.And finally, network shared memory, including the Pub/Sub, Publish/Subscribe model.That's the visibility model that we feel is important.And there are a number of network shared memory applications that nonetheless will operate to persistence.as well as just the visibility.And so the goal of all these workloads is to maintain the ultra low latency.The single microsecond latencies of the fabric and the underlying medium, while retaining compatibility with our programming model.We want to implement these applications in a similar way, whether they're operating locally or remotely.

And that comes out a little interestingly.So basic replication, in which we do write, write, write, flush, write, write, for flush, write, flush, write.We're not overwriting.There's no ordering dependency except for the basic replications.We want them all to be flushed properly.We don't want completions at the data sync.We don't want to have interrupts and overhead at the data sync.We don't want any pipeline bubbles.We don't want to have to stop the pipeline and wait.

And so that's pretty much what we already had with optimized flush in RDMA writes.All we have to do is add the flush.And the flush does not stall the entire queue.It only stalls other flushes or other operations that interact with flush.And it flows quite nicely.So the only thing we need for this basic replication workload is the RDMA flush.I've already talked about it, so I won't talk about it anymore.

However, when we get to the log writer, now we need that second extension.We need that atomic write.We have a transactional behavior.We want to do that write in flush.And then we want to follow it with some sort of flag that is persistently stored along with the data that says, not only is there a log record here, it's the most recent log record.And I'm about to update it again in the next microsecond.But I want a transactional log, an ordered log.And I want a validity pointer that follows the head of that chain.And here, latency is incredibly critical.Databases, log-based file systems, live or die on the latency of that log writer operation.And so we want to make sure that it is done well.

So there's a bunch of protocol implications, et cetera.I won't get into them.But here we have not only the write in flush that we saw previously, where we see the writes followed by a flush.We see a second operation following the flush that waits, that little pink red stop sign that waits for that flush to complete before it performs its write into the memory domain.That second write could be flushed.I didn't show that here.But the point is, two responses come back, one from the flush and one from the atomic write.And if the atomic write succeeded, we know that everything prior to it succeeded.And so this is a really powerful operation.The initiator never had to stall.It could post write flush atomic write.And then it could just wait for the reply of that atomic write to retire the log record.It could even start another log record afterward.It could just keep pipelining writes.Now, some applications may not tolerate that kind of behavior, may not like that kind of behavior.But it's optional.They can do it or not do it if they like.So there's no required pipeline bubble in this picture, which is really important.And that's achieved by the ordering of the RDMA protocol operations.

What if that log writer is paranoid?What if that log writer doesn't trust the hardware on the remote side to have accurately stored that data?Well, what are you going to do?You're going to send a message to the remote side and ask the CPU to read the data and give you a checksum?That would be pretty expensive.That would really ruin your latency story.So this is an additional extension to the RDMA protocol that I proposed.And it's currently in discussion.But it's basically a verify.

And I don't have time to go into the detail.But here's a picture of it where you can see writes followed by a flush.Then you can see a verify.And the verify stops and waits for the flush.The flush complete goes back to the peer.But then at that point, the verify begins to execute in some sort of local engine.Maybe the NIC does it.Maybe the memory subsystem does it.Something on the platform does it and replies.If it matches, it's a green light.Everybody's happy.If it doesn't match, then one of two behaviors occurs.Either it breaks the connection or it responds with an error.I guess that didn't come through on our little animation here.But the verify complete could be an error.

And so by having two different flavors, it supports two different types of workload, which I think I mentioned way down at the bottom here.So one would be that log writer, the paranoid log writer, who wanted to stop immediately if something broke.Another might be an erase scrub that said I want to find all the errors in that region, for instance.There, you don't want the connection to break.You just want to know that there was an error because you're going to keep looking for more.So it supports a couple of different types of workload, which I think is pretty powerful.Anyway, this gets to the feedback from the RDMA community back to the SNIA, the non-volatile memory programming TWG.The implications on the programming model is that RDMA, the use of RDMA at all, strengthens the need for async flush.RDMA additionally makes the errors increasingly imprecise.For one thing, the RDMA connection just breaks.It doesn't tell you what address failed.That's true locally.But it's even more true remotely because of the disconnection in time.The network takes microseconds to send.And there may be lots of other stuff in the pipeline in that time.So it's very tricky to stay in sync with the remote in this model.We also don't know, like, if we have an atomic write, do we need to bubble up the completion?Or is that just sort of swallowed by the library?We don't know yet.Do we need to express asynchronous verify?Do we need to have a verify fail and describe its imprecision?These are sort of meta questions.We don't know yet whether we need to address them as the TWG.But as the dialogue matures, we'll know more and more about that.

And so our next steps are to continue the specification work.We've shared the spec with something called the OFIWG, the Open Fabrics Interface Working Group, which is kind of like an alternative verbs and alternative RDMA API.They work closely with other implementations in open source communities and have been giving us really good feedback.Standard specifications, actual implementations, that would be nice.They're beginning.Come to the talk that Matthew, George, and I are going to give just after this one over in-- I forget which one.I think it's Winchester-- about SMB3 push mode, which will use this, an actual implementation on Windows, by the way.And so as these implementations mature, we'll have a lot to talk about.

And finally, other sorts of stuff.We're currently working on v2.0.We finished 1.2.We were going to call it 1.3, but we realized there's some pretty fundamental stuff, so we've decided to do it as 2.0.It'll have async flush.It'll incorporate implementation learnings of these other flush behaviors.We have something called deep flush that some people hate and some people think we need.We have to answer that question.We're going to continue the remote access for high availability work.We're going to work on the scope and try to get the scope in our spec and flush on fail fail, other interesting stuff.So that's it.

Yes.If all you wanted to do is Pub/Sub and you were willing to accept the, if you will, vagueness of the RDMA guarantees, I believe you could do it with what's there today.People have done it today.My colleagues at Microsoft wrote something called farm fast access remote memory that uses this over RDMA adapters today.You absolutely could.I think the important thing is that Pub/Sub model, like an object store, where you have some sort of persistent storage behind it.And Pub/Sub may not be your only goal.The actual data, the persistence of the data is part of it.And there, I would strongly recommend you don't hack it with an RDMA protocol today, because it will either not work when you have an ARM processor on the other side, for instance, or it will work today and fail tomorrow.

Yeah, so what's the performance story of differing implementations on the target or the PMEM site?The answer is no.The SNIA programming memory TWG does not do any of that type of work.I personally have done some of it.You could come to our talk in 10 minutes, and I could answer that question more meaningfully.

Yeah, it's the difference between nanoseconds and microseconds, though.I mean, if you're looking at peer-to-peer transactions to a flash drive versus an RDMA NIC doing a DMA directly into memory, that's your difference.

Yeah, kind of.Don't forget you have the speed of light, right?That network operates at no better than the speed of light.So we're talking microseconds.I mean, it depends on how long your network is.But there's SerDes.There's on the wire, off the wire, signaling on the wire, all these things, they add up to microseconds.

Yeah.I'm personally very new to persistent memory.But one of the things that's sort of bothering me as I went through this is that the processor caches that exist, right?Now, if you do the memory map of your persistent memory, are those caches no longer writing caches?And the flushes now need to actually push the content of these processor caches.

What are the implications on processor cache-- implications on processor cache implementations?And do they really work the way you want?The answer is older processors, no, did not work properly.And they had to go to write through to give any kind of guarantee.They had-- well, they kind of had a flush operation.It was a terrible flush operation that locked the whole CPU for long periods of time while it flushed.Modern Intel processors are-- they have explicit support for this.They have explicit cache flush operations, some of which are highly efficient.And they have decomposed the flush and drain operations in important ways to allow applications to code to this cache coherency or PMEM, persistency on top of the cache, in accurate ways.This is-- 

Oh, absolutely.Andy and Alan here, it's their day job to teach programmers how to do that.There are transparent versions.When we talked earlier, nvm.block, that API is basically a RAM disk.And it puts PMEM on the bottom, but it provides a disk-like API on top.That's the transparency story for PMEM today.Just do a block API.It'll behave like a disk.You won't get byte addressability.You have sector addressability.But it will be completely transparent.Your application will just work.