207


Good morning, everybody. I'm Jim Hull. I'm here with my colleague Betty Dall in the front row. We're here from Hewlet Packerd Enterprise to tell you about our  Proposal for a new linux Gen-Z subsystem.

Here's what we're going to talk about today.

Starting with an introduction to Gen-Z because my assumption is  That many of you don't even know what Gen-Z is or how it works  Or anything, so we're going to do a really brief discussion of  What you need to know so you can understand what the rest of our  Presentation is about. So Gen-Z is an open new  Interconnect protocol. First of all, it's a consortium  With broad industry support. There's over 70 members in the  Consortium right now, ranging from system designers to memory  Device designers to switch companies to software people. But mostly hardware. And there's a few of us thinking  About software here. It's a whole family of  Specifications. There's a core spec which gives  You the basics of the protocol and how control space works and  A few things we'll talk about later. There's physical specifications, mechanical specifications to  Talk about form factors for putting these things into boxes. There's connectors and a software and management spec as  Well. It's most important parameter  Probably is that Gen-Z is a memory semantic fabric. And by memory semantic, i mean that you can have devices out on  The other side of the fabric and in your cpu under the control of  A linux os, you can do an mmap of a region of memory out there  And do direct load stores to that. So you don't have to just do RDMA  Messaging or ethernet packets or anything. You can do load stores to those devices out there. And Gen-Z can scale anywhere from 2 to 256 million components  On that fabric, which is a pretty big number. And Gen-Z is a PHY independent protocol in the sense that  You have a PHY independence layer and you can run it against any  Number of  PHYs , depending on what kind of latency, bandwidth and  Reach that you need. The three  PHYs  that are specified  Right now include a PCI PHY at 32 giga transfer per second. And two different 802.3  PHYs  at 25 and 50 gigabits. And the reason there are these different  PHYs  is that PCI PHYs  Can go about this far and copper ethernet  PHYs  can go about this  Far. And if you want to go further than  That, like across a row of data center boxes or an entire data  Center, then you probably need to do an optical PHY and there  Will be some of those specified in Gen-Z as well. Gen-Z can support a completely unmodified os by hiding all of  The complication of the fabric management and making the  Devices appear like PCI devices in firmware. But that's not what we're here to talk about. We're here to talk about having linux be a full player. And we'll talk a little bit later about why we think that's  Necessary and that just hiding it in firmware is not a good  Idea. 

So, in this picture, we have two  Different example fabrics. The one on the left is a pretty  Basic fabric, two machines, each with a cpu and memory  Connected over some coherent native interconnect between that  Cpu and a bridge, which is the name Gen-Z gives to the device  That connects from a cpu out onto the Gen-Z fabric. And then two media components in each of those servers. So there's six Gen-Z components in all in that fabric on the  Left. Each component can have one or  More interfaces. It's a point to point connection. If you want to fan out, you have to have a switch, which is  That sort of octagonal thing in the middle there. Switches can be either stand alone or integrated in with  Pretty much any other component type if you want to have  Switching in that component. On the right-hand side is a far  More complicated fabric. It's representative of what you  Might use in a HPC kind of environment. This is a two-dimensional hyper x, which means that each switch  In the fabric is connected directly to all of the other  Switches in both its row and its column. Which leads to one of the prime features of a Gen-Z fabric,  Which is that you can have multi-path. You can have software set up the routing to go between any  Number of those switches, you know, directly, two hops to get  Directly there or multiple hops along the way for a redundancy  Or bandwidth improvements. I mentioned that there has to be  Management software. In general, there will be  Multiple os instances running on the nodes in the fabric.

None of those individual os instances can assume that they  Own the entire fabric or all the components that it might find  Out there. Furthermore, you don't have to  Assign complete components to any given os instance. You can divide up those components. For example, a large media device can be carved up into  Pieces, and each of those we call a resource. Those can be individually assigned to particular os  Instances or shared, which is one of the main ideas here, is  That you don't have to have a resource assigned just to one os,  But they can be used by multiple ones simultaneously. To make that work, the fabric manager has to have some idea  About what resources should be assigned to which os instance. So we have this thing in the management subgroup called the  Grand plan. If you do a google search for  Grand plan, the first thing that comes up is something from  Wikipedia talking about the sith in star wars universe. That's why i think it really is a grand plan in that sense  Exactly. Let's see. Fabric management can be done either in band, meaning that the  Fabric management traffic is going over the Gen-Z fabric  Itself, or out of band, which would mean you would have some  Sort of set of connections between those devices like  Ethernet. Either one can be supported. One of the main functions of this Gen-Z management software  Is to set up the routing. Like i said before, you can have  A multitude of routes, and you have to decide which routes are  Good ones, which ones should be enabled, which ones should be  Denied because you don't want those two components to talk to  Each other at all. And then because there's this  Fabric manager sitting out there, and it's the only one who  Knows which resources should be assigned to an os, there has to  Be some communication mechanism between a local management  Service running on each and every node that talks to that  Fabric manager and says, hey, which of these things that are  Out there that you're managing am i supposed to see? And that local management service will talk to that fabric  Manager using a DMTF Redfish interface to learn what  Resources are its. 

We're going to drop down one  Little level of detail lower now and get you some very basic  Gen-Z concepts. So there are three basic  Component roles. Requesters are the things that  Initiate packets in order to get service from some other  Entity out on the fabric, which is known as a responder, which  Executes that packet and then sends back an acknowledgement  If it needs to. That acknowledgement happens  Both for reads and writes, so even writes are  Acknowledged. This is basically a reliable  Protocol. If there's some error in the  Transmission on any one of those links, the hardware will  Retry up to some program limit to try to make that transaction  Happen. But it could, of course, fail  If that happens too many times, if the link is really dead, for  Example. And then there are switches,  Whose role is just to route packets from ingress interfaces  To egress interfaces, and they have a big set of tables in  Each switch component that decide what routing paths are  Enabled and which ones are not. Every component on the fabric  Has a 28-bit global component id, GCID, or GCID. It's assigned by management software. The first 16 of the bits of those are called the subnet id,  Which is optional, and then there's a required 12-bit  Component id. So if you want to build a  Small fabric, you don't have to have the full 28 bits. You can just do 12 of those. Components, every component on  The fabric has two separate address spaces. There's the data address space, which is up to 2 to the 64  Bytes in size on each and every component. And next to it is a control address space, totally separate,  2 to the 52 bytes in size maximum, where management  Software will program various parameters into the component. A really important thing to understand is that by default  Packets are completely unordered on Gen-Z, which is very  Different than PCIe, which has a well-known ordering model. They are unordered by assumption, and that's for a  Couple of reasons. One, we showed in the previous  Slides that multi-path can happen, so every packet that  Comes from my requester might follow a different path to that  Component, and they may arrive out of order. Furthermore, there's the hardware retry mechanism, which  Can cause just one packet to fail. Others succeed, and then that one is retried, and so that  Causes out of order as well. And finally, another big  Software visible difference is that coherence in this fabric  Is usually going to be done with software, and that's  Because hardware coherence mechanisms that we use on  Components really can't scale to the size fabrics we're  Talking about here. You would be spending all of  Your time doing snooping or even directory-based things  Don't scale that far. So coherence in general will  Be software managed. 

Here's a picture of what  Control space looks like on each component. Every control space starts at zero, and there's a required  Structure at address zero called the core structure, and  So you start there. And inside that core structure,  There will be a bunch of fields describing various things,  Including pointers to other structures, which describe more  Things about the component. And those pointers can create  Links, linked lists, so the interface structure here, for  Example, the first interface, number zero, is pointed to by  The core structure, and then it points to one and on to two  And eventually. And there's a whole tree of  Defined links and a known mechanism to follow all those  Pointers and find all that stuff. There's really two things  In control space. One are structures, which have a  Fixed header in the front of them and therefore can be  Self-describing. There are also tables, which  Are not structures. They don't have that fixed  Header, and therefore you have to have special algorithm to go  Look at other fields and other structures to figure out how  Big that thing is and what it is.

I mentioned that bridges are the Gen-Z device that connects  The cpu into the fabric. Here's a block diagram of an  HPE bridge. This bridge block diagram is a  Marginally fictionalized version of a bridge that HPE has built  And reported on at Hot Chips a couple of weeks ago. So if you want to find out more about that bridge, you can look  Up that presentation. In the middle of this diagram is  The cpu, which has, of course, MMUs and often IOMMUs these  Days. So these are standard cpus with  Their local memory. And then they connect over some  Interconnect to the bridge. If you're doing the load store  Mechanism, then you'll start by executing a loader store  Instruction in the cpu through the standard MMU, creating a  Physical address, which comes out into the bridge. That physical address simply doesn't have enough data in it  To resolve into a Gen-Z address because, as i mentioned, every  Component might have a full 64-bit address of its own, and  Physical addresses on cpus are just not that big these days. Furthermore, you need to have other data to fill into the gen  Z packet, like what the global destination is, and this thing  Called an R-Key, which is part of the access control mechanism. And therefore, there's an extra layer of translation called the  Requesters MMU in the path where that physical address is looked  Up and then turned into all those additional parameters  Before it goes out onto the Gen-Z fabric. And then, a component will have been addressed and routed, the  Packet will be routed to the correct destination, presumably  If not, the responder will throw it away. But assuming it arrives, then the z address will be looked up  In the responder's MMU, along with the R-Key, which is  Compared against the R-Key stored in the MMU to make sure  It matches, and again, if it doesn't, that will be thrown  Away. And that will look up a virtual  Address in PASID, which will be forwarded to an IOMMU before  Flowing into the system memory, assuming your platform has an  Io MMU. Because load store access  Probably can't get you direct access to all of the fancy  Data and functionality that Gen-Z has, it's often a good idea  In your bridge to have what's called a data mover, which is  Just a name for a fancy dma engine. And it gives you access to packets that you can't generate  With load store. It can also provide you the  Option to do RDMA if you want to do that. Similarly, on the receive side, you can have a receive data  Mover, which can receive messages from Gen-Z that are  Encapsulated ethernet packets, for example. And send those off into a queue structure in normal kind of dma  Way, because they're not packets that have direct  Addresses. Instead, they're more context  And finally, the control space has to be directly accessible  Both from the local cpu, and if you're doing in band  Management, that control space has to be accessible from the  Fabric as well. And so, the responder will  Take control packets, which are different than data packets,  And route them not through the responder, but to the control  Space block directly. Again, this is just an example. You don't have to build your bridge like this, but Gen-Z  Subsystem needs to be able to manage these resources in  Bridges. 

A little more about ZMMUs. The assumption in the spec is that they are os managed, so  The os has direct ability to write the translations into the  ZMMU, which means that any os can generate a packet that's  Destined to any particular device out there, and so then  You might ask, well, how do you deal with access control and  Security, and that's a whole other talk that we're not going  To talk about here. The diagram description I already covered most of the  requester ZMMU items on this bullet here. I did say that responders ZMMU is data space only and not  Control space. And the Gen-Z spec defines two  Different kinds of ZMMU structures. The first is called a page table based one, which is  Structured much like a cpus mmu with multiple levels of in  Memory page tables and caching of those elements into a TLB in  The ZMMU, very much like a CPU or IOMMU structure. But there's also another kind which is called a page grid,  Which is an on chip only, no tables in memory, and it has a  Very limited resource, and so we need to have code in the  Subsystem to handle both of those kinds of ZMMUs.

All right. That covers pretty much the  Introduction to Gen-Z that i wanted to get you guys all up  To speed as much as possible in the short time we have. Let's move on now to talk about the kernel subsystem itself.
 
So why do we want to do a kernel subsystem? Well, first we want to enable native device drivers to  Control io devices or accelerators that are out there  Using Gen-Z fabric. And that enables full access of  All the advanced Gen-Z features, the whole list of them here on  The slide, which we are not going to cover today due to lack  Of time. And it also enables the sharing,  Like i mentioned before, where if you do it in firmware, then  Pretty much a resource assigned to a firmware and then  Presented to an os as if it were a local device. Well, that os instance is, of course, going to assume that it  Has full and exclusive access to that device. So if you want to do sharing, you can't do it in the firmware  Way. You have to have an os visible  Knowledge about the sharing that's going on. Furthermore, we have in our design the idea that we're going  To put fabric manager and those local management services that i  Mentioned in user space. And the Gen-Z subsystem will be  The mechanism that those user space processes will be given  Access to those resources. And why are we doing this now? Well, because hardware is showing up essentially now.

Here are the things we had in mind while doing the design that  We have. First, since this Gen-Z  Subsystem wants to expose native devices, it needs to be a bus  Subsystem in the linux kernel sense. And we have existing examples of bus subsystems like PCI and  Usb and Greybus. So we want to be like those  Where we can. That way driver writers that are  Used to doing drivers for those bus subsystems will not be too  Freaked out by some kind of odd design that we've done. This next one is maybe the most important of all, which is that  We want policy to be in user space and just the mechanism in  The kernel to the extent possible. The previous speaker was talking all about these odd heuristics  In the memory management system for page reclaim. We don't want to have things like that that get in the way of  Making this work. So just let user space do it. We're going to use existing kernel services where that makes  Sense. And last but not least, we have  To deal with the fact that if you read the course back in Gen-Z, nearly every feature in there is optional. So we have to somehow deal with that level of complexity where  We have to be able to make sure that we can build an  Interoperable system of these components where they may have  Chosen slightly different feature sets.

So here is our block diagram of what we are proposing to build. The subsystem itself is in kernel space, not at the bottom. It's those two green boxes. The key on the right says the  Green things are new. So that's the new stuff. It will be connected to bus and dma subsystems in the kernel as  You might expect. We'll talk about Netlink, hot  Plug infrastructure in the slash sys file system in a minute. Yes, terry? 

when you say new, it's  Existing new or to be built? 

new in the subsystem or new  User space components using the subsystem. So it's code. New code that we're writing now. That's the green stuff. Yellow stuff is already in the  Kernel. And blue i haven't talked about  Yet, but i will now. So we need to have interfaces  Both down to bridge device drivers, which we'll be talking  About a little bit more later. So that's at the bottom. Each vendor supplies a bridge device driver that corresponds  To their bridge device. And then there will be a set of  Upward facing native device drivers that provide various  Services like block device services or memory device  Services or ethernet services or RDMA services to user space. And then in the user space itself, there are two main  Components being described here. The first is the local  Management services block on the right. Not the far right. But just next to that. And we call that LLaMaS because it's the linux local  Management service. You stick a couple of a's in  There and you get a cool name, LLaMaS. And then there's a fabric manager, which we're calling  Zephyr. Zephyr because besides the  Definition, which has to do with wind, there's one that has  To do with fabric. So fabric manager named zephyr. And we'll talk more about those in a little while.

But first let's talk about the kernel piece of this. One thing i want to make clear is that this is very  Definitely a work in progress. We are not done by any means. We have some code that implements some of this stuff. But we're at a good place where if there are glaring  Deficiencies that you see or things that we're doing wrong,  Let us know now. You'll see a set of questions  Here in a little bit that we have for the community to  Answer. And hopefully we'll get some of  Those answers today or at the end of the talk or out in the  Hallway track. Okay.

So the first aspect of getting the subsystem operational is  Basically that we assume that a bridge device will be  Discovered on its native bus, say it's connected via PCI or  Some variant like CXL or CCIX or OpenCAPI or whatever. So that bridge device will be discovered in a normal way  Using those existing subsystems. And then when it's got its  Device ready, initialized, it will make a call to Gen-Z  Register bridge, which is the notification to the Gen-Z  Subsystem that this isn't just some ordinary PCI device, but  It wants to talk to Gen-Z. And that will happen usually  During the probe function of that native bridge driver. And when the subsystem finds such a bridge, it will make --  It will be presented two things. One is the native device  And the callback for that device on its native bus. As well as a pointer to a structure which has a bunch of  Function pointers in it. And those function pointers will  Include callbacks into the bridge driver to let it return  Bridge information or perform control space reads or writes  Or maps or data space reads or writes or control write message,  Which is a packet i haven't really talked about, but it's  A way to get management entities out on the fabric. And then, of course, there's an unregister that corresponds.

Up in that upper block, blue block in the block diagram, for  Native device, registration will have a Gen-Z register  Driver function. Very much like the PCI version. In fact, i think it has identical parameter interface. The main difference between device driver registration for  Gen-Z versus PCI is that in the PCI world, you have vendor and  Device ids. And in Gen-Z, all the ids are  Uuids instead. So the matching will be by uuid. And again, there will be a structure, Gen-Z driver  Registration, and then there will be PCI-like probe, remove,  Suspend, resume kinds of function pointers. And again, there's an unregister.

As i mentioned before, ZMMU and iommu management is pretty  Fundamental to the way Gen-Z works. And so we want to centralize control and management of those  So that we don't have to have every driver doing the same  Thing. So the subsystem will know about  ZMMUs, it will have calls that will allow mapping control  Space or data space ZMMU entries, control space in  Particular so that we can do implementation of the interface  To user space, which we'll talk about in a minute. As i said, we're a work in progress here. So we don't know exactly what this api looks like. But to the extent possible, we're going to try to hide this  Difference between page grid and page table based ZMMU. I'm still not convinced that we can do that, but that's the  Goal. Because Gen-Z can connect to an  Iommu in the system, and because paces appear in the  ZMMUs themselves, we're very interested in having some kind  Of common set of calls that allow us to manage paces. So you can see our first question up to the community  Here in blue, which is should there be or can there be  Interface for managing paces. There was a talk earlier and  Some hallway conversations we've had since, which i think  Has convinced us that that is the direction that the kernel  Is heading. So, good. Second question out of the community here is about huge  Pages. It's our understanding that  Huge pages for device memory are not well supported in the  Kernel today. And there's a whole host of  Reasons on the slide here for why we think Gen-Z really could  Benefit from that. First off, as i mentioned  Before, there's a huge number of components possible, and  Each of them can have a huge data space, and if you're  Trying to map all of those with 4k pages, you're going to be  Sad. So, big pages help solve that  Problem, at least mitigate it. Especially in the page grid  Case, since there's so few pts in the device, they tend to  Have a huge range of page sizes available. The bridge i mentioned before supports everything from 4k to  256 terabyte pages. So we'd like to be able to take  Advantage of that in the ZMMU. And then the third question on  This slide is, again, about iommus, because we've seen  Patches posted over the last year or so about shared virtual  Addressing and making a common interface to iommus, and again,  Could very much take advantage of that in our subsystem,  Because we can have a bridge, for example, that connects via  CXL, and if CXL is implemented by intel and amd and arm cpus,  We could use that same bridge, but the iommus in those  Platforms are different. We'd like to have a set of  Common calls that we can make to manage those iommus from the  Subsystem. 

I mentioned data movers  Earlier in the block diagram for a bridge. We're a bit torn here. Kernel drivers like block or  Emulated ethernet nick drivers would greatly benefit from  Having a generic data mover interface so that we could  Write the code once and call into the subsystem where it  Could then have interfaces to the underlying bridge driver. That would also be useful in being able to generate packet  Types in Gen-Z that are hard to do with load store like  CXL, or the write message, or some of the more exotic ones  Like buffer and pattern requests. On the other hand, RDMA drivers,  Which in the end want to expose the queues and the data mover  Hardware directly to user space, are going to have to have user  Space drivers that hide the differences between those queue  Mechanisms, and so they are not particularly interested in  Having a common data mover interface. So the question to the community again is, do we think we  Should work on that or not?

Interrupts and unsolicited event packets are kind of  Different in the Gen-Z space. I'll describe unsolicited event  Packets in a minute, but interrupts themselves are very  Different. Unlike in PCI where there's a  Very nice architected msix interrupt structure, you can  Have common code in the kernel that knows how to manage all of  Those things and common code that intersects, interoperates  With the underlying interrupt chip and similar structures in  The kernel. That's not how it works in Gen-Z. Every device can have interrupts,  But there's no common mechanism for describing them or  Programming them, so it has to be done on a per driver basis. Not unlike what was described by intel in the SR-IOV talk  Yesterday. So maybe we can leverage  Something from what they're doing. I don't know yet. Interrupts can come from  Different places in Gen-Z. There are packets that you send  From one os instance to another or from any component to any  Other component that can take an interrupt packet. So that's one source. Interrupts can come from the  Bridge when the data mover has a completion queue, the entry  That's done or some incoming packet comes into the receive  Data mover that will want to generate interrupts. And then there are these things called UEPs, unsolicited event  Interrupts, which are the Gen-Z mechanism for a single  Component or some component in the fabric to signal kind of  Fabric state changes like links up and down or how to remove  Components or errors. So we need to have mechanisms to  Pass those interrupts up into user space if there are user  Space managers that are handling that. Our proposal is that those UEPs become local interrupts on the  Targeted bridge component. Those are handled by the  Subsystem and then forwarded to user space by some mechanism. Perhaps Netlink, which we'll talk about here some more in a  Minute. Okay.

 That's kind of the end of our kernel subsystem part of this.

 I want to talk about the user space pieces and what the  Kernel subsystem is presenting to user space to make user  Space management components work better.

 So as i hinted at earlier, Gen-Z discovery is rather different  Than the way, say, PCI does it, which is all in the kernel and  You explore the PCI hierarchy and you assume all your devices  Are local and owned by the os. So here, every node running an  Os instance on the Gen-Z fabric needs to run a copy of LLaMaS,  The local management services process. And LLaMaS is going to use Redfish, as i mentioned before,  To go and talk to the fabric manager and find out which  Resources are owned by this os instance. And when it has one of those resources, it's going to make a  Netlink call into the kernel Gen-Z subsystem and say, add this  Component. And that's going to cause the  Subsystem to create new entries in sys devices under this path  That you see here. And you'll see that more on a  Further slide. To cause that resource to  Appear under subnet id and component id for a given fabric. And once that subsystem creates those /sys devices, then through  The usual udev mechanism, that will cause a search for a  Driver that can bind to the particular uuid that was added  As part of that add component command. And we'll get a driver bound to that. Fabric manager node is completely different. Fabric manager is the thing that needs to go out and explore the  Entire fabric and try to figure out what's actually there. And if you have a grand plan, as i mentioned before, does the  Grand plan match up with what we actually discovered out there? So it needs to discover those interfaces, it needs to find  Switches and bridges and media controllers and all the things  That are out there. And the mechanism we're  Proposing that it use to do that is, again, Netlink, add  Fabric component command, which again will cause new tree  Entries to be added under /sys/bus/genz. And once those sysfs entries are there, then the fabric  Manager can open the files that it finds in those trees in order  To get direct read/write access to the control space structures  Like i mentioned in the introductory slides.

 Yep.

So does this mean if your boot device is out somewhere in  The Gen-Z fabric, you will need to bring up LLaMaS in order to  Boot? 

so boot is an interesting  Thing. So just like in a local machine  PCI environment today, in order to boot from some unknown PCI  Device that's on your machine, you need to have UEFI drivers  Assuming you're using UEFI environment that corresponds to  That device. That will be true here in Gen-Z fabric as well. UEFI will have to be modified. It's going to have a new subsystem.

 Once the kernel and initramfs have been loaded, we're going  To need to have LLaMaS in the initramfs?

 yes, you're going to need to have LLaMaS in the initramfs. Yes, that is the implication of this design.
 
All right. Let's see where we were. I think we were about on the last bullet here, which is that  -- Another question here. So Netlink seemed to us to  Be a pretty good communication mechanism, both to inform the  Kernel of add and delete component resource commands,  Because it has this structure that lets you audit the kind of  Data that's going through. But it's also a bidirectional  Communication mechanism, so we can send UEPs and other  Interruptive events back to user space using Netlink as well. So the question to the community is, is that a good choice? We had a hallway conversation yesterday with -- what was the  Name? jason, jason, kind of mr. RDMA, who suggested that they chose ioctl for its  Performance benefit over Netlink. I don't know that this has such stringent performance goals as  RDMA does, but maybe that's something to think about instead  Of Netlink. I don't know. Although i don't know how to use ioctl in kind of a  Resource-based mechanism. 

So here is a very high level,  Very simplified version of what you might see in /sys/devices on  A managed node. Remember that six node simple  Topology that i showed in this example we have. The assumption that just a single one of those media  Which has been assigned subnet id zero and component id two has  Been assigned to this node. So LLaMaS has run at this point  And done its add command. It's told us that this cid  Exists. And in /sys you'll see a  Handful of properties like the component class and the free uuid,  Which i didn't tell you about. So you don't know what that is,  But it is the GCID. And then two memory resources. So we assume that two regions of that memory component number  Two have been assigned to this os, and you'll see control and  Data space regions control corresponding to those memory  Regions as the resources here. There's a symlink from bridge  Zero that points to its actual native device. In this case we're assuming on the right-hand side that this  Bridge device is connected by PCI. So that's the first part of that hierarchy is a completely  Standard PCI representation of a device in /sys/devices. And then attached to that bridge device we'll create a Gen-Z  Hierarchy, and in there we'll have sysfs directories and  Binary attribute files corresponding to the control  Space structures that are visible to that local bridge.
 
In contrast on the fabric manager, of course, if it's  Running LLaMaS as well because it's locally managed and has  Gen-Z devices and under sys devices on the fabric manager  You'll see a hierarchy not unlike the one on the previous  Slide, but the unique stuff for /sys on the fabric manager is  All below /sys/bus/genz and under a fabric zero hierarchy. Of course, if you have multiple bridges you might be connected  To multiple fabrics, so that's why there's this fabric zero,  Fabric one, fabric n in the path here. Again, the space id and cid sub directories. And then for each of the devices out in the fabric that have  Been discovered, you'll see the control space structures that  Will allow Zephyr to come in and do open some of those binary  Attribute files and do reads and writes to cause control space  Changes or read the parameters out of the devices and do the  Fabric management that it needs to do. And the next blue question for the community here is does this  Hierarchy look like something sane to you? Is it consistent with linux's intended usage of sysfs? One thing that worries me a bit is that because in the limit  You could have 256 million components out on a fabric,  That's a whole pile of sysfs files and directories, like more  Than has probably ever been in sysfs before. So are we going to run into some kind of limitations in the  Sysfs subsystem just by having so much stuff? Now, realistically, of course, you probably won't have a single  Fabric manager managing a fabric that big. You'll do some kind of federated thing and divide it up, and  There's a whole bunch of stuff in the system and management  Framework for Gen-Z that describes how you can do all of  That. So maybe it's not as bad as i  Make it out, but in the limit, it seems like a lot of sysfs  Stuff. Okay.
 
That just about ends our talk. 

This is a place where we would  Normally in a talk have you guys asking me questions, but  Before we get to that, for those who want to look at this  Afterwards, we've summarized that set of kernel questions  That were in blue on the earlier slides, so you don't have to go  Searching for them if you don't want to. You have them all right here. 

And finally, i have some  References. So the consortium web page will  Let you get access to all the specifications that i mentioned. They're all public. There are, of course, newer  Versions coming out that aren't public yet, but they'll be out  There soon. Our code is on github. Linux-genz. At least an early version of  The code. We're working hard to get it a  Little more sane, so don't look at it today. Give us a couple of days and we'll have more of this actually  Up there. LLaMaS has two github repos  And LLaMaS itself is -- let's just stop my screen. Good. Let's ignore that. LLLaMaS itself is one repo, but it's using a home-grown netlink  Interface called Alpaka as well. The thing which you do not see  On this list is Zephyr, and that's because we haven't  Started to work on that. But it will show up here as well. All right. And that's all i have. Anybody have questions? 

Yeah, so i see python 3 and i'm a bit worried. So some environments, they have very little memory, like, for  Example, key dump, crash kernel, you basically reserve just a bit  Of memory, and you want it to be real small because you  Basically never use this memory. So what if we boot from Gen-Z  Attached disk, kernel crashed, and we need to dump our memory  To the disk? so you mentioned that you still  Have to have LLaMaS in it, and i'm just worried that it will  Take too much disk space. 

good point.

So the question is, is it possible to implement, like,  Very minimalistic, i don't know, llama discovery so you can place  It in kernel? because if you make it very  Complex, like, i don't know, it will basically die. No one will be using it. 

understood. So first, i think you should consider this first  Implementation of LLaMaS to be more of a prototype than  Anything else. We're doing it in python because  It's easy. It doesn't have to be in python. You can do it in c or c++ or go or whatever you like, ruby. And second, for something like a crash kernel, if you are  Willing to pre-configure things, for example, in a static file,  There's nothing that says that LLaMaS actually has to go out  Over the network and talk to the actual fabric manager. It could have a local configuration file that is kind  Of a proxy for that. 

Any more questions?

 perhaps a little heretical, but given you're building a  System that has access to devices all over, would we be  Looking at why bother with some existing technologies? The thing that struck me directly was, RDMA is to talk to  That device way over there through a direct connection or  What looks like a direct connection. Would you be looking at saying, we don't need RDMA anymore, so  Why bother ask the question about RDMA on the slides? Hence heretical question. 

yeah, i know. Kind of a philosophical thing. If you were to take Gen-Z to  Its logical extreme where we're trying to provide direct load  Store access to all those things out there, then i would say,  Yes, RDMA is not necessary in that world. However, there is a huge body of HPC codes in particular that  Are based on MPI and Libfabric and RDMA underneath. We don't want to throw that all away. It in some sense isn't better than any one of those  Technologies that you might mention. It's not better than PCI necessarily. It's not better than InfiniBand necessarily. It's not better than CXL necessarily. But it does do a lot of stuff. We want to bring in and not  Leave behind legacy codes. So i think RDMA is an important  Use case. So i think it needs to live on  Top of the Gen-Z subsystem for a long time.

Any other questions?