-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy path207
86 lines (43 loc) · 37.5 KB
/
207
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
Good morning, everybody. I'm Jim Hull. I'm here with my colleague Betty Dall in the front row. We're here from Hewlet Packerd Enterprise to tell you about our Proposal for a new linux Gen-Z subsystem.
Here's what we're going to talk about today.
Starting with an introduction to Gen-Z because my assumption is That many of you don't even know what Gen-Z is or how it works Or anything, so we're going to do a really brief discussion of What you need to know so you can understand what the rest of our Presentation is about. So Gen-Z is an open new Interconnect protocol. First of all, it's a consortium With broad industry support. There's over 70 members in the Consortium right now, ranging from system designers to memory Device designers to switch companies to software people. But mostly hardware. And there's a few of us thinking About software here. It's a whole family of Specifications. There's a core spec which gives You the basics of the protocol and how control space works and A few things we'll talk about later. There's physical specifications, mechanical specifications to Talk about form factors for putting these things into boxes. There's connectors and a software and management spec as Well. It's most important parameter Probably is that Gen-Z is a memory semantic fabric. And by memory semantic, i mean that you can have devices out on The other side of the fabric and in your cpu under the control of A linux os, you can do an mmap of a region of memory out there And do direct load stores to that. So you don't have to just do RDMA Messaging or ethernet packets or anything. You can do load stores to those devices out there. And Gen-Z can scale anywhere from 2 to 256 million components On that fabric, which is a pretty big number. And Gen-Z is a PHY independent protocol in the sense that You have a PHY independence layer and you can run it against any Number of PHYs , depending on what kind of latency, bandwidth and Reach that you need. The three PHYs that are specified Right now include a PCI PHY at 32 giga transfer per second. And two different 802.3 PHYs at 25 and 50 gigabits. And the reason there are these different PHYs is that PCI PHYs Can go about this far and copper ethernet PHYs can go about this Far. And if you want to go further than That, like across a row of data center boxes or an entire data Center, then you probably need to do an optical PHY and there Will be some of those specified in Gen-Z as well. Gen-Z can support a completely unmodified os by hiding all of The complication of the fabric management and making the Devices appear like PCI devices in firmware. But that's not what we're here to talk about. We're here to talk about having linux be a full player. And we'll talk a little bit later about why we think that's Necessary and that just hiding it in firmware is not a good Idea.
So, in this picture, we have two Different example fabrics. The one on the left is a pretty Basic fabric, two machines, each with a cpu and memory Connected over some coherent native interconnect between that Cpu and a bridge, which is the name Gen-Z gives to the device That connects from a cpu out onto the Gen-Z fabric. And then two media components in each of those servers. So there's six Gen-Z components in all in that fabric on the Left. Each component can have one or More interfaces. It's a point to point connection. If you want to fan out, you have to have a switch, which is That sort of octagonal thing in the middle there. Switches can be either stand alone or integrated in with Pretty much any other component type if you want to have Switching in that component. On the right-hand side is a far More complicated fabric. It's representative of what you Might use in a HPC kind of environment. This is a two-dimensional hyper x, which means that each switch In the fabric is connected directly to all of the other Switches in both its row and its column. Which leads to one of the prime features of a Gen-Z fabric, Which is that you can have multi-path. You can have software set up the routing to go between any Number of those switches, you know, directly, two hops to get Directly there or multiple hops along the way for a redundancy Or bandwidth improvements. I mentioned that there has to be Management software. In general, there will be Multiple os instances running on the nodes in the fabric.
None of those individual os instances can assume that they Own the entire fabric or all the components that it might find Out there. Furthermore, you don't have to Assign complete components to any given os instance. You can divide up those components. For example, a large media device can be carved up into Pieces, and each of those we call a resource. Those can be individually assigned to particular os Instances or shared, which is one of the main ideas here, is That you don't have to have a resource assigned just to one os, But they can be used by multiple ones simultaneously. To make that work, the fabric manager has to have some idea About what resources should be assigned to which os instance. So we have this thing in the management subgroup called the Grand plan. If you do a google search for Grand plan, the first thing that comes up is something from Wikipedia talking about the sith in star wars universe. That's why i think it really is a grand plan in that sense Exactly. Let's see. Fabric management can be done either in band, meaning that the Fabric management traffic is going over the Gen-Z fabric Itself, or out of band, which would mean you would have some Sort of set of connections between those devices like Ethernet. Either one can be supported. One of the main functions of this Gen-Z management software Is to set up the routing. Like i said before, you can have A multitude of routes, and you have to decide which routes are Good ones, which ones should be enabled, which ones should be Denied because you don't want those two components to talk to Each other at all. And then because there's this Fabric manager sitting out there, and it's the only one who Knows which resources should be assigned to an os, there has to Be some communication mechanism between a local management Service running on each and every node that talks to that Fabric manager and says, hey, which of these things that are Out there that you're managing am i supposed to see? And that local management service will talk to that fabric Manager using a DMTF Redfish interface to learn what Resources are its.
We're going to drop down one Little level of detail lower now and get you some very basic Gen-Z concepts. So there are three basic Component roles. Requesters are the things that Initiate packets in order to get service from some other Entity out on the fabric, which is known as a responder, which Executes that packet and then sends back an acknowledgement If it needs to. That acknowledgement happens Both for reads and writes, so even writes are Acknowledged. This is basically a reliable Protocol. If there's some error in the Transmission on any one of those links, the hardware will Retry up to some program limit to try to make that transaction Happen. But it could, of course, fail If that happens too many times, if the link is really dead, for Example. And then there are switches, Whose role is just to route packets from ingress interfaces To egress interfaces, and they have a big set of tables in Each switch component that decide what routing paths are Enabled and which ones are not. Every component on the fabric Has a 28-bit global component id, GCID, or GCID. It's assigned by management software. The first 16 of the bits of those are called the subnet id, Which is optional, and then there's a required 12-bit Component id. So if you want to build a Small fabric, you don't have to have the full 28 bits. You can just do 12 of those. Components, every component on The fabric has two separate address spaces. There's the data address space, which is up to 2 to the 64 Bytes in size on each and every component. And next to it is a control address space, totally separate, 2 to the 52 bytes in size maximum, where management Software will program various parameters into the component. A really important thing to understand is that by default Packets are completely unordered on Gen-Z, which is very Different than PCIe, which has a well-known ordering model. They are unordered by assumption, and that's for a Couple of reasons. One, we showed in the previous Slides that multi-path can happen, so every packet that Comes from my requester might follow a different path to that Component, and they may arrive out of order. Furthermore, there's the hardware retry mechanism, which Can cause just one packet to fail. Others succeed, and then that one is retried, and so that Causes out of order as well. And finally, another big Software visible difference is that coherence in this fabric Is usually going to be done with software, and that's Because hardware coherence mechanisms that we use on Components really can't scale to the size fabrics we're Talking about here. You would be spending all of Your time doing snooping or even directory-based things Don't scale that far. So coherence in general will Be software managed.
Here's a picture of what Control space looks like on each component. Every control space starts at zero, and there's a required Structure at address zero called the core structure, and So you start there. And inside that core structure, There will be a bunch of fields describing various things, Including pointers to other structures, which describe more Things about the component. And those pointers can create Links, linked lists, so the interface structure here, for Example, the first interface, number zero, is pointed to by The core structure, and then it points to one and on to two And eventually. And there's a whole tree of Defined links and a known mechanism to follow all those Pointers and find all that stuff. There's really two things In control space. One are structures, which have a Fixed header in the front of them and therefore can be Self-describing. There are also tables, which Are not structures. They don't have that fixed Header, and therefore you have to have special algorithm to go Look at other fields and other structures to figure out how Big that thing is and what it is.
I mentioned that bridges are the Gen-Z device that connects The cpu into the fabric. Here's a block diagram of an HPE bridge. This bridge block diagram is a Marginally fictionalized version of a bridge that HPE has built And reported on at Hot Chips a couple of weeks ago. So if you want to find out more about that bridge, you can look Up that presentation. In the middle of this diagram is The cpu, which has, of course, MMUs and often IOMMUs these Days. So these are standard cpus with Their local memory. And then they connect over some Interconnect to the bridge. If you're doing the load store Mechanism, then you'll start by executing a loader store Instruction in the cpu through the standard MMU, creating a Physical address, which comes out into the bridge. That physical address simply doesn't have enough data in it To resolve into a Gen-Z address because, as i mentioned, every Component might have a full 64-bit address of its own, and Physical addresses on cpus are just not that big these days. Furthermore, you need to have other data to fill into the gen Z packet, like what the global destination is, and this thing Called an R-Key, which is part of the access control mechanism. And therefore, there's an extra layer of translation called the Requesters MMU in the path where that physical address is looked Up and then turned into all those additional parameters Before it goes out onto the Gen-Z fabric. And then, a component will have been addressed and routed, the Packet will be routed to the correct destination, presumably If not, the responder will throw it away. But assuming it arrives, then the z address will be looked up In the responder's MMU, along with the R-Key, which is Compared against the R-Key stored in the MMU to make sure It matches, and again, if it doesn't, that will be thrown Away. And that will look up a virtual Address in PASID, which will be forwarded to an IOMMU before Flowing into the system memory, assuming your platform has an Io MMU. Because load store access Probably can't get you direct access to all of the fancy Data and functionality that Gen-Z has, it's often a good idea In your bridge to have what's called a data mover, which is Just a name for a fancy dma engine. And it gives you access to packets that you can't generate With load store. It can also provide you the Option to do RDMA if you want to do that. Similarly, on the receive side, you can have a receive data Mover, which can receive messages from Gen-Z that are Encapsulated ethernet packets, for example. And send those off into a queue structure in normal kind of dma Way, because they're not packets that have direct Addresses. Instead, they're more context And finally, the control space has to be directly accessible Both from the local cpu, and if you're doing in band Management, that control space has to be accessible from the Fabric as well. And so, the responder will Take control packets, which are different than data packets, And route them not through the responder, but to the control Space block directly. Again, this is just an example. You don't have to build your bridge like this, but Gen-Z Subsystem needs to be able to manage these resources in Bridges.
A little more about ZMMUs. The assumption in the spec is that they are os managed, so The os has direct ability to write the translations into the ZMMU, which means that any os can generate a packet that's Destined to any particular device out there, and so then You might ask, well, how do you deal with access control and Security, and that's a whole other talk that we're not going To talk about here. The diagram description I already covered most of the requester ZMMU items on this bullet here. I did say that responders ZMMU is data space only and not Control space. And the Gen-Z spec defines two Different kinds of ZMMU structures. The first is called a page table based one, which is Structured much like a cpus mmu with multiple levels of in Memory page tables and caching of those elements into a TLB in The ZMMU, very much like a CPU or IOMMU structure. But there's also another kind which is called a page grid, Which is an on chip only, no tables in memory, and it has a Very limited resource, and so we need to have code in the Subsystem to handle both of those kinds of ZMMUs.
All right. That covers pretty much the Introduction to Gen-Z that i wanted to get you guys all up To speed as much as possible in the short time we have. Let's move on now to talk about the kernel subsystem itself.
So why do we want to do a kernel subsystem? Well, first we want to enable native device drivers to Control io devices or accelerators that are out there Using Gen-Z fabric. And that enables full access of All the advanced Gen-Z features, the whole list of them here on The slide, which we are not going to cover today due to lack Of time. And it also enables the sharing, Like i mentioned before, where if you do it in firmware, then Pretty much a resource assigned to a firmware and then Presented to an os as if it were a local device. Well, that os instance is, of course, going to assume that it Has full and exclusive access to that device. So if you want to do sharing, you can't do it in the firmware Way. You have to have an os visible Knowledge about the sharing that's going on. Furthermore, we have in our design the idea that we're going To put fabric manager and those local management services that i Mentioned in user space. And the Gen-Z subsystem will be The mechanism that those user space processes will be given Access to those resources. And why are we doing this now? Well, because hardware is showing up essentially now.
Here are the things we had in mind while doing the design that We have. First, since this Gen-Z Subsystem wants to expose native devices, it needs to be a bus Subsystem in the linux kernel sense. And we have existing examples of bus subsystems like PCI and Usb and Greybus. So we want to be like those Where we can. That way driver writers that are Used to doing drivers for those bus subsystems will not be too Freaked out by some kind of odd design that we've done. This next one is maybe the most important of all, which is that We want policy to be in user space and just the mechanism in The kernel to the extent possible. The previous speaker was talking all about these odd heuristics In the memory management system for page reclaim. We don't want to have things like that that get in the way of Making this work. So just let user space do it. We're going to use existing kernel services where that makes Sense. And last but not least, we have To deal with the fact that if you read the course back in Gen-Z, nearly every feature in there is optional. So we have to somehow deal with that level of complexity where We have to be able to make sure that we can build an Interoperable system of these components where they may have Chosen slightly different feature sets.
So here is our block diagram of what we are proposing to build. The subsystem itself is in kernel space, not at the bottom. It's those two green boxes. The key on the right says the Green things are new. So that's the new stuff. It will be connected to bus and dma subsystems in the kernel as You might expect. We'll talk about Netlink, hot Plug infrastructure in the slash sys file system in a minute. Yes, terry?
when you say new, it's Existing new or to be built?
new in the subsystem or new User space components using the subsystem. So it's code. New code that we're writing now. That's the green stuff. Yellow stuff is already in the Kernel. And blue i haven't talked about Yet, but i will now. So we need to have interfaces Both down to bridge device drivers, which we'll be talking About a little bit more later. So that's at the bottom. Each vendor supplies a bridge device driver that corresponds To their bridge device. And then there will be a set of Upward facing native device drivers that provide various Services like block device services or memory device Services or ethernet services or RDMA services to user space. And then in the user space itself, there are two main Components being described here. The first is the local Management services block on the right. Not the far right. But just next to that. And we call that LLaMaS because it's the linux local Management service. You stick a couple of a's in There and you get a cool name, LLaMaS. And then there's a fabric manager, which we're calling Zephyr. Zephyr because besides the Definition, which has to do with wind, there's one that has To do with fabric. So fabric manager named zephyr. And we'll talk more about those in a little while.
But first let's talk about the kernel piece of this. One thing i want to make clear is that this is very Definitely a work in progress. We are not done by any means. We have some code that implements some of this stuff. But we're at a good place where if there are glaring Deficiencies that you see or things that we're doing wrong, Let us know now. You'll see a set of questions Here in a little bit that we have for the community to Answer. And hopefully we'll get some of Those answers today or at the end of the talk or out in the Hallway track. Okay.
So the first aspect of getting the subsystem operational is Basically that we assume that a bridge device will be Discovered on its native bus, say it's connected via PCI or Some variant like CXL or CCIX or OpenCAPI or whatever. So that bridge device will be discovered in a normal way Using those existing subsystems. And then when it's got its Device ready, initialized, it will make a call to Gen-Z Register bridge, which is the notification to the Gen-Z Subsystem that this isn't just some ordinary PCI device, but It wants to talk to Gen-Z. And that will happen usually During the probe function of that native bridge driver. And when the subsystem finds such a bridge, it will make -- It will be presented two things. One is the native device And the callback for that device on its native bus. As well as a pointer to a structure which has a bunch of Function pointers in it. And those function pointers will Include callbacks into the bridge driver to let it return Bridge information or perform control space reads or writes Or maps or data space reads or writes or control write message, Which is a packet i haven't really talked about, but it's A way to get management entities out on the fabric. And then, of course, there's an unregister that corresponds.
Up in that upper block, blue block in the block diagram, for Native device, registration will have a Gen-Z register Driver function. Very much like the PCI version. In fact, i think it has identical parameter interface. The main difference between device driver registration for Gen-Z versus PCI is that in the PCI world, you have vendor and Device ids. And in Gen-Z, all the ids are Uuids instead. So the matching will be by uuid. And again, there will be a structure, Gen-Z driver Registration, and then there will be PCI-like probe, remove, Suspend, resume kinds of function pointers. And again, there's an unregister.
As i mentioned before, ZMMU and iommu management is pretty Fundamental to the way Gen-Z works. And so we want to centralize control and management of those So that we don't have to have every driver doing the same Thing. So the subsystem will know about ZMMUs, it will have calls that will allow mapping control Space or data space ZMMU entries, control space in Particular so that we can do implementation of the interface To user space, which we'll talk about in a minute. As i said, we're a work in progress here. So we don't know exactly what this api looks like. But to the extent possible, we're going to try to hide this Difference between page grid and page table based ZMMU. I'm still not convinced that we can do that, but that's the Goal. Because Gen-Z can connect to an Iommu in the system, and because paces appear in the ZMMUs themselves, we're very interested in having some kind Of common set of calls that allow us to manage paces. So you can see our first question up to the community Here in blue, which is should there be or can there be Interface for managing paces. There was a talk earlier and Some hallway conversations we've had since, which i think Has convinced us that that is the direction that the kernel Is heading. So, good. Second question out of the community here is about huge Pages. It's our understanding that Huge pages for device memory are not well supported in the Kernel today. And there's a whole host of Reasons on the slide here for why we think Gen-Z really could Benefit from that. First off, as i mentioned Before, there's a huge number of components possible, and Each of them can have a huge data space, and if you're Trying to map all of those with 4k pages, you're going to be Sad. So, big pages help solve that Problem, at least mitigate it. Especially in the page grid Case, since there's so few pts in the device, they tend to Have a huge range of page sizes available. The bridge i mentioned before supports everything from 4k to 256 terabyte pages. So we'd like to be able to take Advantage of that in the ZMMU. And then the third question on This slide is, again, about iommus, because we've seen Patches posted over the last year or so about shared virtual Addressing and making a common interface to iommus, and again, Could very much take advantage of that in our subsystem, Because we can have a bridge, for example, that connects via CXL, and if CXL is implemented by intel and amd and arm cpus, We could use that same bridge, but the iommus in those Platforms are different. We'd like to have a set of Common calls that we can make to manage those iommus from the Subsystem.
I mentioned data movers Earlier in the block diagram for a bridge. We're a bit torn here. Kernel drivers like block or Emulated ethernet nick drivers would greatly benefit from Having a generic data mover interface so that we could Write the code once and call into the subsystem where it Could then have interfaces to the underlying bridge driver. That would also be useful in being able to generate packet Types in Gen-Z that are hard to do with load store like CXL, or the write message, or some of the more exotic ones Like buffer and pattern requests. On the other hand, RDMA drivers, Which in the end want to expose the queues and the data mover Hardware directly to user space, are going to have to have user Space drivers that hide the differences between those queue Mechanisms, and so they are not particularly interested in Having a common data mover interface. So the question to the community again is, do we think we Should work on that or not?
Interrupts and unsolicited event packets are kind of Different in the Gen-Z space. I'll describe unsolicited event Packets in a minute, but interrupts themselves are very Different. Unlike in PCI where there's a Very nice architected msix interrupt structure, you can Have common code in the kernel that knows how to manage all of Those things and common code that intersects, interoperates With the underlying interrupt chip and similar structures in The kernel. That's not how it works in Gen-Z. Every device can have interrupts, But there's no common mechanism for describing them or Programming them, so it has to be done on a per driver basis. Not unlike what was described by intel in the SR-IOV talk Yesterday. So maybe we can leverage Something from what they're doing. I don't know yet. Interrupts can come from Different places in Gen-Z. There are packets that you send From one os instance to another or from any component to any Other component that can take an interrupt packet. So that's one source. Interrupts can come from the Bridge when the data mover has a completion queue, the entry That's done or some incoming packet comes into the receive Data mover that will want to generate interrupts. And then there are these things called UEPs, unsolicited event Interrupts, which are the Gen-Z mechanism for a single Component or some component in the fabric to signal kind of Fabric state changes like links up and down or how to remove Components or errors. So we need to have mechanisms to Pass those interrupts up into user space if there are user Space managers that are handling that. Our proposal is that those UEPs become local interrupts on the Targeted bridge component. Those are handled by the Subsystem and then forwarded to user space by some mechanism. Perhaps Netlink, which we'll talk about here some more in a Minute. Okay.
That's kind of the end of our kernel subsystem part of this.
I want to talk about the user space pieces and what the Kernel subsystem is presenting to user space to make user Space management components work better.
So as i hinted at earlier, Gen-Z discovery is rather different Than the way, say, PCI does it, which is all in the kernel and You explore the PCI hierarchy and you assume all your devices Are local and owned by the os. So here, every node running an Os instance on the Gen-Z fabric needs to run a copy of LLaMaS, The local management services process. And LLaMaS is going to use Redfish, as i mentioned before, To go and talk to the fabric manager and find out which Resources are owned by this os instance. And when it has one of those resources, it's going to make a Netlink call into the kernel Gen-Z subsystem and say, add this Component. And that's going to cause the Subsystem to create new entries in sys devices under this path That you see here. And you'll see that more on a Further slide. To cause that resource to Appear under subnet id and component id for a given fabric. And once that subsystem creates those /sys devices, then through The usual udev mechanism, that will cause a search for a Driver that can bind to the particular uuid that was added As part of that add component command. And we'll get a driver bound to that. Fabric manager node is completely different. Fabric manager is the thing that needs to go out and explore the Entire fabric and try to figure out what's actually there. And if you have a grand plan, as i mentioned before, does the Grand plan match up with what we actually discovered out there? So it needs to discover those interfaces, it needs to find Switches and bridges and media controllers and all the things That are out there. And the mechanism we're Proposing that it use to do that is, again, Netlink, add Fabric component command, which again will cause new tree Entries to be added under /sys/bus/genz. And once those sysfs entries are there, then the fabric Manager can open the files that it finds in those trees in order To get direct read/write access to the control space structures Like i mentioned in the introductory slides.
Yep.
So does this mean if your boot device is out somewhere in The Gen-Z fabric, you will need to bring up LLaMaS in order to Boot?
so boot is an interesting Thing. So just like in a local machine PCI environment today, in order to boot from some unknown PCI Device that's on your machine, you need to have UEFI drivers Assuming you're using UEFI environment that corresponds to That device. That will be true here in Gen-Z fabric as well. UEFI will have to be modified. It's going to have a new subsystem.
Once the kernel and initramfs have been loaded, we're going To need to have LLaMaS in the initramfs?
yes, you're going to need to have LLaMaS in the initramfs. Yes, that is the implication of this design.
All right. Let's see where we were. I think we were about on the last bullet here, which is that -- Another question here. So Netlink seemed to us to Be a pretty good communication mechanism, both to inform the Kernel of add and delete component resource commands, Because it has this structure that lets you audit the kind of Data that's going through. But it's also a bidirectional Communication mechanism, so we can send UEPs and other Interruptive events back to user space using Netlink as well. So the question to the community is, is that a good choice? We had a hallway conversation yesterday with -- what was the Name? jason, jason, kind of mr. RDMA, who suggested that they chose ioctl for its Performance benefit over Netlink. I don't know that this has such stringent performance goals as RDMA does, but maybe that's something to think about instead Of Netlink. I don't know. Although i don't know how to use ioctl in kind of a Resource-based mechanism.
So here is a very high level, Very simplified version of what you might see in /sys/devices on A managed node. Remember that six node simple Topology that i showed in this example we have. The assumption that just a single one of those media Which has been assigned subnet id zero and component id two has Been assigned to this node. So LLaMaS has run at this point And done its add command. It's told us that this cid Exists. And in /sys you'll see a Handful of properties like the component class and the free uuid, Which i didn't tell you about. So you don't know what that is, But it is the GCID. And then two memory resources. So we assume that two regions of that memory component number Two have been assigned to this os, and you'll see control and Data space regions control corresponding to those memory Regions as the resources here. There's a symlink from bridge Zero that points to its actual native device. In this case we're assuming on the right-hand side that this Bridge device is connected by PCI. So that's the first part of that hierarchy is a completely Standard PCI representation of a device in /sys/devices. And then attached to that bridge device we'll create a Gen-Z Hierarchy, and in there we'll have sysfs directories and Binary attribute files corresponding to the control Space structures that are visible to that local bridge.
In contrast on the fabric manager, of course, if it's Running LLaMaS as well because it's locally managed and has Gen-Z devices and under sys devices on the fabric manager You'll see a hierarchy not unlike the one on the previous Slide, but the unique stuff for /sys on the fabric manager is All below /sys/bus/genz and under a fabric zero hierarchy. Of course, if you have multiple bridges you might be connected To multiple fabrics, so that's why there's this fabric zero, Fabric one, fabric n in the path here. Again, the space id and cid sub directories. And then for each of the devices out in the fabric that have Been discovered, you'll see the control space structures that Will allow Zephyr to come in and do open some of those binary Attribute files and do reads and writes to cause control space Changes or read the parameters out of the devices and do the Fabric management that it needs to do. And the next blue question for the community here is does this Hierarchy look like something sane to you? Is it consistent with linux's intended usage of sysfs? One thing that worries me a bit is that because in the limit You could have 256 million components out on a fabric, That's a whole pile of sysfs files and directories, like more Than has probably ever been in sysfs before. So are we going to run into some kind of limitations in the Sysfs subsystem just by having so much stuff? Now, realistically, of course, you probably won't have a single Fabric manager managing a fabric that big. You'll do some kind of federated thing and divide it up, and There's a whole bunch of stuff in the system and management Framework for Gen-Z that describes how you can do all of That. So maybe it's not as bad as i Make it out, but in the limit, it seems like a lot of sysfs Stuff. Okay.
That just about ends our talk.
This is a place where we would Normally in a talk have you guys asking me questions, but Before we get to that, for those who want to look at this Afterwards, we've summarized that set of kernel questions That were in blue on the earlier slides, so you don't have to go Searching for them if you don't want to. You have them all right here.
And finally, i have some References. So the consortium web page will Let you get access to all the specifications that i mentioned. They're all public. There are, of course, newer Versions coming out that aren't public yet, but they'll be out There soon. Our code is on github. Linux-genz. At least an early version of The code. We're working hard to get it a Little more sane, so don't look at it today. Give us a couple of days and we'll have more of this actually Up there. LLaMaS has two github repos And LLaMaS itself is -- let's just stop my screen. Good. Let's ignore that. LLLaMaS itself is one repo, but it's using a home-grown netlink Interface called Alpaka as well. The thing which you do not see On this list is Zephyr, and that's because we haven't Started to work on that. But it will show up here as well. All right. And that's all i have. Anybody have questions?
Yeah, so i see python 3 and i'm a bit worried. So some environments, they have very little memory, like, for Example, key dump, crash kernel, you basically reserve just a bit Of memory, and you want it to be real small because you Basically never use this memory. So what if we boot from Gen-Z Attached disk, kernel crashed, and we need to dump our memory To the disk? so you mentioned that you still Have to have LLaMaS in it, and i'm just worried that it will Take too much disk space.
good point.
So the question is, is it possible to implement, like, Very minimalistic, i don't know, llama discovery so you can place It in kernel? because if you make it very Complex, like, i don't know, it will basically die. No one will be using it.
understood. So first, i think you should consider this first Implementation of LLaMaS to be more of a prototype than Anything else. We're doing it in python because It's easy. It doesn't have to be in python. You can do it in c or c++ or go or whatever you like, ruby. And second, for something like a crash kernel, if you are Willing to pre-configure things, for example, in a static file, There's nothing that says that LLaMaS actually has to go out Over the network and talk to the actual fabric manager. It could have a local configuration file that is kind Of a proxy for that.
Any more questions?
perhaps a little heretical, but given you're building a System that has access to devices all over, would we be Looking at why bother with some existing technologies? The thing that struck me directly was, RDMA is to talk to That device way over there through a direct connection or What looks like a direct connection. Would you be looking at saying, we don't need RDMA anymore, so Why bother ask the question about RDMA on the slides? Hence heretical question.
yeah, i know. Kind of a philosophical thing. If you were to take Gen-Z to Its logical extreme where we're trying to provide direct load Store access to all those things out there, then i would say, Yes, RDMA is not necessary in that world. However, there is a huge body of HPC codes in particular that Are based on MPI and Libfabric and RDMA underneath. We don't want to throw that all away. It in some sense isn't better than any one of those Technologies that you might mention. It's not better than PCI necessarily. It's not better than InfiniBand necessarily. It's not better than CXL necessarily. But it does do a lot of stuff. We want to bring in and not Leave behind legacy codes. So i think RDMA is an important Use case. So i think it needs to live on Top of the Gen-Z subsystem for a long time.
Any other questions?