Greetings,
I have been exploring the possibility space for asynchronous kernel APIs in Coyotos, and it seems like the picture has crystalised enough in my mind to start sharing my thoughts on this. I'm sure there are imperfections, considerations I've missed, or opportunities for simplification that you see; I'd love to know. I hope that something of this kind could make it into seL4 eventually, too.
There are two motivating factors for my interest here. The primary one is regarding liveness and ergonomics. In Coyotos as it stands, you cannot keep executing while waiting for an IPC reply, at least, not unless you start another Process to wait for each message. I am used to - and I want to design systems in the vein of - E; where I can post a message and continue on with my work, perhaps handling messages from other clients, while I wait for the reply. The other motivating factor is performance. The fastest context switch is the one you avoid taking, and while the seL4 core platform has a mechanism for 2-party asynchronous messaging, I would really like all IPC to work this way.
I want to state that io_uring is not enough, really. We capability nerds have been spoiled by protocols such as CapTP and CapnProto. If we have a load of messages directed at one server, we expect that all of those messages are delivered in a single go. We also expect that communicating with more than one server, and communicating with system-implemented objects, is seamless. The 3-party case, where the message target, argument, and sender may all be implemented by different endpoints, should asynchronously just work.
* Differences
The design space here is different to CapTP, however:
Capabilities are represented by abstract values stored in the application's address space. They are protected by the system, and so do not need to be abstracted as numerical values on the wire. Indeed, there is no wire; the kernel can transfer capabilities safely and without delay.
Coyotos applications must be able to be robust against any behaviour of their caller. They must be able to reject messages, and must not be forced to hold memory on behalf of a caller. They must also be able to avoid missing messages, even if this forces clients to block or be ignored.
I think I've found a way to have our cake and eat it too. There are four primary data structures involved, although these are not analogous to the 4-tables of CapTP. We'll look at the data structures and then come back and expand upon them.
I apologise that the following is really dry and a lot of text. Maybe I should come up with a better way to illustrate, though I haven't been able to figure out how.
* The Command Ring
One data structure that is fairly easy to describe is the ring onto which system calls are placed. This is a circular ring buffer that the application writes to, and the kernel reads from; both the buffer and the write pointer are mapped read-write into the address space of the application; the read pointer is mapped read-only. The size of the buffer, together with the page(s) implementing it, are fields of the Endpoint.
One of the commands accepted on this buffer is a variation of the InvokeCap system call, which implements only the Send part of the protocol, with some basic compression, eliding unused fields according to the ipw0 (or equivalent). The fields of this system call are designed so that the receiving endpoint, which may be listening synchronously or asynchronously, will receive exactly the same message; indeed from outside there is no way to distinguish the two. The flags available for this system call are Non Blocking, which will discard the message if it cannot be delivered, Notify Sent which places an event on the Response Ring once the send has been processed (handy for cleaning up e.g. indirect strings), and a variation of Reply Cap; in this case rather than providing a caploc_t for sndCap[0], a ProtectedPayload is given which will match an entry in the Match Reply table.
Another command implements the receive half of InvokeCap. Specifically, it sets up an entry in the Match Reply table. These entries will be described in detail in that section.
Commands can make use of capabilities that have not yet been received as if they were promises; these name the protected payload of the sender, message number, and reply cap of the sender.
Further commands are supported for manipulating (and in particular, removing entries from) the Match Reply table, the Resolution Table and the Pending Invocations table, as well as updating the read pointer for The Response Ring and any buffers used for receiving strings or capabilities. The CopyCap system call can also be placed into the command ring.
* The Match Reply Table
There may be more than one Entry capability, used to send messages to a given Endpoint. These may be distinguished by having a different Protected Payload, which is a hidden field carried within the capability. An Endpoint can be configured to ignore messages that don't match a particular Protected Payload, and to increment the Payload on each successful send and each successful receive. This is used to implement one-shot reply capabilities when communicating with a server in the existing, synchronous Coyotos.
In an asynchronous model, we need a way to describe where incoming data goes. We also need to ensure reliable receipt of replies. The Match Reply Table allows us to configure the endpoint for these conditions.
The Match Reply Table describes what should happen on receipt of a message with a given Protected Payload. An incoming message is matched against each of the records in this table (in what order?) to determine under what conditions it can be received, or if the caller is requested to block. Each record describes:
** How many messages is this record allowed to match? This number is decremented on each successful match.
** What protected payload is expected? (do we need another element for the server to identify this match, or is the protected payload enough? do we need to associate user data with matches to this request?)
** How many data words and capabilities are accepted for this message?
** Does this message accept an indirect string? How long can it be?
** If capabilities or strings are accepted, are they placed in some static location, or into a ring buffer? which?
** What space in the receive buffer may not be consumed by this message? For example, server messages must not be received if they would not leave enough space for replies we are waiting for.
This table is not mapped writable by the application, rather, it is managed by messages sent to The Command Ring. Whether there is any value in being able to read it or not is not clear.
It's entirely possible that different records could indicate different Response Rings. It's not clear to me yet how useful this would be, and what protocol would be required to set this up and map it into the application's address space.
* The Response Ring
Messages received by the endpoint, matched by the Match Reply table, and can be received without blocking are placed into The Response Ring. This is a ring that the system writes to and the application can read and consume. The read pointer for the ring is mapped writable; the ring might not be (but that's up for debate). The application can read the write pointer for the buffer and determine where to stop processing messages.
The primary message type on the response ring corresponds to messages received by the endpoint. These messages are compressed in the same fashion as Sends in the Command Ring, taking up only the space required to describe the message, data parameters, references to capabilities sent and any associated string. Since capabilities received in a server fashion should be emitted into a ring buffer, they may need to be copied elsewhere if they need to be stored for later.
Capability fields in the Response Ring may be promises. The system only allows promises here under one condition, specifically, if the promise is intended to be fulfilled in the return value of an earlier message. When the system processes a matching send, it updates any promises that would be resolved, removing the need to hold capabilities just in case they may be used by a sender later.
* The Resolution Table
This table maps promises to their resolved value. When a message sent or received resolves a promise that someone is relying on, its value is recorded in this table. This is a shorthand for updating all of the dependent values; this will happen at a later point in time.
* The Pending Invocations Table
The function of promises in this model is more limited than within CapTP. Effectively, promises cross an address space in one case only: if the endpoint the promise is sent to is the one that resolves the promise. The only messages sent are those that are ready to be processed by the receiver - either because none of the parameters are promises, or because the promise refers to a message the receiver is expected to provide via a reply cap.
The intention of this distinction is that servers do not need to allocate space on behalf of clients who may never fulfil a needed promise. It means that servers do not need to use guards to ensure that their arguments refer to concrete values. However, this does not reduce the expressive power of the system; and the pending invocations table picks up the slack.
An attempt to send a message containing (or, directed to) one or more unresolved promises is instead placed into the pending invocations table. Such entries are linked into a list by order to be consulted when the system is preparing to handle commands.
* Blocking?
Given that we have only described fixed-size structures so far, there's always the possibility that we can't send another message. The command ring could be full; the receiver's response ring may be full, or we may be sending to a synchronous endpoint that is not currently being listened to. In this case, a process can make a synchronous InvokeCap system call to the endpoint to be notified when there are messages that can be processed. This is a nice variation on the yield system call, in that it indicates the condition of interest. The exact nature of this method is to be determined.
Applications need to be able to deal with the fact that there are a limited number of messages that can be waiting to be processed. I don't think that this will be a big deal in practice, most systems have fairly low limits on the number of things that a thread can be waiting for relative to what we could support.
On a 64 bit platform for example, if the two rings were one 4096 byte page each, and every message used all of the data, capability, and string fields, each page could hold 4096//22 = 186 messages; much more than select(), with much more utility.
* The System Perspective
How does the system process asynchronous messages?
It seems sensible to process the Command Ring before running its associated process, but it would also be possible to use a different schedule for this. For example, it might even be reasonable to run asynchronous endpoint processing from a different core. For now we'll avoid specifying when and how this processing is scheduled, only to say that it is probably more important to process commands than it is to execute the program, as this frees up room in the command ring for the program to use.
Processing starts by consulting the resolution table. If there are promises with values associated in this table, they should be processed first. If they are referred to by the response ring, we can leave these, but go through the pending invocations table. If a pending invocation refers to a promise that has now been resolved, update the invocation to refer to the real value. If the pending invocation now depends on no promises, send the message and clear the entry from the table. If a promise has no other dependencies (notably, there may be dependencies in the response ring), delete it from the resolution table.
Once the resolution table and pending invocation table have been processed, the command ring can be processed.
Sending a message proceeds by preparing the invokee and pages containing parameter data. A lock is taken on the target endpoint, or in the case of an asynchronous target, the endpoint's receive lock is taken.
If the send is blocking and the target endpoint is not ready to receive - perhaps because it is a synchronous endpoint and its process is not in receiving state, or the process is in closed wait on another endpoint, or because its command ring does not have any space - the message is placed into the pending invocations table and it is parked on the given endpoint. This places it into a queue of processes and asynchronous endpoints waiting to send to the endpoint.
If the send is nonblocking and the target is not ready to receive, discard the message.
Otherwise, the message can be processed. If the target endpoint is synchronous, the message is delivered as any other synchronous IPC, by filling out the processor registers, copying any string and caps, and marking the process as running. Otherwise, the message is compared to the receiver's Match Reply Table. If the conditions match and there is sufficient space in the response ring and buffers indicated by the match reply record, the message is delivered and the lock is released on the target endpoint. Once this is done, the read pointer on the sender's command ring is updated.
The system can keep doing this until the sender's command ring read pointer reaches its submit pointer. A system may elect how to handle a message that crosses the submit pointer, but a reliable system should detect that condition and either fault the process or send a message indicating the problem. While a fault sounds like a more sensible response, we do not know what instruction it might be executing; it may even be running on a different CPU, so it would be a unique position for the process fault handler to find itself in. Perhaps that is still preferable.
* Some other use cases
The one thing I miss having in coyotos is a way to receive a list, that is to say, "send me all of the block devices mapped into that address space", and to receive messages until all of them have been described. A similar situation is wanting to subscribe to a stream that includes data and capabilities. I feel like the functionality I have described here is sufficient for any of the use cases I come up with, at least considering bounded space. However, if there's something I've missed, I'd like to hear that too.
Thank you for reading my rambly thoughts on this one.
--
William Leslie
Q: What is your boss's password?
A: "Authentication", clearly
Notice:
Likely much of this email is, by the nature of copyright, covered under copyright law. You absolutely MAY reproduce any part of it in accordance with the copyright law of the nation you are reading this in. Any attempt to DENY YOU THOSE RIGHTS would be illegal without prior contractual agreement.