Draft asynchronous functionality for Coyotos

27 views

Skip to first unread message

William ML Leslie

unread,

Oct 19, 2021, 7:20:38 AM10/19/21

to cap-talk

Greetings,

I have been exploring the possibility space for asynchronous kernel APIs in Coyotos, and it seems like the picture has crystalised enough in my mind to start sharing my thoughts on this. I'm sure there are imperfections, considerations I've missed, or opportunities for simplification that you see; I'd love to know. I hope that something of this kind could make it into seL4 eventually, too.

There are two motivating factors for my interest here. The primary one is regarding liveness and ergonomics. In Coyotos as it stands, you cannot keep executing while waiting for an IPC reply, at least, not unless you start another Process to wait for each message. I am used to - and I want to design systems in the vein of - E; where I can post a message and continue on with my work, perhaps handling messages from other clients, while I wait for the reply. The other motivating factor is performance. The fastest context switch is the one you avoid taking, and while the seL4 core platform has a mechanism for 2-party asynchronous messaging, I would really like all IPC to work this way.

I want to state that io_uring is not enough, really. We capability nerds have been spoiled by protocols such as CapTP and CapnProto. If we have a load of messages directed at one server, we expect that all of those messages are delivered in a single go. We also expect that communicating with more than one server, and communicating with system-implemented objects, is seamless. The 3-party case, where the message target, argument, and sender may all be implemented by different endpoints, should asynchronously just work.

* Differences

The design space here is different to CapTP, however:

Capabilities are represented by abstract values stored in the application's address space. They are protected by the system, and so do not need to be abstracted as numerical values on the wire. Indeed, there is no wire; the kernel can transfer capabilities safely and without delay.

Coyotos applications must be able to be robust against any behaviour of their caller. They must be able to reject messages, and must not be forced to hold memory on behalf of a caller. They must also be able to avoid missing messages, even if this forces clients to block or be ignored.

I think I've found a way to have our cake and eat it too. There are four primary data structures involved, although these are not analogous to the 4-tables of CapTP. We'll look at the data structures and then come back and expand upon them.

I apologise that the following is really dry and a lot of text. Maybe I should come up with a better way to illustrate, though I haven't been able to figure out how.

* The Command Ring

One data structure that is fairly easy to describe is the ring onto which system calls are placed. This is a circular ring buffer that the application writes to, and the kernel reads from; both the buffer and the write pointer are mapped read-write into the address space of the application; the read pointer is mapped read-only. The size of the buffer, together with the page(s) implementing it, are fields of the Endpoint.

One of the commands accepted on this buffer is a variation of the InvokeCap system call, which implements only the Send part of the protocol, with some basic compression, eliding unused fields according to the ipw0 (or equivalent). The fields of this system call are designed so that the receiving endpoint, which may be listening synchronously or asynchronously, will receive exactly the same message; indeed from outside there is no way to distinguish the two. The flags available for this system call are Non Blocking, which will discard the message if it cannot be delivered, Notify Sent which places an event on the Response Ring once the send has been processed (handy for cleaning up e.g. indirect strings), and a variation of Reply Cap; in this case rather than providing a caploc_t for sndCap[0], a ProtectedPayload is given which will match an entry in the Match Reply table.

Another command implements the receive half of InvokeCap. Specifically, it sets up an entry in the Match Reply table. These entries will be described in detail in that section.

Commands can make use of capabilities that have not yet been received as if they were promises; these name the protected payload of the sender, message number, and reply cap of the sender.

Further commands are supported for manipulating (and in particular, removing entries from) the Match Reply table, the Resolution Table and the Pending Invocations table, as well as updating the read pointer for The Response Ring and any buffers used for receiving strings or capabilities. The CopyCap system call can also be placed into the command ring.

* The Match Reply Table

There may be more than one Entry capability, used to send messages to a given Endpoint. These may be distinguished by having a different Protected Payload, which is a hidden field carried within the capability. An Endpoint can be configured to ignore messages that don't match a particular Protected Payload, and to increment the Payload on each successful send and each successful receive. This is used to implement one-shot reply capabilities when communicating with a server in the existing, synchronous Coyotos.

In an asynchronous model, we need a way to describe where incoming data goes. We also need to ensure reliable receipt of replies. The Match Reply Table allows us to configure the endpoint for these conditions.

The Match Reply Table describes what should happen on receipt of a message with a given Protected Payload. An incoming message is matched against each of the records in this table (in what order?) to determine under what conditions it can be received, or if the caller is requested to block. Each record describes:

** How many messages is this record allowed to match? This number is decremented on each successful match.

** What protected payload is expected? (do we need another element for the server to identify this match, or is the protected payload enough? do we need to associate user data with matches to this request?)

** How many data words and capabilities are accepted for this message?

** Does this message accept an indirect string? How long can it be?

** If capabilities or strings are accepted, are they placed in some static location, or into a ring buffer? which?

** What space in the receive buffer may not be consumed by this message? For example, server messages must not be received if they would not leave enough space for replies we are waiting for.

This table is not mapped writable by the application, rather, it is managed by messages sent to The Command Ring. Whether there is any value in being able to read it or not is not clear.

It's entirely possible that different records could indicate different Response Rings. It's not clear to me yet how useful this would be, and what protocol would be required to set this up and map it into the application's address space.

* The Response Ring

Messages received by the endpoint, matched by the Match Reply table, and can be received without blocking are placed into The Response Ring. This is a ring that the system writes to and the application can read and consume. The read pointer for the ring is mapped writable; the ring might not be (but that's up for debate). The application can read the write pointer for the buffer and determine where to stop processing messages.

The primary message type on the response ring corresponds to messages received by the endpoint. These messages are compressed in the same fashion as Sends in the Command Ring, taking up only the space required to describe the message, data parameters, references to capabilities sent and any associated string. Since capabilities received in a server fashion should be emitted into a ring buffer, they may need to be copied elsewhere if they need to be stored for later.

Capability fields in the Response Ring may be promises. The system only allows promises here under one condition, specifically, if the promise is intended to be fulfilled in the return value of an earlier message. When the system processes a matching send, it updates any promises that would be resolved, removing the need to hold capabilities just in case they may be used by a sender later.

* The Resolution Table

This table maps promises to their resolved value. When a message sent or received resolves a promise that someone is relying on, its value is recorded in this table. This is a shorthand for updating all of the dependent values; this will happen at a later point in time.

* The Pending Invocations Table

The function of promises in this model is more limited than within CapTP. Effectively, promises cross an address space in one case only: if the endpoint the promise is sent to is the one that resolves the promise. The only messages sent are those that are ready to be processed by the receiver - either because none of the parameters are promises, or because the promise refers to a message the receiver is expected to provide via a reply cap.

The intention of this distinction is that servers do not need to allocate space on behalf of clients who may never fulfil a needed promise. It means that servers do not need to use guards to ensure that their arguments refer to concrete values. However, this does not reduce the expressive power of the system; and the pending invocations table picks up the slack.

An attempt to send a message containing (or, directed to) one or more unresolved promises is instead placed into the pending invocations table. Such entries are linked into a list by order to be consulted when the system is preparing to handle commands.

* Blocking?

Given that we have only described fixed-size structures so far, there's always the possibility that we can't send another message. The command ring could be full; the receiver's response ring may be full, or we may be sending to a synchronous endpoint that is not currently being listened to. In this case, a process can make a synchronous InvokeCap system call to the endpoint to be notified when there are messages that can be processed. This is a nice variation on the yield system call, in that it indicates the condition of interest. The exact nature of this method is to be determined.

Applications need to be able to deal with the fact that there are a limited number of messages that can be waiting to be processed. I don't think that this will be a big deal in practice, most systems have fairly low limits on the number of things that a thread can be waiting for relative to what we could support.

On a 64 bit platform for example, if the two rings were one 4096 byte page each, and every message used all of the data, capability, and string fields, each page could hold 4096//22 = 186 messages; much more than select(), with much more utility.

* The System Perspective

How does the system process asynchronous messages?

It seems sensible to process the Command Ring before running its associated process, but it would also be possible to use a different schedule for this. For example, it might even be reasonable to run asynchronous endpoint processing from a different core. For now we'll avoid specifying when and how this processing is scheduled, only to say that it is probably more important to process commands than it is to execute the program, as this frees up room in the command ring for the program to use.

Processing starts by consulting the resolution table. If there are promises with values associated in this table, they should be processed first. If they are referred to by the response ring, we can leave these, but go through the pending invocations table. If a pending invocation refers to a promise that has now been resolved, update the invocation to refer to the real value. If the pending invocation now depends on no promises, send the message and clear the entry from the table. If a promise has no other dependencies (notably, there may be dependencies in the response ring), delete it from the resolution table.

Once the resolution table and pending invocation table have been processed, the command ring can be processed.

Sending a message proceeds by preparing the invokee and pages containing parameter data. A lock is taken on the target endpoint, or in the case of an asynchronous target, the endpoint's receive lock is taken.

If the send is blocking and the target endpoint is not ready to receive - perhaps because it is a synchronous endpoint and its process is not in receiving state, or the process is in closed wait on another endpoint, or because its command ring does not have any space - the message is placed into the pending invocations table and it is parked on the given endpoint. This places it into a queue of processes and asynchronous endpoints waiting to send to the endpoint.

If the send is nonblocking and the target is not ready to receive, discard the message.

Otherwise, the message can be processed. If the target endpoint is synchronous, the message is delivered as any other synchronous IPC, by filling out the processor registers, copying any string and caps, and marking the process as running. Otherwise, the message is compared to the receiver's Match Reply Table. If the conditions match and there is sufficient space in the response ring and buffers indicated by the match reply record, the message is delivered and the lock is released on the target endpoint. Once this is done, the read pointer on the sender's command ring is updated.

The system can keep doing this until the sender's command ring read pointer reaches its submit pointer. A system may elect how to handle a message that crosses the submit pointer, but a reliable system should detect that condition and either fault the process or send a message indicating the problem. While a fault sounds like a more sensible response, we do not know what instruction it might be executing; it may even be running on a different CPU, so it would be a unique position for the process fault handler to find itself in. Perhaps that is still preferable.

* Some other use cases

The one thing I miss having in coyotos is a way to receive a list, that is to say, "send me all of the block devices mapped into that address space", and to receive messages until all of them have been described. A similar situation is wanting to subscribe to a stream that includes data and capabilities. I feel like the functionality I have described here is sufficient for any of the use cases I come up with, at least considering bounded space. However, if there's something I've missed, I'd like to hear that too.

Thank you for reading my rambly thoughts on this one.

--
William Leslie

Q: What is your boss's password?
A: "Authentication", clearly

Notice:
Likely much of this email is, by the nature of copyright, covered under copyright law. You absolutely MAY reproduce any part of it in accordance with the copyright law of the nation you are reading this in. Any attempt to DENY YOU THOSE RIGHTS would be illegal without prior contractual agreement.

William ML Leslie

unread,

Oct 19, 2021, 9:54:13 AM10/19/21

to cap-talk

On Tue, 19 Oct 2021, 10:20 pm William ML Leslie, <william.l...@gmail.com> wrote

On a 64 bit platform for example, if the two rings were one 4096 byte page each, and every message used all of the data, capability, and string fields, each page could hold 4096//22 = 186 messages; much more than select(), with much more utility.

Mathematician can't arithmetic: that would be 22 words, so ~23 messages. So I guess, let's have more than a page for these structures.

Jonathan S. Shapiro

unread,

Oct 23, 2021, 3:42:13 PM10/23/21

to cap-talk

William:

The term "asynchronous" is used in a bunch of ways. Generally, it includes "reliable delivery (barring recipient crash) with no blocking by sender." This is actually not possible, because it requires infinite buffering.

We looked quite hard at non-blocking messaging for Coyotos, and we came up with a pretty good solution (or so we thought) that is simpler than what you propose. We decided not to implement it because we had a delivery deadline and there were still some unknowns. The design provided both asynchronous and synchronous messaging with the same mechanism, provided a way to do a collaborative thread scheduling model between kernel and user mode, and opened the door to processes that (with some further work) could have multiple hardware threads.

Requirements:

No kernel buffering
Sender can elect whether to block. If they refuse to block, delivery is not guaranteed.
Recipient can decide how many messages it wishes to be able to buffer.
Recipient is responsible for allocating storage for inbound messages.
Recipient is notified when a message arrives, even if it is doing something else.
It must be possible to deliver all inbound message payload to [data or capability] memory, because register space is not adequate. This is one reason we made capability pages a thing and did the work to put them in a unified address space with data pages.
Must not rely on any pointers going from recipient to sender - this creates problems for checkpoint/restart implementations.

Here's a quick sketch:

There are three process states:

Running
[Voluntarily] Paused (nothing to do)
Blocked waiting for message transfer (note this can always be preempted and re-tried).

Each process has a data structure - the message block - that describes where the incoming message should go. This lives in process virtual memory within a single page. The virtual address of this page is part of the kernel's per-process state. The message block also contains some state related to protected payloads.

When a message arrives, it is delivered according to the receive specification stored in the message block. Delivery does not depend on the receiving process state - if a valid message block exists, the message will be delivered.

Once the message arrives, the receiving process is preempted. Control is transferred to an in-process scheduling entry point. Code at that entry point is responsible for setting up a new receive block if one is desired. When the in-process scheduler has done whatever it needs to do, it returns control to the then-current user-level thread (which may have been switched by the in-process schedule). On most processors, the return to the main thread can be done without re-entering the kernel. This design is commonly known as "scheduler activations."

There is a regrettable problem on ARMv7 for this type of design (might have been fixed later): there is no way to perform the return to main thread without re-entering the kernel. I don't remember the exact issue. From memory, I think it may have been that certain parts of the conditional execution state could not be restored properly using only user-mode instruction sequences. On the ARM, a stub trap into the kernel is needed in these cases, but it is a very fast path trap. I do not recall whether it is always needed, or just when dealing with the conditional execution issue.

Important: process resumption is also implemented by an activation, giving the newly resumed process a chance to make application-level scheduling decisions.

The higher-level model here is that the in-process scheduler is maintaining a list of things to do, and possibly making scheduling decisions about when to do them. One of the events that can cause a blocked user level thread to become available is the arrival of a message.

The distinction between a send and a reply capability lies in its protected payload value. The protected payload value is provided to the in-process scheduler entrypoint on message arrival, which decides what to do with it. When a "reply" capability is formed, it is formed using a protected payload value taken from the kernel-shared message block. It is the responsibility of the in-process scheduler to increment this value when needed to ensure that return messages are not replayed.

Jonathan

Jonathan S. Shapiro

unread,

Oct 23, 2021, 3:47:59 PM10/23/21

to cap-talk

Addendum:

There is no receive ring in my design. There are a couple of ways to implement setting up the next message block:

1. In-process scheduler restores the message block to appropriate valid-to-receive state.

2. Message block contains a field that describes the VA of the next message block. In most cases it will be the VA of the same (current) message block because multiple arrivals are not desired.

There is a bit of delicacy here, because preempting and re-directing the in-process scheduler itself is problematic. The thing to do would be to have the kernel consuming message blocks and the in-process scheduler restoring them in the style of a ring buffer. Note that we cannot send a further activation to the process when it is already in the in-process scheduler, so there is a little bit of a dance around that issue. Here again, I can think of several ways to solve that.

Jonathan

William ML Leslie

unread,

Oct 24, 2021, 4:05:07 AM10/24/21

to cap-talk

Thanks heaps for this description, Shap.  It does not surprise me to find out that you had a much more elegant plan.

On Sun, 24 Oct 2021 at 06:42, Jonathan S. Shapiro <jonathan....@gmail.com> wrote:

William:

The term "asynchronous" is used in a bunch of ways. Generally, it includes "reliable delivery (barring recipient crash) with no blocking by sender." This is actually not possible, because it requires infinite buffering.

Thanks, I should be more careful with my terminology.  As is the case with everything in Coyotos, any IPC implementation must be clear about who pays for what resource, and how that is bound.

I've gone back and forth on whether I should just implement the semantics I want or whether I should try harder to avoid context switches.  The model you describe seems to be further toward the "behaves as I would like" end of that spectrum.  What do others think?

We looked quite hard at non-blocking messaging for Coyotos, and we came up with a pretty good solution (or so we thought) that is simpler than what you propose. We decided not to implement it because we had a delivery deadline and there were still some unknowns. The design provided both asynchronous and synchronous messaging with the same mechanism, provided a way to do a collaborative thread scheduling model between kernel and user mode, and opened the door to processes that (with some further work) could have multiple hardware threads.

Requirements:
No kernel buffering
Sender can elect whether to block. If they refuse to block, delivery is not guaranteed.
Recipient can decide how many messages it wishes to be able to buffer.
Recipient is responsible for allocating storage for inbound messages.
Recipient is notified when a message arrives, even if it is doing something else.
It must be possible to deliver all inbound message payload to [data or capability] memory, because register space is not adequate. This is one reason we made capability pages a thing and did the work to put them in a unified address space with data pages.
Must not rely on any pointers going from recipient to sender - this creates problems for checkpoint/restart implementations.

These are similar to the requirements I was working from. I don't notify the recipient when a message arrives, however; I assume that the application is probably running an event loop or some such, and checks the response ring regularly, entering a blocking recv if the ring is empty and there is nothing else to do. Since the "asynchronous endpoint" is still an endpoint, it's possible to wait on it.

I am also a little paranoid about minimising the number of capabilities that need to be prepared, specifically, it seems a bit shady that a receiver can steal additional CPU cycles by ensuring that the sender triggers a page fault. I should honestly just get over this, since it's just not worth it to constrain where capabilities are placed during receive. Not only does it not matter at the moment (using the embedded spacebank), but it won't matter for the next cut of the space bank either, since these are part of the TCB.

Here's a quick sketch:

There are three process states:
Running
[Voluntarily] Paused (nothing to do)
Blocked waiting for message transfer (note this can always be preempted and re-tried).
Each process has a data structure - the message block - that describes where the incoming message should go. This lives in process virtual memory within a single page. The virtual address of this page is part of the kernel's per-process state. The message block also contains some state related to protected payloads.

I take it this is per endpoint rather than per process in practice?  If not, was there any reason to place this on the process?

When a message arrives, it is delivered according to the receive specification stored in the message block. Delivery does not depend on the receiving process state - if a valid message block exists, the message will be delivered.

Once the message arrives, the receiving process is preempted. Control is transferred to an in-process scheduling entry point. Code at that entry point is responsible for setting up a new receive block if one is desired. When the in-process scheduler has done whatever it needs to do, it returns control to the then-current user-level thread (which may have been switched by the in-process schedule). On most processors, the return to the main thread can be done without re-entering the kernel. This design is commonly known as "scheduler activations."

It feels like the difference between this and a synchronous receipt is that the system won't preempt the receiver while it's handling this interrupt. Have I missed something? Can "setting up a new receive block" involve further calls to the space bank? How do I make sure that I charge the CPU usage to the receiver but return control to the sender once done?

There is a regrettable problem on ARMv7 for this type of design (might have been fixed later): there is no way to perform the return to main thread without re-entering the kernel. I don't remember the exact issue. From memory, I think it may have been that certain parts of the conditional execution state could not be restored properly using only user-mode instruction sequences. On the ARM, a stub trap into the kernel is needed in these cases, but it is a very fast path trap. I do not recall whether it is always needed, or just when dealing with the conditional execution issue.

Important: process resumption is also implemented by an activation, giving the newly resumed process a chance to make application-level scheduling decisions.

The higher-level model here is that the in-process scheduler is maintaining a list of things to do, and possibly making scheduling decisions about when to do them. One of the events that can cause a blocked user level thread to become available is the arrival of a message.

I really like this idea, but I've been deliberately avoiding touching the way scheduling works for now.  I've got lots of plans, but my guess is I'll need to be able to do async before I'll have to decide what to do there, and that I can probably get away with ignoring scheduling for now.

The distinction between a send and a reply capability lies in its protected payload value. The protected payload value is provided to the in-process scheduler entrypoint on message arrival, which decides what to do with it. When a "reply" capability is formed, it is formed using a protected payload value taken from the kernel-shared message block. It is the responsibility of the in-process scheduler to increment this value when needed to ensure that return messages are not replayed.

It must also be responsible to ensure that there are enough receive blocks available to handle any outstanding replies, lest these be dropped.

Jonathan

So I must admit I'm a little hesitant to immediately invoke application code on receipt here, and that is my one quibble.  But otherwise, this is very cool, and would be great to put in hardware.

William ML Leslie

unread,

Oct 24, 2021, 4:17:29 AM10/24/21

to cap-talk

On Sun, 24 Oct 2021 at 06:47, Jonathan S. Shapiro <jonathan....@gmail.com> wrote:

Addendum:

There is no receive ring in my design. There are a couple of ways to implement setting up the next message block:

1. In-process scheduler restores the message block to appropriate valid-to-receive state.
2. Message block contains a field that describes the VA of the next message block. In most cases it will be the VA of the same (current) message block because multiple arrivals are not desired.

I must admit I like the idea of implementing recv blocks as linked lists as you can have another such address for dependent receives, hack hack hack.

So if I have `(foo <- bar()) <- baz()` then instead of attaching the second delivery to the next-delivered-message pointer, it could be attached to a next-dependent-message list on the original message.  That way the application can easily ignore messages it can't yet handle, or easily free them all. It could even have different freelists for different match table entries, given these have a common size.

There is a bit of delicacy here, because preempting and re-directing the in-process scheduler itself is problematic. The thing to do would be to have the kernel consuming message blocks and the in-process scheduler restoring them in the style of a ring buffer. Note that we cannot send a further activation to the process when it is already in the in-process scheduler, so there is a little bit of a dance around that issue. Here again, I can think of several ways to solve that.

Eeep.

William ML Leslie

unread,

Oct 25, 2021, 4:10:02 AM10/25/21

to cap-talk

On Sun, 24 Oct 2021 at 19:17, William ML Leslie <william.l...@gmail.com> wrote:

So if I have `(foo <- bar()) <- baz()` then instead of attaching the second delivery to the next-delivered-message pointer, it could be attached to a next-dependent-message list on the original message. That way the application can easily ignore messages it can't yet handle, or easily free them all. It could even have different freelists for different match table entries, given these have a common size.

Incidentally, one of the things I was having trouble with using a command queue was declaring relationships between things to be sent. This is important because even if some command is blocked, you may still want to issue other commands; making it unclear what to do about advancing the read pointer on the command queue. Further, it would be nice to pin clean-up actions directly onto the success and fail callbacks of a send operation. This particular tree structure solves these problems. It also makes it slightly easier to cancel pending sends. It does not handle the case that a send you want to perform contains more that one promise, but that can be easily handled by the application once you have the ability to send messages straight to your own receive queue and you have the given tree structure.

It does seem a bit annoying from the kernel's perspective  to be attempting to ingest a tree rather than a queue, though this can be done with constant kernel space if desired.  When you have processed a command which has no dependent subcommands, remove it from its list, and then simply start again from the top of the tree.

Jonathan S. Shapiro

unread,

Oct 30, 2021, 2:58:39 PM10/30/21

to cap-talk

Really sorry that I'm so slow in responding and then have to respond in batches, but so it goes.

On Sun, Oct 24, 2021 at 1:05 AM William ML Leslie <william.l...@gmail.com> wrote:

I don't notify the recipient when a message arrives, however; I assume that the application is probably running an event loop or some such, and checks the response ring regularly, entering a blocking recv if the ring is empty and there is nothing else to do. Since the "asynchronous endpoint" is still an endpoint, it's possible to wait on it.

That's certainly the common way to do it, but as a low-level IPC interface it's not particularly elegant. If you can get an activation when a message arrives, waiting and pausing become exactly the same thing. The way to wait on a message is simply to tell the kernel that you do not wish to be scheduled again until an event arrives. Where they work, activations can be quite beautiful.

I am also a little paranoid about minimising the number of capabilities that need to be prepared, specifically, it seems a bit shady that a receiver can steal additional CPU cycles by ensuring that the sender triggers a page fault. I should honestly just get over this, since it's just not worth it to constrain where capabilities are placed during receive.

This isn't the right way to think about it. By handing out the capability, the receiver has authorized the page fault if needed. You are also assuming that the scheduling model is going to schedule things according to a particular design of scheduling. There are other designs in which cycles are "donated" for this kind of thing.

I've never liked those, because at the end of the day it always seems to turn out that overly precise CPU resource isolation creates more harm than good. At some point it turns into more context switches, at which point you've obtained resource precision at the cost of universal overheads that are often larger than the imprecision you were trying to close.

It feels like the difference between this and a synchronous receipt is that the system won't preempt the receiver while it's handling this interrupt. Have I missed something? Can "setting up a new receive block" involve further calls to the space bank?

Think of the activation handler as an interrupt handler. It's sole job is to choose something to run. Further work should be done by that thread. Note that the activation handler cannot be preempted [in the user-mode sense], which means it cannot detect when a reply has arrived, which means it can't make service calls. One could imagine schemes that support multi-level interrupts, but those rapidly become very complicated.

How do I make sure that I charge the CPU usage to the receiver but return control to the sender once done?

What CPU usage? The page faults?

You can never guarantee returning control to the sender. The kernel scheduler always determines what will run next.

I really like this idea, but I've been deliberately avoiding touching the way scheduling works for now. I've got lots of plans, but my guess is I'll need to be able to do async before I'll have to decide what to do there, and that I can probably get away with ignoring scheduling for now.

It's definitely a mind bender. But there are real problems when user and kernel level scheduling cannot coordinate. Both X10 and X11 were notorious for priority inversion problems that arose because they couldn't do this.

So I must admit I'm a little hesitant to immediately invoke application code on receipt here, and that is my one quibble. But otherwise, this is very cool, and would be great to put in hardware.

Technically you don't. You move the target application into a suitable state such that the kernel scheduler knows that it is now a candidate for dispatch.

Jonathan

Jonathan S. Shapiro

unread,

Oct 30, 2021, 3:01:11 PM10/30/21

to cap-talk

On Sun, Oct 24, 2021 at 1:17 AM William ML Leslie <william.l...@gmail.com> wrote:

I must admit I like the idea of implementing recv blocks as linked lists as you can have another such address for dependent receives, hack hack hack.

So if I have `(foo <- bar()) <- baz()` then instead of attaching the second delivery to the next-delivered-message pointer, it could be attached to a next-dependent-message list on the original message. That way the application can easily ignore messages it can't yet handle, or easily free them all. It could even have different freelists for different match table entries, given these have a common size.

At the kernel level of abstraction, no. The kernel should have no knowledge of message dependencies.

But yes, I suppose it would be possible to have distinct entry capabilities that named distinct receive blocks that had individual chains. I had not considered that, and I'm not sure the complexity is justified, but I do think it would be possible.

Jonathan

William ML Leslie

unread,

Nov 5, 2021, 5:10:16 AM11/5/21

to cap-talk

On Sun, 31 Oct 2021 at 05:58, Jonathan S. Shapiro <jonathan....@gmail.com> wrote:

Really sorry that I'm so slow in responding and then have to respond in batches, but so it goes.

Not at all. We both have families and somewhat demanding jobs; add to that this topic is not exactly light conversation and requires a lot of concentration. I've been trying to nut this stuff out and consider different angles for over a year now, but I really appreciate that you've more experience than I on this one.

On Sun, Oct 24, 2021 at 1:05 AM William ML Leslie <william.l...@gmail.com> wrote:
I don't notify the recipient when a message arrives, however; I assume that the application is probably running an event loop or some such, and checks the response ring regularly, entering a blocking recv if the ring is empty and there is nothing else to do. Since the "asynchronous endpoint" is still an endpoint, it's possible to wait on it.

That's certainly the common way to do it, but as a low-level IPC interface it's not particularly elegant. If you can get an activation when a message arrives, waiting and pausing become exactly the same thing. The way to wait on a message is simply to tell the kernel that you do not wish to be scheduled again until an event arrives. Where they work, activations can be quite beautiful.

I think I misspoke. I don't interrupt the recipient when a message arrives, but if they are in an open wait or a closed wait on the receiving endpoint, we'll have to wake them up. I need to work out the details of what that looks like. Maybe the first message to be delivered in that way should be received blockingly - that is, if the process is waiting on the given endpoint, make the receive operate as the existing blocking mechanism, otherwise deliver the message to the receive queue and take no further action.

I guess one tricky bit there is that if a message is already in the receive queue when the process attempts to enter the receiving state, I may need to stop it entering that state. It's easy to do for a closed wait, to check that the target endpoint has no unprocessed messages, but not so for an open wait as there is no mapping from process to endpoint. Maybe the only solution is to periodically interrupt such a sleeping process.

I am also a little paranoid about minimising the number of capabilities that need to be prepared, specifically, it seems a bit shady that a receiver can steal additional CPU cycles by ensuring that the sender triggers a page fault. I should honestly just get over this, since it's just not worth it to constrain where capabilities are placed during receive.

This isn't the right way to think about it. By handing out the capability, the receiver has authorized the page fault if needed. You are also assuming that the scheduling model is going to schedule things according to a particular design of scheduling. There are other designs in which cycles are "donated" for this kind of thing.

I've never liked those, because at the end of the day it always seems to turn out that overly precise CPU resource isolation creates more harm than good. At some point it turns into more context switches, at which point you've obtained resource precision at the cost of universal overheads that are often larger than the imprecision you were trying to close.

I suppose that's not my real concern, but rather what it could do to latency. A server trying to reply may be held up while all the relevant pages are pulled in from disk. Maybe it's sufficient to say that if messages don't have to be processed according to one strict order that slow sends shouldn't prevent a server from responding to other clients.

It feels like the difference between this and a synchronous receipt is that the system won't preempt the receiver while it's handling this interrupt. Have I missed something? Can "setting up a new receive block" involve further calls to the space bank?

Think of the activation handler as an interrupt handler. It's sole job is to choose something to run. Further work should be done by that thread. Note that the activation handler cannot be preempted [in the user-mode sense], which means it cannot detect when a reply has arrived, which means it can't make service calls. One could imagine schemes that support multi-level interrupts, but those rapidly become very complicated.

So the purpose here is to schedule the process to run/recv if blocked, and to set up the next message handler? Given that these are already things we do in the kernel for the existing IPC mechanism, is the intention here that the activation handler can select different processes, as a kind of scatter mechanism? Is there anything else useful that it might do?

How do I make sure that I charge the CPU usage to the receiver but return control to the sender once done?

What CPU usage? The page faults?

You can never guarantee returning control to the sender. The kernel scheduler always determines what will run next.

One big motivation for these non-blocking kernel APIs (in particular io_uring which I want to leave in the dust) is that you get to send several commands in one go, rather than switching to each target as they are delivered, dwarfing the context switching time with sheer volume of messages. When the kernel is performing the IPC then, it should pick the next message off of the send queue, find the target endpoint, and attempt to deliver the message. This may cause delays as capabilities may need to be prepared, or the send may need to be retried as the send may be blocking but the endpoint is not ready to receive. Otherwise, once a message is sent, the kernel will continue dequeuing messages from the send queue and delivering them until the sender's meter is exhausted.

In order to do this, control must return to the kernel in order to send the next message, not to the first receiver. Not only does it make the accounting easier, it amortises the cost of the new non-blocking send.

Some of the time, many of the messages will be going from the send queue to the same endpoint (e.g. when one of these endpoints belongs to the space bank or a filesystem). In this case, it's nice to have them all delivered and ready to work on in the receive queue. I guess the distinction is less interesting for kernel-implemented objects, since a call to these looks the same either way.

I really like this idea, but I've been deliberately avoiding touching the way scheduling works for now. I've got lots of plans, but my guess is I'll need to be able to do async before I'll have to decide what to do there, and that I can probably get away with ignoring scheduling for now.

It's definitely a mind bender. But there are real problems when user and kernel level scheduling cannot coordinate. Both X10 and X11 were notorious for priority inversion problems that arose because they couldn't do this.

Absolutely. I know I've misunderstood "scheduler activations" as a term many times over (and the literature is apparently also extremely confused on this), but my understanding was that scheduling a process to run is just a particular kind of message that you send to the process, donating the current slice; and that you can do all IPC like this if you like. I guess this is what the BiiN/Intel i960 did - IPC is just a function call, so there's nothing special you can do WRT scheduling. The thing preventing me from pursuing that is I'm not sure how it works with allocating stacks, given that stacks are not protected on any modern architecture and they have to come from somewhere. Statically allocating them seems ugly, as does only allowing one receive at a time, given that you're at the mercy of every function you call not to return to you in a reasonable timeframe.

So I must admit I'm a little hesitant to immediately invoke application code on receipt here, and that is my one quibble. But otherwise, this is very cool, and would be great to put in hardware.

Technically you don't. You move the target application into a suitable state such that the kernel scheduler knows that it is now a candidate for dispatch.

Right, so stuff we already have to do (:

On Sun, 31 Oct 2021 at 06:01, Jonathan S. Shapiro <jonathan....@gmail.com> wrote:

On Sun, Oct 24, 2021 at 1:17 AM William ML Leslie <william.l...@gmail.com> wrote:
I must admit I like the idea of implementing recv blocks as linked lists as you can have another such address for dependent receives, hack hack hack.

So if I have `(foo <- bar()) <- baz()` then instead of attaching the second delivery to the next-delivered-message pointer, it could be attached to a next-dependent-message list on the original message. That way the application can easily ignore messages it can't yet handle, or easily free them all. It could even have different freelists for different match table entries, given these have a common size.

At the kernel level of abstraction, no. The kernel should have no knowledge of message dependencies.

So, one big problem I have with io_uring is the rather weak mechanisms for controlling the order that system calls are executed. You have a way to delimit groups; instructions in a group may be issued in any order, but a group must be completely submitted before work begins on the next group of system calls. To some extent, this makes sense on Linux, because there is only one receiver, the kernel. I don't think that it would fly with the capability community, however. Our programs usually have several conversations going on with different components, and we expect component X to make progress even while we are waiting for component Y. We even talk in detail about the message order of 3-party interactions. I would very much like to build an API that Dean Tribble would be proud of.

My first message in this thread tried really hard to make 3+ party interactions work without any intervention from the app; I'm not sure this is necessary though, if we can just have a way to say "once this message is sent, please schedule these ones to be sent". This means that the dependent messages no longer block the "queue" (which is no longer a queue, rather a set of messages ready to be sent). It's not hard or costly to do "once a reply to this message is received, schedule these ones to be sent" from userspace, though. Having messages structured in this way massively simplifies all the promise handling mess. I should try to consider that from the top.

But yes, I suppose it would be possible to have distinct entry capabilities that named distinct receive blocks that had individual chains. I had not considered that, and I'm not sure the complexity is justified, but I do think it would be possible.

I think I'll give it a try.

William ML Leslie

Reply all

Reply to author

Forward

0 new messages