Proxying vs. Scheduling, Real-time vs. components

8 views

Skip to first unread message

Jonathan S. Shapiro

unread,

Oct 25, 2025, 7:19:32 PM (10 days ago) Oct 25

to cap-talk

I had a mildly interesting idea today while writing up scheduling challenges in Coyotos. The 30,000 foot problem is that communication across schedule boundaries is really hard to do without violating scheduling reservation contracts.

Imagine a situation where some real-time component needs to talk to (e.g.) a display compositor, whose schedule probably has a lower level of service than its real-time caller. It seems to me that there are basically three ways to deal with this:

Bounded buffering, where display operations will be dropped if the buffer isn't drained in a timely way, or
"Donating" a process pre-configured with a real-time schedule for use by the shared service, or
"Lending" the real-time schedule at IPC time by sending a schedule capability for the service to use directly.

The first two introduce significant recovery complexities if the buffer is revoked (invalid reference in shared server after revocation), though we managed to sort out the memory revocation problem in the EROS high-speed networking work.

If the Process is revoked, the shared service needs a way to notice, and then to clean up after a thread of control that vanished in mid-execution. Which is much harder than recovering from a revoked buffer.

Lending a schedule capability seems cleaner, but then we have the problem that the service might "leak" that capability - which upon use would violate the reservation requirements of the client.

It eventually occurred to me that a mutually trusted intermediary might work. The service client provides the schedule capability, the service provides the storage for a new process using this capability (and is therefore protected from revocation), and a fault handler is installed in such a way that if the real-time schedule is revoked a more pedestrian "backup" schedule can be inserted to allow cleanup before the process is destroyed. Following which, a Coyotos notice (a single bit in-band signal that gets delivered at next receive-wait) can be used to tell the process that it's time to retire.

It feels heavy, but like the proverbial dancing bear, it's amazing that it dances at all.

Anyway, I'm curious what people think about this.

Jonathan

Raoul Duke

unread,

Oct 25, 2025, 7:24:51 PM (10 days ago) Oct 25

to cap-...@googlegroups.com

Can't just drop things like draw instructions because you'd get invalid output ie glitchy unusable display results, no?

Jonathan S. Shapiro

unread,

Oct 25, 2025, 9:48:05 PM (10 days ago) Oct 25

to cap-...@googlegroups.com

On Sat, Oct 25, 2025 at 4:24 PM Raoul Duke <rao...@gmail.com> wrote:

Can't just drop things like draw instructions because you'd get invalid output ie glitchy unusable display results, no?

If the choice is between dropping the draw instructions and giving the CAT scan patient cancer because you blocked when you should have been turning off the high power x-rays, you definitely drop the draw instructions.

The reason hard real time scheduling is called "hard" isn't just because the schedules are non-negotiable. It's because the consequences of missing the schedules are both unrecoverable and catastrophic. If it doesn't meet that test, it probably shouldn't be hard real-time.

Cheers!

Jonathan

Raoul Duke

unread,

Oct 25, 2025, 10:57:45 PM (10 days ago) Oct 25

to cap-...@googlegroups.com

I was not trying to say the realtime part should suffer. Just that it appears to me to be kind of by definition an intractable impedance mismatch in some situations.

--
You received this message because you are subscribed to the Google Groups "cap-talk" group.
To unsubscribe from this group and stop receiving emails from it, send an email to cap-talk+u...@googlegroups.com.
To view this discussion visit https://groups.google.com/d/msgid/cap-talk/CAAP%3D3QMKbGxR3%3D_uEf%2B1NWcXfJUNtkGcHOy2UOGP0bvuJbHKuA%40mail.gmail.com.

Jonathan S. Shapiro

unread,

Oct 25, 2025, 11:07:45 PM (10 days ago) Oct 25

to cap-...@googlegroups.com

On Sat, Oct 25, 2025 at 7:57 PM Raoul Duke <rao...@gmail.com> wrote:

I was not trying to say the realtime part should suffer. Just that it appears to me to be kind of by definition an intractable impedance mismatch in some situations.

Yes. That's the point. Real-time subsystems often have to communicate with non-real-time subsystems. At that boundary, you can do pretty well with a best-effort boundary solution. But there's always going to be some traffic that exceeds what you planned for, or some event that throws off your planning. At that point you have to have a solution that doesn't back the critical subsystem.

There are definitely ways to do better than the buffering solution I sketched. But it's a complex solution whose development cost probably isn't justified.

Jonathan

Raoul Duke

unread,

Oct 25, 2025, 11:26:49 PM (10 days ago) Oct 25

to cap-...@googlegroups.com

Right, I feel like maybe a table that breaks down the cases would be good and presumably already exists in several textbooks so I'm not saying anything new of course. Distinguishing what is not possible vs. what is ok if best effort vs. something that never gets behind. A display that is driven by temporally applying draw commands could drop things in some use cases (eg a video game), buffer and apply everything in others (if there are just occasional spikes of updates so it is known that it can recover), or else unfortunately be unacceptable behavior (if the display should never be wrong or glitchy or a lie).

William ML Leslie

unread,

Oct 26, 2025, 6:30:09 AM (10 days ago) Oct 26

to cap-...@googlegroups.com

So you are creating a process specifically for use by this client? That definitely makes the problem easier to manage.

There's no interface to get notified of a schedule being revoked yet, but there's no reason you couldn't make one.

If the callee doesn't have a schedule configured by default, it should inherit the caller's schedule until it responds. Are there situations where you don't want to use that mechanism, or where that mechanisms isn't sufficient for you?

William ML Leslie

Jonathan S. Shapiro

unread,

Oct 26, 2025, 9:17:13 PM (9 days ago) Oct 26

to cap-...@googlegroups.com

On Sun, Oct 26, 2025 at 3:30 AM William ML Leslie <william.l...@gmail.com> wrote:

On Sun, 26 Oct 2025 at 09:19, Jonathan S. Shapiro <jonathan....@gmail.com> wrote:
I had a mildly interesting idea today while writing up scheduling challenges in Coyotos. The 30,000 foot problem is that communication across schedule boundaries is really hard to do without violating scheduling reservation contracts.

Imagine a situation where some real-time component needs to talk to (e.g.) a display compositor, whose schedule probably has a lower level of service than its real-time caller. It seems to me that there are basically three ways to deal with this:
Bounded buffering, where display operations will be dropped if the buffer isn't drained in a timely way, or
"Donating" a process pre-configured with a real-time schedule for use by the shared service, or
"Lending" the real-time schedule at IPC time by sending a schedule capability for the service to use directly.
The first two introduce significant recovery complexities if the buffer is revoked (invalid reference in shared server after revocation), though we managed to sort out the memory revocation problem in the EROS high-speed networking work.

If the Process is revoked, the shared service needs a way to notice, and then to clean up after a thread of control that vanished in mid-execution. Which is much harder than recovering from a revoked buffer.

Lending a schedule capability seems cleaner, but then we have the problem that the service might "leak" that capability - which upon use would violate the reservation requirements of the client.

It eventually occurred to me that a mutually trusted intermediary might work. The service client provides the schedule capability, the service provides the storage for a new process using this capability (and is therefore protected from revocation), and a fault handler is installed in such a way that if the real-time schedule is revoked a more pedestrian "backup" schedule can be inserted to allow cleanup before the process is destroyed. Following which, a Coyotos notice (a single bit in-band signal that gets delivered at next receive-wait) can be used to tell the process that it's time to retire.

It feels heavy, but like the proverbial dancing bear, it's amazing that it dances at all.

So you are creating a process specifically for use by this client? That definitely makes the problem easier to manage.

More or less. I was actually thinking of it as a thread running a schedule supplied by the client that obeys the code of (in this case) the compositor. This probably means adding a restriction on the Process capability that prevents fetching from the capability slot so that the capability itself cannot be leaked. I definitely haven't thought this all the way through yet.

There's no interface to get notified of a schedule being revoked yet, but there's no reason you couldn't make one.

I think it would turn into a fault code that gets delivered to the per-process fault handler. There's an already defined NoSchedule fault that would work for this.

If the callee doesn't have a schedule configured by default, it should inherit the caller's schedule until it responds. Are there situations where you don't want to use that mechanism, or where that mechanisms isn't sufficient for you?

It can't inherit the caller's schedule in the literal sense; that would leak the schedule capability. It is possible to allow the recipient to run on the remainder of the caller's slice, but (a) it's technically an authority leak, and (b) you can't do that on reserved schedules without making a heck of a mess unless both sides run on the same reserve.

Jonathan

William ML Leslie

unread,

Oct 27, 2025, 4:05:20 AM (9 days ago) Oct 27

to cap-...@googlegroups.com

On Mon, 27 Oct 2025 at 11:17, Jonathan S. Shapiro <jonathan....@gmail.com> wrote:

On Sun, Oct 26, 2025 at 3:30 AM William ML Leslie <william.l...@gmail.com> wrote:
There's no interface to get notified of a schedule being revoked yet, but there's no reason you couldn't make one.

I think it would turn into a fault code that gets delivered to the per-process fault handler. There's an already defined NoSchedule fault that would work for this.

Does it get delivered to the handlers of all of the processes using this schedule statically? Perhaps more critically: when? Do you want to walk over all of these during revocation?

If the callee doesn't have a schedule configured by default, it should inherit the caller's schedule until it responds. Are there situations where you don't want to use that mechanism, or where that mechanisms isn't sufficient for you?

It can't inherit the caller's schedule in the literal sense; that would leak the schedule capability. It is possible to allow the recipient to run on the remainder of the caller's slice, but (a) it's technically an authority leak, and (b) you can't do that on reserved schedules without making a heck of a mess unless both sides run on the same reserve.

You can inherit a schedule without setting the schedule field of the process (I mean, this is the current behaviour, since schedules aren't implemented in 0.6.2). Notice that schedules have processes they will cause to run, and processes also have a schedule - that is, there are references that go each way, and they mean different things. This is how I thought this worked:

* a process that has a schedule configured (i.e. has a valid schedule in its schedule slot) always runs on that schedule. When receiving a message, the process is placed on that schedule's active list and control is not immediately transferred to that process.

* a schedule has (one, or a stack of, or a fifo queue of?) processes. When the schedule is invoked, either by scheduler activation, interrupt, or return from interrupt, a process is selected from the schedule and given time.

Having a high-priority process invoke some common service works not just by donating the remaining slice, it causes the schedule to point to the invokee as its first process to run. When the process returns, the schedule now points back at the caller. Nothing gains access to the schedule capability; instead, the schedule internally refers to the process it will run next. I haven't thought about how this is exposed to the checkpoint facility yet.

I have been designing schedules loosely as a stack in my own fork, mostly because I've been enjoying some Midori lately, but this comes with a weird quirk in that if you don't obey strict stack discipline in the messages you send on behalf of clients, some of your callees won't get time. It feels a little like rust's async/await: if you're used to how lovely E is, it's really jank to have to await something in order for it to make progress.

William ML Leslie

Jonathan S. Shapiro

unread,

Oct 27, 2025, 4:08:39 PM (8 days ago) Oct 27

to cap-...@googlegroups.com

On Mon, Oct 27, 2025 at 1:05 AM William ML Leslie <william.l...@gmail.com> wrote:

Does [NoSchedule] get delivered to the handlers of all of the processes using this schedule statically? Perhaps more critically: when? Do you want to walk over all of these during revocation?

It does not. When a process attempts to run on a revoked schedule, the kernel will discover that its schedule capability has become the null capability, which causes the fault to be issued for that process. If a replacement schedule is installed before a particular process attempts to run, there's no reason to issue a fault for it.

Taking your next point slightly out of order:

Notice that schedules have processes they will cause to run, and processes also have a schedule - that is, there are references that go each way, and they mean different things.

That reference relationship may not exist in either direction if the process hasn't been dispatched yet. For processes that have been dispatched, I suppose you could walk the ready queue for the schedule, but that wouldn't get you the processes that are currently on-CPU. You'd have to issue a broadcast IPI to kick all of the currently executing processes off of their respective cores so they were all on the appropriate ready queue, do your thing, and then let them resume if the current reserve hasn't run out.

Coyotos got rid of the doubly linked list structure for finding capabilities because the cache behavior was terrible. It has been replaced by an incremental background tracing mechanism.

You mentioned scheduler activations. I was looking at the code yesterday, and it looks like we backed scheduler activations out when we hit the ARM issue. It had been about 13 years since I last looked at that code, and it seems that I was mixed up about where it stood.

This is how I thought this worked:

* a process that has a schedule configured (i.e. has a valid schedule in its schedule slot) always runs on that schedule. When receiving a message, the process is placed on that schedule's active list and control is not immediately transferred to that process.

This is not what currently happens. In EROS, we took the view that the logical flow of control passes from invoker to invokee in an IPC, and we allow the invokee to run on the currently active slice (the one from the invoker) until (a) the invokee blocks, or (b) the slice ends. What is true is that when a process comes off of a stall queue it goes on the ready queue associated with the schedule named by its scheduling capability.

This was never a good answer, it was less bad than any other answer. Following the logical control flow from invoker to invokee is fine when the two processes are running on the same schedule, but it leaks computons if the invokee runs on a lower schedule.

We don't have hard real-time schedules in EROS, but in most hard real-time schemes we would have to take the view that the logical thread changes at the schedule boundary (not at the IPC boundary). You can send messages between two processes in the same schedule without giving up your slice, but hard real-time scheduling regimes (e.g. ARINC-653 and friends) require strict temporal isolation, which. means that you cannot "give away" unused computons.

In Coyotos there's an added twist: in a communication from Peter to Paul, the two might not be assigned to the same processor performance class. It's not clear whether we want to revise Paul's processor class in this situation, but at the moment we cross the process boundary all of the state we need in order to resume Paul is sitting in the cpuFor(Peter) cache.

In the movie The Devil's Own, Brad Pitt comes out with: "If you're not confused, you don't understand what's going on." I'm pretty sure he was talking about scheduling. :-)

* a schedule has (one, or a stack of, or a fifo queue of?) processes. When the schedule is invoked, either by scheduler activation, interrupt, or return from interrupt, a process is selected from the schedule and given time.

Within a schedule, it's a FIFO ready queue. Schedules aren't invoked by any of these events. What happens is that the appropriate receiving process becomes ready, or in the case of end-of-interrupt, a process reverts to receiving. In either case, the scheduler operates according to its normal behavior.

Jonathan

William ML Leslie

unread,

Oct 28, 2025, 12:37:43 AM (8 days ago) Oct 28

to cap-...@googlegroups.com

On Tue, 28 Oct 2025 at 06:08, Jonathan S. Shapiro <jonathan....@gmail.com> wrote:

On Mon, Oct 27, 2025 at 1:05 AM William ML Leslie <william.l...@gmail.com> wrote:
Does [NoSchedule] get delivered to the handlers of all of the processes using this schedule statically? Perhaps more critically: when? Do you want to walk over all of these during revocation?

It does not. When a process attempts to run on a revoked schedule, the kernel will discover that its schedule capability has become the null capability, which causes the fault to be issued for that process. If a replacement schedule is installed before a particular process attempts to run, there's no reason to issue a fault for it.

I wonder how schedules tie into the selection of a process after a timer interrupt. I've been imagining that on timer interrupts, a process yielding, or blocking on object load etc results in selecting the next schedule and picking the first process off of it. In that model, I don't know how a process could be selected to run if it's not on a valid schedule. I guess it could happen if the current process blocks on receive? Is that the only case?

Taking your next point slightly out of order:

Notice that schedules have processes they will cause to run, and processes also have a schedule - that is, there are references that go each way, and they mean different things.

That reference relationship may not exist in either direction if the process hasn't been dispatched yet. For processes that have been dispatched, I suppose you could walk the ready queue for the schedule, but that wouldn't get you the processes that are currently on-CPU. You'd have to issue a broadcast IPI to kick all of the currently executing processes off of their respective cores so they were all on the appropriate ready queue, do your thing, and then let them resume if the current reserve hasn't run out.

The image format as it stands doesn't store schedules, but you presumably do want to configure your schedules as part of the image. Once we do that, processes that haven't been dispatched will also have both relations set up.

Coyotos got rid of the doubly linked list structure for finding capabilities because the cache behavior was terrible. It has been replaced by an incremental background tracing mechanism.

You mentioned scheduler activations. I was looking at the code yesterday, and it looks like we backed scheduler activations out when we hit the ARM issue. It had been about 13 years since I last looked at that code, and it seems that I was mixed up about where it stood.

I only imagine a `donate` method on a schedule object so that you can do user-level scheduling if you like. The checkpointing question is about how queues are represented to the checkpointer. My understanding from the CapROS and Coyotos sources is, they aren't. This includes the wait queue on processes. The practical impact is that all processes need to be in-memory, even if they haven't had anything to do in weeks. If they aren't in memory but are waiting on a process to receive, there's no existing way to know to wake them.

This is how I thought this worked:

* a process that has a schedule configured (i.e. has a valid schedule in its schedule slot) always runs on that schedule. When receiving a message, the process is placed on that schedule's active list and control is not immediately transferred to that process.

This is not what currently happens. In EROS, we took the view that the logical flow of control passes from invoker to invokee in an IPC, and we allow the invokee to run on the currently active slice (the one from the invoker) until (a) the invokee blocks, or (b) the slice ends. What is true is that when a process comes off of a stall queue it goes on the ready queue associated with the schedule named by its scheduling capability.

This was never a good answer, it was less bad than any other answer. Following the logical control flow from invoker to invokee is fine when the two processes are running on the same schedule, but it leaks computons if the invokee runs on a lower schedule.

We don't have hard real-time schedules in EROS, but in most hard real-time schemes we would have to take the view that the logical thread changes at the schedule boundary (not at the IPC boundary). You can send messages between two processes in the same schedule without giving up your slice, but hard real-time scheduling regimes (e.g. ARINC-653 and friends) require strict temporal isolation, which. means that you cannot "give away" unused computons.

I think there's consensus at this point that Coyotos (and seL4) Call should really be treated like a call to a library function: the callee can just refuse to return, so if you don't trust the invokee to honour their temporal part of this contract, you can't make any promises about timeliness no matter how scheduling is implemented. If it's the case that you need to trust the timing behaviours of your callee, why not trust them to borrow your schedule? At least trust that you can put those processes on your schedule until they reply to you.

In Coyotos there's an added twist: in a communication from Peter to Paul, the two might not be assigned to the same processor performance class. It's not clear whether we want to revise Paul's processor class in this situation, but at the moment we cross the process boundary all of the state we need in order to resume Paul is sitting in the cpuFor(Peter) cache.

100%. We've been assuming, I think, that we don't want the kernel making those sorts of decisions. The last version you published doesn't pin Processes to CPUs (or Schedules to CPUs, because schedules aren't implemented), instead, it's one global ready queue. In my fork I have Processes pinned, but the APIs both at runtime and in the image aren't there to configure that.

In the movie The Devil's Own, Brad Pitt comes out with: "If you're not confused, you don't understand what's going on." I'm pretty sure he was talking about scheduling. :-)

No doubt, but I'm extra confused because not only do I have "the way the current implementation does it", "how seL4 does it both MCS and Original Recipe", and "things we talked about 20 years ago" in my head, I also have "how a terrifying, undocumented branch in my own fork does it". So I'll definitely take some time to page all of that back in.

* a schedule has (one, or a stack of, or a fifo queue of?) processes. When the schedule is invoked, either by scheduler activation, interrupt, or return from interrupt, a process is selected from the schedule and given time.

Within a schedule, it's a FIFO ready queue. Schedules aren't invoked by any of these events. What happens is that the appropriate receiving process becomes ready, or in the case of end-of-interrupt, a process reverts to receiving. In either case, the scheduler operates according to its normal behavior.

I'm not sure what its normal behaviour is! We talked about it a lot back in 2008 or so but did we nail anything down?

William ML Leslie

Jonathan S. Shapiro

unread,

Oct 28, 2025, 3:37:39 AM (8 days ago) Oct 28

to cap-...@googlegroups.com

On Mon, Oct 27, 2025 at 9:37 PM William ML Leslie <william.l...@gmail.com> wrote:

I wonder how schedules tie into the selection of a process after a timer interrupt. I've been imagining that on timer interrupts, a process yielding, or blocking on object load etc results in selecting the next schedule and picking the first process off of it. In that model, I don't know how a process could be selected to run if it's not on a valid schedule. I guess it could happen if the current process blocks on receive? Is that the only case?

Answers:

A process that has voluntarily given up the CPU (yielded) moves to the back of the ready queue for its schedule. There will probably also be a yield variant for real-time schedules that yields the remainder of the current scheduling period.
If you are blocked on an object load, you're on the stall queue for the object being loaded. Once loaded, that stall queue is awakened and the member processes get put back on their respective ready queues.
A process on a CPU has a slice. When the slice times out or the process goes to a receiving state, it is ineligible to run until something returns it to the running state.
When a CPU has no active process, the process selection logic is used to find something to run.
A process on an invalid or malformed schedule will by courtesy run long enough to send a NoSchedule fault if (and only if) a fault handler can be found for that process.

The image format as it stands doesn't store schedules, but you presumably do want to configure your schedules as part of the image. Once we do that, processes that haven't been dispatched will also have both relations set up.

The general purpose priority schedules are statically defined, so they don't need to be in the image.
Real-time schedules have presumably blown their deadlines on restart, I need to work through a protocol to inform those processes that this has happened.
The kernel scheduler entries were computed by an application that loaded them into the kernel tables. That process can reload them at restart.

The actual challenge here is to get all of the RT processes back to a stable resumption state w.r.t. their real-time schedules.

I only imagine a `donate` method on a schedule object so that you can do user-level scheduling if you like.

There is no donate method because multi-level scheduling doesn't work. People have tried and tried without success. If you want a user-defined schedule, the way to get it is to negotiate with the scheduler, obtain a capability for the schedule you want, and install that in a proces.

The checkpointing question is about how queues are represented to the checkpointer. My understanding from the CapROS and Coyotos sources is, they aren't.

They aren't. Any process on a stall queue is logically running and will re-attempt the operation that put it on the stall queue.

Wait queues and stall queues are the same thing.

The practical impact is that all processes need to be in-memory, even if they haven't had anything to do in weeks.

There is a design for a decongester that converts long-blocked processes into restart capabilities. The issue has never come up forcefully enough to merit implementing the decongester. Once a restart key exists, the process goes into a closed receive state. At that point the process can be paged out.

If they aren't in memory but are waiting on a process to receive, there's no existing way to know to wake them.

You have the directionality of the wait mixed up. You don't wait on a process to receive. You wait to receive. If you only want to receive from a particular process, you give it a different endpoint and filter. The pointers and capabilities involved always point in the forward direction.

I think there's consensus at this point that Coyotos (and seL4) Call should really be treated like a call to a library function: the callee can just refuse to return, so if you don't trust the invokee to honour their temporal part of this contract, you can't make any promises about timeliness no matter how scheduling is implemented. If it's the case that you need to trust the timing behaviours of your callee, why not trust them to borrow your schedule? At least trust that you can put those processes on your schedule until they reply to you.

Since there is no such thing as a Coyotos "call", there may be less of a concensus than you imagine. The only operation in Coyotos is send-and-receive. Coyotos no longer distinguishes the available and waiting states that existed in EROS.

A closed receive is implemented with an endpoint filter. The moral equivalent of an EROS resume capability is an entry capability to an endpoint that serves as the resume endpoint.

I'm definitely not going to bake a call/return assumption into the IPC mechanism.

In Coyotos there's an added twist: in a communication from Peter to Paul, the two might not be assigned to the same processor performance class. It's not clear whether we want to revise Paul's processor class in this situation, but at the moment we cross the process boundary all of the state we need in order to resume Paul is sitting in the cpuFor(Peter) cache.

100%. We've been assuming, I think, that we don't want the kernel making those sorts of decisions. The last version you published doesn't pin Processes to CPUs (or Schedules to CPUs, because schedules aren't implemented), instead, it's one global ready queue. In my fork I have Processes pinned, but the APIs both at runtime and in the image aren't there to configure that.

Since the placement decision is performance critical, I think the kernel has to make the placement decision. I need to understand more about the important hard real time scheduling methods to understand what pinning is important. The mechanism, I think, would be to have some schedules that draw on a specific CPU or possibly a specific CPU cluster/class. Though for "computons per period" schedules you could probably dispatch to a wide range of CPUs just fine if the deadline lets you.

You've clearly spent some time thinking about this. I'd love to see where your thinking is at!

In the movie The Devil's Own, Brad Pitt comes out with: "If you're not confused, you don't understand what's going on." I'm pretty sure he was talking about scheduling. :-)

No doubt, but I'm extra confused because not only do I have "the way the current implementation does it", "how seL4 does it both MCS and Original Recipe", and "things we talked about 20 years ago" in my head, I also have "how a terrifying, undocumented branch in my own fork does it". So I'll definitely take some time to page all of that back in.

That should be good for extra confusion, which ought to qualify you as a domain expert. :-)

* a schedule has (one, or a stack of, or a fifo queue of?) processes. When the schedule is invoked, either by scheduler activation, interrupt, or return from interrupt, a process is selected from the schedule and given time.

Within a schedule, it's a FIFO ready queue. Schedules aren't invoked by any of these events. What happens is that the appropriate receiving process becomes ready, or in the case of end-of-interrupt, a process reverts to receiving. In either case, the scheduler operates according to its normal behavior.

I'm not sure what its normal behaviour is! We talked about it a lot back in 2008 or so but did we nail anything down?

I think we mostly nailed down that we need to nail this down.

I suspect we're going to end up with a table-driven scheduler that is ordered by either priority or deadline. That schedule gets computed by an application-level scheduler when new scheduling contracts are created.

Jonathan

William ML Leslie

unread,

Oct 28, 2025, 5:31:36 AM (8 days ago) Oct 28

to cap-...@googlegroups.com

On Tue, 28 Oct 2025 at 17:37, Jonathan S. Shapiro <jonathan....@gmail.com> wrote:

On Mon, Oct 27, 2025 at 9:37 PM William ML Leslie <william.l...@gmail.com> wrote:
I wonder how schedules tie into the selection of a process after a timer interrupt. I've been imagining that on timer interrupts, a process yielding, or blocking on object load etc results in selecting the next schedule and picking the first process off of it. In that model, I don't know how a process could be selected to run if it's not on a valid schedule. I guess it could happen if the current process blocks on receive? Is that the only case?

Answers:
A process that has voluntarily given up the CPU (yielded) moves to the back of the ready queue for its schedule. There will probably also be a yield variant for real-time schedules that yields the remainder of the current scheduling period.
If you are blocked on an object load, you're on the stall queue for the object being loaded. Once loaded, that stall queue is awakened and the member processes get put back on their respective ready queues.
A process on a CPU has a slice. When the slice times out or the process goes to a receiving state, it is ineligible to run until something returns it to the running state.
When a CPU has no active process, the process selection logic is used to find something to run.
A process on an invalid or malformed schedule will by courtesy run long enough to send a NoSchedule fault if (and only if) a fault handler can be found for that process.

It's good to know that these things end up with these processes no longer being on any run queue. The reason I ask is, with the introduction of Schedules, you also presumably want to do something specific with the process selection logic. I pictured the selection logic as first choosing a Schedule, and then selecting a process from that schedule. The problem as I had seen it was that a rescinded schedule would never be selected, so the process selection logic wouldn't see it. Maybe I'm off in the rough.

The image format as it stands doesn't store schedules, but you presumably do want to configure your schedules as part of the image. Once we do that, processes that haven't been dispatched will also have both relations set up.

The general purpose priority schedules are statically defined, so they don't need to be in the image.
Real-time schedules have presumably blown their deadlines on restart, I need to work through a protocol to inform those processes that this has happened.
The kernel scheduler entries were computed by an application that loaded them into the kernel tables. That process can reload them at restart.
The actual challenge here is to get all of the RT processes back to a stable resumption state w.r.t. their real-time schedules.

You want to have a fixed (small) number of buckets in the image? So e.g. all processes at priority 20 get equivalent slices?

If they aren't in memory but are waiting on a process to receive, there's no existing way to know to wake them.

You have the directionality of the wait mixed up. You don't wait on a process to receive. You wait to receive. If you only want to receive from a particular process, you give it a different endpoint and filter. The pointers and capabilities involved always point in the forward direction.

I'm waiting to send to a process; I'm waiting on that process to receive. I'm on its stall queue. If I'm checkpointed, and I lose my space in the process cache before the target goes into receiving state, I never get placed onto a run queue, because the CapROS and Coyotos stall queue can only contain processes that are resident, and the mechanism that puts me back onto the runnable queue is the target emptying its stall queue when it attempts to receive.

This is how it's currently implemented. The only reason we can't hit this problem is because processes are always in memory in the current implementations.

I think there's consensus at this point that Coyotos (and seL4) Call should really be treated like a call to a library function: the callee can just refuse to return, so if you don't trust the invokee to honour their temporal part of this contract, you can't make any promises about timeliness no matter how scheduling is implemented. If it's the case that you need to trust the timing behaviours of your callee, why not trust them to borrow your schedule? At least trust that you can put those processes on your schedule until they reply to you.

Since there is no such thing as a Coyotos "call", there may be less of a concensus than you imagine. The only operation in Coyotos is send-and-receive. Coyotos no longer distinguishes the available and waiting states that existed in EROS.

A closed receive is implemented with an endpoint filter. The moral equivalent of an EROS resume capability is an entry capability to an endpoint that serves as the resume endpoint.

I'm definitely not going to bake a call/return assumption into the IPC mechanism.

The send portion is either blocking or it is non-blocking. If it is non-blocking, we don't have this problem, we just have to deal with the fact that the service we are trying to use might not have received our message. I don't know if you can build a useful system using only non-blocking send.

If it is blocking, it's a Call by another name. You either wait until you get a reply, or you shouldn't expect a reply. I suppose you could instead use a notification and/or a write to a shared buffer as your reply. You are still at the mercy of the invokee, who may never perform a receive.

Since the placement decision is performance critical, I think the kernel has to make the placement decision. I need to understand more about the important hard real time scheduling methods to understand what pinning is important. The mechanism, I think, would be to have some schedules that draw on a specific CPU or possibly a specific CPU cluster/class. Though for "computons per period" schedules you could probably dispatch to a wide range of CPUs just fine if the deadline lets you.

You've clearly spent some time thinking about this. I'd love to see where your thinking is at!

The case I was dealing with is not so sophisticated. Some pcie devices (notably virtio, some network hardware) may expose more than one command/response buffer, so that you can fill them from different cores. In order to keep data as local as possible, I spun off a process on each cpu that managed one buffer each. The plan was that application processes could use the driver process closest to them to avoid needing an IPI in most cases. There are quite a few programs that want to be able to control the locality of their threads so that they can maximise their throughput, so I figured I might as well start supporting that now, even though I was only implementing some very simple programs when I paused work on those drivers.

In the movie The Devil's Own, Brad Pitt comes out with: "If you're not confused, you don't understand what's going on." I'm pretty sure he was talking about scheduling. :-)

No doubt, but I'm extra confused because not only do I have "the way the current implementation does it", "how seL4 does it both MCS and Original Recipe", and "things we talked about 20 years ago" in my head, I also have "how a terrifying, undocumented branch in my own fork does it". So I'll definitely take some time to page all of that back in.

That should be good for extra confusion, which ought to qualify you as a domain expert. :-)

If I get any more muddled, I'll apply to be a sales consultant at CrowdStrike.

Within a schedule, it's a FIFO ready queue. Schedules aren't invoked by any of these events. What happens is that the appropriate receiving process becomes ready, or in the case of end-of-interrupt, a process reverts to receiving. In either case, the scheduler operates according to its normal behavior.

I'm not sure what its normal behaviour is! We talked about it a lot back in 2008 or so but did we nail anything down?

I think we mostly nailed down that we need to nail this down.

I suspect we're going to end up with a table-driven scheduler that is ordered by either priority or deadline. That schedule gets computed by an application-level scheduler when new scheduling contracts are created.

The real finesse will be in keeping the kernel API sufficiently expressive but the implementation clean. I struggled with this, but you might be able to see the way.

To get the expressiveness right, we might need to come back to your original statement: "communication across schedule boundaries is really hard to do without violating scheduling reservation contracts."

William ML Leslie

Jonathan S. Shapiro

unread,

Oct 29, 2025, 5:44:32 PM (6 days ago) Oct 29

to cap-...@googlegroups.com

On Tue, Oct 28, 2025 at 2:31 AM William ML Leslie <william.l...@gmail.com> wrote:

It's good to know that these things end up with these processes no longer being on any run queue. The reason I ask is, with the introduction of Schedules, you also presumably want to do something specific with the process selection logic. I pictured the selection logic as first choosing a Schedule, and then selecting a process from that schedule. The problem as I had seen it was that a rescinded schedule would never be selected, so the process selection logic wouldn't see it. Maybe I'm off in the rough.

In abstract, a rescinded schedule should behave much like a rescinded page, in which case what you describe is the "principled" solution.

In practice, letting them run at extremely low priority long enough to deliver their own death notices is harmless (those are idle cycles anyway), and saves us the trouble of introducing a more complicated solution.

The image format as it stands doesn't store schedules, but you presumably do want to configure your schedules as part of the image. Once we do that, processes that haven't been dispatched will also have both relations set up.

The general purpose priority schedules are statically defined, so they don't need to be in the image.
Real-time schedules have presumably blown their deadlines on restart, I need to work through a protocol to inform those processes that this has happened.
The kernel scheduler entries were computed by an application that loaded them into the kernel tables. That process can reload them at restart.
The actual challenge here is to get all of the RT processes back to a stable resumption state w.r.t. their real-time schedules.

You want to have a fixed (small) number of buckets in the image? So e.g. all processes at priority 20 get equivalent slices?

The bottom tier of schedules is simple priorities. Empirically, it's not useful to have a crazy number of them, so there's no real schedule data structure to save. You need to save the capabilities, but those are stored in the process structure.

The only thing we actually need to know about processes in the checkpoint area is which ones were running. That can be done with a single bit. The schedule will get fixed up when the schedule capabilities are prepared, so all we need to do is to get them to a point where that preparation will actually happen.

Sticking them on the highest priority ready queue (a priority only used for restart) is a fine way to make that happen.

I'm waiting to send to a process; I'm waiting on that process to receive.

In that case you are on a stall queue associated with the target process. Processes that are on a stall queue at checkpoint time are in the running state. By definition, that means they are in memory.

This is how it's currently implemented. The only reason we can't hit this problem is because processes are always in memory in the current implementations.

Yes. Though the decongestion design would change that.

The send portion is either blocking or it is non-blocking. If it is non-blocking, we don't have this problem, we just have to deal with the fact that the service we are trying to use might not have received our message. I don't know if you can build a useful system using only non-blocking send.

At present, there is no such thing as a non-blocking send. There is an AppNotice mechanism that can send something similar to a signal. Sending an AppNotice is non-blocking, but receipt is not preemptive.

This muddle is why I was looking at scheduler activations.

If it is blocking, it's a Call by another name. You either wait until you get a reply, or you shouldn't expect a reply.

I get what you're saying, and you're mostly right. My quibble is that when people talk about an IPC "call", they tend to mean that the receiver must reply to the call before doing something else. In systems that don't implement that expectation, naming the mechanism "call" creates a surprising amount of confusion when you try to explain IPC.

The case I was dealing with is not so sophisticated. Some pcie devices (notably virtio, some network hardware) may expose more than one command/response buffer, so that you can fill them from different cores.

Yes. Sigh. Right now that requires multiple processes and notices, and if you need any kind of buffer performance you really all of the receiving processes to be on the same CPU that is going to be doing the work. If you're moving data, the cache locality is probably more important than the IPI.

Needing multiple processes is an annoying complication. This is exactly the sort of thing where a scheduler activation system works really well.

The real finesse will be in keeping the kernel API sufficiently expressive but the implementation clean. I struggled with this, but you might be able to see the way.

Maybe I should just join you at crowdstrike...

Jonathan

William ML Leslie

unread,

Nov 2, 2025, 2:38:26 AM (3 days ago) Nov 2

to cap-...@googlegroups.com

On Thu, 30 Oct 2025 at 07:44, Jonathan S. Shapiro <jonathan....@gmail.com> wrote:

On Tue, Oct 28, 2025 at 2:31 AM William ML Leslie <william.l...@gmail.com> wrote:

It's good to know that these things end up with these processes no longer being on any run queue. The reason I ask is, with the introduction of Schedules, you also presumably want to do something specific with the process selection logic. I pictured the selection logic as first choosing a Schedule, and then selecting a process from that schedule. The problem as I had seen it was that a rescinded schedule would never be selected, so the process selection logic wouldn't see it. Maybe I'm off in the rough.

In abstract, a rescinded schedule should behave much like a rescinded page, in which case what you describe is the "principled" solution.

In practice, letting them run at extremely low priority long enough to deliver their own death notices is harmless (those are idle cycles anyway), and saves us the trouble of introducing a more complicated solution.

Right, cool. So these processes are on a low-priority ready queue so can be found easily.

You want to have a fixed (small) number of buckets in the image? So e.g. all processes at priority 20 get equivalent slices?

The bottom tier of schedules is simple priorities. Empirically, it's not useful to have a crazy number of them, so there's no real schedule data structure to save. You need to save the capabilities, but those are stored in the process structure.

The only thing we actually need to know about processes in the checkpoint area is which ones were running. That can be done with a single bit. The schedule will get fixed up when the schedule capabilities are prepared, so all we need to do is to get them to a point where that preparation will actually happen.

Sticking them on the highest priority ready queue (a priority only used for restart) is a fine way to make that happen.

Sounds good. Do you plan to use something like a KeyKOS meter to give high-priority processes bounded slices? How is priority related to which interrupts remain enabled while running? Not necessarily questions for right now, but they aren't answered in the existing documentation.

I'm waiting to send to a process; I'm waiting on that process to receive.

In that case you are on a stall queue associated with the target process. Processes that are on a stall queue at checkpoint time are in the running state. By definition, that means they are in memory.

Currently, when a process goes into receiving mode, the kernel wakes all processes waiting to send to that target. This has some upsides - higher priority processes have their message processed sooner - and some downsides, such as a significant amount of churn in the stall and ready queues that can mean individual processes might not make progress. I don't think there are semantic problems with message ordering, but I could be wrong.

Oh - and all of them are woken up, even if the receiver expects an endpoint match.

The send portion is either blocking or it is non-blocking. If it is non-blocking, we don't have this problem, we just have to deal with the fact that the service we are trying to use might not have received our message. I don't know if you can build a useful system using only non-blocking send.

At present, there is no such thing as a non-blocking send. There is an AppNotice mechanism that can send something similar to a signal. Sending an AppNotice is non-blocking, but receipt is not preemptive.

Non-blocking send is how the server's reply operation is implemented in Coyotos, which keeps the sender safe from a client who is not currently receiving.

If the target isn't receiving, or if there is a receiver memory fault during IPC, the send is ignored and the sender continues with the receive phase if present otherwise it continues execution.

This muddle is why I was looking at scheduler activations.

FWIW I would love to know how you define scheduler activations. Not sure if I've asked you this before. I have been reading this as "attempting to send to a process (or some equivalent interaction with it) causes it to run immediately". Do you mean that a specific function is run immediately? Who provides the stack?

If it is blocking, it's a Call by another name. You either wait until you get a reply, or you shouldn't expect a reply.

I get what you're saying, and you're mostly right. My quibble is that when people talk about an IPC "call", they tend to mean that the receiver must reply to the call before doing something else. In systems that don't implement that expectation, naming the mechanism "call" creates a surprising amount of confusion when you try to explain IPC.

Ah, I see. Gernot is picky about using "protected procedure call" over "inter-process communication" for related reasons.

The case I was dealing with is not so sophisticated. Some pcie devices (notably virtio, some network hardware) may expose more than one command/response buffer, so that you can fill them from different cores.

Yes. Sigh. Right now that requires multiple processes and notices, and if you need any kind of buffer performance you really all of the receiving processes to be on the same CPU that is going to be doing the work. If you're moving data, the cache locality is probably more important than the IPI.

Needing multiple processes is an annoying complication. This is exactly the sort of thing where a scheduler activation system works really well.

I get that we're dealing with legacy CPU architectures, so I can tolerate a bunch of per-cpu Processes for performance-critical drivers.

If we eliminated Processes from this sort of set-up, I like the idea that we might be able to zero a page of stack, pop that into the Mapping for that process on that CPU, and run the result - or, clone the top-level mapping. I don't know if that would perform any better on real hardware. I'm really jealous of Midori, WASM and the B5500 here.

You played with scatter/gather in the early days of Coyotos, I wonder if we'll revisit that once we get some real SMP applications.

William ML Leslie

Jonathan S. Shapiro

unread,

Nov 3, 2025, 12:40:02 AM (2 days ago) Nov 3

to cap-...@googlegroups.com

On Sun, Nov 2, 2025 at 12:38 AM William ML Leslie <william.l...@gmail.com> wrote:

In practice, letting them run at extremely low priority long enough to deliver their own death notices is harmless (those are idle cycles anyway), and saves us the trouble of introducing a more complicated solution.

Right, cool. So these processes are on a low-priority ready queue so can be found easily.

Not with certainty:

If they successfully deliver a death notices (a fault message), they are no longer on any ready queue. At that point it becomes the fault handler's problem to deal with.
If no fault handler is present (no entry capability in the handler slot), the death notice is delivered to the void capability, which does nothing. After this the broken process is no longer on any ready queue.

In short, if you lose track of your process it isn't the kernel's problem to keep track of it. If you really need to get rid of it you can destroy the appropriate space bank.

Sounds good. Do you plan to use something like a KeyKOS meter to give high-priority processes bounded slices? How is priority related to which interrupts remain enabled while running? Not necessarily questions for right now, but they aren't answered in the existing documentation.

I want to apologize before I answer. I've been tied up in a couple of unrelated things, so this has been getting [small] fractional attention for the last week.

I was not a fan of the meter hierarchy. It's one of two places in the KeyKOS kernel where an n log(n) algorithm can run, and the value of n can be quite large. The other is the checkpoint log.

For simple priority scheduling, there's no reason to have meters at all. The question I've been asking myself these last couple of weeks is whether something like a meter is useful for recording real-time schedules. My intuition, at this point, is that real-time schedules are going to be table driven, and the question then becomes "Is the table big enough?" Some kind of schedule data structure relieves that question a bit, but then leaves us needing to wrestle with what happens when a schedule is paged out.

But then I'm left with a second intuition that I am suspect about: in a general-purpose system:

the number of useful table entries is probably bounded by a relatively small constant, and
[confident] some scheduling application computed them, and therefore knows what they are, and
[confident] on restart, real-time schedule contracts have been violated, so we want the schedules to be re-built in any case.

If the first point is true, then there's no compelling reason to re-introduce a durable scheduling object to replace meters.

I'm waiting to send to a process; I'm waiting on that process to receive.

In that case you are on a stall queue associated with the target process. Processes that are on a stall queue at checkpoint time are in the running state. By definition, that means they are in memory.

Currently, when a process goes into receiving mode, the kernel wakes all processes waiting to send to that target. This has some upsides - higher priority processes have their message processed sooner - and some downsides, such as a significant amount of churn in the stall and ready queues that can mean individual processes might not make progress. I don't think there are semantic problems with message ordering, but I could be wrong.

Yes. But there is another issue to consider. Suppose we wake up only the highest-priority sender. When this sender is re-started, there is no guarantee that it will re-issue the send. If it does not, and the other senders are not awakened, then they will not be awakened until some completely unrelated process sends a message to the receiver, which eventually causes it to re-enter the receiving state.

If the sender queue is short, it's a non-issue. If the sender queue is long, there's a larger design problem.

That said, this issue is one of the reasons I was pushing on the scheduler activations idea. It doesn't solve anything if the receiver doesn't cooperate, and sometimes there are really good reasons for the receiver not to accept and queue additional in-bound messages. One of which is the generic multi-level scheduling problem. But if the receiver is designed to handle high concurrency, it helps.

Assuming the receiver does accept and queue messages, the next question is whether we want the receiver to handle them according to sender priority. The problem with this is that we simply don't have a unified model for scheduling in the presence of real-time, and if we did, the receiver may not be able to satisfy any particular desired ordering.

In short: cooperative multi-level scheduling remains an unsolved problem

Oh - and all of them are woken up, even if the receiver expects an endpoint match.

Yes. And I suppose we could filter, but there's still that pesky issue that the sender may not re-issue the send at all, in which case the filter is not relevant and we want to awaken the sender in any case...

The send portion is either blocking or it is non-blocking. If it is non-blocking, we don't have this problem, we just have to deal with the fact that the service we are trying to use might not have received our message. I don't know if you can build a useful system using only non-blocking send.

At present, there is no such thing as a non-blocking send. There is an AppNotice mechanism that can send something similar to a signal. Sending an AppNotice is non-blocking, but receipt is not preemptive.

Non-blocking send is how the server's reply operation is implemented in Coyotos, which keeps the sender safe from a client who is not currently receiving.

That is not (quite) correct. The replying process has an entry capability that is supposed to serve as a reply endpoint, but:

this does not guarantee that the receiver (who, at this point, is probably the caller awaiting a return) is still in a receiving state
this does not guarantee that the receiver's receive block matches the intended cohort (might not be an issue - I need to dig back into the code)
this does not guarantee that the receiver still exists.

But in any of these cases, the receiver has broken the contract. The defense is that the replying process replies with the "will not wait" bit set. If the receiver has defected, the message will be lost, the replying process will continue, and the receiver may remain in the receiving state forever with a filter that cannot (ever) be matched.

Which all sounds reasonable until the receiver is paused for debugging. There are workarounds for that, but nobody would confuse them for being pretty.

If the target isn't receiving, or if there is a receiver memory fault during IPC, the send is ignored and the sender continues with the receive phase if present otherwise it continues execution.

Not quite. In the return case, the receiver is expected to have set up their entry memory region[s] in advance and we can argue that they have defected if there is a memory fault. But even if that has occurred, we should probably try to issue a fault notice to the receiver's fault handler advising that the receiver committed sepuku through gross stupidity. Debugging, again, being an entertaining exception.

If the sender is willing to wait, faults on the receiver side should be processed in the normal way and the send should then be re-attempted.

There is a hidden requirement on the messaging protocol here: you need to be able to plan ahead for a reply to an invocation you made, but you can't necessarily plan for an inbound request that somebody else might make to you.

This muddle is why I was looking at scheduler activations.

FWIW I would love to know how you define scheduler activations. Not sure if I've asked you this before. I have been reading this as "attempting to send to a process (or some equivalent interaction with it) causes it to run immediately". Do you mean that a specific function is run immediately? Who provides the stack?

Let me take this up in a separate email thread.

I get what you're saying, and you're mostly right. My quibble is that when people talk about an IPC "call", they tend to mean that the receiver must reply to the call before doing something else.

They do. Which completely breaks receive-side multiplexing or re-ordering. Multiplexing is (e.g.) absolutely required for responsive window systems.

Ah, I see. Gernot is picky about using "protected procedure call" over "inter-process communication" for related reasons.

Yes. Though I think the "protected" part is orthogonal to the "procedure call" part. There are use-cases for both patterns, and the underlying mechanism doesn't need to dictate. Perhaps the right question is how the participants might express their expectations about the contract and how the underlying mechanism might report defections.

You played with scatter/gather in the early days of Coyotos, I wonder if we'll revisit that once we get some real SMP applications.

The spec for receiving regions still has that capability. It's uncomfortably "orthogonal addressing" from my point of view. Very reminiscent of VAX or 68000. But I haven't found a better approach.

Jonathan

William ML Leslie

unread,

Nov 3, 2025, 1:45:49 AM (2 days ago) Nov 3

to cap-...@googlegroups.com

On Mon, 3 Nov 2025 at 15:40, Jonathan S. Shapiro <jonathan....@gmail.com> wrote:

On Sun, Nov 2, 2025 at 12:38 AM William ML Leslie <william.l...@gmail.com> wrote:

Currently, when a process goes into receiving mode, the kernel wakes all processes waiting to send to that target. This has some upsides - higher priority processes have their message processed sooner - and some downsides, such as a significant amount of churn in the stall and ready queues that can mean individual processes might not make progress. I don't think there are semantic problems with message ordering, but I could be wrong.

Yes. But there is another issue to consider. Suppose we wake up only the highest-priority sender. When this sender is re-started, there is no guarantee that it will re-issue the send. If it does not, and the other senders are not awakened, then they will not be awakened until some completely unrelated process sends a message to the receiver, which eventually causes it to re-enter the receiving state.

My hack for this was that if a process is in the receiving state, and the stall queue is not empty, I place it on the ready queue. If a process is selected from the ready queue and it's in open wait, I peel off the first process in the stall queue and schedule it. It is a bit more complex than that; I moved the stall queues to the endpoints and built an rbtree of those. I don't know that I recommend this.

Non-blocking send is how the server's reply operation is implemented in Coyotos, which keeps the sender safe from a client who is not currently receiving.

That is not (quite) correct. The replying process has an entry capability that is supposed to serve as a reply endpoint, but:
this does not guarantee that the receiver (who, at this point, is probably the caller awaiting a return) is still in a receiving state
this does not guarantee that the receiver's receive block matches the intended cohort (might not be an issue - I need to dig back into the code)
this does not guarantee that the receiver still exists.
But in any of these cases, the receiver has broken the contract. The defense is that the replying process replies with the "will not wait" bit set. If the receiver has defected, the message will be lost, the replying process will continue, and the receiver may remain in the receiving state forever with a filter that cannot (ever) be matched.

Which all sounds reasonable until the receiver is paused for debugging. There are workarounds for that, but nobody would confuse them for being pretty.

Can't we just pretend that receive is atomic? :)

If the target isn't receiving, or if there is a receiver memory fault during IPC, the send is ignored and the sender continues with the receive phase if present otherwise it continues execution.

Not quite. In the return case, the receiver is expected to have set up their entry memory region[s] in advance and we can argue that they have defected if there is a memory fault. But even if that has occurred, we should probably try to issue a fault notice to the receiver's fault handler advising that the receiver committed sepuku through gross stupidity. Debugging, again, being an entertaining exception.

Delivering the fault sounds fine; but we can still choose not to block the sender to run a fault handler. The receiver loses their message, but at least they get to know why.

You played with scatter/gather in the early days of Coyotos, I wonder if we'll revisit that once we get some real SMP applications.

The spec for receiving regions still has that capability. It's uncomfortably "orthogonal addressing" from my point of view. Very reminiscent of VAX or 68000. But I haven't found a better approach.

I ripped out indirect strings in my port (which was probably not a good idea) but the balance is I now have capabilities for slots in gpts. You can name one during receive to say "install the sent capability here" and it will opaquely map a sent memory object for you. I think about doing more like that.

William ML Leslie

Jonathan S. Shapiro

unread,

Nov 3, 2025, 2:03:15 AM (2 days ago) Nov 3

to cap-talk

William wrote:

FWIW I would love to know how you define scheduler activations. Not sure if I've asked you this before.

This deserves a separate discussion. Reference:

Anderson et al., "Scheduler Activations: Effective Kernel-Level Support for User-Level Management of Parallelism", ACM Transactions on Computer Systems, 10(1), Feb. 1992

I'll sketch it below, but the Anderson paper is brilliant. It's the only credible start I have seen on unifying kernel-mode and user-mode scheduling. I strongly recommend it, if only to put structure on the problem.

There is a strong conceptual connection between scheduler activations, Marlow's introduction of async/await in Haskell, and Don Symes' subsequent introduction of async/await into F#, but on the basis of a cursory look at the cited references this appears to be a case of independent reinvention. Notably: implementation of async/await does not rely on kernel-implemented activations.

Recap of the Paper

The Anderson paper is concerned with managing the N:M mapping problem between user-level threads and kernel-level threads. For purposes of recap, let's assume N:1 user:kernel threads. The general idea is that we want to tell the application process when (a) its current user-level thread has blocked and it should choose another user-mode thread to run, or (b) the containing user-level process has been newly dispatched. In some ways, this work anticipates async/await (introduced in 2007), but it is strictly more general.

In a conventional scheduler, control resumes at the saved PC whenever the user-level process is resumed. Anderson's proposal is that control should instead resume at a dedicated user-level scheduler PC, and that the user-level scheduler should receive (as an argument) the user-level control block that is being resumed. A working assumption is that the kernel knows how to find the user-level control block, and can save the preempted PC into the control block; there are several possible implementations. Add a little bit of information about why the previous user-level control block was preempted (blocked? slice timed out?) and you have the basics of a cooperative interface between the kernel-level scheduler and the application-level scheduler.

The contract here is small and interesting. The kernel's job is to say what was preempted/blocked, including register state. The application's job is to decide what user-level thread is resumed, which may involve queueing the user-mode thread that was preempted. By resuming to the user-level scheduler entry point, the user-level process has an opportunity to choose what user-level thread it will resume.

Anderson does not attempt to deal with mixed kernel-level priorities in a single process.

In Anderson's version, the user-mode scheduler is serially re-entrant. While user-mode scheduling decisions are being made, further activations will not be injected.

Aside: Wasabi used scheduler activations in production as a way to implement high-performance storage subsystems.

Extension to IPC:

Proceeding from the sketch above, it isn't hard to imagine extending the scheduler activation concept to IPC receive blocks. I have a memory that L4 explored something similar. Here's a sketch:

Processes register a receive block with the kernel. This block describes where the next in-bound message should go. The receive block additionally specifies the address of the next receive block, or null if no further receive block has been defined.
If the current receive block pointer is null, the process is not receiving and the sender blocks (there are user/kernel handshake implications on the receive side).
When an in-coming message arrives, it is delivered according to the current receive block, and the current receive block pointer is advanced to the next receive block.
If a message has been delivered, the kernel resumes at a "message has been delivered" entry point, which (similar to scheduler activations) decides what to run next in the user-level application.

Unlike scheduler activations, there is a handshake needed between the application and the kernel to update the current receive block address. The receive block pointer itself can be updated with CAS, but a kernel transition is required to wake up blocked senders.

When the receiver is activated, many choices are possible. The received message blocks can be stuck in a queue, the user-mode thread can be preempted, the current user-mode thread can be preempted, or other possible choices. The main points are:

Message receipt is non-blocking provided the receiver has an active and well-formed receive block at the time of receipt.
Message receipt does not necessarily imply preemption of the active user-mode thread. The application-level scheduler gets to decide whether to continue the current user-level thread (presumably queueing the message) or preempt it in favor of a "receiving" user-level thread.
In abstract, the application-level scheduler could dynamically scale the receive buffer pool in response to load.

Open issues:

Should all process resumptions be handled as if an in-bound resumption message has been sent?
How should "foreign" processes behave if this is implemented? By virtue of being foreign they cannot receive normal IPC messages in any case. The question is how to support the emulated system call interface?

If we assume foreign applications are recompiled, we can bake support into the foreign application runtime, but
If binary compatibility for foreign APIs is a goal we need to decide what to do and how to do it.

None of this solves the problem of preserving scheduling priority across a call. Anderson's proposed mechanism doesn't attempt to address this. Which is probably a good call.

This is a radical departure from GNOSIS/KeyKOS/EROS, and it is (at best) 3/4 baked. But if it can be worked through sensibly it seems to offer new and interesting concurrency options.

Jonathan

Jonathan S. Shapiro

unread,

Nov 3, 2025, 2:53:32 AM (2 days ago) Nov 3

to cap-...@googlegroups.com

On Sun, Nov 2, 2025 at 10:45 PM William ML Leslie <william.l...@gmail.com> wrote:

On Mon, 3 Nov 2025 at 15:40, Jonathan S. Shapiro <jonathan....@gmail.com> wrote:
Yes. But there is another issue to consider. Suppose we wake up only the highest-priority sender. When this sender is re-started, there is no guarantee that it will re-issue the send. If it does not, and the other senders are not awakened, then they will not be awakened until some completely unrelated process sends a message to the receiver, which eventually causes it to re-enter the receiving state.

My hack for this was that if a process is in the receiving state, and the stall queue is not empty, I place it on the ready queue. If a process is selected from the ready queue and it's in open wait, I peel off the first process in the stall queue and schedule it.

The problem remains that the awakened invoker may not re-attempt. This seems to leave two options:

Wake all senders and let the scheduler sort it out. At the risk of blasphemy, this might be characterized as the "God will know his own" approach to resumption.
Wake a preferred sender and stick the rest on a "worry" queue of some sort with a timeout or a recovery mechanism (danger here).

The first approach is simple but optimistic. The second disrespects the scheduler, but at some number of senders that is true of both policies. This is the sort of thing that led Phil Karlton and Jim Gettys (of X10/X11 fame) to adopt "mechanism, not policy" as something like a design religion. Unfortunately, you can't always avoid making choices.

It is a bit more complex than that; I moved the stall queues to the endpoints and built an rbtree of those. I don't know that I recommend this.

What guarantees that these stall queues are ever awakened? Given multiple endpoints targeting a common process, how does this avoid priority inversion?

I can see a story that says "A is invoking C, let's see if some other (more deserving) process B is invoking C and decide whether to context switch to B rather than allowing A to send." As with politics, any problem that starts with "We wish to benefit the more deserving" turns out to be very problematic. In this case, it guarantees priority inversion.

For reasonable queue lengths, waking all blocked senders seems in practice to be the best of a poor set of options. The root problem is that in the presence of real-time scheduling there is no stable ordering of dispatch precedence. The core distinction between a stall queue and a ready queue is that the stall queue does not embed any policy about order of dispatch.

I can imagine approaches that wake the full stall queue only if the selected sender defects. At some point we'll undoubtedly find a motivation to implement that for some reason, but until then simple seems good. Partly because specifying defects, or even "might have defected", seems hard.

Non-blocking send is how the server's reply operation is implemented in Coyotos, which keeps the sender safe from a client who is not currently receiving.

That is not (quite) correct. The replying process has an entry capability that is supposed to serve as a reply endpoint, but:
this does not guarantee that the receiver (who, at this point, is probably the caller awaiting a return) is still in a receiving state
this does not guarantee that the receiver's receive block matches the intended cohort (might not be an issue - I need to dig back into the code)
this does not guarantee that the receiver still exists.
But in any of these cases, the receiver has broken the contract. The defense is that the replying process replies with the "will not wait" bit set. If the receiver has defected, the message will be lost, the replying process will continue, and the receiver may remain in the receiving state forever with a filter that cannot (ever) be matched.

Which all sounds reasonable until the receiver is paused for debugging. There are workarounds for that, but nobody would confuse them for being pretty.

Can't we just pretend that receive is atomic? :)

As long as the key word is "pretend", sure! OS architects have as much right to a rich fantasy life as anyone else engaged in fantastical constructions! :-)

If the target isn't receiving, or if there is a receiver memory fault during IPC, the send is ignored and the sender continues with the receive phase if present otherwise it continues execution.

Not quite. In the return case, the receiver is expected to have set up their entry memory region[s] in advance and we can argue that they have defected if there is a memory fault. But even if that has occurred, we should probably try to issue a fault notice to the receiver's fault handler advising that the receiver committed sepuku through gross stupidity. Debugging, again, being an entertaining exception.

Delivering the fault sounds fine; but we can still choose not to block the sender to run a fault handler. The receiver loses their message, but at least they get to know why.

Agreed. I was trying (poorly) to say this, with the caveat that attempting to deliver the fault code at some very low "I'm completely screwed but attempting to tell somebody" priority seems useful. The sender will pay for one attempt to deliver a message to the fault handler, but will not otherwise be blocked. By virtue of happening at "I'm completely screwed" priority, subsequent delivery attempts ares de facto non-blocking provided schedule perimeters are respected.

You played with scatter/gather in the early days of Coyotos, I wonder if we'll revisit that once we get some real SMP applications.

The spec for receiving regions still has that capability. It's uncomfortably "orthogonal addressing" from my point of view. Very reminiscent of VAX or 68000. But I haven't found a better approach.

I ripped out indirect strings in my port (which was probably not a good idea) but the balance is I now have capabilities for slots in gpts. You can name one during receive to say "install the sent capability here" and it will opaquely map a sent memory object for you. I think about doing more like that.

That's an interesting idea, and I want to let that rattle around a bit.

My first concern is that the receiver is letting the sender transmit an arbitrary capability into a slot that spans a region of the receiver's address space. There are security issues. Assume the sender is hostile and the receiver needs to be robust. How does the receiver know/respond if the receiver makes a reference to that part of the address space and the sender has destroyed the memory region? Have a look at the high-performance networking work we did in EROS.

Accepting a memory mapping update seems to be bi-modal. You either trust the sender completely or not at all. The problem if you do not trust the sender is that defection manifests as an invalid memory reference by the receiver, and there is no sane recovery model at the high-level programming language level. This is more or less the important conclusion of the EROS networking paper.

I'm not saying your approach is a bad idea. The GNOSIS family leans in the direction of "I gave you a clear spec, if you do something well-defined and hopelessly stupid it's on you." EROS leans somewhat in the direction of "but the docs should tell you how much it's going to hurt if you bend over in this particular way and you don't supply your own lube." But both systems seem to hold that "fixing stupid is (at best) best effort."

Jonathan

Bakul Shah

unread,

Nov 3, 2025, 3:18:21 AM (2 days ago) Nov 3

to cap-...@googlegroups.com, cap-talk

On Nov 2, 2025, at 11:03 PM, Jonathan S. Shapiro <jonathan....@gmail.com> wrote:

William wrote:

FWIW I would love to know how you define scheduler activations. Not sure if I've asked you this before.

This deserves a separate discussion. Reference:

Anderson et al., "Scheduler Activations: Effective Kernel-Level Support for User-Level Management of Parallelism", ACM Transactions on Computer Systems, 10(1), Feb. 1992

I'll sketch it below, but the Anderson paper is brilliant. It's the only credible start I have seen on unifying kernel-mode and user-mode scheduling. I strongly recommend it, if only to put structure on the problem.

A link to this paper: https://web.eecs.umich.edu/~mosharaf/Readings/Scheduler-Activations.pdf

See also the wikipedia page for SA. Note that both NetBSD & FreeBSD ripped out support for SA (M user threads multiplexed over N kernel threads) in favor of 1:1 when they implemented SMP support.

--

You received this message because you are subscribed to the Google Groups "cap-talk" group.
To unsubscribe from this group and stop receiving emails from it, send an email to cap-talk+u...@googlegroups.com.

To view this discussion visit https://groups.google.com/d/msgid/cap-talk/CAAP%3D3QPvuRzHqcaoS9V%2BixprFTSNST5stH7pWUC7a%2BaiJ1L%3DRQ%40mail.gmail.com.

Jonathan S. Shapiro

unread,

Nov 3, 2025, 3:30:14 AM (2 days ago) Nov 3

to cap-...@googlegroups.com

On Mon, Nov 3, 2025 at 12:18 AM 'Bakul Shah' via cap-talk <cap-...@googlegroups.com> wrote:

A link to this paper: https://web.eecs.umich.edu/~mosharaf/Readings/Scheduler-Activations.pdf

Thanks very much! I actually have that very link up in a browser window, and forgot to provide the link. Appreciate the help!

See also the wikipedia page for SA. Note that both NetBSD & FreeBSD ripped out support for SA (M user threads multiplexed over N kernel threads) in favor of 1:1 when they implemented SMP support.

Yes. I don't know the NetBSD decision process, and I'm now prompted to ask. My guess is that hierarchical scheduling was problematic and the progressive introduction of more cores relieved some of the motivating pressure. Adding to that, the fact that kernel scheduling priorities don't cross the user/supervisor boundary gracefully may have made SA less attractive.

Using the approach for message queueing may be more helpful, but the proof is in the experimenting.

Jonathan

Jonathan S. Shapiro

unread,

Nov 3, 2025, 4:05:38 AM (2 days ago) Nov 3

to cap-talk

It may be helpful to summarize.

There is no known mechanism that co-manages scheduling across the user/supervisor boundary.
To the extent that this is true, and where schedule preservation is important, we may need to explore application approaches where the application itself is non-blocking and the participating [kernel] threads run on distinct schedules. I have no idea how to build such applications robustly.
[Mostly] non-blocking message delivery seems like a place to consider collaborative user/kernel preemption.
Alternatively, it may turn out that multiple kernel processes listening on a common endpoint are simpler.

Provisionally, I think that [3] and [4] are not mutually exclusive. User-preemptive message receipt preserves the single-threaded concurrency style of async/await, which is undeniably simpler to code than true concurrency. Parallel waiting receivers enable true hardware-level concurrency, at the cost of accepting the full complexity of true concurrency.

Async/await breaks down when schedulable order and dependency order are not aligned. When this happens, the non-preemptive nature of async/await combined with long-running operations gets in the way. In exchange, it eliminates the need to write programs that deal with full-on non-blocking concurrency.

Jonathan

William ML Leslie

unread,

Nov 3, 2025, 5:50:42 AM (2 days ago) Nov 3

to cap-...@googlegroups.com

On Mon, 3 Nov 2025 at 17:53, Jonathan S. Shapiro <jonathan....@gmail.com> wrote:

On Sun, Nov 2, 2025 at 10:45 PM William ML Leslie <william.l...@gmail.com> wrote:
On Mon, 3 Nov 2025 at 15:40, Jonathan S. Shapiro <jonathan....@gmail.com> wrote:
Yes. But there is another issue to consider. Suppose we wake up only the highest-priority sender. When this sender is re-started, there is no guarantee that it will re-issue the send. If it does not, and the other senders are not awakened, then they will not be awakened until some completely unrelated process sends a message to the receiver, which eventually causes it to re-enter the receiving state.

My hack for this was that if a process is in the receiving state, and the stall queue is not empty, I place it on the ready queue. If a process is selected from the ready queue and it's in open wait, I peel off the first process in the stall queue and schedule it.

The problem remains that the awakened invoker may not re-attempt. This seems to leave two options:

The hopeful receiver is still on the ready queue, and if it is selected again, it will wake the next potential sender in sequence.

Wake all senders and let the scheduler sort it out. At the risk of blasphemy, this might be characterized as the "God will know his own" approach to resumption.
Wake a preferred sender and stick the rest on a "worry" queue of some sort with a timeout or a recovery mechanism (danger here).
The first approach is simple but optimistic. The second disrespects the scheduler, but at some number of senders that is true of both policies. This is the sort of thing that led Phil Karlton and Jim Gettys (of X10/X11 fame) to adopt "mechanism, not policy" as something like a design religion. Unfortunately, you can't always avoid making choices.

It is a bit more complex than that; I moved the stall queues to the endpoints and built an rbtree of those. I don't know that I recommend this.

What guarantees that these stall queues are ever awakened? Given multiple endpoints targeting a common process, how does this avoid priority inversion?

I was going for guaranteed in-order delivery, because I like the progress story. If a server wants to implement multiple priorities, it can use distinct endpoints and attempt a non-blocking wait with the high-priority endpoint id first, before falling back to open wait.

I can see a story that says "A is invoking C, let's see if some other (more deserving) process B is invoking C and decide whether to context switch to B rather than allowing A to send." As with politics, any problem that starts with "We wish to benefit the more deserving" turns out to be very problematic. In this case, it guarantees priority inversion.

For reasonable queue lengths, waking all blocked senders seems in practice to be the best of a poor set of options. The root problem is that in the presence of real-time scheduling there is no stable ordering of dispatch precedence. The core distinction between a stall queue and a ready queue is that the stall queue does not embed any policy about order of dispatch.

I can imagine approaches that wake the full stall queue only if the selected sender defects. At some point we'll undoubtedly find a motivation to implement that for some reason, but until then simple seems good. Partly because specifying defects, or even "might have defected", seems hard.

I'd love to dig into the pathologies of senders playing "down low, too slow" if you still think this is an issue.

Can't we just pretend that receive is atomic? :)

As long as the key word is "pretend", sure! OS architects have as much right to a rich fantasy life as anyone else engaged in fantastical constructions! :-)

Quoth the spec:

From the sender perspective, message transmission is (nearly) atomic. From the receiver perspective, message transfer occurs asynchronously. Arrival is signalled by a message completion event delivered to the receiver's activation handler.

If the target isn't receiving, or if there is a receiver memory fault during IPC, the send is ignored and the sender continues with the receive phase if present otherwise it continues execution.

Not quite. In the return case, the receiver is expected to have set up their entry memory region[s] in advance and we can argue that they have defected if there is a memory fault. But even if that has occurred, we should probably try to issue a fault notice to the receiver's fault handler advising that the receiver committed sepuku through gross stupidity. Debugging, again, being an entertaining exception.

Delivering the fault sounds fine; but we can still choose not to block the sender to run a fault handler. The receiver loses their message, but at least they get to know why.

Agreed. I was trying (poorly) to say this, with the caveat that attempting to deliver the fault code at some very low "I'm completely screwed but attempting to tell somebody" priority seems useful. The sender will pay for one attempt to deliver a message to the fault handler, but will not otherwise be blocked. By virtue of happening at "I'm completely screwed" priority, subsequent delivery attempts ares de facto non-blocking provided schedule perimeters are respected.

I recall having a moment at some point where I panicked thinking "what if the fault handler isn't ready to receive?!" before realising it's pretty easy to arrange things so that the handler is always receiving if the process is running.

You played with scatter/gather in the early days of Coyotos, I wonder if we'll revisit that once we get some real SMP applications.

The spec for receiving regions still has that capability. It's uncomfortably "orthogonal addressing" from my point of view. Very reminiscent of VAX or 68000. But I haven't found a better approach.

I ripped out indirect strings in my port (which was probably not a good idea) but the balance is I now have capabilities for slots in gpts. You can name one during receive to say "install the sent capability here" and it will opaquely map a sent memory object for you. I think about doing more like that.

That's an interesting idea, and I want to let that rattle around a bit.

My first concern is that the receiver is letting the sender transmit an arbitrary capability into a slot that spans a region of the receiver's address space. There are security issues. Assume the sender is hostile and the receiver needs to be robust. How does the receiver know/respond if the receiver makes a reference to that part of the address space and the sender has destroyed the memory region? Have a look at the high-performance networking work we did in EROS.

I have gone back and forth on this quite a bit, starting with trying to understand window capabilities. The impression I got from what implementation of window capabilities exist, and what little the spec mentions, is that they were kind of superseded by the way the ElfSpace copies its background space. Further, for the sort of uses I thought windowing might be useful - such as read-only shared maps of a file - I couldn't see how to make them work. An ever so slightly related point is that if you want to have a GPT do something special - say, you want a memory handler - the GPT can only translate 3 bits. I sometimes wonder if I should fix that, somehow.

So, I wonder if having another object specifically for the sake of including a memory handler or a remappable window might be useful. Something that has methods for shifting the window over the underlying memory object, mapping a fallback page or object if access fails, and has some way to query whether the map is still active.

There's the related problem that if I grant some memory to a process, I probably also want a way to revoke that access. I have some support for this in two unrelated objects I should really write about publically.

William ML Leslie

Jonathan S. Shapiro

unread,

Nov 3, 2025, 1:07:54 PM (2 days ago) Nov 3

to cap-...@googlegroups.com

On Mon, Nov 3, 2025 at 2:50 AM William ML Leslie <william.l...@gmail.com> wrote:

On Mon, 3 Nov 2025 at 17:53, Jonathan S. Shapiro <jonathan....@gmail.com> wrote:

The problem remains that the awakened invoker may not re-attempt. This seems to leave two options:

The hopeful receiver is still on the ready queue, and if it is selected again, it will wake the next potential sender in sequence.

A receiver is in the receiving state. It is not on the ready queue.

What guarantees that these stall queues are ever awakened? Given multiple endpoints targeting a common process, how does this avoid priority inversion?

I was going for guaranteed in-order delivery, because I like the progress story. If a server wants to implement multiple priorities, it can use distinct endpoints and attempt a non-blocking wait with the high-priority endpoint id first, before falling back to open wait.

Since you can't guarantee retry on restart, there would need to be a way to detect that the sender didn't retry so that you could move on to the next sender. We've never found a clean way to do that.

I can imagine approaches that wake the full stall queue only if the selected sender defects. At some point we'll undoubtedly find a motivation to implement that for some reason, but until then simple seems good. Partly because specifying defects, or even "might have defected", seems hard.

I'd love to dig into the pathologies of senders playing "down low, too slow" if you still think this is an issue.

Can't we just pretend that receive is atomic? :)

As long as the key word is "pretend", sure! OS architects have as much right to a rich fantasy life as anyone else engaged in fantastical constructions! :-)

Quoth the spec:
From the sender perspective, message transmission is (nearly) atomic. From the receiver perspective, message transfer occurs asynchronously. Arrival is signalled by a message completion event delivered to the receiver's activation handler.

This is taken from a version that had (or at least described) scheduler activations. Can I ask where you are seeing this? I'm not convinced that I have the most recent version of the Coyotos repository, and I'd like to check if you may have a newer version.

I recall having a moment at some point where I panicked thinking "what if the fault handler isn't ready to receive?!" before realising it's pretty easy to arrange things so that the handler is always receiving if the process is running.

Not necessarily. One handler might be handling faults for hundreds of processes, which means it might be running when a particular process attempts to send it a fault. The sender simply ends up on a stall queue. It wakes up and retries when the handler re-enters the receive state.

My first concern is that the receiver is letting the sender transmit an arbitrary capability into a slot that spans a region of the receiver's address space. There are security issues. Assume the sender is hostile and the receiver needs to be robust. How does the receiver know/respond if the receiver makes a reference to that part of the address space and the sender has destroyed the memory region? Have a look at the high-performance networking work we did in EROS.

I have gone back and forth on this quite a bit, starting with trying to understand window capabilities. The impression I got from what implementation of window capabilities exist, and what little the spec mentions, is that they were kind of superseded by the way the ElfSpace copies its background space.

Not superseded. Just a different use case. What ElfSpace is doing is copy-on-write for mutable pages. That's not a windowing problem.

Further, for the sort of uses I thought windowing might be useful - such as read-only shared maps of a file - I couldn't see how to make them work.

I suppose you could use them that way, but usually it's easier to just insert the read-only subspace directly into a containing space.

An ever so slightly related point is that if you want to have a GPT do something special - say, you want a memory handler - the GPT can only translate 3 bits. I sometimes wonder if I should fix that, somehow.

Don't. It's intentional. The most comparable construct in EROS/KeyKOS was red segments, which allowed more slots to be populated. The resulting translation logic was kind of a mess, because it didn't translate an exact number of bits. The decision to make GPTs translate an exact number of bits in this case was an intentional simplification.

So, I wonder if having another object specifically for the sake of including a memory handler or a remappable window might be useful. Something that has methods for shifting the window over the underlying memory object, mapping a fallback page or object if access fails, and has some way to query whether the map is still active.

All of that can be done now. It doesn't motivate adding a different object.

I probably need to write up how background windows work better! To be honest, I never used a window (outside of testing) in all of the years of working on EROS. The case in which you really need one is pretty obscure.

There's the related problem that if I grant some memory to a process, I probably also want a way to revoke that access. I have some support for this in two unrelated objects I should really write about publically.

That's a common sharing pattern. The way to implement it is to wrap the space in a GPT that doesn't consume any translation bits.

Jonathan

Gernot Heiser

unread,

Nov 3, 2025, 11:01:18 PM (2 days ago) Nov 3

to cap-...@googlegroups.com

On 3 Nov 2025, at 18:03, Jonathan S. Shapiro <jonathan....@gmail.com> wrote:

Extension to IPC:

Proceeding from the sketch above, it isn't hard to imagine extending the scheduler activation concept to IPC receive blocks. I have a memory that L4 explored something similar. Here's a sketch:

Processes register a receive block with the kernel. This block describes where the next in-bound message should go. The receive block additionally specifies the address of the next receive block, or null if no further receive block has been defined.
If the current receive block pointer is null, the process is not receiving and the sender blocks (there are user/kernel handshake implications on the receive side).
When an in-coming message arrives, it is delivered according to the current receive block, and the current receive block pointer is advanced to the next receive block.
If a message has been delivered, the kernel resumes at a "message has been delivered" entry point, which (similar to scheduler activations) decides what to run next in the user-level application.
Unlike scheduler activations, there is a handshake needed between the application and the kernel to update the current receive block address. The receive block pointer itself can be updated with CAS, but a kernel transition is required to wake up blocked senders.

[…]

This is a radical departure from GNOSIS/KeyKOS/EROS, and it is (at best) 3/4 baked. But if it can be worked through sensibly it seems to offer new and interesting concurrency options.

why?

Why should the kernel deal with user-level threads?

You’re right that Jochen proposed something somewhat similar for original L4 many years back [Liedtke & Wenske, HotOS’21], as a way of making intra-AS IPC faster. In hindsight, it was solving a problem created by the then L4 model forcing multithreaded designs on programs running on a single core.

With seL4 we routinely use a coroutine library to multiplex a single kernel-scheduled thread, and don’t see a need for adding more complexity to the kernel.

Gernot

William ML Leslie

unread,

Nov 4, 2025, 3:48:53 AM (21 hours ago) Nov 4

to cap-...@googlegroups.com

On Tue, 4 Nov 2025 at 04:07, Jonathan S. Shapiro <jonathan....@gmail.com> wrote:

On Mon, Nov 3, 2025 at 2:50 AM William ML Leslie <william.l...@gmail.com> wrote:
On Mon, 3 Nov 2025 at 17:53, Jonathan S. Shapiro <jonathan....@gmail.com> wrote:

The problem remains that the awakened invoker may not re-attempt. This seems to leave two options:

The hopeful receiver is still on the ready queue, and if it is selected again, it will wake the next potential sender in sequence.

A receiver is in the receiving state. It is not on the ready queue.

I put things on the ready queue sometimes. Vibe scheduling, if you like.

I can't usually put sending processes on the ready queue, because they would lose their place in the stall queue, but that isn't a problem for Coyotos since their position in the queue doesn't matter. I use the once-vestigial `ipcPeer` field to indicate which sender can skip the queue, if you were wondering how I maintain message order after selecting and scheduling a sender.

What guarantees that these stall queues are ever awakened? Given multiple endpoints targeting a common process, how does this avoid priority inversion?

I was going for guaranteed in-order delivery, because I like the progress story. If a server wants to implement multiple priorities, it can use distinct endpoints and attempt a non-blocking wait with the high-priority endpoint id first, before falling back to open wait.

Since you can't guarantee retry on restart, there would need to be a way to detect that the sender didn't retry so that you could move on to the next sender. We've never found a clean way to do that.

I haven't got all of the possibilities covered yet, but I don't know if there's value in my going into detail since I don't think that technique will apply as cleanly to Coyotos.

Quoth the spec:
From the sender perspective, message transmission is (nearly) atomic. From the receiver perspective, message transfer occurs asynchronously. Arrival is signalled by a message completion event delivered to the receiver's activation handler.

This is taken from a version that had (or at least described) scheduler activations. Can I ask where you are seeing this? I'm not convinced that I have the most recent version of the Coyotos repository, and I'd like to check if you may have a newer version.

Well I forked your 2019 release, so https://bitbucket.org/coyotos/coyotos/src/92f3572075a1223362218233915a910ae01222cf/src/web/docs/ukernel/spec.xml#lines-458

It looked the same as the mercurial version I had before that.

I recall having a moment at some point where I panicked thinking "what if the fault handler isn't ready to receive?!" before realising it's pretty easy to arrange things so that the handler is always receiving if the process is running.

Not necessarily. One handler might be handling faults for hundreds of processes, which means it might be running when a particular process attempts to send it a fault. The sender simply ends up on a stall queue. It wakes up and retries when the handler re-enters the receive state.

Ah. You are correct. I missed that proc_do_upcall prepares the target with willingToBlock = true.

My first concern is that the receiver is letting the sender transmit an arbitrary capability into a slot that spans a region of the receiver's address space. There are security issues. Assume the sender is hostile and the receiver needs to be robust. How does the receiver know/respond if the receiver makes a reference to that part of the address space and the sender has destroyed the memory region? Have a look at the high-performance networking work we did in EROS.

I have gone back and forth on this quite a bit, starting with trying to understand window capabilities. The impression I got from what implementation of window capabilities exist, and what little the spec mentions, is that they were kind of superseded by the way the ElfSpace copies its background space.

Not superseded. Just a different use case. What ElfSpace is doing is copy-on-write for mutable pages. That's not a windowing problem.

Further, for the sort of uses I thought windowing might be useful - such as read-only shared maps of a file - I couldn't see how to make them work.

I suppose you could use them that way, but usually it's easier to just insert the read-only subspace directly into a containing space.

If the offset into the file isn't a multiple of the size of the region I'm mapping it into, I have to build a whole new set of GPTs. I'd wondered if a window into the file might remove the need for that.

An ever so slightly related point is that if you want to have a GPT do something special - say, you want a memory handler - the GPT can only translate 3 bits. I sometimes wonder if I should fix that, somehow.

Don't. It's intentional. The most comparable construct in EROS/KeyKOS was red segments, which allowed more slots to be populated. The resulting translation logic was kind of a mess, because it didn't translate an exact number of bits. The decision to make GPTs translate an exact number of bits in this case was an intentional simplification.

I'd sooner give up on keeping the size of an ExGPT at 256 bytes. No specific plans, though.

So, I wonder if having another object specifically for the sake of including a memory handler or a remappable window might be useful. Something that has methods for shifting the window over the underlying memory object, mapping a fallback page or object if access fails, and has some way to query whether the map is still active.

All of that can be done now. It doesn't motivate adding a different object.

I probably need to write up how background windows work better! To be honest, I never used a window (outside of testing) in all of the years of working on EROS. The case in which you really need one is pretty obscure.

Please :)

There's the related problem that if I grant some memory to a process, I probably also want a way to revoke that access. I have some support for this in two unrelated objects I should really write about publically.

That's a common sharing pattern. The way to implement it is to wrap the space in a GPT that doesn't consume any translation bits.

I'm doing that for mapping writable files, but isn't it a bit involved to use this technique for all IPC that requires an indirect region? I mean, this is the mechanism for replacing indirect strings, having to allocate then rescind on every such IPC.

William ML Leslie

unread,

Nov 4, 2025, 5:21:56 AM (19 hours ago) Nov 4

to cap-...@googlegroups.com

On Tue, 4 Nov 2025 at 18:48, William ML Leslie <william.l...@gmail.com> wrote:

On Tue, 4 Nov 2025 at 04:07, Jonathan S. Shapiro <jonathan....@gmail.com> wrote:
On Mon, Nov 3, 2025 at 2:50 AM William ML Leslie <william.l...@gmail.com> wrote:
On Mon, 3 Nov 2025 at 17:53, Jonathan S. Shapiro <jonathan....@gmail.com> wrote:
What guarantees that these stall queues are ever awakened? Given multiple endpoints targeting a common process, how does this avoid priority inversion?

I was going for guaranteed in-order delivery, because I like the progress story. If a server wants to implement multiple priorities, it can use distinct endpoints and attempt a non-blocking wait with the high-priority endpoint id first, before falling back to open wait.

Since you can't guarantee retry on restart, there would need to be a way to detect that the sender didn't retry so that you could move on to the next sender. We've never found a clean way to do that.

I haven't got all of the possibilities covered yet, but I don't know if there's value in my going into detail since I don't think that technique will apply as cleanly to Coyotos.

Sorry, that's flippant of me.

The cheap way to do it is every time the receiver is scheduled it schedules another waiting process; so if the sender doesn't retry in time, it will have competition.

The more sophisticated way to do it is to have a separate state for "ready to send", storing all of the detail required for the send to happen without going back into userspace. To do this, you need to make sure you've got no more marshalling to do, hence the removal of indirect strings. I feel like I'm breaking some unwritten invariant but that's nothing new.

I do enjoy bouncing design around with you, often more than implementation.

William ML Leslie

unread,

Nov 4, 2025, 7:40:51 AM (17 hours ago) Nov 4

to cap-...@googlegroups.com

On Mon, 3 Nov 2025 at 17:03, Jonathan S. Shapiro <jonathan....@gmail.com> wrote:

William wrote:

FWIW I would love to know how you define scheduler activations. Not sure if I've asked you this before.

This deserves a separate discussion. Reference:

Anderson et al., "Scheduler Activations: Effective Kernel-Level Support for User-Level Management of Parallelism", ACM Transactions on Computer Systems, 10(1), Feb. 1992

I'll sketch it below, but the Anderson paper is brilliant. It's the only credible start I have seen on unifying kernel-mode and user-mode scheduling. I strongly recommend it, if only to put structure on the problem.

Thank you, that brings a lot of clarity.

Extension to IPC:

Proceeding from the sketch above, it isn't hard to imagine extending the scheduler activation concept to IPC receive blocks. I have a memory that L4 explored something similar. Here's a sketch:
Processes register a receive block with the kernel. This block describes where the next in-bound message should go. The receive block additionally specifies the address of the next receive block, or null if no further receive block has been defined.
If the current receive block pointer is null, the process is not receiving and the sender blocks (there are user/kernel handshake implications on the receive side).
When an in-coming message arrives, it is delivered according to the current receive block, and the current receive block pointer is advanced to the next receive block.
If a message has been delivered, the kernel resumes at a "message has been delivered" entry point, which (similar to scheduler activations) decides what to run next in the user-level application.
Unlike scheduler activations, there is a handshake needed between the application and the kernel to update the current receive block address. The receive block pointer itself can be updated with CAS, but a kernel transition is required to wake up blocked senders.

When the receiver is activated, many choices are possible. The received message blocks can be stuck in a queue, the user-mode thread can be preempted, the current user-mode thread can be preempted, or other possible choices. The main points are:
Message receipt is non-blocking provided the receiver has an active and well-formed receive block at the time of receipt.
Message receipt does not necessarily imply preemption of the active user-mode thread. The application-level scheduler gets to decide whether to continue the current user-level thread (presumably queueing the message) or preempt it in favor of a "receiving" user-level thread.
In abstract, the application-level scheduler could dynamically scale the receive buffer pool in response to load.
Open issues:
Should all process resumptions be handled as if an in-bound resumption message has been sent?
How should "foreign" processes behave if this is implemented? By virtue of being foreign they cannot receive normal IPC messages in any case. The question is how to support the emulated system call interface?
If we assume foreign applications are recompiled, we can bake support into the foreign application runtime, but
If binary compatibility for foreign APIs is a goal we need to decide what to do and how to do it.
None of this solves the problem of preserving scheduling priority across a call. Anderson's proposed mechanism doesn't attempt to address this. Which is probably a good call.

This is a radical departure from GNOSIS/KeyKOS/EROS, and it is (at best) 3/4 baked. But if it can be worked through sensibly it seems to offer new and interesting concurrency options.

I see this making non-blocking receive possible. Not sure about non-blocking send, yet.

What stack do you run the handler on? To explore: how far is this from having a separate process to handle all of your events for you? If you've just got the one user thread, at least.

William ML Leslie

Jonathan S. Shapiro

unread,

Nov 4, 2025, 1:24:03 PM (11 hours ago) Nov 4

to cap-...@googlegroups.com

On Mon, Nov 3, 2025 at 8:01 PM Gernot Heiser <gernot...@gmail.com> wrote:

On 3 Nov 2025, at 18:03, Jonathan S. Shapiro <jonathan....@gmail.com> wrote:

Extension to IPC:

This is a radical departure from GNOSIS/KeyKOS/EROS, and it is (at best) 3/4 baked. But if it can be worked through sensibly it seems to offer new and interesting concurrency options.

why?

Why should the kernel deal with user-level threads?

Lazy process switching is indeed the paper I was remembering.

Thanks, Gernot, I've been thinking about this completely wrong. Your comment reminds me that activations are not needed at all for the IPC path to make a user-level scheduling decision. Since it's an in-band resumption, that can be handled by application-level library code with no change. Activations are only useful/necessary if we want the application to know when it resumes from preemption, because that's an out-of-band resumption. Actually, this is quite a relief, because activations do not play nicely with foreign process traps (emulated APIs).

I'm now thinking that the only reason to even consider having the kernel IPC code touch an activation block would involve some argument about handling wakeup in a consistent way at the user/kernel API boundary. I don't think that's compelling unless it comes with other benefits.

I think this topic can now be filed under "Being tired and distracted makes Jonathan stupid". :-)

Jonathan

Jonathan S. Shapiro

unread,

Nov 4, 2025, 1:31:44 PM (11 hours ago) Nov 4

to cap-...@googlegroups.com

If receive is non-blocking, then send is non-blocking provided nobody page faults - which isn't a kernel problem.

But yes, this is what got me started on this path. As Gernot pointed out today, there are better ways to handle application-level scheduling for this scenario. I'd add that we already have the needed mechanisms to have the receiving kernel thread be different from the kernel thread that actually does the work in the receiver.

What stack do you run the handler on? To explore: how far is this from having a separate process to handle all of your events for you? If you've just got the one user thread, at least.

It probably would use a tiny stack in the activation block. You could almost run it on the receiving thread stack, but if you got really unlucky that stack might fault and then you wouldn't be able to receive until a fault handler cleaned you up.

Gernot's point completely severs IPC from activations, so I'm going to set the activations idea aside. I'm definitely still spinning back up in this space, and happy to take this one off of my plate.