Can't just drop things like draw instructions because you'd get invalid output ie glitchy unusable display results, no?
--
You received this message because you are subscribed to the Google Groups "cap-talk" group.
To unsubscribe from this group and stop receiving emails from it, send an email to cap-talk+u...@googlegroups.com.
To view this discussion visit https://groups.google.com/d/msgid/cap-talk/CAAP%3D3QMKbGxR3%3D_uEf%2B1NWcXfJUNtkGcHOy2UOGP0bvuJbHKuA%40mail.gmail.com.
I was not trying to say the realtime part should suffer. Just that it appears to me to be kind of by definition an intractable impedance mismatch in some situations.
On Sun, 26 Oct 2025 at 09:19, Jonathan S. Shapiro <jonathan....@gmail.com> wrote:I had a mildly interesting idea today while writing up scheduling challenges in Coyotos. The 30,000 foot problem is that communication across schedule boundaries is really hard to do without violating scheduling reservation contracts.Imagine a situation where some real-time component needs to talk to (e.g.) a display compositor, whose schedule probably has a lower level of service than its real-time caller. It seems to me that there are basically three ways to deal with this:The first two introduce significant recovery complexities if the buffer is revoked (invalid reference in shared server after revocation), though we managed to sort out the memory revocation problem in the EROS high-speed networking work.
- Bounded buffering, where display operations will be dropped if the buffer isn't drained in a timely way, or
- "Donating" a process pre-configured with a real-time schedule for use by the shared service, or
- "Lending" the real-time schedule at IPC time by sending a schedule capability for the service to use directly.
If the Process is revoked, the shared service needs a way to notice, and then to clean up after a thread of control that vanished in mid-execution. Which is much harder than recovering from a revoked buffer.Lending a schedule capability seems cleaner, but then we have the problem that the service might "leak" that capability - which upon use would violate the reservation requirements of the client.It eventually occurred to me that a mutually trusted intermediary might work. The service client provides the schedule capability, the service provides the storage for a new process using this capability (and is therefore protected from revocation), and a fault handler is installed in such a way that if the real-time schedule is revoked a more pedestrian "backup" schedule can be inserted to allow cleanup before the process is destroyed. Following which, a Coyotos notice (a single bit in-band signal that gets delivered at next receive-wait) can be used to tell the process that it's time to retire.It feels heavy, but like the proverbial dancing bear, it's amazing that it dances at all.So you are creating a process specifically for use by this client? That definitely makes the problem easier to manage.
There's no interface to get notified of a schedule being revoked yet, but there's no reason you couldn't make one.
If the callee doesn't have a schedule configured by default, it should inherit the caller's schedule until it responds. Are there situations where you don't want to use that mechanism, or where that mechanisms isn't sufficient for you?
On Sun, Oct 26, 2025 at 3:30 AM William ML Leslie <william.l...@gmail.com> wrote:There's no interface to get notified of a schedule being revoked yet, but there's no reason you couldn't make one.I think it would turn into a fault code that gets delivered to the per-process fault handler. There's an already defined NoSchedule fault that would work for this.
If the callee doesn't have a schedule configured by default, it should inherit the caller's schedule until it responds. Are there situations where you don't want to use that mechanism, or where that mechanisms isn't sufficient for you?It can't inherit the caller's schedule in the literal sense; that would leak the schedule capability. It is possible to allow the recipient to run on the remainder of the caller's slice, but (a) it's technically an authority leak, and (b) you can't do that on reserved schedules without making a heck of a mess unless both sides run on the same reserve.
Does [NoSchedule] get delivered to the handlers of all of the processes using this schedule statically? Perhaps more critically: when? Do you want to walk over all of these during revocation?
Notice that schedules have processes they will cause to run, and processes also have a schedule - that is, there are references that go each way, and they mean different things.
This is how I thought this worked:* a process that has a schedule configured (i.e. has a valid schedule in its schedule slot) always runs on that schedule. When receiving a message, the process is placed on that schedule's active list and control is not immediately transferred to that process.
* a schedule has (one, or a stack of, or a fifo queue of?) processes. When the schedule is invoked, either by scheduler activation, interrupt, or return from interrupt, a process is selected from the schedule and given time.
On Mon, Oct 27, 2025 at 1:05 AM William ML Leslie <william.l...@gmail.com> wrote:Does [NoSchedule] get delivered to the handlers of all of the processes using this schedule statically? Perhaps more critically: when? Do you want to walk over all of these during revocation?It does not. When a process attempts to run on a revoked schedule, the kernel will discover that its schedule capability has become the null capability, which causes the fault to be issued for that process. If a replacement schedule is installed before a particular process attempts to run, there's no reason to issue a fault for it.
Taking your next point slightly out of order:Notice that schedules have processes they will cause to run, and processes also have a schedule - that is, there are references that go each way, and they mean different things.That reference relationship may not exist in either direction if the process hasn't been dispatched yet. For processes that have been dispatched, I suppose you could walk the ready queue for the schedule, but that wouldn't get you the processes that are currently on-CPU. You'd have to issue a broadcast IPI to kick all of the currently executing processes off of their respective cores so they were all on the appropriate ready queue, do your thing, and then let them resume if the current reserve hasn't run out.
Coyotos got rid of the doubly linked list structure for finding capabilities because the cache behavior was terrible. It has been replaced by an incremental background tracing mechanism.You mentioned scheduler activations. I was looking at the code yesterday, and it looks like we backed scheduler activations out when we hit the ARM issue. It had been about 13 years since I last looked at that code, and it seems that I was mixed up about where it stood.
This is how I thought this worked:* a process that has a schedule configured (i.e. has a valid schedule in its schedule slot) always runs on that schedule. When receiving a message, the process is placed on that schedule's active list and control is not immediately transferred to that process.This is not what currently happens. In EROS, we took the view that the logical flow of control passes from invoker to invokee in an IPC, and we allow the invokee to run on the currently active slice (the one from the invoker) until (a) the invokee blocks, or (b) the slice ends. What is true is that when a process comes off of a stall queue it goes on the ready queue associated with the schedule named by its scheduling capability.This was never a good answer, it was less bad than any other answer. Following the logical control flow from invoker to invokee is fine when the two processes are running on the same schedule, but it leaks computons if the invokee runs on a lower schedule.We don't have hard real-time schedules in EROS, but in most hard real-time schemes we would have to take the view that the logical thread changes at the schedule boundary (not at the IPC boundary). You can send messages between two processes in the same schedule without giving up your slice, but hard real-time scheduling regimes (e.g. ARINC-653 and friends) require strict temporal isolation, which. means that you cannot "give away" unused computons.
In Coyotos there's an added twist: in a communication from Peter to Paul, the two might not be assigned to the same processor performance class. It's not clear whether we want to revise Paul's processor class in this situation, but at the moment we cross the process boundary all of the state we need in order to resume Paul is sitting in the cpuFor(Peter) cache.
In the movie The Devil's Own, Brad Pitt comes out with: "If you're not confused, you don't understand what's going on." I'm pretty sure he was talking about scheduling. :-)
* a schedule has (one, or a stack of, or a fifo queue of?) processes. When the schedule is invoked, either by scheduler activation, interrupt, or return from interrupt, a process is selected from the schedule and given time.Within a schedule, it's a FIFO ready queue. Schedules aren't invoked by any of these events. What happens is that the appropriate receiving process becomes ready, or in the case of end-of-interrupt, a process reverts to receiving. In either case, the scheduler operates according to its normal behavior.
I wonder how schedules tie into the selection of a process after a timer interrupt. I've been imagining that on timer interrupts, a process yielding, or blocking on object load etc results in selecting the next schedule and picking the first process off of it. In that model, I don't know how a process could be selected to run if it's not on a valid schedule. I guess it could happen if the current process blocks on receive? Is that the only case?
The image format as it stands doesn't store schedules, but you presumably do want to configure your schedules as part of the image. Once we do that, processes that haven't been dispatched will also have both relations set up.
I only imagine a `donate` method on a schedule object so that you can do user-level scheduling if you like.
The checkpointing question is about how queues are represented to the checkpointer. My understanding from the CapROS and Coyotos sources is, they aren't.
The practical impact is that all processes need to be in-memory, even if they haven't had anything to do in weeks.
If they aren't in memory but are waiting on a process to receive, there's no existing way to know to wake them.
I think there's consensus at this point that Coyotos (and seL4) Call should really be treated like a call to a library function: the callee can just refuse to return, so if you don't trust the invokee to honour their temporal part of this contract, you can't make any promises about timeliness no matter how scheduling is implemented. If it's the case that you need to trust the timing behaviours of your callee, why not trust them to borrow your schedule? At least trust that you can put those processes on your schedule until they reply to you.
In Coyotos there's an added twist: in a communication from Peter to Paul, the two might not be assigned to the same processor performance class. It's not clear whether we want to revise Paul's processor class in this situation, but at the moment we cross the process boundary all of the state we need in order to resume Paul is sitting in the cpuFor(Peter) cache.100%. We've been assuming, I think, that we don't want the kernel making those sorts of decisions. The last version you published doesn't pin Processes to CPUs (or Schedules to CPUs, because schedules aren't implemented), instead, it's one global ready queue. In my fork I have Processes pinned, but the APIs both at runtime and in the image aren't there to configure that.
In the movie The Devil's Own, Brad Pitt comes out with: "If you're not confused, you don't understand what's going on." I'm pretty sure he was talking about scheduling. :-)No doubt, but I'm extra confused because not only do I have "the way the current implementation does it", "how seL4 does it both MCS and Original Recipe", and "things we talked about 20 years ago" in my head, I also have "how a terrifying, undocumented branch in my own fork does it". So I'll definitely take some time to page all of that back in.
* a schedule has (one, or a stack of, or a fifo queue of?) processes. When the schedule is invoked, either by scheduler activation, interrupt, or return from interrupt, a process is selected from the schedule and given time.Within a schedule, it's a FIFO ready queue. Schedules aren't invoked by any of these events. What happens is that the appropriate receiving process becomes ready, or in the case of end-of-interrupt, a process reverts to receiving. In either case, the scheduler operates according to its normal behavior.I'm not sure what its normal behaviour is! We talked about it a lot back in 2008 or so but did we nail anything down?
On Mon, Oct 27, 2025 at 9:37 PM William ML Leslie <william.l...@gmail.com> wrote:I wonder how schedules tie into the selection of a process after a timer interrupt. I've been imagining that on timer interrupts, a process yielding, or blocking on object load etc results in selecting the next schedule and picking the first process off of it. In that model, I don't know how a process could be selected to run if it's not on a valid schedule. I guess it could happen if the current process blocks on receive? Is that the only case?Answers:
- A process that has voluntarily given up the CPU (yielded) moves to the back of the ready queue for its schedule. There will probably also be a yield variant for real-time schedules that yields the remainder of the current scheduling period.
- If you are blocked on an object load, you're on the stall queue for the object being loaded. Once loaded, that stall queue is awakened and the member processes get put back on their respective ready queues.
- A process on a CPU has a slice. When the slice times out or the process goes to a receiving state, it is ineligible to run until something returns it to the running state.
- When a CPU has no active process, the process selection logic is used to find something to run.
- A process on an invalid or malformed schedule will by courtesy run long enough to send a NoSchedule fault if (and only if) a fault handler can be found for that process.
The image format as it stands doesn't store schedules, but you presumably do want to configure your schedules as part of the image. Once we do that, processes that haven't been dispatched will also have both relations set up.
- The general purpose priority schedules are statically defined, so they don't need to be in the image.
- Real-time schedules have presumably blown their deadlines on restart, I need to work through a protocol to inform those processes that this has happened.
- The kernel scheduler entries were computed by an application that loaded them into the kernel tables. That process can reload them at restart.
The actual challenge here is to get all of the RT processes back to a stable resumption state w.r.t. their real-time schedules.
If they aren't in memory but are waiting on a process to receive, there's no existing way to know to wake them.You have the directionality of the wait mixed up. You don't wait on a process to receive. You wait to receive. If you only want to receive from a particular process, you give it a different endpoint and filter. The pointers and capabilities involved always point in the forward direction.
I think there's consensus at this point that Coyotos (and seL4) Call should really be treated like a call to a library function: the callee can just refuse to return, so if you don't trust the invokee to honour their temporal part of this contract, you can't make any promises about timeliness no matter how scheduling is implemented. If it's the case that you need to trust the timing behaviours of your callee, why not trust them to borrow your schedule? At least trust that you can put those processes on your schedule until they reply to you.Since there is no such thing as a Coyotos "call", there may be less of a concensus than you imagine. The only operation in Coyotos is send-and-receive. Coyotos no longer distinguishes the available and waiting states that existed in EROS.A closed receive is implemented with an endpoint filter. The moral equivalent of an EROS resume capability is an entry capability to an endpoint that serves as the resume endpoint.I'm definitely not going to bake a call/return assumption into the IPC mechanism.
Since the placement decision is performance critical, I think the kernel has to make the placement decision. I need to understand more about the important hard real time scheduling methods to understand what pinning is important. The mechanism, I think, would be to have some schedules that draw on a specific CPU or possibly a specific CPU cluster/class. Though for "computons per period" schedules you could probably dispatch to a wide range of CPUs just fine if the deadline lets you.You've clearly spent some time thinking about this. I'd love to see where your thinking is at!
In the movie The Devil's Own, Brad Pitt comes out with: "If you're not confused, you don't understand what's going on." I'm pretty sure he was talking about scheduling. :-)No doubt, but I'm extra confused because not only do I have "the way the current implementation does it", "how seL4 does it both MCS and Original Recipe", and "things we talked about 20 years ago" in my head, I also have "how a terrifying, undocumented branch in my own fork does it". So I'll definitely take some time to page all of that back in.That should be good for extra confusion, which ought to qualify you as a domain expert. :-)
Within a schedule, it's a FIFO ready queue. Schedules aren't invoked by any of these events. What happens is that the appropriate receiving process becomes ready, or in the case of end-of-interrupt, a process reverts to receiving. In either case, the scheduler operates according to its normal behavior.I'm not sure what its normal behaviour is! We talked about it a lot back in 2008 or so but did we nail anything down?I think we mostly nailed down that we need to nail this down.I suspect we're going to end up with a table-driven scheduler that is ordered by either priority or deadline. That schedule gets computed by an application-level scheduler when new scheduling contracts are created.
It's good to know that these things end up with these processes no longer being on any run queue. The reason I ask is, with the introduction of Schedules, you also presumably want to do something specific with the process selection logic. I pictured the selection logic as first choosing a Schedule, and then selecting a process from that schedule. The problem as I had seen it was that a rescinded schedule would never be selected, so the process selection logic wouldn't see it. Maybe I'm off in the rough.
The image format as it stands doesn't store schedules, but you presumably do want to configure your schedules as part of the image. Once we do that, processes that haven't been dispatched will also have both relations set up.
- The general purpose priority schedules are statically defined, so they don't need to be in the image.
- Real-time schedules have presumably blown their deadlines on restart, I need to work through a protocol to inform those processes that this has happened.
- The kernel scheduler entries were computed by an application that loaded them into the kernel tables. That process can reload them at restart.
The actual challenge here is to get all of the RT processes back to a stable resumption state w.r.t. their real-time schedules.You want to have a fixed (small) number of buckets in the image? So e.g. all processes at priority 20 get equivalent slices?
I'm waiting to send to a process; I'm waiting on that process to receive.
This is how it's currently implemented. The only reason we can't hit this problem is because processes are always in memory in the current implementations.
The send portion is either blocking or it is non-blocking. If it is non-blocking, we don't have this problem, we just have to deal with the fact that the service we are trying to use might not have received our message. I don't know if you can build a useful system using only non-blocking send.
If it is blocking, it's a Call by another name. You either wait until you get a reply, or you shouldn't expect a reply.
The case I was dealing with is not so sophisticated. Some pcie devices (notably virtio, some network hardware) may expose more than one command/response buffer, so that you can fill them from different cores.
The real finesse will be in keeping the kernel API sufficiently expressive but the implementation clean. I struggled with this, but you might be able to see the way.
On Tue, Oct 28, 2025 at 2:31 AM William ML Leslie <william.l...@gmail.com> wrote:It's good to know that these things end up with these processes no longer being on any run queue. The reason I ask is, with the introduction of Schedules, you also presumably want to do something specific with the process selection logic. I pictured the selection logic as first choosing a Schedule, and then selecting a process from that schedule. The problem as I had seen it was that a rescinded schedule would never be selected, so the process selection logic wouldn't see it. Maybe I'm off in the rough.In abstract, a rescinded schedule should behave much like a rescinded page, in which case what you describe is the "principled" solution.In practice, letting them run at extremely low priority long enough to deliver their own death notices is harmless (those are idle cycles anyway), and saves us the trouble of introducing a more complicated solution.
You want to have a fixed (small) number of buckets in the image? So e.g. all processes at priority 20 get equivalent slices?The bottom tier of schedules is simple priorities. Empirically, it's not useful to have a crazy number of them, so there's no real schedule data structure to save. You need to save the capabilities, but those are stored in the process structure.The only thing we actually need to know about processes in the checkpoint area is which ones were running. That can be done with a single bit. The schedule will get fixed up when the schedule capabilities are prepared, so all we need to do is to get them to a point where that preparation will actually happen.Sticking them on the highest priority ready queue (a priority only used for restart) is a fine way to make that happen.
I'm waiting to send to a process; I'm waiting on that process to receive.In that case you are on a stall queue associated with the target process. Processes that are on a stall queue at checkpoint time are in the running state. By definition, that means they are in memory.
The send portion is either blocking or it is non-blocking. If it is non-blocking, we don't have this problem, we just have to deal with the fact that the service we are trying to use might not have received our message. I don't know if you can build a useful system using only non-blocking send.At present, there is no such thing as a non-blocking send. There is an AppNotice mechanism that can send something similar to a signal. Sending an AppNotice is non-blocking, but receipt is not preemptive.
This muddle is why I was looking at scheduler activations.
If it is blocking, it's a Call by another name. You either wait until you get a reply, or you shouldn't expect a reply.I get what you're saying, and you're mostly right. My quibble is that when people talk about an IPC "call", they tend to mean that the receiver must reply to the call before doing something else. In systems that don't implement that expectation, naming the mechanism "call" creates a surprising amount of confusion when you try to explain IPC.
The case I was dealing with is not so sophisticated. Some pcie devices (notably virtio, some network hardware) may expose more than one command/response buffer, so that you can fill them from different cores.Yes. Sigh. Right now that requires multiple processes and notices, and if you need any kind of buffer performance you really all of the receiving processes to be on the same CPU that is going to be doing the work. If you're moving data, the cache locality is probably more important than the IPI.Needing multiple processes is an annoying complication. This is exactly the sort of thing where a scheduler activation system works really well.
In practice, letting them run at extremely low priority long enough to deliver their own death notices is harmless (those are idle cycles anyway), and saves us the trouble of introducing a more complicated solution.Right, cool. So these processes are on a low-priority ready queue so can be found easily.
Sounds good. Do you plan to use something like a KeyKOS meter to give high-priority processes bounded slices? How is priority related to which interrupts remain enabled while running? Not necessarily questions for right now, but they aren't answered in the existing documentation.
I'm waiting to send to a process; I'm waiting on that process to receive.In that case you are on a stall queue associated with the target process. Processes that are on a stall queue at checkpoint time are in the running state. By definition, that means they are in memory.Currently, when a process goes into receiving mode, the kernel wakes all processes waiting to send to that target. This has some upsides - higher priority processes have their message processed sooner - and some downsides, such as a significant amount of churn in the stall and ready queues that can mean individual processes might not make progress. I don't think there are semantic problems with message ordering, but I could be wrong.
Oh - and all of them are woken up, even if the receiver expects an endpoint match.
The send portion is either blocking or it is non-blocking. If it is non-blocking, we don't have this problem, we just have to deal with the fact that the service we are trying to use might not have received our message. I don't know if you can build a useful system using only non-blocking send.At present, there is no such thing as a non-blocking send. There is an AppNotice mechanism that can send something similar to a signal. Sending an AppNotice is non-blocking, but receipt is not preemptive.Non-blocking send is how the server's reply operation is implemented in Coyotos, which keeps the sender safe from a client who is not currently receiving.
If the target isn't receiving, or if there is a receiver memory fault during IPC, the send is ignored and the sender continues with the receive phase if present otherwise it continues execution.
This muddle is why I was looking at scheduler activations.FWIW I would love to know how you define scheduler activations. Not sure if I've asked you this before. I have been reading this as "attempting to send to a process (or some equivalent interaction with it) causes it to run immediately". Do you mean that a specific function is run immediately? Who provides the stack?
I get what you're saying, and you're mostly right. My quibble is that when people talk about an IPC "call", they tend to mean that the receiver must reply to the call before doing something else.
Ah, I see. Gernot is picky about using "protected procedure call" over "inter-process communication" for related reasons.
You played with scatter/gather in the early days of Coyotos, I wonder if we'll revisit that once we get some real SMP applications.
On Sun, Nov 2, 2025 at 12:38 AM William ML Leslie <william.l...@gmail.com> wrote:Currently, when a process goes into receiving mode, the kernel wakes all processes waiting to send to that target. This has some upsides - higher priority processes have their message processed sooner - and some downsides, such as a significant amount of churn in the stall and ready queues that can mean individual processes might not make progress. I don't think there are semantic problems with message ordering, but I could be wrong.Yes. But there is another issue to consider. Suppose we wake up only the highest-priority sender. When this sender is re-started, there is no guarantee that it will re-issue the send. If it does not, and the other senders are not awakened, then they will not be awakened until some completely unrelated process sends a message to the receiver, which eventually causes it to re-enter the receiving state.
Non-blocking send is how the server's reply operation is implemented in Coyotos, which keeps the sender safe from a client who is not currently receiving.That is not (quite) correct. The replying process has an entry capability that is supposed to serve as a reply endpoint, but:
- this does not guarantee that the receiver (who, at this point, is probably the caller awaiting a return) is still in a receiving state
- this does not guarantee that the receiver's receive block matches the intended cohort (might not be an issue - I need to dig back into the code)
- this does not guarantee that the receiver still exists.
But in any of these cases, the receiver has broken the contract. The defense is that the replying process replies with the "will not wait" bit set. If the receiver has defected, the message will be lost, the replying process will continue, and the receiver may remain in the receiving state forever with a filter that cannot (ever) be matched.Which all sounds reasonable until the receiver is paused for debugging. There are workarounds for that, but nobody would confuse them for being pretty.
If the target isn't receiving, or if there is a receiver memory fault during IPC, the send is ignored and the sender continues with the receive phase if present otherwise it continues execution.Not quite. In the return case, the receiver is expected to have set up their entry memory region[s] in advance and we can argue that they have defected if there is a memory fault. But even if that has occurred, we should probably try to issue a fault notice to the receiver's fault handler advising that the receiver committed sepuku through gross stupidity. Debugging, again, being an entertaining exception.
You played with scatter/gather in the early days of Coyotos, I wonder if we'll revisit that once we get some real SMP applications.The spec for receiving regions still has that capability. It's uncomfortably "orthogonal addressing" from my point of view. Very reminiscent of VAX or 68000. But I haven't found a better approach.
FWIW I would love to know how you define scheduler activations. Not sure if I've asked you this before.
Anderson et al., "Scheduler Activations: Effective Kernel-Level Support for User-Level Management of Parallelism", ACM Transactions on Computer Systems, 10(1), Feb. 1992
On Mon, 3 Nov 2025 at 15:40, Jonathan S. Shapiro <jonathan....@gmail.com> wrote:Yes. But there is another issue to consider. Suppose we wake up only the highest-priority sender. When this sender is re-started, there is no guarantee that it will re-issue the send. If it does not, and the other senders are not awakened, then they will not be awakened until some completely unrelated process sends a message to the receiver, which eventually causes it to re-enter the receiving state.My hack for this was that if a process is in the receiving state, and the stall queue is not empty, I place it on the ready queue. If a process is selected from the ready queue and it's in open wait, I peel off the first process in the stall queue and schedule it.
It is a bit more complex than that; I moved the stall queues to the endpoints and built an rbtree of those. I don't know that I recommend this.
Non-blocking send is how the server's reply operation is implemented in Coyotos, which keeps the sender safe from a client who is not currently receiving.That is not (quite) correct. The replying process has an entry capability that is supposed to serve as a reply endpoint, but:
- this does not guarantee that the receiver (who, at this point, is probably the caller awaiting a return) is still in a receiving state
- this does not guarantee that the receiver's receive block matches the intended cohort (might not be an issue - I need to dig back into the code)
- this does not guarantee that the receiver still exists.
But in any of these cases, the receiver has broken the contract. The defense is that the replying process replies with the "will not wait" bit set. If the receiver has defected, the message will be lost, the replying process will continue, and the receiver may remain in the receiving state forever with a filter that cannot (ever) be matched.Which all sounds reasonable until the receiver is paused for debugging. There are workarounds for that, but nobody would confuse them for being pretty.Can't we just pretend that receive is atomic? :)
If the target isn't receiving, or if there is a receiver memory fault during IPC, the send is ignored and the sender continues with the receive phase if present otherwise it continues execution.Not quite. In the return case, the receiver is expected to have set up their entry memory region[s] in advance and we can argue that they have defected if there is a memory fault. But even if that has occurred, we should probably try to issue a fault notice to the receiver's fault handler advising that the receiver committed sepuku through gross stupidity. Debugging, again, being an entertaining exception.Delivering the fault sounds fine; but we can still choose not to block the sender to run a fault handler. The receiver loses their message, but at least they get to know why.
You played with scatter/gather in the early days of Coyotos, I wonder if we'll revisit that once we get some real SMP applications.The spec for receiving regions still has that capability. It's uncomfortably "orthogonal addressing" from my point of view. Very reminiscent of VAX or 68000. But I haven't found a better approach.I ripped out indirect strings in my port (which was probably not a good idea) but the balance is I now have capabilities for slots in gpts. You can name one during receive to say "install the sent capability here" and it will opaquely map a sent memory object for you. I think about doing more like that.
William wrote:FWIW I would love to know how you define scheduler activations. Not sure if I've asked you this before.This deserves a separate discussion. Reference:Anderson et al., "Scheduler Activations: Effective Kernel-Level Support for User-Level Management of Parallelism", ACM Transactions on Computer Systems, 10(1), Feb. 1992I'll sketch it below, but the Anderson paper is brilliant. It's the only credible start I have seen on unifying kernel-mode and user-mode scheduling. I strongly recommend it, if only to put structure on the problem.
--
You received this message because you are subscribed to the Google Groups "cap-talk" group.
To unsubscribe from this group and stop receiving emails from it, send an email to cap-talk+u...@googlegroups.com.
To view this discussion visit https://groups.google.com/d/msgid/cap-talk/CAAP%3D3QPvuRzHqcaoS9V%2BixprFTSNST5stH7pWUC7a%2BaiJ1L%3DRQ%40mail.gmail.com.
A link to this paper: https://web.eecs.umich.edu/~mosharaf/Readings/Scheduler-Activations.pdf
See also the wikipedia page for SA. Note that both NetBSD & FreeBSD ripped out support for SA (M user threads multiplexed over N kernel threads) in favor of 1:1 when they implemented SMP support.
On Sun, Nov 2, 2025 at 10:45 PM William ML Leslie <william.l...@gmail.com> wrote:On Mon, 3 Nov 2025 at 15:40, Jonathan S. Shapiro <jonathan....@gmail.com> wrote:Yes. But there is another issue to consider. Suppose we wake up only the highest-priority sender. When this sender is re-started, there is no guarantee that it will re-issue the send. If it does not, and the other senders are not awakened, then they will not be awakened until some completely unrelated process sends a message to the receiver, which eventually causes it to re-enter the receiving state.My hack for this was that if a process is in the receiving state, and the stall queue is not empty, I place it on the ready queue. If a process is selected from the ready queue and it's in open wait, I peel off the first process in the stall queue and schedule it.The problem remains that the awakened invoker may not re-attempt. This seems to leave two options:
The first approach is simple but optimistic. The second disrespects the scheduler, but at some number of senders that is true of both policies. This is the sort of thing that led Phil Karlton and Jim Gettys (of X10/X11 fame) to adopt "mechanism, not policy" as something like a design religion. Unfortunately, you can't always avoid making choices.
- Wake all senders and let the scheduler sort it out. At the risk of blasphemy, this might be characterized as the "God will know his own" approach to resumption.
- Wake a preferred sender and stick the rest on a "worry" queue of some sort with a timeout or a recovery mechanism (danger here).
It is a bit more complex than that; I moved the stall queues to the endpoints and built an rbtree of those. I don't know that I recommend this.What guarantees that these stall queues are ever awakened? Given multiple endpoints targeting a common process, how does this avoid priority inversion?
I can see a story that says "A is invoking C, let's see if some other (more deserving) process B is invoking C and decide whether to context switch to B rather than allowing A to send." As with politics, any problem that starts with "We wish to benefit the more deserving" turns out to be very problematic. In this case, it guarantees priority inversion.For reasonable queue lengths, waking all blocked senders seems in practice to be the best of a poor set of options. The root problem is that in the presence of real-time scheduling there is no stable ordering of dispatch precedence. The core distinction between a stall queue and a ready queue is that the stall queue does not embed any policy about order of dispatch.I can imagine approaches that wake the full stall queue only if the selected sender defects. At some point we'll undoubtedly find a motivation to implement that for some reason, but until then simple seems good. Partly because specifying defects, or even "might have defected", seems hard.
Can't we just pretend that receive is atomic? :)As long as the key word is "pretend", sure! OS architects have as much right to a rich fantasy life as anyone else engaged in fantastical constructions! :-)
If the target isn't receiving, or if there is a receiver memory fault during IPC, the send is ignored and the sender continues with the receive phase if present otherwise it continues execution.Not quite. In the return case, the receiver is expected to have set up their entry memory region[s] in advance and we can argue that they have defected if there is a memory fault. But even if that has occurred, we should probably try to issue a fault notice to the receiver's fault handler advising that the receiver committed sepuku through gross stupidity. Debugging, again, being an entertaining exception.Delivering the fault sounds fine; but we can still choose not to block the sender to run a fault handler. The receiver loses their message, but at least they get to know why.Agreed. I was trying (poorly) to say this, with the caveat that attempting to deliver the fault code at some very low "I'm completely screwed but attempting to tell somebody" priority seems useful. The sender will pay for one attempt to deliver a message to the fault handler, but will not otherwise be blocked. By virtue of happening at "I'm completely screwed" priority, subsequent delivery attempts ares de facto non-blocking provided schedule perimeters are respected.
You played with scatter/gather in the early days of Coyotos, I wonder if we'll revisit that once we get some real SMP applications.The spec for receiving regions still has that capability. It's uncomfortably "orthogonal addressing" from my point of view. Very reminiscent of VAX or 68000. But I haven't found a better approach.I ripped out indirect strings in my port (which was probably not a good idea) but the balance is I now have capabilities for slots in gpts. You can name one during receive to say "install the sent capability here" and it will opaquely map a sent memory object for you. I think about doing more like that.That's an interesting idea, and I want to let that rattle around a bit.My first concern is that the receiver is letting the sender transmit an arbitrary capability into a slot that spans a region of the receiver's address space. There are security issues. Assume the sender is hostile and the receiver needs to be robust. How does the receiver know/respond if the receiver makes a reference to that part of the address space and the sender has destroyed the memory region? Have a look at the high-performance networking work we did in EROS.
On Mon, 3 Nov 2025 at 17:53, Jonathan S. Shapiro <jonathan....@gmail.com> wrote:The problem remains that the awakened invoker may not re-attempt. This seems to leave two options:The hopeful receiver is still on the ready queue, and if it is selected again, it will wake the next potential sender in sequence.
What guarantees that these stall queues are ever awakened? Given multiple endpoints targeting a common process, how does this avoid priority inversion?I was going for guaranteed in-order delivery, because I like the progress story. If a server wants to implement multiple priorities, it can use distinct endpoints and attempt a non-blocking wait with the high-priority endpoint id first, before falling back to open wait.
I can imagine approaches that wake the full stall queue only if the selected sender defects. At some point we'll undoubtedly find a motivation to implement that for some reason, but until then simple seems good. Partly because specifying defects, or even "might have defected", seems hard.I'd love to dig into the pathologies of senders playing "down low, too slow" if you still think this is an issue.Can't we just pretend that receive is atomic? :)As long as the key word is "pretend", sure! OS architects have as much right to a rich fantasy life as anyone else engaged in fantastical constructions! :-)Quoth the spec:From the sender perspective, message transmission is (nearly) atomic. From the receiver perspective, message transfer occurs asynchronously. Arrival is signalled by a message completion event delivered to the receiver's activation handler.
I recall having a moment at some point where I panicked thinking "what if the fault handler isn't ready to receive?!" before realising it's pretty easy to arrange things so that the handler is always receiving if the process is running.
My first concern is that the receiver is letting the sender transmit an arbitrary capability into a slot that spans a region of the receiver's address space. There are security issues. Assume the sender is hostile and the receiver needs to be robust. How does the receiver know/respond if the receiver makes a reference to that part of the address space and the sender has destroyed the memory region? Have a look at the high-performance networking work we did in EROS.I have gone back and forth on this quite a bit, starting with trying to understand window capabilities. The impression I got from what implementation of window capabilities exist, and what little the spec mentions, is that they were kind of superseded by the way the ElfSpace copies its background space.
Further, for the sort of uses I thought windowing might be useful - such as read-only shared maps of a file - I couldn't see how to make them work.
An ever so slightly related point is that if you want to have a GPT do something special - say, you want a memory handler - the GPT can only translate 3 bits. I sometimes wonder if I should fix that, somehow.
So, I wonder if having another object specifically for the sake of including a memory handler or a remappable window might be useful. Something that has methods for shifting the window over the underlying memory object, mapping a fallback page or object if access fails, and has some way to query whether the map is still active.
There's the related problem that if I grant some memory to a process, I probably also want a way to revoke that access. I have some support for this in two unrelated objects I should really write about publically.
Extension to IPC:Proceeding from the sketch above, it isn't hard to imagine extending the scheduler activation concept to IPC receive blocks. I have a memory that L4 explored something similar. Here's a sketch:
- Processes register a receive block with the kernel. This block describes where the next in-bound message should go. The receive block additionally specifies the address of the next receive block, or null if no further receive block has been defined.
- If the current receive block pointer is null, the process is not receiving and the sender blocks (there are user/kernel handshake implications on the receive side).
- When an in-coming message arrives, it is delivered according to the current receive block, and the current receive block pointer is advanced to the next receive block.
- If a message has been delivered, the kernel resumes at a "message has been delivered" entry point, which (similar to scheduler activations) decides what to run next in the user-level application.
Unlike scheduler activations, there is a handshake needed between the application and the kernel to update the current receive block address. The receive block pointer itself can be updated with CAS, but a kernel transition is required to wake up blocked senders.
[…]
This is a radical departure from GNOSIS/KeyKOS/EROS, and it is (at best) 3/4 baked. But if it can be worked through sensibly it seems to offer new and interesting concurrency options.
On Mon, Nov 3, 2025 at 2:50 AM William ML Leslie <william.l...@gmail.com> wrote:On Mon, 3 Nov 2025 at 17:53, Jonathan S. Shapiro <jonathan....@gmail.com> wrote:The problem remains that the awakened invoker may not re-attempt. This seems to leave two options:The hopeful receiver is still on the ready queue, and if it is selected again, it will wake the next potential sender in sequence.A receiver is in the receiving state. It is not on the ready queue.
What guarantees that these stall queues are ever awakened? Given multiple endpoints targeting a common process, how does this avoid priority inversion?I was going for guaranteed in-order delivery, because I like the progress story. If a server wants to implement multiple priorities, it can use distinct endpoints and attempt a non-blocking wait with the high-priority endpoint id first, before falling back to open wait.Since you can't guarantee retry on restart, there would need to be a way to detect that the sender didn't retry so that you could move on to the next sender. We've never found a clean way to do that.
Quoth the spec:From the sender perspective, message transmission is (nearly) atomic. From the receiver perspective, message transfer occurs asynchronously. Arrival is signalled by a message completion event delivered to the receiver's activation handler.This is taken from a version that had (or at least described) scheduler activations. Can I ask where you are seeing this? I'm not convinced that I have the most recent version of the Coyotos repository, and I'd like to check if you may have a newer version.
I recall having a moment at some point where I panicked thinking "what if the fault handler isn't ready to receive?!" before realising it's pretty easy to arrange things so that the handler is always receiving if the process is running.Not necessarily. One handler might be handling faults for hundreds of processes, which means it might be running when a particular process attempts to send it a fault. The sender simply ends up on a stall queue. It wakes up and retries when the handler re-enters the receive state.
My first concern is that the receiver is letting the sender transmit an arbitrary capability into a slot that spans a region of the receiver's address space. There are security issues. Assume the sender is hostile and the receiver needs to be robust. How does the receiver know/respond if the receiver makes a reference to that part of the address space and the sender has destroyed the memory region? Have a look at the high-performance networking work we did in EROS.I have gone back and forth on this quite a bit, starting with trying to understand window capabilities. The impression I got from what implementation of window capabilities exist, and what little the spec mentions, is that they were kind of superseded by the way the ElfSpace copies its background space.Not superseded. Just a different use case. What ElfSpace is doing is copy-on-write for mutable pages. That's not a windowing problem.Further, for the sort of uses I thought windowing might be useful - such as read-only shared maps of a file - I couldn't see how to make them work.I suppose you could use them that way, but usually it's easier to just insert the read-only subspace directly into a containing space.
An ever so slightly related point is that if you want to have a GPT do something special - say, you want a memory handler - the GPT can only translate 3 bits. I sometimes wonder if I should fix that, somehow.Don't. It's intentional. The most comparable construct in EROS/KeyKOS was red segments, which allowed more slots to be populated. The resulting translation logic was kind of a mess, because it didn't translate an exact number of bits. The decision to make GPTs translate an exact number of bits in this case was an intentional simplification.
So, I wonder if having another object specifically for the sake of including a memory handler or a remappable window might be useful. Something that has methods for shifting the window over the underlying memory object, mapping a fallback page or object if access fails, and has some way to query whether the map is still active.All of that can be done now. It doesn't motivate adding a different object.I probably need to write up how background windows work better! To be honest, I never used a window (outside of testing) in all of the years of working on EROS. The case in which you really need one is pretty obscure.
There's the related problem that if I grant some memory to a process, I probably also want a way to revoke that access. I have some support for this in two unrelated objects I should really write about publically.That's a common sharing pattern. The way to implement it is to wrap the space in a GPT that doesn't consume any translation bits.
On Tue, 4 Nov 2025 at 04:07, Jonathan S. Shapiro <jonathan....@gmail.com> wrote:On Mon, Nov 3, 2025 at 2:50 AM William ML Leslie <william.l...@gmail.com> wrote:On Mon, 3 Nov 2025 at 17:53, Jonathan S. Shapiro <jonathan....@gmail.com> wrote:What guarantees that these stall queues are ever awakened? Given multiple endpoints targeting a common process, how does this avoid priority inversion?I was going for guaranteed in-order delivery, because I like the progress story. If a server wants to implement multiple priorities, it can use distinct endpoints and attempt a non-blocking wait with the high-priority endpoint id first, before falling back to open wait.Since you can't guarantee retry on restart, there would need to be a way to detect that the sender didn't retry so that you could move on to the next sender. We've never found a clean way to do that.I haven't got all of the possibilities covered yet, but I don't know if there's value in my going into detail since I don't think that technique will apply as cleanly to Coyotos.
William wrote:FWIW I would love to know how you define scheduler activations. Not sure if I've asked you this before.This deserves a separate discussion. Reference:Anderson et al., "Scheduler Activations: Effective Kernel-Level Support for User-Level Management of Parallelism", ACM Transactions on Computer Systems, 10(1), Feb. 1992I'll sketch it below, but the Anderson paper is brilliant. It's the only credible start I have seen on unifying kernel-mode and user-mode scheduling. I strongly recommend it, if only to put structure on the problem.
Extension to IPC:Proceeding from the sketch above, it isn't hard to imagine extending the scheduler activation concept to IPC receive blocks. I have a memory that L4 explored something similar. Here's a sketch:
- Processes register a receive block with the kernel. This block describes where the next in-bound message should go. The receive block additionally specifies the address of the next receive block, or null if no further receive block has been defined.
- If the current receive block pointer is null, the process is not receiving and the sender blocks (there are user/kernel handshake implications on the receive side).
- When an in-coming message arrives, it is delivered according to the current receive block, and the current receive block pointer is advanced to the next receive block.
- If a message has been delivered, the kernel resumes at a "message has been delivered" entry point, which (similar to scheduler activations) decides what to run next in the user-level application.
Unlike scheduler activations, there is a handshake needed between the application and the kernel to update the current receive block address. The receive block pointer itself can be updated with CAS, but a kernel transition is required to wake up blocked senders.When the receiver is activated, many choices are possible. The received message blocks can be stuck in a queue, the user-mode thread can be preempted, the current user-mode thread can be preempted, or other possible choices. The main points are:
- Message receipt is non-blocking provided the receiver has an active and well-formed receive block at the time of receipt.
- Message receipt does not necessarily imply preemption of the active user-mode thread. The application-level scheduler gets to decide whether to continue the current user-level thread (presumably queueing the message) or preempt it in favor of a "receiving" user-level thread.
- In abstract, the application-level scheduler could dynamically scale the receive buffer pool in response to load.
Open issues:
- Should all process resumptions be handled as if an in-bound resumption message has been sent?
- How should "foreign" processes behave if this is implemented? By virtue of being foreign they cannot receive normal IPC messages in any case. The question is how to support the emulated system call interface?
- If we assume foreign applications are recompiled, we can bake support into the foreign application runtime, but
- If binary compatibility for foreign APIs is a goal we need to decide what to do and how to do it.
- None of this solves the problem of preserving scheduling priority across a call. Anderson's proposed mechanism doesn't attempt to address this. Which is probably a good call.
This is a radical departure from GNOSIS/KeyKOS/EROS, and it is (at best) 3/4 baked. But if it can be worked through sensibly it seems to offer new and interesting concurrency options.
On 3 Nov 2025, at 18:03, Jonathan S. Shapiro <jonathan....@gmail.com> wrote:
Extension to IPC:
This is a radical departure from GNOSIS/KeyKOS/EROS, and it is (at best) 3/4 baked. But if it can be worked through sensibly it seems to offer new and interesting concurrency options.
why?Why should the kernel deal with user-level threads?
What stack do you run the handler on? To explore: how far is this from having a separate process to handle all of your events for you? If you've just got the one user thread, at least.