Questions regarding KeyKOS, EROS, CapROS

35 views
Skip to first unread message

Bakul Shah

unread,
May 25, 2026, 12:40:05 PM (10 days ago) May 25
to cap-talk
I understand that on KeyKOS there could be multiple schedulers ("meter keepers"). Was this also the case for EROS and CapROS? If not, what were the reasons for the change? AIUI KeyKOS was a uniprocessor OS, while EROS, CapROS were not so scheduling would be more complicated. When a thread becomes runnable but another thread is running on the core it ran on previously, do you kick out the current running thread or run the newly runnable thread on another core?

Technically even the scheduler could only schedule threads for which it holds capabilities so that futher complicates things!

Were such things explored? Was there much (or any) practical experience with EROS/CapROS to get some real data on their design decisions?

Such questions come to mind if one were to design a secure computing substrate on a multicore processor with *no* protection rings or supervisor/hypervisor state.

Thanks for any insight!
If this is not the right group, apologies. Would love to know of a better place to ask such questions.

Bakul

Jonathan S. Shapiro

unread,
May 25, 2026, 10:32:21 PM (9 days ago) May 25
to cap-...@googlegroups.com
Hey, Bakul. I actually had to go back and refresh myself on meters.

Before getting into keepers, it's useful to understand the meter hierarchy. The prime meter is kernel-implemented and has an infinite number of "ticks". Meters below the prime meter have a finite number of ticks. For every [useful] meter, there is a straight line descent of ancestor meters rooted at the prime meter. As a process (in KeyKOS: a domain) executes, it reduces the number of ticks on all of the meters between the process and the prime meter. When a given meter goes to zero, its keeper is invoked. The keeper can set the tick supply on any meter that it manages.

If you consider a meter M having child meters M1, M2, M3, M4, there is a problem of selection. The idea in KeyKOS was that the tick counts in M1..M4 tacitly specified a proportional subdivision of the ticks consumed from M. Various scheduling papers in the 1990s looked for fair ways to implement this without much success. KeyKOS did not have any notion of a start time guarantee; execution starts were assumed to be fungible. In real systems, most processes are ineligible for execution most of the time, so KeyKOS got away with it, but the scheme becomes fragile in the presence of multiple CPUs or real-time requirements. You can imagine a distinct prime meter for each CPU, or for each class of CPUs, but that doesn't address cache affinity in process placement. There's really no provision in the scheme for start time or deadline guarantees. Finally, the meter hierarchy didn't have a bound, so there are ways that clever abusers can interfere with the scheduling system.

Even when I first started on EROS in the 1990s, it was obvious that meters needed to be replaced by something else. Norm Hardy reluctantly agreed. But we never got to it in EROS. During the late 1990s there was an uptick in interest in scheduling mechanisms, and especially real-time scheduling (both hard and soft). I ended up deciding on a two-level scheme, the upper level being table driven and the lower level being traditionally priority driven. The idea was that a single, central scheduler would manage the table-driven portion, and we'd sort the rest out from there. It still hasn't been implemented, so no, we don't have any timing. There may or may not be a sketch of an implementation in the existing Coyotos code, but it isn't anything like complete.

A further unpleasant challenge is that there is no single approach to scheduling that works for all applications. Vehicles, for example, have a particular set of hard scheduling requirements that can't be implemented by more flexible real-time regimes. The car's scheduler wastes a fair number of resources, but it's guaranteed that you'll get what you need when you need it. In a car, that's important.

Not sure how much of that is helpful, but at least it's a start.


Yes, this is one of the hard problems in secure multicore - and more so when the processors are heterogeneous (e.g. big/little). I'm not convinced that removing user/supervisor is a good idea, but one could make a case for it with a CHERI-like processor. I'm not sure how hypervisors would be done on that. I've been waiting for some of the tariff pressure to come off to look at some ideas along those lines.

At one point I realized that the entire EROS or Coyotos kernel can pretty well be reduced to silicon, which would be one way to eliminate a supervisor. Over the years, I've come to the view that it's not a great plan. There isn't enough parallelism in a kernel implementation to warrant reducing the entirety to hardware, and the landscape for writing secure, well-specified, and verified code has improved very dramatically over the last 20 years. We understand the U/S approach extremely well, and we need to be able to deal with bugs and changes, so a software-defined kernel seems like a better way to go.

That said, some U/S architectures are a heck of a lot better than others, and in some cases they recurse well enough that no truly separate H mode is required.


Jonathan

Mark S. Miller

unread,
May 25, 2026, 11:50:08 PM (9 days ago) May 25
to cap-...@googlegroups.com, Gernot
It is worth learning from seL4's approach to mixed criticality. cc'ing Gernot.


--
You received this message because you are subscribed to the Google Groups "cap-talk" group.
To unsubscribe from this group and stop receiving emails from it, send an email to cap-talk+u...@googlegroups.com.
To view this discussion visit https://groups.google.com/d/msgid/cap-talk/CAAP%3D3QPCs%3DaJ%3DUj%3DLv0V0GYR40_P2TbQ1Ry4AvMH%2BGcZagZ%2BBw%40mail.gmail.com.


--
  Cheers,
  --MarkM

Jonathan S. Shapiro

unread,
May 26, 2026, 12:48:47 PM (9 days ago) May 26
to cap-...@googlegroups.com, Gernot
On Mon, May 25, 2026 at 8:50 PM Mark S. Miller <eri...@gmail.com> wrote:
It is worth learning from seL4's approach to mixed criticality. cc'ing Gernot.

Agreed!

Gernot Heiser

unread,
May 26, 2026, 10:08:32 PM (8 days ago) May 26
to cap-...@googlegroups.com
Ok, thanks for the tag.

seL4 also has a two-level scheduling approach:
1. the top-level “domain scheduler” is a table-driven, strict time-partitioning scheduler
2. the second-level real-time scheduler uses scheduling contexts (SC) that are cap-controlled objects that provide the right to use a CPU.

(1) is purely an enabler of information-flow confidentiality proofs (which nevertheless map onto few real-world scenarios). It is IMHO the worst design decision made in seL4, where verification convenience trumped good design. We finally started working on a way to do meaningful confidentiality verification without it. In terms of functionality, it is completely redundant, we (TS@UNSW) don’t use it for any of our designs, and I’ll ignore it for the rest of the discussion. So I’ll pretend (2) is all there is.

A thread has two scheduling parameterns: a priority, p, and an optional SC – it is not runnable without an SC.

An SC primarily specifies a period, T, and a budget, C, on a specific core, where C≤T. The maximum utilisation this thread can have is U=C/T.

The scheduler (which is per-core) will at any time run the highest-prio thread that has budget, using round-robin within a prio.

The execution time diminishes the budget until it is depleted, at which time the thread is no longer schedulable, until the budget gets replenished once the current period has elapsed. There is a lot of detail how things work with preemptions, blocking etc (and some of these details are presently under review). 

The model implements the “sproadic server” model from the real-time systems theory, it is mean to support hard RT systems, including mixed-criticality systems (MCS). 

In particular, it allows having a high-prio (less critical) thread that preempts a lower prio critical thread while still guaranteeing that the critical one meets its deadlines. Simple example: a control loop that executes at a period of 10s of ms, and an ethernet driver that needs to respond within a few µs to avoid losing packets. Rate-monotonic prio assignment gives the driver high and the control low priority. The system supports this safely, eg for the driver p=100, T=10µs, C=3µs, and for the control p=50, T=10ms, C=5ms. The driver can preempt the control, but not for more than 30% of the time, leaving sufficient time for the control to do its job.

How does a thread get an SC?

An SC can be bound to a TCB, in which case the thread can always execute as long as it has budget.

A thread without an SC (“passive server”) can obtained an SC when it is invoked through an endpoint by a protected procedure call (PPC, unfortunately still referred to as IPC,  a term I’d like to stamp out) from a thread that does have an SC (else it wouldn’t be able to invoke the endpoint). In that case, the passive server executes on the client’s “borrowed” SC until it returns from the PPC, at which time the SC reverts back to the caller. This works transitively. And good design practice is to have the server running at a higher prio than its clients (which incidentally implements the immediate priority ceiling protocol, IPCP, from RT theory). It means that a shared server charges its execution time to the client on whose behalf it executes.

The other way in which a thread can receive an SC is when it’s blocked on a Notification (binary semaphore-like async operation). A Notification can be active, meaning it has an SC bound to its TCB. When such an active Notification is signalled, a thread blocked on a wait on this Notification gets to execute on the Notification’s SC, until it blocks (or the budget depletes).

There are a bunch of issues and corner cases in this model, and I wouldn’t want to guarantee that the current implementation always does the right thing. We’re currently in the process of formalising the scheduling model and ensuring that it does have a sound theory (which may lead to subtle implementation changes).

Finally there’s the question of who controls time, i.e. can initialise SCs. (Anyone holding a cap to Untyped memory can *create* SC objects, but they don’t have a budget when created.)

There’s a per-core scheduling control cap (SCC) that authorises the holder to assign budget and period to an SC.

Note that, other than initialising SC’s on authority of the SCC, there’s (currently) no other way to assign budgets (i.e. initialise SCs). In particular, we don’t have a mechanism for the holder of an SC cap to donate a fraction of its budget to a new SC. I had been thinking for a while that this form of delegation might be useful, but haven’t seen a convincing use case that wouldn’t be served by other means, but I’m keeping my mind open FTTB. One of the issues here is the non-fungibility of time (as opposed to space being largely fungible), and the inevitable loss (die to context switching overheads) from splitting a budget that complicates things. Again, this is something where we may arrive with a more definite answer as part of the on-going (recently started) work on formalising the model.

Hope this helps.

Gernot

Bakul Shah

unread,
May 28, 2026, 12:20:25 AM (7 days ago) May 28
to cap-...@googlegroups.com
Many thanks for a very detailed response! Some comments inline.

On May 25, 2026, at 7:32 PM, Jonathan S. Shapiro <jonathan....@gmail.com> wrote:

Hey, Bakul. I actually had to go back and refresh myself on meters.

Before getting into keepers, it's useful to understand the meter hierarchy. The prime meter is kernel-implemented and has an infinite number of "ticks". Meters below the prime meter have a finite number of ticks. For every [useful] meter, there is a straight line descent of ancestor meters rooted at the prime meter. As a process (in KeyKOS: a domain) executes, it reduces the number of ticks on all of the meters between the process and the prime meter. When a given meter goes to zero, its keeper is invoked. The keeper can set the tick supply on any meter that it manages.

If you consider a meter M having child meters M1, M2, M3, M4, there is a problem of selection. The idea in KeyKOS was that the tick counts in M1..M4 tacitly specified a proportional subdivision of the ticks consumed from M. Various scheduling papers in the 1990s looked for fair ways to implement this without much success. KeyKOS did not have any notion of a start time guarantee; execution starts were assumed to be fungible. In real systems, most processes are ineligible for execution most of the time, so KeyKOS got away with it, but the scheme becomes fragile in the presence of multiple CPUs or real-time requirements. You can imagine a distinct prime meter for each CPU, or for each class of CPUs, but that doesn't address cache affinity in process placement. There's really no provision in the scheme for start time or deadline guarantees. Finally, the meter hierarchy didn't have a bound, so there are ways that clever abusers can interfere with the scheduling system.

Thanks these details.

A further unpleasant challenge is that there is no single approach to scheduling that works for all applications. Vehicles, for example, have a particular set of hard scheduling requirements that can't be implemented by more flexible real-time regimes. The car's scheduler wastes a fair number of resources, but it's guaranteed that you'll get what you need when you need it. In a car, that's important.

Perhaps one needs a "pluggable" scheduler.... The idea would be to implement a scheduler API and then add implementations with different algorithms.

Not sure how much of that is helpful, but at least it's a start.


Yes, this is one of the hard problems in secure multicore - and more so when the processors are heterogeneous (e.g. big/little). I'm not convinced that removing user/supervisor is a good idea, but one could make a case for it with a CHERI-like processor. I'm not sure how hypervisors would be done on that. I've been waiting for some of the tariff pressure to come off to look at some ideas along those lines.

At one point I realized that the entire EROS or Coyotos kernel can pretty well be reduced to silicon, which would be one way to eliminate a supervisor. Over the years, I've come to the view that it's not a great plan. There isn't enough parallelism in a kernel implementation to warrant reducing the entirety to hardware, and the landscape for writing secure, well-specified, and verified code has improved very dramatically over the last 20 years. We understand the U/S approach extremely well, and we need to be able to deal with bugs and changes, so a software-defined kernel seems like a better way to go.

I still find this area quite interesting (in the same way that plan9 got rid of the idea of superuser). But I need to flesh out ideas & designs to find if this is doable. [This was the original reason for my questions] 

I haven't looked at the CHERI work in detail but my impression was that this should be possible.

Thanks,
Bakul

That said, some U/S architectures are a heck of a lot better than others, and in some cases they recurse well enough that no truly separate H mode is required.


Jonathan

Bakul Shah

unread,
May 28, 2026, 12:47:46 AM (7 days ago) May 28
to cap-...@googlegroups.com
Many thanks for these details. Are there any published papers &/or statistics (on how well it has worked etc)?

Is MCS is primarily used for real time (but mixed criticality) scheduling or also mixed mode? My interest is more in a mix of real time + non-real time processes. Not unlike how a network router/switch would handle high priority vs isochronous vs best effort traffic! wasted CPU time is similar to wasted network bandwidth. Fairness is also an issue for non-realtime. Jitter can be an issue for realtime. But of course scheduling is more complex as unlike network flows, threads/processes interact.

Thanks Mark, for cc'ing Gernot.

Cheers,
Bakul


Gernot Heiser

unread,
May 28, 2026, 1:47:13 AM (7 days ago) May 28
to cap-...@googlegroups.com
MCS is pretty much what you call “mixed mode”.

Note that “fairness” is pretty much the opposite of real-time: RTS aren’t fair by definition.

However, any slack time left after the RT components are ensured of their timeliness can be used for best-effort components. (Incl when there are no RT components, only best-effort).

seL4’s MCS mechanisms can be used to build (fairly arbitrary) scheduling policies at user level. A standard approach is to have a scheduler thread (that might be time triggered) which determines what to run next, based on its policy, and donates the rest of its current budget to that thread.

There’s a paper describing the model and its use: https://trustworthy.systems/publications/abstracts/Lyons_MAH_18.abstract
It demonstrates (among others) implementing EDF scheduling in usermode on top of the priority scheduler with less overhead thank doing it in-kernel in Linux.

We haven’t done an experience paper (yet).

Gernot

Reply all
Reply to author
Forward
0 new messages