Ok, thanks for the tag.
seL4 also has a two-level scheduling approach:
1. the top-level “domain scheduler” is a table-driven, strict time-partitioning scheduler
2. the second-level real-time scheduler uses scheduling contexts (SC) that are cap-controlled objects that provide the right to use a CPU.
(1) is purely an enabler of information-flow confidentiality proofs (which nevertheless map onto few real-world scenarios). It is IMHO the worst design decision made in seL4, where verification convenience trumped good design. We finally started working on a way to do meaningful confidentiality verification without it. In terms of functionality, it is completely redundant, we (TS@UNSW) don’t use it for any of our designs, and I’ll ignore it for the rest of the discussion. So I’ll pretend (2) is all there is.
A thread has two scheduling parameterns: a priority, p, and an optional SC – it is not runnable without an SC.
An SC primarily specifies a period, T, and a budget, C, on a specific core, where C≤T. The maximum utilisation this thread can have is U=C/T.
The scheduler (which is per-core) will at any time run the highest-prio thread that has budget, using round-robin within a prio.
The execution time diminishes the budget until it is depleted, at which time the thread is no longer schedulable, until the budget gets replenished once the current period has elapsed. There is a lot of detail how things work with preemptions, blocking etc (and some of these details are presently under review).
The model implements the “sproadic server” model from the real-time systems theory, it is mean to support hard RT systems, including mixed-criticality systems (MCS).
In particular, it allows having a high-prio (less critical) thread that preempts a lower prio critical thread while still guaranteeing that the critical one meets its deadlines. Simple example: a control loop that executes at a period of 10s of ms, and an ethernet driver that needs to respond within a few µs to avoid losing packets. Rate-monotonic prio assignment gives the driver high and the control low priority. The system supports this safely, eg for the driver p=100, T=10µs, C=3µs, and for the control p=50, T=10ms, C=5ms. The driver can preempt the control, but not for more than 30% of the time, leaving sufficient time for the control to do its job.
How does a thread get an SC?
An SC can be bound to a TCB, in which case the thread can always execute as long as it has budget.
A thread without an SC (“passive server”) can obtained an SC when it is invoked through an endpoint by a protected procedure call (PPC, unfortunately still referred to as IPC, a term I’d like to stamp out) from a thread that does have an SC (else it wouldn’t be able to invoke the endpoint). In that case, the passive server executes on the client’s “borrowed” SC until it returns from the PPC, at which time the SC reverts back to the caller. This works transitively. And good design practice is to have the server running at a higher prio than its clients (which incidentally implements the immediate priority ceiling protocol, IPCP, from RT theory). It means that a shared server charges its execution time to the client on whose behalf it executes.
The other way in which a thread can receive an SC is when it’s blocked on a Notification (binary semaphore-like async operation). A Notification can be active, meaning it has an SC bound to its TCB. When such an active Notification is signalled, a thread blocked on a wait on this Notification gets to execute on the Notification’s SC, until it blocks (or the budget depletes).
There are a bunch of issues and corner cases in this model, and I wouldn’t want to guarantee that the current implementation always does the right thing. We’re currently in the process of formalising the scheduling model and ensuring that it does have a sound theory (which may lead to subtle implementation changes).
Finally there’s the question of who controls time, i.e. can initialise SCs. (Anyone holding a cap to Untyped memory can *create* SC objects, but they don’t have a budget when created.)
There’s a per-core scheduling control cap (SCC) that authorises the holder to assign budget and period to an SC.
Note that, other than initialising SC’s on authority of the SCC, there’s (currently) no other way to assign budgets (i.e. initialise SCs). In particular, we don’t have a mechanism for the holder of an SC cap to donate a fraction of its budget to a new SC. I had been thinking for a while that this form of delegation might be useful, but haven’t seen a convincing use case that wouldn’t be served by other means, but I’m keeping my mind open FTTB. One of the issues here is the non-fungibility of time (as opposed to space being largely fungible), and the inevitable loss (die to context switching overheads) from splitting a budget that complicates things. Again, this is something where we may arrive with a more definite answer as part of the on-going (recently started) work on formalising the model.
Hope this helps.
Gernot