Conceptually, there is some elegance in the approach, since everything is just either a Key, a Node/KeyList or a Page...
There were four KeyKOS ideas that ultimately did not survive into Coyotos:
- Meters were replaced by scheduling capabilities (meters can't express real timerequirements)
- Everything is a node or a page (too restrictive, high complexity)
- The memory hierarchy using nodes as segments (restrictive, space inefficient, and high complexity).
- Kernel calls as the atomic units of operation (does not admit a serious hardware-concurrent implementation).
- The rule that all invocations are synchronous and blocking.
The parenthesized parts are my opinion. Let me take these in turn,
Meters
Meters were abandoned for two reasons. The first is that they aren't a good model for hard or soft real-time systems; they simply don't express what you want. We initially replaced them with simple priorities in EROS to get going. We intended to implement a different scheduling model, but EROS was replaced by Coyotos before the replacement was started.
The second is that we had an explicit goal to change the time bound requirements for the kernel. In KeyKOS, worst-case operations are supposed to have a time bound of O(log n), where n is the number of objects touched and n always has a small constant bound. The notable exception to this is the meter tree, where a meter flush that happens at the top of the meter tree can (in theory) touch every meter below it. While the height of the meter tree is bounded, the width of the meter tree is not bounded. In consequence, there is no constant bound on the number of meters that may be touched. In Coyotos, the goal was to have an O(1) time bound for all kernel operations. More precisely: O(n) for constant-bounded n.
One of our goals at Johns Hopkins was to start proving properties about these systems, and one of the properties we wanted to prove was the constant time bound property.
Given this, meters had to go.
Nodes and Pages
In KeyKOS and EROS, everything is a node or a page. Segments and domains/processes are "skins" on a node. The skin you are looking through and the operations you can perform are determined by the capability type.
At the disk level, KeyKOS disk "ranges" are either node ranges or page ranges, and the type cannot be changed without a dump and restore (and then only with difficulty). The checkpoint area isn't quite as strict; there are data pages and page-sized "node pots." This made checkpoints complex. One of the big differences at the storage layer between KeyKOS and EROS was that the EROS checkpoint area is a circular log. The evolution is discussed
here.
An important difference is that EROS ranges can be allocated dynamically. Since different systems can require very different proportions, the ability to allocate ranges as you need them is very helpful. To facilitate this, EROS uses a lookup table to map from object id to disk range, while KeyKOS uses an encoded disk address as the object ID. This is one of the small differences in the on-disk capability formats. This level of indirection permits on-the-fly storage rearrangement by abusing the range replication logic. It also simplifies construction of the initial system image slightly.
The "only nodes and pages" design created a bunch of complex challenges for process state. There is a published description of how this works
here. It's an early paper, and definitely not one of my better efforts, but it's a start. As I mentioned earlier, the prepare/deprepare logic was the origin of the only bug I ever found in KeyKOS. The complexity of making everything a node in KeyKOS touched
everything.
As Norm described it to me, one reason for this restriction was that proliferating object types would mean proliferating the number of disk range types, which would increase the likelihood that you'd end up with an unfortunate (and constraining) division of space on the disk level. The EROS version does not share this problem, and in Coyotos we eventually made process objects and endpoint objects first-class. This allowed us to remove node objects entirely. That removed a lot of ugly and complicated code, which was helpful, because it simplified the implementation of a multiprocessor kernel by a lot.
Since we no longer needed it for processes, Coyotos is able to replace the Node object with the
GPT object. GPTs require fewer objects to describe address spaces that have multiple regions. They also allow us to cleanly support multiple page sizes (which nodes cannot do in KeyKOS or EROS).
If you decide to examine Coyotos, your first priority should be to resurrect CapIDL, because you will need that to generate the documentation tree. The "base" portion of the documentation tree documents all of the kernel interfaces in a standard, web-browsable way.
Memory Hierarchy
Mostly the GPT change, but also the change to the invocation mechanism to allow using capabilities from the memory tree. In Coyotos, there is a single, unified address space. All leaf objects in the memory tree are pages. Pages are "tagged" at allocation time as either capability pages or data pages. This allowed us to use the same memory walk logic for both address spaces, and it also allows us to create a single address space that contains "capability regions" and "data regions."
System Calls as Atomic Units of Operations
This was the hardest decision, and it had huge consequences for the system specification, the kernel design, and the kernel implementation.
The Coyotos design was a joint effort between myself and Jonathan Adams (JWA). One of the very valuable things that JWA brought to the effort was a detailed understanding of how multithreading was done in Solaris, and the various things Solaris had gotten wrong and had to fix over the years. Norm's story for multithreading in KeyKOS basically required a "big kernel lock." This was also the first approach in Linux, and it severely hampered performance.
I believe that Norm liked the "big kernel lock" idea for two reasons:
- It is simple, and
- It preserves the story that the current state of the system can always be described inductively as a sequence of transforming operations (capability invocations) that have been applied to an initial state. You may not know the sequence, but knowing that it exists enables a bunch of verification proofs about the correctness of the system state.
So far as I know, KeyKOS and EROS are the
only kernels for which this claim can be made. It's one of the reasons that the verification proof [Shapiro and Weber 2000] and the later mechanized verification proof (Scott Doerrie's
dissertation) are possible. According to Jochen, L4 did
not have this property. I do not know if seL4 does, or if it instead adopts something similar to the approach adopted in Coyotos (described below). Gerwin Klein's
TOCS paper probably sheds light on this. Given the hardware they were working on, I would expect seL4 to consider the same multithreading concerns that we did. Note that Gerwin's team didn't fully understand how our initial system image was constructed; a correctness verification there was certainly possible.
Verification is one issue, but performance is another. The system calls in KeyKOS, EROS, and Coyotos have much shorter paths, but the degree of multithreading is much higher, and the cache coherency traffic alone for the big kernel lock approach places a severe limit on overall system performance. Given the percentage of time spent in the kernel in these systems, you really don't want to make the kernel be effectively single threaded. It would mean that operations on completely unrelated objects (each of which is probably local to the acting hardware cache) have to be done sequentially. Even on Linux, which spends much less of its time (proportionally) in the kernel (and therefore should have much less contention) the impact was severe.
The alternative is to relax the requirement that system calls are atomic units of operation. All serious multithreading kernels do this. JWA played a critical role in architecting and implementing what follows, and Eric Northup provided a bunch of tutoring on hardware concurrency to a certain doddering former graduate advisor on this. :-)
In Coyotos, the concurrency contract is:
- Every kernel operation consists of a setup phase (in which all necessary objects are locked) followed by an atomic "action" phase (but see below re data read/write).
- Durable preconditions established in the setup phase may be invalidated by a competing hardware thread prior to the mutate phase. This is checked before the mutate phase, and the thread that depends on an invalidated condition will abandon (and restart) its operation from scratch.
- Once the action phase is begun, you're committed (so to speak). The action phase must either complete or it must induce a hard system reboot due to an unrecoverable error (in our case, most likely an ECC memory fault). Action phases are O(1). Once a thread enters its action phase, a competing, higher-priority thread will wait for the action phase to complete rather than breaking a lock held by the acting thread.
- It is explicitly permitted for memory walks to occur where process A traverses a GPT during its setup phase, process B modifies that GPT, and process A then modifies a target object lower in the memory tree being walked (that is: after the path that it walked has been modified behind it). The GPT slot read is atomic, but the GPT path traversal is not. The slot-level read and write operations are atomic, but these do not create durable preconditions. Getting the interaction correct between the GPT tree and the hardware mapping tree (for some architectures) and the soft-TLB (for others) was very entertaining.
Note that most traversals are done in the hardware page tree, which is a partial cache of the GPT state, so this particular form of weak consistency is (in practice) a requirement. - If simultaneous readers and writers exist for data page[s], it is guaranteed that some byte string will be returned to the reader, but there is no guarantee about which of the concurrently written bytes will be seen by the reader. This is partly because you can't make the hardware take locks during a load instruction, and partly because many hardware implementations do not implement hardware-enforced consistency across caches.
So each Coyotos system call is comprised of a well-defined series of [mostly] atomic operations, but the state of the system cannot necessarily be reconstructed by re-applying some sequence of system calls.
From a verification perspective, data read/write operations have no security semantics, so that overlap doesn't matter. It also turns out that the memory walk overlap (which is initially discomforting) is safe from both a verification and a concurrency perspective.
One of the clever things we did in Coyotos (shap said, dislocating his shoulder while patting himself on the back) was exploit the mostly-atomic nature of Coyotos to implement locking differently. Coyotos consistency is implemented by a mix of locks (for both durable and non-durable preconditions) and atomic operations (for non-durable preconditions). Every process has a current transaction ID that is incremented on each call, This ID is embedded in the lock word. If the lock transaction ID and the owning process transaction ID do not match, the lock is not held. Which means that locks are "gang released" by incrementing the per-process transaction ID. I've never seen any other system that works this way. See the comments in
kerninc/mutex.h.
A related trick is used for transient mappings of things like page tables, in which mandatory TLB invalidations are exploited to lazily invalidate transient mappings. The cost of TLB invalidates being very high, this proves to have a significant performance benefit. See
hal/transmap.h.
So far as I recall, the only explicit lock release in the Coyotos kernel occurs in the GPT walk path (where locks are non-durable). The big reason for making these locks non-durable is that we don't want to impose serialization on kernel operations when multiple threads share a common address space or address sub-space. It also matches the fact that hardware-implemented walks of page tables that cache the GPT state are proceeding concurrently, and reduces the number of required IPI operations significantly.
Anyway, if you are scratching your head looking for the lock releases and/or the page unmaps, that's why they aren't there.
This locking implementation is at least an order of magnitude faster than the usual approach where locks need to be explicitly released. In conventional kernels, abandoning or re-trying a system call requires returning up the stack through all of the currently outstanding procedure calls. This is necessary for two reasons:
- Most kernels do not cleanly distinguish between setup and operation phases. In such kernels, there are intermediate state changes that need to be undone when a system call fails, errors, needs to be restarted, or succeeds.
- In most kernels, the knowledge of lock locations is recorded in stack-local variables. The containing stack frames need to be re-visited in order to release locks.
Note that the stack is generally "cold" in the cache by this point, because conventional kernel stacks can get pretty big.
In Coyotos, a system call may be abandoned for one of three reasons:
- A lock conflict during setup phase in which the requesting thread has lower priority and yields. At this point, no "live" state exists on the stack, so there is no need for conventional procedure call returns to clean up.
- A "broken durable precondition" is found, which means that a higher-priority thread stole your lock. This can only occur prior to the action phase, so once again, there is no "live" state on the stack.
- The action phase has been completed, in which case there is no "live" state on the stack.
In all three cases,
the kernel stack is abandoned in place by assembly-level code that resets the stack pointer. In the first two cases, the kernel jumps back to the instruction that follows the inbound argument save. The difference in the third case is that the currently active system call advances from the "send" phase to the "receive" phase. The implementation isn't even done with a longjmp(). It's an inline-compiled jump instruction to an assembly-level code sequence that bashes the kernel stack pointer directly.
This is one of the reasons that Coyotos was implemented in C rather than C++. In EROS, a lot of effort was needed to ensure that no object requiring a destructor ever ended up on the stack. The ultimate death of C++ for us happened when it was no longer possible to turn off exception handling - the usual implementations of exceptions have a large cost in stack size, code size, and cache temperature. Given what I've just said about the abandonment strategy, it should be clear that (by design) we didn't need an exception handling mechanism.
An no, we don't kick the process all the way back to user mode in this case. That user->kernel transition is damned expensive! That particular optimization was something we did very early in EROS. Linux later adopted something similar in selected cases. I have wondered whether that idea may have indirectly made its way from our work to theirs. Some very sharp people have worked on that part of the Linux kernel.