Coyotos: Revised capability layout

25 views
Skip to first unread message

Jonathan S. Shapiro

unread,
Mar 6, 2026, 12:19:46 AM (7 days ago) Mar 6
to cap-talk
I very much appreciate the discussions about 64-bit capability layout, and in particular the move from pointers to cursors. That should make a number of things simpler.

William suggested increasing the in-memory capability size, which really is an interesting idea. I've spent much of today looking at it from various angles, which was a very useful refresh on the problem space. Having done so, my response is no, I don't think we should do that.

There are four main requirements for the capability structure and its content:
  1. Capability size must be a power of two.
  2. Certain things need to be in the payload. They don't all fit, so we need to make some trade-offs (see new layout below)
  3. We don't have the real memory budget for a 32-byte in-memory capability format. Therefore the payload organization must allow for swizzling.
  4. We want to be able to do a "blind" capability copy; merely copying a capability should not involve any long-running or complicated algorithms.
Expanding these in turn:

Powers of Two

If the capability size is not a power of two:
  • There are a range of negative caching impacts because some capabilities will straddle cache lines. In some cases, critical fields will straddle cache lines, breaking the ability to do atomic operations on those fields.
  • Capabilities do not fit naturally into a capability page.
  • Tagged memory is so much more complex when size is not a power of two as to be impractical
  • Some things that need to reside at natural alignments do not. Worse, they may do so in even numbered capabilities but not in odd numbered capabilities.
Payload

Ideally, we would like the unswizzled capability layout to hold:
  • A naturally aligned 64-bit OID, or (see below) a 56 bit OID with 8 bits used for other purposes.
  • A naturally aligned 32-bit allocation count.
  • A naturally aligned 32-bit protected payload
  • 11 bits of flag fields comprising 5 capability type bits, 1 bit indicating whether the capability is swizzled, and 5 bits describing permission restrictions. The current layout has 6 bits of capability type bits, but that's a result of a rearrangement I did that wasn't well thought out.
Obviously this isn't going to fit in a 16 byte representation, so we have to choose whether to shrink the allocation count, the protected payload, the OID, or some combination of these. I went through several options and ended up with the following layout (which is not the current layout):

byte
offset  fields
0:   allocCount: 21 or subType: 21, restrictions: 5, type: 5, swizzled=0: 1
4:   protPayload:32 or l2g+match for memory caps
8:   pad: 2, OID:62 or offset: 62 [window and background])

on 64-bit platforms, this supports up to 16 ZebiBytes of attached storage and 8TiB of DRAM (see below) using the following OTEntry rewrite:

8:   pad: 1, ObjectHeader Index: 31, OTEntry Index: 32 when swizzled bit is 1

There are other ways we could arrange things, but we want the fields in the first 8 bytes to remain present when the capability is swizzled. For systems with very large DRAMs, it would not be unreasonable to use a 32-byte in-memory format to allow for wider indexes. Ironically this would reduce the OTEntry objects to a single byte, because the only part we would need in that variant would be the flags.

On a bright note, this is a pointer-free data structure, so keeping Rust happy will be much easier.

Note that I've reduced the type field from 6 bits to 5 and introduced a subType field in order to increase the allocCount field. KeyKOS followed the GNOSIS practice of having a primary key type "Misc" with all of the random kernel capabilities bundled under that, and included the range capability in the primary group because it needed all the bits it could get to describe the range.

In the Coyotos that approach would use only 9 primary key types. Everything else was a subtype of "Misc". Now that I've done the math I might change the type field to 4 bits and expand either the restrictions field or the allocCount.

Swizzling: Key Ring Not Possible for 64 bit

The reason for swizzling is that using an OID to look up an ObjectHeader involves a hash table search, which tends to involve cold cache lines. The KeyKOS solution (for 32-bit systems) was to replace three words of the capability with pointers: one to the ObjectHeader, and two to form a doubly linked list of capabilities rooted at the ObjectHeader. KeyKOS referred to that linked list as the "key ring." We found in EROS that the performance of this was pretty bad. More recent processors are wildly more aggressively speculative processors. The performance for these might be different today.

The key ring won't work for Coyotos, because we can't fit 24 bytes worth of pointers into a 16 byte capability format no matter how cleverly we pack them. For high performance computing machines that have ridiculous amounts of memory, it would make sense to move to a 32 byte in-memory capability format, though there's a challenge with capability pages if we do that. But that can be done by changing the kernel without changing the store. With care, it's not a semantic change.

Which, by the way, is why I didn't adopt 56 bit OIDs. We could definitely use more allocation count bits, and down-sizing the OID would buy them for us, so I may yet do that and split the allocation count field into two parts. But I'm trying to hold that as a last resort.

Swizzling: OTEntries

The OTEntry approach works for both word sizes. Right now the structure definition is 16 bytes when it only needs to be 8, and I will be fixing that. The reason the OID is 62 bits in the format above is that the two bit pad area replaces the three "flags" bits in the OTEntry structure, reducing it to 8 bytes.

To refresh and revise:
  • When the capability is swizzled, the ObjectHeader Index holds a type-specific array index into the per-type array. A 31 bit field supports up to 8 TiB of main memory.
  • We need the OTEntry field to be one bit larger to support an oversupply.
A swizzled capability is valid if cap.OTEindex == typeObjectHeaders[cap.OHdrIndex].OTEIndex.

Before somebody asks, yes, I considered renaming the two structures O'Header and O'TEntry in recognition of Xanadu's Irish subsidiary. Ah, the fun Mark would have had with those puns.

Progress

Eric Northup and I had an extended discussion about cleaning progress for OTEntries and whether it scales to 64 bits. The answer is yes.

Here are the operations that allocate an OTEntry:
  1. Bringing an object in from the store when its first capability is swizzled for use, and
  2. Moving an object from the "free" generation back to a live generation when its OTEntry has already been marked free. Conceptually this is a fast-path object load.
  3. Destroying and re-allocating an object within the same checkpoint generation, because this bumps the allocation count. Given the reduced allocation count size we'll be doing more work to wear-level objects, which will make this rare.
The question is: how do we make sure that OTEntry scan completes before free OTEntries are exhausted?

OTEntry states are { destroyed, scanning, removing, valid }  Free entries are recognized by virtue of the fact that they are on the free list.

Independent of the scanner, the ageing system contributes. We maintain 8 generations, and the ageing system periodically marks the OTEntries for everything in generation 6 and 7 as "removing." This whacks any associated mapping state. The idea of marking gen 6 is to give live objects a chance to stay live before we actually start removing them. It incidentally ensures that an object in gen 7 definitely has no mapping state associated with it.

Conceptually, the scan proceeds as follows:
  1. Make a pass over the OTEntry objects, rewriting all valid entries to scanning.
  2. Make a pass over all objects that contain capabilities, validating their caps against their OTEntry records:
    • If the OTEntry says "destroyed", rewrite the capability as null.
    • If the OTEntry says "removing", unswizzle the capability.
    • If the OTEntry says "scanning", there is a swizzled capability in a live object. Rewrite the OTEntry to "valid".
  3. Make a final pass over the OTEntry records moving any record that is not "valid" to the OTEntry free list.
This can be done incrementally. For each operation that allocates an OTEntry, scan one object. Since we have 1.5 OTEntry records for each ObjectHeader, this is sufficient to keep up. But it's sufficient for a different reason as well: more than half of the objects in memory are data pages. These cannot contain capabilities, so a 1:1 scan ends up giving us a better than 2:1 rate advantage. Yes, I'm ignoring capability pages because they are a statistically insignificant part of the in-memory object set. We'll need to review the rate requirement if we ever have tagged pages.



Jonathan

Ben Laurie

unread,
Mar 6, 2026, 4:36:27 AM (7 days ago) Mar 6
to cap-...@googlegroups.com
You really should look more closely at CHERI, since you seem to be slowly reinventing most of its features (only without h/w support). CHERI capabilities, btw, are twice the address size (for 32 and 64 bit, I don't think we could manage that for 16 bit) and contain: permissions, base, end, cursor, object ID.


--
You received this message because you are subscribed to the Google Groups "cap-talk" group.
To unsubscribe from this group and stop receiving emails from it, send an email to cap-talk+u...@googlegroups.com.
To view this discussion visit https://groups.google.com/d/msgid/cap-talk/CAAP%3D3QM46GzSJQ%3DsySmeKNrhJ7eyWnadiNAoRV327Z-7Kj-OTg%40mail.gmail.com.

William ML Leslie

unread,
Mar 6, 2026, 5:48:41 AM (7 days ago) Mar 6
to cap-...@googlegroups.com
On Fri, 6 Mar 2026 at 15:19, Jonathan S. Shapiro <jonathan....@gmail.com> wrote:
Conceptually, the scan proceeds as follows:
  1. Make a pass over the OTEntry objects, rewriting all valid entries to scanning.
  2. Make a pass over all objects that contain capabilities, validating their caps against their OTEntry records:
    • If the OTEntry says "destroyed", rewrite the capability as null.
    • If the OTEntry says "removing", unswizzle the capability.
    • If the OTEntry says "scanning", there is a swizzled capability in a live object. Rewrite the OTEntry to "valid".
  3. Make a final pass over the OTEntry records moving any record that is not "valid" to the OTEntry free list.
This can be done incrementally.

I have two capabilities to my object.  When the first is scanned, there's a swizzled capability in a live object and the OTEntry is rewritten to "valid".  I use the second capability to delete the object.  At the end of the pass, the OTEntry is marked as destroyed, so the entry is freed.  Later, an object and an OTEntry are allocated.  Nothing here prevents the first capability from indicating the new object and its matching table entry.

I guess entries could be both destroyed (please rewrirte to null) and valid (may have swizzled referrers, don't deallocate).

Removing feels like less of a hazard because any change to the object in that state will necessarily mark it valid, and unswizzling is always safe.

--
William ML Leslie
A tool for making incorrect guesses and generating large volumes of plausible-looking nonsense.  Who is this very useful tool for?

William ML Leslie

unread,
Mar 6, 2026, 6:12:08 AM (7 days ago) Mar 6
to cap-...@googlegroups.com
On Fri, 6 Mar 2026 at 15:19, Jonathan S. Shapiro <jonathan....@gmail.com> wrote:
Independent of the scanner, the ageing system contributes. We maintain 8 generations, and the ageing system periodically marks the OTEntries for everything in generation 6 and 7 as "removing." This whacks any associated mapping state. The idea of marking gen 6 is to give live objects a chance to stay live before we actually start removing them. It incidentally ensures that an object in gen 7 definitely has no mapping state associated with it.

Progress may be difficult if we have trouble maintaining appropriate sizes for each generation.  I mean, progression through the gc algorithm can presumably move things out of 6 and 7.  How do we ensure we don't exhaust them?
 

Jonathan S. Shapiro

unread,
Mar 6, 2026, 12:08:17 PM (7 days ago) Mar 6
to cap-...@googlegroups.com
On Fri, Mar 6, 2026 at 1:36 AM 'Ben Laurie' via cap-talk <cap-...@googlegroups.com> wrote:
You really should look more closely at CHERI, since you seem to be slowly reinventing most of its features (only without h/w support). CHERI capabilities, btw, are twice the address size (for 32 and 64 bit, I don't think we could manage that for 16 bit) and contain: permissions, base, end, cursor, object ID.

Ben:

Definitely in the queue - I've been getting interrupted a lot these last two weeks.

This line of work predates CHERI by almost 40 years, so "reinventing" might have it backwards. :-) The things we're talking about here are very minor refinements to deal with 64-bit and a possible Rust port. The leaders of the CHERI project were well aware of the EROS work.


Jonathan

Jonathan S. Shapiro

unread,
Mar 6, 2026, 12:30:27 PM (7 days ago) Mar 6
to cap-...@googlegroups.com
On Fri, Mar 6, 2026 at 2:48 AM William ML Leslie <william.l...@gmail.com> wrote:
On Fri, 6 Mar 2026 at 15:19, Jonathan S. Shapiro <jonathan....@gmail.com> wrote:
Conceptually, the scan proceeds as follows:
  1. Make a pass over the OTEntry objects, rewriting all valid entries to scanning.
  2. Make a pass over all objects that contain capabilities, validating their caps against their OTEntry records:
    • If the OTEntry says "destroyed", rewrite the capability as null.
    • If the OTEntry says "removing", unswizzle the capability.
    • If the OTEntry says "scanning", there is a swizzled capability in a live object. Rewrite the OTEntry to "valid".
  3. Make a final pass over the OTEntry records moving any record that is not "valid" to the OTEntry free list.
This can be done incrementally.

I have two capabilities to my object.  When the first is scanned, there's a swizzled capability in a live object and the OTEntry is rewritten to "valid".  I use the second capability to delete the object.  At the end of the pass, the OTEntry is marked as destroyed, so the entry is freed.  Later, an object and an OTEntry are allocated.  Nothing here prevents the first capability from indicating the new object and its matching table entry.

Before an OTEntry can be marked destroyed, the OTEntry index in the target ObjectHeader needs to be set to a canary value signalling an invalid OTEntry. The moment that is done, all outstanding capabilities to that object become invalid regardless of the OTEntry marking. The purpose of the destroyed marker in the OTEntry is to signal that the capability, when unswizzled, should be rewritten to Null.

But you are correct that there is an order of operations race, and the first capability will not be unswizzled correctly in the situation you describe. Nice catch!

Offhand, I think we need a generation marker bit in the OTEntry flags. This will reduce the in-capability OID size by one bit, but I don't think that's a problem.

This is conceptually similar to what happens in checkpoint when we need to keep the old object version around until it is written down. But in this situation copy-on-write does not help us.
 
I guess entries could be both destroyed (please rewrirte to null) and valid (may have swizzled referrers, don't deallocate).

The two states are mutually exclusive. Once destroyed there is no path to valid.
 
Removing feels like less of a hazard because any change to the object in that state will necessarily mark it valid, and unswizzling is always safe.

If an object becomes a candidate for removal, its OTEntry index gets whacked and all outstanding capabilities are temporarily invalid. But in this case, the effect of validating the capability is to restore the OTEntry index in the target object, which makes it live again and escalates it into a younger ageing generation. This also marks the OTEntry as active.

The current coyotos code base is in-memory only, so there isn't any object ageing right now. Which is why I missed this in my first attempt. :-)


Jonathan

Jonathan S. Shapiro

unread,
Mar 6, 2026, 1:02:00 PM (7 days ago) Mar 6
to cap-...@googlegroups.com
On Fri, Mar 6, 2026 at 3:12 AM William ML Leslie <william.l...@gmail.com> wrote:
On Fri, 6 Mar 2026 at 15:19, Jonathan S. Shapiro <jonathan....@gmail.com> wrote:
Independent of the scanner, the ageing system contributes. We maintain 8 generations, and the ageing system periodically marks the OTEntries for everything in generation 6 and 7 as "removing." This whacks any associated mapping state. The idea of marking gen 6 is to give live objects a chance to stay live before we actually start removing them. It incidentally ensures that an object in gen 7 definitely has no mapping state associated with it.

Progress may be difficult if we have trouble maintaining appropriate sizes for each generation.  I mean, progression through the gc algorithm can presumably move things out of 6 and 7.  How do we ensure we don't exhaust them?

It's a valid point and a valid question! I'd have to go back to the EROS code to look at an actual implementation, but here's how this is intended to work:
  • Generation 6 is a canary pass. The reason we mark objects for removal in this generation is so that objects we actually still need will get revived. This also invalidates existing mappings. The goal is to make revival from generation 7 very rare.
  • Generation 7 is objects we actually expect to remove, because they have made it past the canary phase. It is still possible for a revival to happen, but it is rare.
Revival has the effect of moving the object to an intermediate generation.

At the start of the ageing pass, every object is moved to the next generation (generation 6 is merged into generation 7). Ageing is triggered by a low threshold of reclaimable objects. If generation 7 was too small, the trigger will simply be hit again more quickly.

If we are running out of free object frames, then either we have a broken algorithm or the system overall is working harder than it is provisioned for. You can fix the algorithm. If you must, you can decide which work gets starved - that's what the scheduler is for. You can consider residency quotas (which are a different form of scheduling). But you can't extract 10 lbs of computation out of a 5 lb system configuration.


Jonathan 

William ML Leslie

unread,
Mar 6, 2026, 5:15:42 PM (6 days ago) Mar 6
to cap-...@googlegroups.com
On Sat, 7 Mar 2026 at 03:30, Jonathan S. Shapiro <jonathan....@gmail.com> wrote:
On Fri, Mar 6, 2026 at 2:48 AM William ML Leslie <william.l...@gmail.com> wrote:
On Fri, 6 Mar 2026 at 15:19, Jonathan S. Shapiro <jonathan....@gmail.com> wrote:
Conceptually, the scan proceeds as follows:
  1. Make a pass over the OTEntry objects, rewriting all valid entries to scanning.
  2. Make a pass over all objects that contain capabilities, validating their caps against their OTEntry records:
    • If the OTEntry says "destroyed", rewrite the capability as null.
    • If the OTEntry says "removing", unswizzle the capability.
    • If the OTEntry says "scanning", there is a swizzled capability in a live object. Rewrite the OTEntry to "valid".
  3. Make a final pass over the OTEntry records moving any record that is not "valid" to the OTEntry free list.
This can be done incrementally.

I have two capabilities to my object.  When the first is scanned, there's a swizzled capability in a live object and the OTEntry is rewritten to "valid".  I use the second capability to delete the object.  At the end of the pass, the OTEntry is marked as destroyed, so the entry is freed.  Later, an object and an OTEntry are allocated.  Nothing here prevents the first capability from indicating the new object and its matching table entry.

Before an OTEntry can be marked destroyed, the OTEntry index in the target ObjectHeader needs to be set to a canary value signalling an invalid OTEntry. The moment that is done, all outstanding capabilities to that object become invalid regardless of the OTEntry marking. The purpose of the destroyed marker in the OTEntry is to signal that the capability, when unswizzled, should be rewritten to Null.

But you are correct that there is an order of operations race, and the first capability will not be unswizzled correctly in the situation you describe. Nice catch!

The trouble is that if we are moving OTEntry records that are "destroyed" into the free list, the entry and the object frame could be re-associated by the time we go to use the swizzled capability again.  It's a use-after-free.

Destroy can clobber the otEntryIndex on the object, but the Scan can't free the OTEntry itself until you can be sure there are no swizzled caps referring to it, otherwise there is a (slim) chance that an unrelated allocation causes cap.otIndex == cap.target->otIndex again.

Jonathan S. Shapiro

unread,
Mar 6, 2026, 11:19:21 PM (6 days ago) Mar 6
to cap-...@googlegroups.com
OK. Let's try this again. What got me messed up is that the version in the current Coyotos code has both more flag bits than it needs and names them very poorly.

Forget removing. We don't need that in the flag bits. The flags we actually need are:
  • active (bit is zero) or free (bit is 1). At start of OTEntry GC all OTEntry objects are marked free. If they are still free at the end, they go on the free list.
  • valid (bit is 0) or destroyed (bit is 1). FIeld is not altered by the OTEntry GC, but may be altered by other operations.
At start of GC pass, all OTEntry object are marked [provisionally] free. If they are still free at the end they will go on the free list.
During capability scan, swizzled capabilities:
  • Are unswizzled if they live in an old object (generation 6 or 7). No change is made to the OTEntry flags.
  • Are unswizzled if the OTEntry is marked destroyed. No change is made to the OTEntry flags.
  • Remain swizzled in all other cases. OTEntry is set to active
At the end of the mark pass, all free OTEntry objects are actually placed on the free list.

Concurrent with this:
  • Objects may be destroyed. Their OTEntry migrates to active+destroyed.
  • Objects may be loaded. Their initial OTEntry records come off the free list, and are marked active+valid.
  • Capabilities may be swizzled (or re-swizzled). Their OTEntry migrates to active+<unchanged>
  • Objects may be cleaned, in which case capabilities to valid objects remain in valid form and capabilities to destroyed objects become Null. There is no change to the OTEntry state.

So basically, anything that swizzles a capability or loads an object or preserves a swizzled capability un-changed results in a live OTEntry.

It is possible for a capability swizzle to mark an OTEntry live while the "mark all OTEntry objects free" pass is in progress. That's okay, because it happens before the capability scan begins.


Jonathan

Ben Laurie

unread,
Mar 8, 2026, 1:58:14 PM (5 days ago) Mar 8
to cap-...@googlegroups.com
Speaking as one of them, I quite agree: but that's not what I meant. I meant that a lot of the detailed questions you've been thinking about have been extensively worked on, implemented at scale and tested in the process of CHERI's implentation, and it is worth learning from that experience.

I would also strongly suggest that Coyotos would benefit hugely from being implemented on top of CHERI.

BTW, there are substantial microarchitectural benefits to be gained from using CHERI with a single address space.

 


Jonathan

--
You received this message because you are subscribed to the Google Groups "cap-talk" group.
To unsubscribe from this group and stop receiving emails from it, send an email to cap-talk+u...@googlegroups.com.
Reply all
Reply to author
Forward
0 new messages