I very much appreciate the discussions about 64-bit capability layout, and in particular the move from pointers to cursors. That should make a number of things simpler.
William suggested increasing the in-memory capability size, which really is an interesting idea. I've spent much of today looking at it from various angles, which was a very useful refresh on the problem space. Having done so, my response is no, I don't think we should do that.
There are four main requirements for the capability structure and its content:
- Capability size must be a power of two.
- Certain things need to be in the payload. They don't all fit, so we need to make some trade-offs (see new layout below)
- We don't have the real memory budget for a 32-byte in-memory capability format. Therefore the payload organization must allow for swizzling.
- We want to be able to do a "blind" capability copy; merely copying a capability should not involve any long-running or complicated algorithms.
Expanding these in turn:
Powers of Two
If the capability size is not a power of two:
- There are a range of negative caching impacts because some capabilities will straddle cache lines. In some cases, critical fields will straddle cache lines, breaking the ability to do atomic operations on those fields.
- Capabilities do not fit naturally into a capability page.
- Tagged memory is so much more complex when size is not a power of two as to be impractical
- Some things that need to reside at natural alignments do not. Worse, they may do so in even numbered capabilities but not in odd numbered capabilities.
Payload
Ideally, we would like the unswizzled capability layout to hold:
- A naturally aligned 64-bit OID, or (see below) a 56 bit OID with 8 bits used for other purposes.
- A naturally aligned 32-bit allocation count.
- A naturally aligned 32-bit protected payload
- 11 bits of flag fields comprising 5 capability type bits, 1 bit indicating whether the capability is swizzled, and 5 bits describing permission restrictions. The current layout has 6 bits of capability type bits, but that's a result of a rearrangement I did that wasn't well thought out.
Obviously this isn't going to fit in a 16 byte representation, so we have to choose whether to shrink the allocation count, the protected payload, the OID, or some combination of these. I went through several options and ended up with the following layout (which is not the current layout):
0: allocCount: 21 or subType: 21, restrictions: 5, type: 5, swizzled=0: 1
4: protPayload:32 or l2g+match for memory caps
8: pad: 2, OID:62 or offset: 62 [window and background])
on 64-bit platforms, this supports up to 16 ZebiBytes of attached storage and 8TiB of DRAM (see below) using the following OTEntry rewrite:
8: pad: 1, ObjectHeader Index: 31, OTEntry Index: 32 when swizzled bit is 1
There are other ways we could arrange things, but we want the fields in the first 8 bytes to remain present when the capability is swizzled. For systems with very large DRAMs, it would not be unreasonable to use a 32-byte in-memory format to allow for wider indexes. Ironically this would reduce the OTEntry objects to a single byte, because the only part we would need in that variant would be the flags.
On a bright note, this is a pointer-free data structure, so keeping Rust happy will be much easier.
Note that I've reduced the type field from 6 bits to 5 and introduced a subType field in order to increase the allocCount field. KeyKOS followed the GNOSIS practice of having a primary key type "Misc" with all of the random kernel capabilities bundled under that, and included the range capability in the primary group because it needed all the bits it could get to describe the range.
In the Coyotos that approach would use only 9 primary key types. Everything else was a subtype of "Misc". Now that I've done the math I might change the type field to 4 bits and expand either the restrictions field or the allocCount.
Swizzling: Key Ring Not Possible for 64 bit
The reason for swizzling is that using an OID to look up an ObjectHeader involves a hash table search, which tends to involve cold cache lines. The KeyKOS solution (for 32-bit systems) was to replace three words of the capability with pointers: one to the ObjectHeader, and two to form a doubly linked list of capabilities rooted at the ObjectHeader. KeyKOS referred to that linked list as the "key ring." We found in EROS that the performance of this was pretty bad. More recent processors are wildly more aggressively speculative processors. The performance for these might be different today.
The key ring won't work for Coyotos, because we can't fit 24 bytes worth of pointers into a 16 byte capability format no matter how cleverly we pack them. For high performance computing machines that have ridiculous amounts of memory, it would make sense to move to a 32 byte in-memory capability format, though there's a challenge with capability pages if we do that. But that can be done by changing the kernel without changing the store. With care, it's not a semantic change.
Which, by the way, is why I didn't adopt 56 bit OIDs. We could definitely use more allocation count bits, and down-sizing the OID would buy them for us, so I may yet do that and split the allocation count field into two parts. But I'm trying to hold that as a last resort.
Swizzling: OTEntries
The OTEntry approach works for both word sizes. Right now the structure definition is 16 bytes when it only needs to be 8, and I will be fixing that. The reason the OID is 62 bits in the format above is that the two bit pad area replaces the three "flags" bits in the OTEntry structure, reducing it to 8 bytes.
To refresh and revise:
- When the capability is swizzled, the ObjectHeader Index holds a type-specific array index into the per-type array. A 31 bit field supports up to 8 TiB of main memory.
- We need the OTEntry field to be one bit larger to support an oversupply.
A swizzled capability is valid if cap.OTEindex == typeObjectHeaders[cap.OHdrIndex].OTEIndex.
Before somebody asks, yes, I considered renaming the two structures O'Header and O'TEntry in recognition of Xanadu's Irish subsidiary. Ah, the fun Mark would have had with those puns.
Progress
Eric Northup and I had an extended discussion about cleaning progress for OTEntries and whether it scales to 64 bits. The answer is yes.
Here are the operations that allocate an OTEntry:
- Bringing an object in from the store when its first capability is swizzled for use, and
- Moving an object from the "free" generation back to a live generation when its OTEntry has already been marked free. Conceptually this is a fast-path object load.
- Destroying and re-allocating an object within the same checkpoint generation, because this bumps the allocation count. Given the reduced allocation count size we'll be doing more work to wear-level objects, which will make this rare.
The question is: how do we make sure that OTEntry scan completes before free OTEntries are exhausted?
OTEntry states are { destroyed, scanning, removing, valid } Free entries are recognized by virtue of the fact that they are on the free list.
Independent of the scanner, the ageing system contributes. We maintain 8 generations, and the ageing system periodically marks the OTEntries for everything in generation 6 and 7 as "removing." This whacks any associated mapping state. The idea of marking gen 6 is to give live objects a chance to stay live before we actually start removing them. It incidentally ensures that an object in gen 7 definitely has no mapping state associated with it.
Conceptually, the scan proceeds as follows:
- Make a pass over the OTEntry objects, rewriting all valid entries to scanning.
- Make a pass over all objects that contain capabilities, validating their caps against their OTEntry records:
- If the OTEntry says "destroyed", rewrite the capability as null.
- If the OTEntry says "removing", unswizzle the capability.
- If the OTEntry says "scanning", there is a swizzled capability in a live object. Rewrite the OTEntry to "valid".
- Make a final pass over the OTEntry records moving any record that is not "valid" to the OTEntry free list.
This can be done incrementally. For each operation that allocates an OTEntry, scan one object. Since we have 1.5 OTEntry records for each ObjectHeader, this is sufficient to keep up. But it's sufficient for a different reason as well: more than half of the objects in memory are data pages. These cannot contain capabilities, so a 1:1 scan ends up giving us a better than 2:1 rate advantage. Yes, I'm ignoring capability pages because they are a statistically insignificant part of the in-memory object set. We'll need to review the rate requirement if we ever have tagged pages.
Jonathan