Coyotos update/question: pointers in cyclic graphs vs. safe Rust

52 views
Skip to first unread message

Jonathan S. Shapiro

unread,
Feb 19, 2026, 6:03:30 PMFeb 19
to cap-talk
As I said in my note from an hour ago, I've been digging pretty thoroughly into some of the low-level coyotos data structures and how they get translated into Rust. I've hit an issue that I don't know how to deal with, so I want to lay it out and see what advice people may have.

Let me preface this by saying that I've now been working with the borrow checker long enough to pretty much have my head around it. I definitely see benefits for concurrent applications. I also see the benefits of the linear typing that emerges as an (intended) side effect. That said, I now feel a fair bit of confidence that the Rust borrow checker is categorically incompatible with the way Coyotos guards concurrency, and I don't see a path from where Coyotos is to a performant rewrite in safe Rust.

Before getting into the crucial example, a quick reminder: the Coyotos kernel doesn't ever want to deallocate memory in the sense of free(). Literally all of the references being thrown around are &'static T, so borrowing (or, more precisely, dropping) doesn't actually make any sense for them for memory release purposes. The remaining use is to implement the &mut T vs &T exclusion for concurrency purposes, which, as I'm about to describe, is also a problem for the Coyotos kernel.

To illustrate, let's talk about how Coyotos builds [explicitly guarded] cyclic reference graphs pervasively. The most obvious example of this is the troika of Capability, ObjectHeader, and OTEntry:

    -----> OtEntry <--------
   /          |             \
Cap           |               Cap2, ... CapN
   \          v             /
    ----->ObjectHeader<-----

The purpose of the OTEntry is to serve as a canary when target objects are removed from memory. It allows us to unswizzle or invalidate capabilities that didn't get the memo about the object's removal and GC them back to unswizzled form later. Note that there are no back-pointers from objects to their capabilities. That's a change from EROS and KeyKOS, and it took a lot of re-work (and was comprehensively dictated by cache impacts).

There are both invariants and mutexes that keep this troika concurrency-safe. The first is that the Cap necessarily lives inside an object that we already hold a mutex over. The other has to do with the checks required to extract the ObjectHeader*, which can only be distracted by calling (in Rust terms) a method on the capability that performs the following checks on the capability:
  1. Capability is swizzled and valid (according to OTEntry). Lock the target object for the duration of the system call, fetch the ObjectHeader* from the capability, and proceed.
  2. Capability is swizzled but not valid (stale - would eventually be GC'd).
    1. If the OTEntry says target was destroyed, unswizzle to void capability and re-start the system call.
    2. Otherwise unswizzle the capability to its valid-but-unswizzled form and re-start the system call.
  3. The capability is unswizzled. Look up the target object by (type, OID). Sub-cases:
    1. Not found in memory. Initiate object fetch, queue on completion, and abandon current system call.
    2. Allocation counts do not match. Rewrite capability as void and restart system call.
    3. Allocation counts match. Rewrite capability in swizzled form and proceed.
In Rust terms, the sequence above is (effectively) the implementation of a borrow on the object named by a capability.

Side notes:
  1. Changing a capability from swizzled to unswizzled (or the other way) requires marking its containing object dirty, which can trigger a system call re-start as well. That's OK. Actually, it's exactly what we want in this circumstance. Doesn't matter that the borrow hasn't (yet) succeeded, because we'll abandon the control flow if we can't mark the container dirty.
  2. At the end of this operation, we haven't established that it's OK to write to the target object. That requires a further step that exists mainly to ensure the target object is marked dirty before we write it.
  3. "Lock" means "to the currently executing CPU and system call." It's completely OK to do that more than once on the same object.

The OTEntry references here could hypothetically be replaced by cursors, but the offset computation introduced to turn those cursors into array entry references would actually be noticeable at the margin. Computing that offset is a dependent and depended-on computation. Every cycle we lose in the kernel is multiplied everywhere.

Unfortunately, the ObjectHeader references can't be turned into cursors. ObjectHeaders appear at the "front" of each durable object type (exception: pages, where they are a side table). Each durable object type lives in a type specific array. Sometimes more than one, because system physical memory isn't actually contiguous on real machines, and that can't be cleaned up in the virtual map. This means that translating a cursor into a reference is both type-dependent and complicated. And not really necessary from a safety perspective, since the type field of the capability serves as a type tag telling us what the containing object type is for this particular ObjectHeader structure and we already have the concurrency guard logic that I've described above. There are a (very) few places where we use this tag to justify a type up-cast to the containing type.

Increasing the size of these references is not an option. Adding a word to every capability because of a fat reference could very well turn into terabytes of added storage requirement on the disk.


I can see a case where we have the prepare() call return a target object &T reference (at least in the case where it returns). Taking mutable reference to the target requires some work that the Rust compiler doesn't know about, because we have to mark the target object dirty along the way. I'm actually not sure if it's possible to intercede in the middle of a reference's mutability upgrade in Rust. Haven't looked yet.

Note that we can have multiple capabilities referencing the same object, and in perverse cases we can end up with two or more being passed during the same system call. This is not an error.

Note also that we have two logically distinct kinds of mutations going on to the target object. One is for the purpose of changing the object's metadata (to mark it dirty). That operation requires a "fence" on the object, but we do not want it to invalidate outstanding references for read purposes. The other (which doesn't always arise) is for modifying the object's data. The Rust reference taking mechanism doesn't recognize that these are logically distinct.


At the moment, I'm not seeing a Rust path through this example without resorting to raw pointers. Not because the pointers are invalid, but because the borrow checker's concurrency guards are so thoroughly in opposition to the Coyotos concurrency guards and object graph. The ObjectHeader* pointers appear so pervasively that I think this amounts to abandoning the baked-in Rust concurrency guards for all cross-CPU purposes.


I'm totally willing to keep going for a bit to find out what the impact really is. Even if it turns out to be a negative result, it's an interesting negative result.


What do others see here in terms of "how to convert to Rust" that I am failing to see?


Jonathan

William ML Leslie

unread,
Feb 19, 2026, 9:26:51 PMFeb 19
to cap-...@googlegroups.com
On Fri, 20 Feb 2026 at 09:03, Jonathan S. Shapiro <jonathan....@gmail.com> wrote:
Unfortunately, the ObjectHeader references can't be turned into cursors. ObjectHeaders appear at the "front" of each durable object type (exception: pages, where they are a side table). Each durable object type lives in a type specific array. Sometimes more than one, because system physical memory isn't actually contiguous on real machines, and that can't be cleaned up in the virtual map.


Is this still true on 64-bit systems?  I get that it can be a pain in the neck to embed the assumption that the object tables are contiguous into the system, especially with hot-swappable RAM and CXL.
 
--
William ML Leslie
A tool for making incorrect guesses and generating large volumes of plausible-looking nonsense.  Who is this very useful tool for?

Kevin Reid

unread,
Feb 19, 2026, 10:19:42 PMFeb 19
to cap-...@googlegroups.com
On Thu, Feb 19, 2026 at 3:03 PM Jonathan S. Shapiro <jonathan....@gmail.com> wrote:
Note also that we have two logically distinct kinds of mutations going on to the target object. One is for the purpose of changing the object's metadata (to mark it dirty). That operation requires a "fence" on the object, but we do not want it to invalidate outstanding references for read purposes. The other (which doesn't always arise) is for modifying the object's data. The Rust reference taking mechanism doesn't recognize that these are logically distinct.

Distinct in what sense? Do you mean that modifying the metadata “doesn’t count” in some sense? In that case, consider whether you can store the metadata in atomic types or (if atomic types cannot express what you need) UnsafeCell, which indicates that the contents of the cell may be modified (unsafely) even when an & reference to the cell exists, and leaves it up to you to avoid data races.

At the moment, I'm not seeing a Rust path through this example without resorting to raw pointers. Not because the pointers are invalid, but because the borrow checker's concurrency guards are so thoroughly in opposition to the Coyotos concurrency guards and object graph. The ObjectHeader* pointers appear so pervasively that I think this amounts to abandoning the baked-in Rust concurrency guards for all cross-CPU purposes.

I would say that, in general, "writing a garbage collector in Rust" typically requires at least some raw pointers and unsafe code — outside of “toy virtual machine” cases where the memory you’re garbage collecting is an explicitly accessed array, of course. For example, in the Rust standard library, Arc provides memory management that the borrow checker does not understand, using raw pointers internally, and yet also cooperates with the borrow checker.
 
What do others see here in terms of "how to convert to Rust" that I am failing to see?

It’s hard to say more without more concrete details — actual data structures and function signatures that could be improved.

Jonathan S. Shapiro

unread,
Feb 19, 2026, 10:57:58 PMFeb 19
to cap-...@googlegroups.com
On Thu, Feb 19, 2026 at 6:26 PM William ML Leslie <william.l...@gmail.com> wrote:
On Fri, 20 Feb 2026 at 09:03, Jonathan S. Shapiro <jonathan....@gmail.com> wrote:
Unfortunately, the ObjectHeader references can't be turned into cursors. ObjectHeaders appear at the "front" of each durable object type (exception: pages, where they are a side table). Each durable object type lives in a type specific array. Sometimes more than one, because system physical memory isn't actually contiguous on real machines, and that can't be cleaned up in the virtual map.


Is this still true on 64-bit systems?  I get that it can be a pain in the neck to embed the assumption that the object tables are contiguous into the system, especially with hot-swappable RAM and CXL.

Sadly yes, it's still true. Regardless, a lot of deeply embedded systems will continue to be 32-bit for quite some time. There's no compelling reason to make a smart light switch 64-bit.


Jonathan

Jonathan S. Shapiro

unread,
Feb 19, 2026, 11:20:22 PMFeb 19
to cap-...@googlegroups.com
On Thu, Feb 19, 2026 at 7:19 PM Kevin Reid <kpr...@switchb.org> wrote:
On Thu, Feb 19, 2026 at 3:03 PM Jonathan S. Shapiro <jonathan....@gmail.com> wrote:
Note also that we have two logically distinct kinds of mutations going on to the target object. One is for the purpose of changing the object's metadata (to mark it dirty). That operation requires a "fence" on the object, but we do not want it to invalidate outstanding references for read purposes. The other (which doesn't always arise) is for modifying the object's data. The Rust reference taking mechanism doesn't recognize that these are logically distinct.

Distinct in what sense? Do you mean that modifying the metadata “doesn’t count” in some sense? In that case, consider whether you can store the metadata in atomic types or (if atomic types cannot express what you need) UnsafeCell, which indicates that the contents of the cell may be modified (unsafely) even when an & reference to the cell exists, and leaves it up to you to avoid data races.

That's a very useful thing to point out. Thank you. Obviously I'm wandering around in a part of Rust that's not the "standard problem", so thank you!

Also: thank you for the question, because it shows I said something that conveyed a badly wrong impression. It's not that the metadata update to "dirty" doesn't "count". It's that being dirty does not, in and of itself, mean that anything has yet been mutated.

But it definitely counts! If the target object has not been marked dirty, it is an ERROR to have an outstanding mutable reference to that object. By which I mean:

let ro_ref = &target_object;
let mut mut_ref = &mut ro_ref;

is a static error, because the mutable ref cannot safely be taken until we know the target object has been marked dirty. Not "it's immutable to code outside the same crate", but "taking the &mut reference is just not okay at all until the target object is dirty". This may be a use case for a read-only wrapper - I need to go look at that.

Mind you, we didn't actually have that pointer-level permission guard in Coyotos. It's just that &mut T as implemented in Rust is a form of privilege escalation. The fact that the mut_ref binding above is possible tells us that &T does not mean & readonly T.
 
At the moment, I'm not seeing a Rust path through this example without resorting to raw pointers. Not because the pointers are invalid, but because the borrow checker's concurrency guards are so thoroughly in opposition to the Coyotos concurrency guards and object graph. The ObjectHeader* pointers appear so pervasively that I think this amounts to abandoning the baked-in Rust concurrency guards for all cross-CPU purposes.

I would say that, in general, "writing a garbage collector in Rust" typically requires at least some raw pointers and unsafe code — outside of “toy virtual machine” cases where the memory you’re garbage collecting is an explicitly accessed array, of course. For example, in the Rust standard library, Arc provides memory management that the borrow checker does not understand, using raw pointers internally, and yet also cooperates with the borrow checker.

Because it's all pre-allocated arrays with a free list, the garbage collector won't actually be the problem. The big mess is going to be with the capability troika that I discussed.
 
 
What do others see here in terms of "how to convert to Rust" that I am failing to see?

It’s hard to say more without more concrete details — actual data structures and function signatures that could be improved.

Well, that's certainly fair. Unfortunately every last one of those data structures contains capabilities first.


Jonathan 

William ML Leslie

unread,
Feb 20, 2026, 2:28:47 AMFeb 20
to cap-...@googlegroups.com
On Fri, 20 Feb 2026 at 09:03, Jonathan S. Shapiro <jonathan....@gmail.com> wrote:
The OTEntry references here could hypothetically be replaced by cursors, but the offset computation introduced to turn those cursors into array entry references would actually be noticeable at the margin. Computing that offset is a dependent and depended-on computation. Every cycle we lose in the kernel is multiplied everywhere.

I should measure this, but in the spirit of "the processor is hurtling head-long into its next cache miss" I think we have plenty of time waiting for dependent pointer lookups.  In the IPC path, for example, we have the chain from the entry cap to the endpoint, to the process it targets, to the object table entry on that process.  A little address calculation in there might be completely free.

Matt Rice

unread,
Feb 20, 2026, 11:45:14 AMFeb 20
to cap-...@googlegroups.com
On Thu, Feb 19, 2026 at 11:03 PM Jonathan S. Shapiro
<jonathan....@gmail.com> wrote:
>
> Before getting into the crucial example, a quick reminder: the Coyotos kernel doesn't ever want to deallocate memory in the sense of free().
>

One of the things this reminds me of are the various arenas which
don't deallocate until the arena is deallocated
which seems slightly more similar to the coyotos usage than the usual
scope based per allocation dropping mechanism. But as Kevin said, it's
pretty hard without a pesky level of detail.

Kevin Reid

unread,
Feb 20, 2026, 11:46:06 AMFeb 20
to cap-...@googlegroups.com
On Thu, Feb 19, 2026 at 8:20 PM Jonathan S. Shapiro <jonathan....@gmail.com> wrote:
Also: thank you for the question, because it shows I said something that conveyed a badly wrong impression. It's not that the metadata update to "dirty" doesn't "count". It's that being dirty does not, in and of itself, mean that anything has yet been mutated.

But it definitely counts! If the target object has not been marked dirty, it is an ERROR to have an outstanding mutable reference to that object. By which I mean:

let ro_ref = &target_object;
let mut mut_ref = &mut ro_ref;

is a static error, because the mutable ref cannot safely be taken until we know the target object has been marked dirty. Not "it's immutable to code outside the same crate", but "taking the &mut reference is just not okay at all until the target object is dirty". This may be a use case for a read-only wrapper - I need to go look at that.

Mind you, we didn't actually have that pointer-level permission guard in Coyotos. It's just that &mut T as implemented in Rust is a form of privilege escalation. The fact that the mut_ref binding above is possible tells us that &T does not mean & readonly T.

The Rust code you have quoted does not do what you say it does. It doesn’t compile, but supposing we fix the mut annotations so it does, and add type annotations for illustration, we get:
 
let mut ro_ref: &T = &target_object;
let mut_ref: &mut &T = &mut ro_ref;

That is, mut_ref does not allow mutation of *ro_ref. Rather, it allows mutation of ro_ref, i.e. changing ro_ref to point to something else.

In Rust, creating an &mut T that points to the same T as a &T is prohibited by safe Rust and undefined behavior in unsafe Rust (with one possible exception around UnsafeCell which is not yet settled). The privilege escalation you describe does not exist.

Jonathan S. Shapiro

unread,
Feb 24, 2026, 4:33:56 PMFeb 24
to cap-...@googlegroups.com
On Thu, Feb 19, 2026 at 11:28 PM William ML Leslie <william.l...@gmail.com> wrote:
On Fri, 20 Feb 2026 at 09:03, Jonathan S. Shapiro <jonathan....@gmail.com> wrote:
The OTEntry references here could hypothetically be replaced by cursors, but the offset computation introduced to turn those cursors into array entry references would actually be noticeable at the margin. Computing that offset is a dependent and depended-on computation. Every cycle we lose in the kernel is multiplied everywhere.

I should measure this, but in the spirit of "the processor is hurtling head-long into its next cache miss" I think we have plenty of time waiting for dependent pointer lookups.  In the IPC path, for example, we have the chain from the entry cap to the endpoint, to the process it targets, to the object table entry on that process.  A little address calculation in there might be completely free.

It might, and it wouldn't be that big an effort to find out. 

Jonathan S. Shapiro

unread,
Feb 24, 2026, 4:41:46 PMFeb 24
to cap-...@googlegroups.com
Another way to think about what Coyotos does is that it does an initial allocation out of a global arena that is never deallocated.

To say that there is no allocate or free in Coyotos is true if we are comparing to malloc() and free(), but the truth is that what we actually have is typed heaps. Free frames in those heaps get allocated on object page-in and freed by the ager and the snapshot mechanism. The benefit we're really getting is from the typed heap, which is something that was noted in the literature a long time ago. We can have pointers to stale object frames, but at least the pointer type and the target object type remain in sync. And even "empty" object frames exist in a "well formed" state.

But there are all the added complications of capability preparation, background ageing, locking, and pageout that a conventional allocator never deals with. It's closely enough adjacent to typed heaps that we get a lot of the benefits, but it's not really the same thing.


Jonathan

Jonathan S. Shapiro

unread,
Feb 24, 2026, 4:56:22 PMFeb 24
to cap-...@googlegroups.com
On Fri, Feb 20, 2026 at 8:46 AM Kevin Reid <kpr...@switchb.org> wrote:
In Rust, creating an &mut T that points to the same T as a &T is prohibited by safe Rust and undefined behavior in unsafe Rust (with one possible exception around UnsafeCell which is not yet settled). The privilege escalation you describe does not exist.

Thanks for correcting me on this.

I'm a ways from getting to it, but the big concern I have is that a lot of the prepare() code makes mutations that are not logical changes. Examples include swizzling and pinning. The trick Rust uses where a lock returns a wrapper object that delegates transparently to the real object and therefore can expose the same API is a really nice pattern, and I was thinking about doing something similar for BitC at one point.

But the idea that the lock is released as a consequence of the guard object going out of scope may be a problem. We work like crazy to avoid unwinding the stack because it is mostly cold cache, so those unwinds would depend on somehow knowing when to release when you reach the "bottom" of the call stack and drop down to assembly code. But the cost of keeping a table of guards that need unwinding is really high - that was our original implementation for PrepLock release in EROS before it occurred to me to just bump the transaction counter.

I suppose one could hack up the Rust runtime to use a modified mutex structure that works the way the Coyotos mutex works. I just haven't gotten deep enough into the bowels of Rust to understand how many parts of the compiler stack think they know something about the internals of a mutex. It's one of those things that's going to either be really easy or really hard.

I also worry about the various "cache" updates. Like pages get put into the TransMap so we can write them, but the TransMap address isn't straightforwardly associated with the ObjectHeader that we pinned.

One step at a time, I guess. At least in Coyotos we don't have a separate domain/process cache and we don't have all of the awkward dances around what flavor a node is at any given moment.


Jonathan

Kevin Reid

unread,
Feb 24, 2026, 6:40:25 PMFeb 24
to cap-...@googlegroups.com
On Tue, Feb 24, 2026 at 1:56 PM Jonathan S. Shapiro <jonathan....@gmail.com> wrote:
But the idea that the lock is released as a consequence of the guard object going out of scope may be a problem. We work like crazy to avoid unwinding the stack because it is mostly cold cache, so those unwinds would depend on somehow knowing when to release when you reach the "bottom" of the call stack and drop down to assembly code. But the cost of keeping a table of guards that need unwinding is really high - that was our original implementation for PrepLock release in EROS before it occurred to me to just bump the transaction counter.

I suppose one could hack up the Rust runtime to use a modified mutex structure that works the way the Coyotos mutex works. I just haven't gotten deep enough into the bowels of Rust to understand how many parts of the compiler stack think they know something about the internals of a mutex. It's one of those things that's going to either be really easy or really hard.

There is no “Rust runtime” except in the sense of the code that manages specific things like the global allocator and panicking. std::sync::Mutex is not integrated with the compiler at all. You can freely use any mutex implementation you want and Rust-the-language does not mind. If you want to explore uses of mutexes that don't follow the exact std::sync::MutexGuard pattern, I recommend checking out lock_api. Its main goal is to provide a similar Mutex<T> type built on pluggable implementations, but you can also directly use any of the many third-party implementations of its RawMutex types yourself in order to lock and unlock without using RAII-style guards.

The basic primitive behind either Mutex, and anything else which provides mutation behind &, is UnsafeCell (which is special to the compiler); a Mutex<T> is a pair of a raw mutex and an UnsafeCell<T> containing the data, with an API that always checks the raw mutex before granting access to the UnsafeCell. When you use UnsafeCell, you can guard access to the contents of the UnsafeCell using whatever synchronization algorithm you like, including none at all if you have some kind of access pattern that is statically checked in a way the borrow checker can’t prove (for example, Cell just contains an UnsafeCell, and is safe solely via what methods and traits it doesn't have).

Jonathan S. Shapiro

unread,
Feb 28, 2026, 5:34:54 PM (13 days ago) Feb 28
to cap-...@googlegroups.com
On Thu, Feb 19, 2026 at 6:26 PM William ML Leslie <william.l...@gmail.com> wrote:
On Fri, 20 Feb 2026 at 09:03, Jonathan S. Shapiro <jonathan....@gmail.com> wrote:
Unfortunately, the ObjectHeader references can't be turned into cursors. 

Is this still true on 64-bit systems?  I get that it can be a pain in the neck to embed the assumption that the object tables are contiguous into the system, especially with hot-swappable RAM and CXL.

I think so, because a 64 bit reference won't fit in the current capability structure and I'm not thrilled about the disk impact of increasing the size of a capability. The current 64-bit OID values get us to plenty of pages, but the limit of a 32-bit swizzled capability reference to those pages is more restrictive. Not for pages, because we only map those transiently, but for other objects.

If you actually manage to use a large physical memory, I'm guessing you'll end up at something like 8 pages per GPT. If so, then 128GB of physical memory needs (128G/16/4096) * 300 bytes per GPT ~= 630MB for GPT space within the kernel virtual space. So by using 32-bit offsets from KVA we're probably still OK to 256GB of physical ram. GPT virtual space is contiguous, so we can stretch that to 32TB physical ram using an index into that array in the prepared capability. But past that we would need to extend kernel virtual space beyond 4GB or come up with a way to map GPTs transiently. Transient GPTs seem tricky because of the mutual dependency tracking between GPTs and page tables.

Part of the answer is that I need to implement large pages, which we need in any case.

The root problem is the byte ratio of GPTs to pages mapped, which is currently at 300 : 65536. And I haven't considered the depend table space yet. So it would be really nice to shrink the capability size rather than grow it, but I'm not seeing a path to get four bytes out of the capability format right now. Oh. Hmm. Maybe I do, but it would need a complete re-think of how OIDs work on the store, and also a re-think of the OTEntry trick.

One thing is certain: I *definitely* don't want to be doing anything like mark/sweep over 4G of kernel virtual space. Norm and I had a sketch for how to do it, but...

It's been a concern from the beginning that KeyKOS required (proportionally) a lot more metadata bytes per page than UNIX does

Anyway, not a problem for today. Assuming I could run Coyotos on it, I'd be okay on my personal laptop for another 256x increase in DRAM,. As long as OpenAI and Anthropic continue to exist, nobody's going to be able to buy DRAM, and if they did they couldn't pay for the power to run it anyway. :-)


Jonathan

Jonathan S. Shapiro

unread,
Feb 28, 2026, 5:59:16 PM (13 days ago) Feb 28
to cap-...@googlegroups.com
On Thu, Feb 19, 2026 at 11:28 PM William ML Leslie <william.l...@gmail.com> wrote:
I should measure this, but in the spirit of "the processor is hurtling head-long into its next cache miss" I think we have plenty of time waiting for dependent pointer lookups.

I think you may be tacitly assuming we have an OoO multi-issue machine? Some embedded stuff doesn't.

The problem is that for page space those dependent pointer lookups also produce cache misses because it becomes necessary to look up the ObjectHeader vector base address associated with that region. Oh. Duh. No it doesn't. It's like the joke goes: "Yes, it is obvious." The fact that the physical page regions aren't contiguous doesn't mean that the corresponding ObjectHeaders can't live in a contiguous vector in kernel virtual space. I have to go look, but I think you could even pre-compute that size with a small change to the current code, but there's a better way to do this.


Just so it's captured: rather than allocating the object vectors sequentially in contiguous KVA space, assume at compile time that you will reserve enough KVA space for the largest useful vector of each type, which is determined by the max amount of physical memory the kernel is able to support given the various data structure sizes and their ratios of occurrence rather than the amount of physical memory present on that machine. Only map backing physical pages behind the entries you actually use.

This turns all of the cursor conversions into scaled index + constant calculations, eliminating the base pointer lookups. Unfortunately the size isn't usually a power of two, so 3-5 cycles, but that probably beats a likely cache miss.


One joy of the KeyKOS family is that there's almost always a profoundly disgusting way to optimize out a problem you have not yet established that you actually have. :-)



Jonathan

Jonathan S. Shapiro

unread,
Feb 28, 2026, 6:33:49 PM (13 days ago) Feb 28
to cap-...@googlegroups.com
On Tue, Feb 24, 2026 at 3:40 PM Kevin Reid <kpr...@switchb.org> wrote:

There is no “Rust runtime” except in the sense of the code that manages specific things like the global allocator and panicking. std::sync::Mutex is not integrated with the compiler at all. You can freely use any mutex implementation you want and Rust-the-language does not mind. If you want to explore uses of mutexes that don't follow the exact std::sync::MutexGuard pattern, I recommend checking out lock_api. Its main goal is to provide a similar Mutex<T> type built on pluggable implementations, but you can also directly use any of the many third-party implementations of its RawMutex types yourself in order to lock and unlock without using RAII-style guards.

The basic primitive behind either Mutex, and anything else which provides mutation behind &, is UnsafeCell (which is special to the compiler); a Mutex<T> is a pair of a raw mutex and an UnsafeCell<T> containing the data, with an API that always checks the raw mutex before granting access to the UnsafeCell.

I'd have to go look at some things, but I suspect we will want the mutex to be a field of the object it guards, also not at the front of it. Which would seem to put the mutex inside the UnsafeCell containing the object that it guards. Given that mutex accesses are atomic, I imagine we can make something unsavory like that work, but it seems like a "hold your nose" sort of design pattern in the eyes of Rustaceans :-)

From a lock dominance point of view, the current ObjectHeader probably has several interior UnsafeCells in it. From a locking perspective, the most interesting one is probably the object hash table links. On that one, we take a lock on the bucket. Holding the bucket lock guarantees that you have exclusive dominion over all of the link pointers in the bucket, but the memory words containing the link pointers are interior within their objects.

Jonathan

Mark S. Miller

unread,
Feb 28, 2026, 8:04:28 PM (13 days ago) Feb 28
to cap-...@googlegroups.com
"OoO" ?

--
You received this message because you are subscribed to the Google Groups "cap-talk" group.
To unsubscribe from this group and stop receiving emails from it, send an email to cap-talk+u...@googlegroups.com.
To view this discussion visit https://groups.google.com/d/msgid/cap-talk/CAAP%3D3QPMkB1WY_PT-eft__pkhVvVBiW%3DmAEsAQL%2BwieyVLR7Ew%40mail.gmail.com.


--
  Cheers,
  --MarkM

Jonathan S. Shapiro

unread,
Feb 28, 2026, 8:51:29 PM (13 days ago) Feb 28
to cap-...@googlegroups.com
OoO = "out of order". These days usually implemented with a register rename microarchitecture that issues two or more instructions per cycle. Lets the machine queue up instructions like crazy, executing each of them dataflow style as soon as the inputs for that particular instruction become available, and roll everything back to the right place if it discovers that the naive in-order instruction would have thrown an exception or something else that involves backing out of a wrong turn.

IMO it's an exceptionally elegant way to approach running a semantically sequential computation in a parallel way. On the first HaL machine we saw pending instruction queues that were 40 instructions deep per functional unit. Which means the machine overall was successfully speculating hundreds of instructions ahead.

Are you familiar with how register rename works? I imagine you know about static single assignment. Register rename implements dynamic single assignment.


Jonathan

Kevin Reid

unread,
Feb 28, 2026, 11:39:26 PM (13 days ago) Feb 28
to cap-...@googlegroups.com
On Sat, Feb 28, 2026 at 3:33 PM Jonathan S. Shapiro <jonathan....@gmail.com> wrote:
On Tue, Feb 24, 2026 at 3:40 PM Kevin Reid <kpr...@switchb.org> wrote:
There is no “Rust runtime” except in the sense of the code that manages specific things like the global allocator and panicking. std::sync::Mutex is not integrated with the compiler at all. You can freely use any mutex implementation you want and Rust-the-language does not mind. If you want to explore uses of mutexes that don't follow the exact std::sync::MutexGuard pattern, I recommend checking out lock_api. Its main goal is to provide a similar Mutex<T> type built on pluggable implementations, but you can also directly use any of the many third-party implementations of its RawMutex types yourself in order to lock and unlock without using RAII-style guards.

The basic primitive behind either Mutex, and anything else which provides mutation behind &, is UnsafeCell (which is special to the compiler); a Mutex<T> is a pair of a raw mutex and an UnsafeCell<T> containing the data, with an API that always checks the raw mutex before granting access to the UnsafeCell.

I'd have to go look at some things, but I suspect we will want the mutex to be a field of the object it guards, also not at the front of it. Which would seem to put the mutex inside the UnsafeCell containing the object that it guards. Given that mutex accesses are atomic, I imagine we can make something unsavory like that work, but it seems like a "hold your nose" sort of design pattern in the eyes of Rustaceans :-)

Not at all! You call UnsafeCell::get(), thereby obtaining a raw pointer, and then it is totally up to you to synchronize accesses to any particular part of the contents however you like, including atomically accessing some other part. As long as you don’t create an &mut or & to some part of the contents that you’re also mutating from elsewhere, you’re good.

William ML Leslie

unread,
Mar 1, 2026, 12:22:30 AM (13 days ago) Mar 1
to cap-...@googlegroups.com
On Sun, 1 Mar 2026 at 08:34, Jonathan S. Shapiro <jonathan....@gmail.com> wrote:
On Thu, Feb 19, 2026 at 6:26 PM William ML Leslie <william.l...@gmail.com> wrote:
On Fri, 20 Feb 2026 at 09:03, Jonathan S. Shapiro <jonathan....@gmail.com> wrote:
Unfortunately, the ObjectHeader references can't be turned into cursors. 

Is this still true on 64-bit systems?  I get that it can be a pain in the neck to embed the assumption that the object tables are contiguous into the system, especially with hot-swappable RAM and CXL.

I think so, because a 64 bit reference won't fit in the current capability structure and I'm not thrilled about the disk impact of increasing the size of a capability.

Oh dear.  We talk about everything, Shap, I am horrified that we haven't discussed this yet.  It's not like 64-bit was a faint concept distantly forming on the horizon when I joined the list in 2007.  In the capability, we've got a 32-bit pointer to an OTE and a 32-bit pointer to the object, in addition to the 32-bit protected payload, the version number, and the object type.
 
The current 64-bit OID values get us to plenty of pages, but the limit of a 32-bit swizzled capability reference to those pages is more restrictive. Not for pages, because we only map those transiently, but for other objects.

The 32-bit reference is to the page object, which is an object header plus a physical address, and not to the page contents.  Mapping the object headers transiently would be tricky, but it could be done.
 
If you actually manage to use a large physical memory, I'm guessing you'll end up at something like 8 pages per GPT. If so, then 128GB of physical memory needs (128G/16/4096) * 300 bytes per GPT ~= 630MB for GPT space within the kernel virtual space. So by using 32-bit offsets from KVA we're probably still OK to 256GB of physical ram. GPT virtual space is contiguous, so we can stretch that to 32TB physical ram using an index into that array in the prepared capability. But past that we would need to extend kernel virtual space beyond 4GB or come up with a way to map GPTs transiently.

+1 for more kernel virtual space on 64-bit systems.

FWIW there is 256MiB of kernel virtual space on 32-bit.  The rest of the 4GiB is the current process.
 
Transient GPTs seem tricky because of the mutual dependency tracking between GPTs and page tables.

Part of the answer is that I need to implement large pages, which we need in any case.

Yes.  I have some hacks, but they don't work with the transmap or when huge pages get aged out.  Thinking I'll do different oid ranges for regions of different sizes, like 6.2 did for device ranges.
 
The root problem is the byte ratio of GPTs to pages mapped, which is currently at 300 : 65536. And I haven't considered the depend table space yet. So it would be really nice to shrink the capability size rather than grow it, but I'm not seeing a path to get four bytes out of the capability format right now. Oh. Hmm. Maybe I do, but it would need a complete re-think of how OIDs work on the store, and also a re-think of the OTEntry trick.

Asymptotic 0.5% is not so scary, but we also need to account for the RevMap and Depends infrastructure, both not persistent, and on old crusty CPU architectures you also need the hardware mapping structures.  All I can hope is that not all GPTs need to be resident once we have the hardware Mappings built.

Jonathan S. Shapiro

unread,
Mar 1, 2026, 1:23:54 PM (12 days ago) Mar 1
to cap-...@googlegroups.com
On Sat, Feb 28, 2026 at 9:22 PM William ML Leslie <william.l...@gmail.com> wrote:
On Sun, 1 Mar 2026 at 08:34, Jonathan S. Shapiro <jonathan....@gmail.com> wrote:
 
If you actually manage to use a large physical memory, I'm guessing you'll end up at something like 8 pages per GPT. If so, then 128GB of physical memory needs (128G/16/4096) * 300 bytes per GPT ~= 630MB for GPT space within the kernel virtual space. So by using 32-bit offsets from KVA we're probably still OK to 256GB of physical ram. GPT virtual space is contiguous, so we can stretch that to 32TB physical ram using an index into that array in the prepared capability. But past that we would need to extend kernel virtual space beyond 4GB or come up with a way to map GPTs transiently.

+1 for more kernel virtual space on 64-bit systems.

At current DRAM provisionings (e.g. my laptop has 12*GB), we seem to be within sight of exhausting a 4GB kernel virtual region. The real problem is that this is true because we need that much physical RAM for metadata. So I think we need to think about nwo to thin the current data structures. The corresponding kernel space overheads in Linux aren't free, but the Coyotos overheads are quite a bit higher.

Though here's something amusing. That 128GB of physical RAM only needs 1GB of DRAM for tag bits at 16 byte intervals. Much less if the OS cooperates, because most pages don't hold mixed data and capabilities. Current FPGAs have that much.

FWIW there is 256MiB of kernel virtual space on 32-bit.  The rest of the 4GiB is the current process.

The reason I held to 256MiB was that this was compatible with then-existing application-level address space assumptions for Linux applications. I figured binary compatibility might be helpful if we ever got around to building a Linux subsystem. On 64-bit systems that same argument gives us the upper half. If we manage to arrive at a memory overhead rate of 1:1, I'll be perversely impressed.
 
 Yes.  I have some hacks, but they don't work with the transmap or when huge pages get aged out.  Thinking I'll do different oid ranges for regions of different sizes, like 6.2 did for device ranges.

OID ranges are already scoped by object type, which helps. The concern with more region sizes is that the more of them you add the more complicated they get to manage, and complexity in the space bank or the persistence subsystem is not your friend.

I've actually thought a bit about large pages, wondering if they require a new disk-level object. There are existing kernels that synthesize them opportunistically by shuffling 4K pages around to where you can build a large page TLB entry for them. Maybe before we think about larger regions we should work through the problem that's already in front of us.
 
The root problem is the byte ratio of GPTs to pages mapped, which is currently at 300 : 65536. And I haven't considered the depend table space yet. So it would be really nice to shrink the capability size rather than grow it, but I'm not seeing a path to get four bytes out of the capability format right now. Oh. Hmm. Maybe I do, but it would need a complete re-think of how OIDs work on the store, and also a re-think of the OTEntry trick.

Asymptotic 0.5% is not so scary, but we also need to account for the RevMap and Depends infrastructure, both not persistent, and on old crusty CPU architectures you also need the hardware mapping structures.  All I can hope is that not all GPTs need to be resident once we have the hardware Mappings built.

New crusty CPU architectures also have hardware mapping tables. :-)

If you take the GPTs out of residence you no longer know what's in them, so I think you have to whack the caches that depend on their state. Though if you do take them out, you pretty well know they aren't changing, so maybe that's worth re-thinking. Strictly speaking it's not the residence of the GPT that invalidates the cached information, but the modification of the GPT that does so.

Nice call. We seem to be accumulating some interesting architectural evolution questions here.


Jonathan

Mark S. Miller

unread,
Mar 1, 2026, 2:59:24 PM (12 days ago) Mar 1
to cap-...@googlegroups.com
Very nice explanation, thanks!

The c-list tables in the captp family of protocols (E's captp, Midori, Cap'n Proto, ocapn) as used for promise pipelining are doing dynamic single assignment. Not to do speculation but to compensate for the huge latency barrier between remote machines. But FIFO rather than OoO.



--
You received this message because you are subscribed to the Google Groups "cap-talk" group.
To unsubscribe from this group and stop receiving emails from it, send an email to cap-talk+u...@googlegroups.com.


--
  Cheers,
  --MarkM

William ML Leslie

unread,
Mar 1, 2026, 5:43:39 PM (12 days ago) Mar 1
to cap-...@googlegroups.com
On Mon, 2 Mar 2026 at 04:23, Jonathan S. Shapiro <jonathan....@gmail.com> wrote:
At current DRAM provisionings (e.g. my laptop has 12*GB), we seem to be within sight of exhausting a 4GB kernel virtual region. The real problem is that this is true because we need that much physical RAM for metadata. So I think we need to think about nwo to thin the current data structures. The corresponding kernel space overheads in Linux aren't free, but the Coyotos overheads are quite a bit higher.

Oh?  Every now and then I worry about the size of our object tables, but I usually find I've taken a number of bits and treated it like a number of bytes at some point.  I always find that Norm, you, Adams and others made the right choices.  I think there might be better choices for hash-table structure but most of the metadata size I worry about is pathological cases outside of the object tables, especially parts of address spaces getting shared between processes on CPUs with a lot of threads.
 
The reason I held to 256MiB was that this was compatible with then-existing application-level address space assumptions for Linux applications. I figured binary compatibility might be helpful if we ever got around to building a Linux subsystem. On 64-bit systems that same argument gives us the upper half. If we manage to arrive at a memory overhead rate of 1:1, I'll be perversely impressed.

It's convenient that lookup of the invoker's soft registers including all the SrcCap is very easy because they are just sitting there mapped into the kernel's address space.  It's also key to how the sender's indirect string is handled - you only need the transmap when receiving.  I'd be happy to give this up, of course, but I remember being surprised by it - initially a bit horrified thinking about Meltdown, but then appreciating how effective your implementation was.
 
New crusty CPU architectures also have hardware mapping tables. :-)

RISC-V: Boring enough to succeed (TM :-)

If you take the GPTs out of residence you no longer know what's in them, so I think you have to whack the caches that depend on their state. Though if you do take them out, you pretty well know they aren't changing, so maybe that's worth re-thinking. Strictly speaking it's not the residence of the GPT that invalidates the cached information, but the modification of the GPT that does so.

Right now if you remove a reference to a GPT from another, we walk the entire affected tree looking for memory handlers and wake up everything waiting on them.  I don't know that this is strictly necessary, but the bigger problem with that is its time complexity, so I think it was always something we wanted to address.

> Nice call. We seem to be accumulating some interesting architectural evolution questions here.

The reference implementation is something you built with zero help and a looming deadline, and yet you can build real systems on it.  I always assumed there were corners we were going to want to revisit.

--

Jonathan S. Shapiro

unread,
Mar 2, 2026, 1:54:50 AM (11 days ago) Mar 2
to cap-...@googlegroups.com
On Sun, Mar 1, 2026 at 2:43 PM William ML Leslie <william.l...@gmail.com> wrote:
On Mon, 2 Mar 2026 at 04:23, Jonathan S. Shapiro <jonathan....@gmail.com> wrote:
 
New crusty CPU architectures also have hardware mapping tables. :-)

RISC-V: Boring enough to succeed (TM :-)

If I bought him a glass of wine first, Dave Patterson would get a healthy laugh out of that.
 
If you take the GPTs out of residence you no longer know what's in them, so I think you have to whack the caches that depend on their state. Though if you do take them out, you pretty well know they aren't changing, so maybe that's worth re-thinking. Strictly speaking it's not the residence of the GPT that invalidates the cached information, but the modification of the GPT that does so.

Right now if you remove a reference to a GPT from another, we walk the entire affected tree looking for memory handlers and wake up everything waiting on them.  I don't know that this is strictly necessary, but the bigger problem with that is its time complexity, so I think it was always something we wanted to address.

Hmm. Can you identify where that happens? I'd like to refresh myself on that code and figure out why it is doing that. There are odd correctness conditions all over the place, but that sounds like a bug.

> Nice call. We seem to be accumulating some interesting architectural evolution questions here.

The reference implementation is something you built with zero help and a looming deadline, and yet you can build real systems on it.  I always assumed there were corners we were going to want to revisit.

I actually had quite a lot of help from Jonathan Adams, who is superb


Jonathan

William ML Leslie

unread,
Mar 2, 2026, 2:53:16 AM (11 days ago) Mar 2
to cap-...@googlegroups.com
On Mon, 2 Mar 2026 at 16:54, Jonathan S. Shapiro <jonathan....@gmail.com> wrote:
If you take the GPTs out of residence you no longer know what's in them, so I think you have to whack the caches that depend on their state. Though if you do take them out, you pretty well know they aren't changing, so maybe that's worth re-thinking. Strictly speaking it's not the residence of the GPT that invalidates the cached information, but the modification of the GPT that does so.

Right now if you remove a reference to a GPT from another, we walk the entire affected tree looking for memory handlers and wake up everything waiting on them.  I don't know that this is strictly necessary, but the bigger problem with that is its time complexity, so I think it was always something we wanted to address.

Hmm. Can you identify where that happens? I'd like to refresh myself on that code and figure out why it is doing that. There are odd correctness conditions all over the place, but that sounds like a bug.

Looking back now, I think I was conflating two code paths.  The one from GPT.setSlot goes via depend_invalidate_slot, which calls depend_entry_invalidate on anything that depends on it.  But, that's just clearing entries in the page table.  The other is from obhdr_invalidate() -> memhdr_invalidate_cached_state() -> depend_invalidate() on each slot -> depend_entry_invalidate() -> pte_invalidate().

So we _only_ wake up anything sending to a memory handler if the GPT handler slot specifically is cleared.  cap_handlerBeingOverwritten is called in two different places but neither of their callers visit any more GPTs.
 

> Nice call. We seem to be accumulating some interesting architectural evolution questions here.

The reference implementation is something you built with zero help and a looming deadline, and yet you can build real systems on it.  I always assumed there were corners we were going to want to revisit.

I actually had quite a lot of help from Jonathan Adams, who is superb.

Oh, cool.  I knew they did a lot of work on the spacebank and constructor but clearly I hadn't gone back far enough in the git/hg history.

Jonathan S. Shapiro

unread,
Mar 2, 2026, 7:51:08 AM (11 days ago) Mar 2
to cap-...@googlegroups.com
On Sun, Mar 1, 2026 at 11:53 PM William ML Leslie <william.l...@gmail.com> wrote:
Looking back now, I think I was conflating two code paths.  The one from GPT.setSlot goes via depend_invalidate_slot, which calls depend_entry_invalidate on anything that depends on it.  But, that's just clearing entries in the page table.  The other is from obhdr_invalidate() -> memhdr_invalidate_cached_state() -> depend_invalidate() on each slot -> depend_entry_invalidate() -> pte_invalidate().

So we _only_ wake up anything sending to a memory handler if the GPT handler slot specifically is cleared.  cap_handlerBeingOverwritten is called in two different places but neither of their callers visit any more GPTs.

That makes a bit more sense. If we have somebody blocked trying to get to a handler, and a handler is changed, then the handler in charge of their problem may have changed.

The memory walk is, without question, my least favorite part of this kernel. And yet, in spite of the fact that this version is concurrency-safe, it's actually simpler than the node-based walker!


Jonathan
Reply all
Reply to author
Forward
0 new messages