64bit Coyotos might be a SASOS

57 views
Skip to first unread message

Jonathan S. Shapiro

unread,
Mar 2, 2026, 9:34:51 PM (10 days ago) Mar 2
to cap-talk
Hopefully this is the end of the sequence of Coyotos messages for today. What's really been going on is that I've been slowly wrapping my head around the implications of a 64-bit address space and just how big that is, and how many of the tricks used in the 32-bit kernel are either (a) not relevant, or (b) not possible.

In particular, anything that relies on background GC or a background cleanup pass over a 64-bit address space just isn't going to work. We can do things like that for the pages that become dirty, but not by using a linear scan through memory. Which means that the OTEntry idea for "depreparing" capabilities doesn't work at this scale.

On the other hand, we have a 2^63 byte kernel virtual address space to play with. Which means that we can assign a permanent kernel virtual address for every allocation of every page-sized "frame" on the store. Please note that I am not yet convinced this is a good idea - just pondering.

And here I had thought that migration to 64-bit was going to be simple.


Jonathan

William ML Leslie

unread,
Mar 3, 2026, 1:59:56 AM (10 days ago) Mar 3
to cap-...@googlegroups.com
On Tue, 3 Mar 2026 at 12:34, Jonathan S. Shapiro <jonathan....@gmail.com> wrote:
Hopefully this is the end of the sequence of Coyotos messages for today. What's really been going on is that I've been slowly wrapping my head around the implications of a 64-bit address space and just how big that is, and how many of the tricks used in the 32-bit kernel are either (a) not relevant, or (b) not possible.

In particular, anything that relies on background GC or a background cleanup pass over a 64-bit address space just isn't going to work. We can do things like that for the pages that become dirty, but not by using a linear scan through memory. Which means that the OTEntry idea for "depreparing" capabilities doesn't work at this scale.

There might be some tricks we can use to get there.  An algorithm for building ... precliques? comes to mind.  We have shared weak address spaces, processes that add mutable storage on top of those, and endpoints that connect processes together.  different processes that map the same page writable can be assigned the same clique via union-find.  Buffers for communication, a-la the Kry10 "data diode" need some way to be identified.  The thinking is that most objects will have exactly one referrer, and it might be worthwhile tracking that.  We can maintain an object's preclique or a relationship between precliques. when we assign into a gpt or process slot.

What got me thinking on those terms is that we have similar opportunities with Mappings.  I suspect that most Mappings could be bound for their lifetime to one CPU, which could be free to repurpose the most empty ones.  Other Mappings are explicitly marked as in-use by multiple CPUs.  Similarly, many GPTs are only going to be in one mapping, and other GPTs are going to be mapped all over the place.  Confidently knowing that you do or don't need to issue an IPI for a given address space change sounds useful.  I haven't got any plans around locking order, just a vague idea that I should explore it.

I think this matches with how most existing software works today, even if it's not ideal.  Most systems have exactly one process running full-pelt across all of their cores, and everything else belongs on a single hardware thread.
 
On the other hand, we have a 2^63 byte kernel virtual address space to play with. Which means that we can assign a permanent kernel virtual address for every allocation of every page-sized "frame" on the store. Please note that I am not yet convinced this is a good idea - just pondering.

Meltdown aside, I love that we might be able to completely remove mapping changes (via the transmap) from the IPC path.  In fact, we could nest both the entire sender and receiver mappings inside one and have indirect string copies apply to contiguous virtual addresses.
 
And here I had thought that migration to 64-bit was going to be simple.

¯\_(ツ)_/¯

--
William ML Leslie
A tool for making incorrect guesses and generating large volumes of plausible-looking nonsense.  Who is this very useful tool for?

William ML Leslie

unread,
Mar 3, 2026, 2:11:15 AM (10 days ago) Mar 3
to cap-...@googlegroups.com
On Tue, 3 Mar 2026 at 16:59, William ML Leslie <william.l...@gmail.com> wrote:
There might be some tricks we can use to get there.  An algorithm for building ... precliques? comes to mind.  We have shared weak address spaces, processes that add mutable storage on top of those, and endpoints that connect processes together.

Endpoints and cappages.  I don't think we know clearly yet just how extensively people will use cappages but I guess that's the challenging part.

Jonathan S. Shapiro

unread,
Mar 3, 2026, 1:04:39 PM (10 days ago) Mar 3
to cap-...@googlegroups.com
On Mon, Mar 2, 2026 at 11:11 PM William ML Leslie <william.l...@gmail.com> wrote:
On Tue, 3 Mar 2026 at 16:59, William ML Leslie <william.l...@gmail.com> wrote:
There might be some tricks we can use to get there.  An algorithm for building ... precliques? comes to mind.  We have shared weak address spaces, processes that add mutable storage on top of those, and endpoints that connect processes together.

Endpoints and cappages.  I don't think we know clearly yet just how extensively people will use cappages but I guess that's the challenging part.

In order to pass capabilities on the stack, we either need cappages or memory tag bits.

If we don't have endpoints, then the number of objects that can be served by a single process probably isn't large enough for a 64-bit system. The most critical problem will be the space bank.


Jonathan

Jonathan S. Shapiro

unread,
Mar 3, 2026, 2:29:20 PM (10 days ago) Mar 3
to cap-...@googlegroups.com
On Mon, Mar 2, 2026 at 10:59 PM William ML Leslie <william.l...@gmail.com> wrote:
What got me thinking on those terms is that we have similar opportunities with Mappings.  I suspect that most Mappings could be bound for their lifetime to one CPU.

That may be worth thinking about. The challenge will be that processes are dispatched to whatever CPU is available. On the other hand, a given process can only be on one CPU at a time, there is usually an address space switch associated with CPU migration (though not always). So it might help.
 
On the other hand, we have a 2^63 byte kernel virtual address space to play with. Which means that we can assign a permanent kernel virtual address for every allocation of every page-sized "frame" on the store. Please note that I am not yet convinced this is a good idea - just pondering.

Meltdown aside, I love that we might be able to completely remove mapping changes (via the transmap) from the IPC path.  In fact, we could nest both the entire sender and receiver mappings inside one and have indirect string copies apply to contiguous virtual addresses.

The transmap is a blessing, not a curse. It's actually the most efficient mapping mechanism we have. The problem with it is that it's too small. We would still need it in a 64-bit implementation for various reasons.


Jonathan 

digit...@gmail.com

unread,
Mar 3, 2026, 4:14:41 PM (10 days ago) Mar 3
to cap-talk
Reordering paragraphs for importance

On Monday, March 2, 2026 at 6:34:51 PM UTC-8 Jonathan S. Shapiro wrote:
[...]
On the other hand, we have a 2^63 byte kernel virtual address space to play with. Which means that we can assign a permanent kernel virtual address for every allocation of every page-sized "frame" on the store. Please note that I am not yet convinced this is a good idea - just pondering.

No, this is unsafe on modern high performance CPUs, absent specific hardware guarantees you know to be true.  I don't expect any CPU vendor to stand behind such guarantees except on in-order embedded style cores.

Why: speculative execution will trigger TLB loading along virtual addresses that were never touched, architecturally.  Anything valid in the page tables can become resident in the TLB.  And, anything in the TLB can produce a physical address that gets sent to DRAM and loaded into a cache.  By the time a permission, consistency checks, and rollback logic conclude the access was not supposed to happen, the load is already in flight and can't be cancelled.  Assume speculation gadgets can exfiltrate all cache contents, absent reason to believe otherwise.

If you could show that there are no useful speculation gadgets in the privileged code, maybe that would be survivable.  New categories of them are found, still, a decade after spectre/meltdown, so that's not a bet I'd want to take without hardware co-design of the properties needed.

A self-consistent set of safety rules:
1) while executing in security context A, the address mappings have no live entries for any other security context's mappings.  Sane CPUs won't fetch random physical addresses into their caches, and if the data isn't in cache, speculative execution won't leak its contents.
2) ask your CPU vendor just how much cache you need to flush and what the dance is to preclude speculation when switching security contexts.

Permanent global mapping violates 1.  You could fix the correspondence between host virtual address and OID, but leave only the currently valid ones in place.  Such predictable addresses would give a shortcut to would-be attackers who find a read or write gadget, but as long as they are not 

then, less importantly:

In particular, anything that relies on background GC or a background cleanup pass over a 64-bit address space just isn't going to work. We can do things like that for the pages that become dirty, but not by using a linear scan through memory. Which means that the OTEntry idea for "depreparing" capabilities doesn't work at this scale.
I think it will work, but am not sure it will be usable.

Full passes over gigantic out-of-core datasets for GC (or other reasons) work, in wide practice.  ZFS and stores devices' translation layers do so.  Both flash and heck, even spinning disk drives (see: shingled magnetic recording and the translation layers they use).  

Large ZFS storage pools backed by slow throughput media (spinning disks) can take day(s) to complete those passes.  Napkin math suggests a 20TiB drive at 150-300MB/sec needs 40-70 hours for a full read+rewrite pass.  That's long enough to accumulate a _lot_ of newly invalid OIDs from objects live in RAM during such a pass.

That likely forces extra version number bits, bounding how much of slow storage is used for capability slots, and/or throttling/rate matching OID re-use to force convergence with GC passes.

You could probably mitigate the worst pathologies by polluting space bank with locality and "this OID is going to be slow the next time anyone uses it, make better choices" concerns, but I'm still not sure it's a system I'd want to use vs having truly large storage outside the persistent real and more explicitly managing loading and unloading of those large data

Never used a transparently persistent system, so my intuitions here are uncalibrated.  The changes in order of magnitude since such systems were last used gives me pause.  Did KeyKOS swap from tape (cite the 20 year old Jim Gray disk-is-tape comparison and then extrapolate another 20 years)?  

 
Jonathan

Mark S. Miller

unread,
Mar 3, 2026, 4:16:32 PM (10 days ago) Mar 3
to cap-...@googlegroups.com
"SASOS"?

--
You received this message because you are subscribed to the Google Groups "cap-talk" group.
To unsubscribe from this group and stop receiving emails from it, send an email to cap-talk+u...@googlegroups.com.
To view this discussion visit https://groups.google.com/d/msgid/cap-talk/43f72001-b4e8-472c-aa95-b692c65cbc10n%40googlegroups.com.


--
  Cheers,
  --MarkM

Jonathan S. Shapiro

unread,
Mar 3, 2026, 4:53:14 PM (10 days ago) Mar 3
to cap-...@googlegroups.com
On Tue, Mar 3, 2026 at 1:16 PM Mark S. Miller <eri...@gmail.com> wrote:
"SASOS"?

On Tue, Mar 3, 2026 at 1:14 PM digit...@gmail.com <digit...@gmail.com> wrote:
Reordering paragraphs for importance

On Monday, March 2, 2026 at 6:34:51 PM UTC-8 Jonathan S. Shapiro wrote:
[...]
On the other hand, we have a 2^63 byte kernel virtual address space to play with. Which means that we can assign a permanent kernel virtual address for every allocation of every page-sized "frame" on the store. Please note that I am not yet convinced this is a good idea - just pondering.

No, this is unsafe on modern high performance CPUs, absent specific hardware guarantees you know to be true.  I don't expect any CPU vendor to stand behind such guarantees except on in-order embedded style cores.

Why: speculative execution will trigger TLB loading along virtual addresses that were never touched, architecturally.  Anything valid in the page tables can become resident in the TLB.  And, anything in the TLB can produce a physical address that gets sent to DRAM and loaded into a cache.  By the time a permission, consistency checks, and rollback logic conclude the access was not supposed to happen, the load is already in flight and can't be cancelled.  Assume speculation gadgets can exfiltrate all cache contents, absent reason to believe otherwise.

I'm confident that you're right, but I'm a bit confused. When I wrote "assign a permanent kernel virtual address", I should have written "address region". That is: a range of kernel virtual addresses at which we would eventually map page-sized objects when they come in. So if there isn't an object, or it hasn't come in, there are no valid PTE entries for the corresponding page framess. Even when valid they would be kernel-only mappings. The corresponding user-mode addresses are in a completely different part of the address space.

I've been writing my way through that, and I'm no longer convinced that it is as helpful as it seemed.

But you seem to be saying that kernel-only mappings more generally need added cautions around speculation. What do I need to read to understand the issues?

 
Never used a transparently persistent system, so my intuitions here are uncalibrated.  The changes in order of magnitude since such systems were last used gives me pause.  Did KeyKOS swap from tape (cite the 20 year old Jim Gray disk-is-tape comparison and then extrapolate another 20 years)? 

I think GNOSIS at one point had a checkpoint-to-tape mechanism and used this for incremental live backup. I'm sure that could be used for system restore. I'm not aware that they ever read objects off of tape during normal execution.


Jonathan 

Jonathan S. Shapiro

unread,
Mar 3, 2026, 4:58:19 PM (10 days ago) Mar 3
to cap-...@googlegroups.com
On Tue, Mar 3, 2026 at 1:16 PM Mark S. Miller <eri...@gmail.com> wrote:
"SASOS"?

Single address space operating system. The idea is that things in memory get a single address, which may or may not be valid in a particular process, but never appears at any other address.

SASOS isn't really what I was thinking about. What I was actually thinking was that a 63-bit kernel address space is big enough to "flat map" the store, after which OIDs and pointers are different encodings of essentially the same values and prepare/deprepare (or swizzle/unswizzle) is no longer motivated.

It may still be useful, but it doesn't help unless the objects are exactly a page or close enough to a page that you're willing to burn a physical ram page to hold them. Process state is heading in that direction, but GPTS and Endpoints are pretty small.

Jonathan

Jonathan S. Shapiro

unread,
Mar 3, 2026, 5:19:10 PM (10 days ago) Mar 3
to cap-talk
OK. For those of you who aren't following all 5,000 bouncing balls at the moment, let me unpack the "SASOS" comment. Bear with me, because it's going to want some walking through.

Context from 32-bit implementations:

In KeyKOS, prepared (swizzled) capabilities lived on a doubly linked list anchored at their target object. When you remove the target, you walk the list and unswizzle the capabilities. Late in the EROS process, I realized that this caused a lot of cold cache references on a linked list that could hypothetically be as large as kernel memory.

Because of the cold cache costs, I went looking for a way to do background incremental unswizzle. After the discussions over the last few days, I am no longer sure that was the best solution.

The OTEntry got created because a swizzling a capability overwrites the OID. That information gets overwritten by the pointer to the ObjectHeader for the target object. In KeyKOS, when you walk the ring, you recover the OID from the ObjectHeader as you unswizzle the capability. Well, actually, you read it once and then use it as you unswizzle everything.

But in Coyotos the unswizzle is lazy background activity, so we need a way to keep the OID around after the object is gone. That's what the OTEntry is for - it keeps the OID and contains some flags related to the background cleanup process. One of those is a flag saying "no, the object was destroyed, unswizzle the capability to null rather than back to an object capability. This is an optimization only.

A note: we want to be able to overwrite capabilities. The GC approach supports this, the KeyKOS key chain supports this. A singly linked list would not, because overwriting a capability would break the chain.

In the last few days we've been questioning whether swizzling was ever necessary, which has me re-thinking the OTEntry idea approach.

Did We Ever Need To Swizzle?

On 32-bit systems, the problem is that we can't build a "flat" correspondence between object space and kernel virtual space. Part of the issue is the 3G/1G split, which (as I've said) was introduced with the idea that we might want to run a UNIX subsystem some day. The bigger issue is that a one terabyte store is 2^28 pages, and we can't even map their bookkeeping structures in a 2GB region. So we end up with an unpleasant lookup to go from an OID to wherever the target object landed, and no, we don't want to do that lookup every time we access the capability.

And then, because we swizzled, we need to unswizzle. And there were basically three ways to do that:
  1. Have something like an OTEntry, or
  2. Keep the actual object around until its capabilities are unwound by background GC, or
  3. Keep the bidirectional links, but only walk a bounded number at a time when we are trying to remove a target object.
  4. Maintain something like a depend table, of the form "this object might hold a swizzled capability to that object." Which is effectively what the OTEntry is doing.
The main problem with the bi-directional links is their performance cost in the IPC path because of the cache misses.

But yes, I think we needed to swizzle on the 32-bit platforms. Though I've just noticed that the OTEntry objects are twice as big as they need to be.


On a 64-bit platform, I was thinking that for [near] page size objects we could go ahead and reserve a KVA region for each disk page or object pot, and then map them there as they come in. This would mean that the "disk page frame" part of the OID and the kernel virtual address are just two encodings for the same thing: frameOf(OID) = KVA + const, which makes unswizzling pretty easy. even if the object itself has been unmapped.

But it doesn't extend well to smaller objects, because it implies mapping a full page that contains several of those objects. That's fine if a bunch of them are actually getting used. but not so great as co-usage within a frame declines. Which means that it's probably not great for the per-page ObjectHeader structures.

Setting aside Eric's post from an hour ago because I don't understand it yet, the thing that really determines whether these linear sparse arrays help or hurt depends on whether the objects that land in a particular page frame tend to be co-utilized.

I don't think we have any idea, and I'm not inclined to change this part of things around until we do.

Which regrettably means that the in memory capability structure may have to be larger than the on-disk structure to accommodate two 64-bit pointers. Either that or we eat the on-disk space to keep them the same size. Not the end of the world, but annoying.


Jonathan


William ML Leslie

unread,
Mar 3, 2026, 8:09:36 PM (9 days ago) Mar 3
to cap-...@googlegroups.com
On Wed, 4 Mar 2026 at 07:53, Jonathan S. Shapiro <jonathan....@gmail.com> wrote:
On Tue, Mar 3, 2026 at 1:16 PM Mark S. Miller <eri...@gmail.com> wrote:
"SASOS"?

On Tue, Mar 3, 2026 at 1:14 PM digit...@gmail.com <digit...@gmail.com> wrote:
Reordering paragraphs for importance

On Monday, March 2, 2026 at 6:34:51 PM UTC-8 Jonathan S. Shapiro wrote:
[...]
On the other hand, we have a 2^63 byte kernel virtual address space to play with. Which means that we can assign a permanent kernel virtual address for every allocation of every page-sized "frame" on the store. Please note that I am not yet convinced this is a good idea - just pondering.

No, this is unsafe on modern high performance CPUs, absent specific hardware guarantees you know to be true.  I don't expect any CPU vendor to stand behind such guarantees except on in-order embedded style cores.

Why: speculative execution will trigger TLB loading along virtual addresses that were never touched, architecturally.  Anything valid in the page tables can become resident in the TLB.  And, anything in the TLB can produce a physical address that gets sent to DRAM and loaded into a cache.  By the time a permission, consistency checks, and rollback logic conclude the access was not supposed to happen, the load is already in flight and can't be cancelled.  Assume speculation gadgets can exfiltrate all cache contents, absent reason to believe otherwise.

I'm confident that you're right, but I'm a bit confused. When I wrote "assign a permanent kernel virtual address", I should have written "address region". That is: a range of kernel virtual addresses at which we would eventually map page-sized objects when they come in. So if there isn't an object, or it hasn't come in, there are no valid PTE entries for the corresponding page framess. Even when valid they would be kernel-only mappings. The corresponding user-mode addresses are in a completely different part of the address space.

I've been writing my way through that, and I'm no longer convinced that it is as helpful as it seemed.

But you seem to be saying that kernel-only mappings more generally need added cautions around speculation. What do I need to read to understand the issues?

I think the original Meltdown paper is still a great resource for this, it's available at https://meltdownattack.com/

While the kernel-mode-only bit in the page table structure prevents user mode processes from loading data directly from kernel space, userspace processes can still cause the load to happen and use the loaded value to impact the cache in some way that allows them to infer the original content.  Our safety property doesn't depend on the secrecy of kernel memory, but it's a data leak.  The more we have mapped into the kernel space, the more serious this becomes, e.g. if we have pages mapped into kernel space, any process can read any currently loaded SSH private and session keys.

If you're doing your own hardware, well, you can do the permission check when you've completed the translation and before you issue the load.  I admit I've never understood the performance arguments against this - it's a single bit, and by the time you know the physical address, you already have said bit.

William ML Leslie

unread,
Mar 3, 2026, 8:50:48 PM (9 days ago) Mar 3
to cap-...@googlegroups.com
On Wed, 4 Mar 2026 at 11:09, William ML Leslie <william.l...@gmail.com> wrote:
If you're doing your own hardware, well, you can do the permission check when you've completed the translation and before you issue the load.  I admit I've never understood the performance arguments against this - it's a single bit, and by the time you know the physical address, you already have said bit.

Oh.  If the cache is virtually addressed, then you can get the value at a virtual address without a matching entry being present in the TLB.  The entry might turn out to be from an invalid mapping, so the physical tag on the cache line is checked once the TLB is filled.
 

Jonathan S. Shapiro

unread,
Mar 3, 2026, 10:02:42 PM (9 days ago) Mar 3
to cap-...@googlegroups.com
William:

I haven't had a chance to refresh on meltdown, but I think I can answer one of your questions. You asked:

On Tue, Mar 3, 2026 at 5:50 PM William ML Leslie <william.l...@gmail.com> wrote:
On Wed, 4 Mar 2026 at 11:09, William ML Leslie <william.l...@gmail.com> wrote:
If you're doing your own hardware, well, you can do the permission check when you've completed the translation and before you issue the load.  I admit I've never understood the performance arguments against this - it's a single bit, and by the time you know the physical address, you already have said bit.

Oh.  If the cache is virtually addressed, then you can get the value at a virtual address without a matching entry being present in the TLB.  The entry might turn out to be from an invalid mapping, so the physical tag on the cache line is checked once the TLB is filled.

We actually looked at this at HaL back in 1991. It's known as a virtually indexed, physically tagged cache. Ideally you'd add the ASID, but between virtual tag, physical tag, and ASID those tags end up being a large proportion of the per-line state. In all of the designs I know about, you run the TLB check in parallel and try to reconcile the cache line result late. Intel was reluctant to go to virtually indexed caches for a long time, but I'm inferring from your comment and Eric's that they eventually did so?

That idea isn't so bad until you implement an out-of-order, renamed core. What happens next is that somebody decides to kick off the cache load result into the dataflow execution process. Now the TLB check can be really late, because nothing dependent on that load is going to be allowed to go back to memory until all possible "oops" conditions on the load are resolved, and anything that touches registers can be unwound by un-doing the renames. So no harm, no foul, right? But in that kind of design, that unwind might get tripped 60 or more instructions later. And you may execute data-dependent contingent branches in there. And they may load from the cache looking for their destination. And you can statistically measure whether that iCache load happened from a completely unrelated thread of control while this is going on. And then you discover a reason to unwind, and the machine [approximately] serializes while you let all of the discards settle and figure out what the program counter is now.

All of which takes a very noticeable amount of time that is very measurable.

The idea of multiple threads sharing an L1 cache and/or TLB (SMT) can reasonably be viewed as hardware support for side channel attacks,

Like I said, I haven't refreshed on meltdown, and that could be fun in a whole different way.


This sort of thing highlights one of the ways that an FPGA-based core can be helpful even if it is completely unmodified. I'm interested in playing with OoO, which is resource intensive, so I was poking at ChatGPT to estimate what might fit onto an XC7A200T (double the size of the Arty A7 that a lot of CHERIoT work is happening on). The first answer is that you probably aren't going to get one four issue out of order core out of 200 logic cells. If you want to push seriously in OoO + multicore, you're looking at something closer to 1.5M logic cells. But in that footprint you can probably get:
  • Dual core simple-ish three stage pipeline to test kernel concurrency bugs
  • SMT2 (two thread single core), though probably not with vector unit
  • Two-way out-of-order single core.
  • With or without the hypervisor extensions
And of course more options as the FPGA fabric gets larger, which is why I'm drooling a bit about the Zynq part. And you can introduce hardware-level instrumentation if that seems helpful.

The reason this is a little interesting is that all of these approaches implement the same external specification of the processor. But the respective behaviors are very different in ways that stress different things. Nobody is going to confuse a CPU on an FPGA for fast compared to a hard core, but the fact that you can swap out the micro-architecture for the cost of shipping over a different bitstream (what we would think of as an executable). is pretty interesting.

I knew this in abstract, but I hadn't really stopped to think it through.

Another way to look at this is that the computation part of a CPU is kind of "done." You can event new kinds of functional units, but ALUs and single-wide FPUs haven't changed substantially in 35 years. An instruction here or there, but...

The interesting steps forward have been in the form of new functional units, improved instruction decode and issue logic, and stuff that happens in the vicinity of the LSU/TLB/cache (the so-called "uncore").


Jonathan

William ML Leslie

unread,
Mar 3, 2026, 11:24:37 PM (9 days ago) Mar 3
to cap-...@googlegroups.com
On Wed, 4 Mar 2026 at 13:02, Jonathan S. Shapiro <jonathan....@gmail.com> wrote:
William:

I haven't had a chance to refresh on meltdown, but I think I can answer one of your questions. You asked:

On Tue, Mar 3, 2026 at 5:50 PM William ML Leslie <william.l...@gmail.com> wrote:
On Wed, 4 Mar 2026 at 11:09, William ML Leslie <william.l...@gmail.com> wrote:
If you're doing your own hardware, well, you can do the permission check when you've completed the translation and before you issue the load.  I admit I've never understood the performance arguments against this - it's a single bit, and by the time you know the physical address, you already have said bit.

Oh.  If the cache is virtually addressed, then you can get the value at a virtual address without a matching entry being present in the TLB.  The entry might turn out to be from an invalid mapping, so the physical tag on the cache line is checked once the TLB is filled.

We actually looked at this at HaL back in 1991. It's known as a virtually indexed, physically tagged cache. Ideally you'd add the ASID, but between virtual tag, physical tag, and ASID those tags end up being a large proportion of the per-line state. In all of the designs I know about, you run the TLB check in parallel and try to reconcile the cache line result late. Intel was reluctant to go to virtually indexed caches for a long time, but I'm inferring from your comment and Eric's that they eventually did so?

That idea isn't so bad until you implement an out-of-order, renamed core. What happens next is that somebody decides to kick off the cache load result into the dataflow execution process. Now the TLB check can be really late, because nothing dependent on that load is going to be allowed to go back to memory until all possible "oops" conditions on the load are resolved, and anything that touches registers can be unwound by un-doing the renames. So no harm, no foul, right? But in that kind of design, that unwind might get tripped 60 or more instructions later. And you may execute data-dependent contingent branches in there. And they may load from the cache looking for their destination. And you can statistically measure whether that iCache load happened from a completely unrelated thread of control while this is going on. And then you discover a reason to unwind, and the machine [approximately] serializes while you let all of the discards settle and figure out what the program counter is now.

All of which takes a very noticeable amount of time that is very measurable.

This is basically my understanding as of today.  I originally thought meltdown was about mappings.  It seems that the kernel map was just a convenient source of virtual addresses.
 
The idea of multiple threads sharing an L1 cache and/or TLB (SMT) can reasonably be viewed as hardware support for side channel attacks,

Yep.  Well, it turns out it's even true with a single thread - those fun demos using spectre to dump the contents of the browser process from javascript come to mind.

Gernot has spoken about some interesting work the trustworthy systems group has been doing in that space: https://trustworthy.systems/publications/abstracts/Wistoff_SGBH_20.abstract

Another way to look at this is that the computation part of a CPU is kind of "done." You can event new kinds of functional units, but ALUs and single-wide FPUs haven't changed substantially in 35 years. An instruction here or there, but...

The interesting steps forward have been in the form of new functional units, improved instruction decode and issue logic, and stuff that happens in the vicinity of the LSU/TLB/cache (the so-called "uncore").

Right.  It bifurcates the conversation, though.  Even if we can fix this in hardware, I'm still interested in solving this for amd64.

Jonathan S. Shapiro

unread,
Mar 3, 2026, 11:48:52 PM (9 days ago) Mar 3
to cap-...@googlegroups.com
On Tue, Mar 3, 2026 at 8:24 PM William ML Leslie <william.l...@gmail.com> wrote:
 
Right.  It bifurcates the conversation, though.  Even if we can fix this in hardware, I'm still interested in solving this for amd64.

How is your time machine project coming along?

Jonathan

William ML Leslie

unread,
Mar 4, 2026, 4:49:16 AM (9 days ago) Mar 4
to cap-...@googlegroups.com
On Wed, 4 Mar 2026 at 08:19, Jonathan S. Shapiro <jonathan....@gmail.com> wrote:
On a 64-bit platform, I was thinking that for [near] page size objects we could go ahead and reserve a KVA region for each disk page or object pot, and then map them there as they come in. This would mean that the "disk page frame" part of the OID and the kernel virtual address are just two encodings for the same thing: frameOf(OID) = KVA + const, which makes unswizzling pretty easy. even if the object itself has been unmapped.

But it doesn't extend well to smaller objects, because it implies mapping a full page that contains several of those objects. That's fine if a bunch of them are actually getting used. but not so great as co-usage within a frame declines. Which means that it's probably not great for the per-page ObjectHeader structures.

I don't think any objects are really page-sized.  For example, a page (or cappage) object is an Object Header plus a virtual address.  The process object comes close, especially if you represent SIMD registers inline in the Process structure.
 
I don't think we have any idea, and I'm not inclined to change this part of things around until we do.

+1.  I think we can get by changing the pointer to an index into an array that we place into the largest available ram region and revisit when that's not enough.
 
Which regrettably means that the in memory capability structure may have to be larger than the on-disk structure to accommodate two 64-bit pointers. Either that or we eat the on-disk space to keep them the same size. Not the end of the world, but annoying.

Yes.

digit...@gmail.com

unread,
Mar 4, 2026, 7:31:34 PM (9 days ago) Mar 4
to cap-talk
On Tuesday, March 3, 2026 at 1:53:14 PM UTC-8 Jonathan S. Shapiro wrote:
On Tue, Mar 3, 2026 at 1:16 PM Mark S. Miller <eri...@gmail.com> wrote:
"SASOS"?

On Tue, Mar 3, 2026 at 1:14 PM digit...@gmail.com <digit...@gmail.com> wrote:
Reordering paragraphs for importance

On Monday, March 2, 2026 at 6:34:51 PM UTC-8 Jonathan S. Shapiro wrote:
[...]
On the other hand, we have a 2^63 byte kernel virtual address space to play with. Which means that we can assign a permanent kernel virtual address for every allocation of every page-sized "frame" on the store. Please note that I am not yet convinced this is a good idea - just pondering.

No, this is unsafe on modern high performance CPUs, absent specific hardware guarantees you know to be true.  I don't expect any CPU vendor to stand behind such guarantees except on in-order embedded style cores.

Why: speculative execution will trigger TLB loading along virtual addresses that were never touched, architecturally.  Anything valid in the page tables can become resident in the TLB.  And, anything in the TLB can produce a physical address that gets sent to DRAM and loaded into a cache.  By the time a permission, consistency checks, and rollback logic conclude the access was not supposed to happen, the load is already in flight and can't be cancelled.  Assume speculation gadgets can exfiltrate all cache contents, absent reason to believe otherwise.

I'm confident that you're right, but I'm a bit confused. When I wrote "assign a permanent kernel virtual address", I should have written "address region". That is: a range of kernel virtual addresses at which we would eventually map page-sized objects when they come in. So if there isn't an object, or it hasn't come in, there are no valid PTE entries for the corresponding page framess. Even when valid they would be kernel-only mappings. The corresponding user-mode addresses are in a completely different part of the address space. 

I've been writing my way through that, and I'm no longer convinced that it is as helpful as it seemed.

But you seem to be saying that kernel-only mappings more generally need added cautions around speculation. What do I need to read to understand the issues?

All mapping need caution around speculation, but you already were going to structure the system so that usermode address spaces of different security contexts were not going to be simultaneously visible to a given CPU, one presumes. Thus I focus on the kernel space.  WRT usermode, small spaces from EROS would be an example of a design that is safe on an in-order machine and which, when run on a heavily speculative OoO cpu, introduces channels that would allow cross-Domain data exfiltration that should be impossible.

SASOS can be fine, in that a given kernel VA will only ever hold a particular object id or else be invalid.  When running a particular Process, no VAs in the current address map (even if they have supervisor-only permission) should map data that Process doesn't have permission to access.  So: don't casually share the substructure of the supervisor mode page tables, even if the virtual address -> object ID mapping is shared.  Exceptions to that rule require reasoning to explain why they are safe.
  

digit...@gmail.com

unread,
Mar 4, 2026, 7:58:15 PM (8 days ago) Mar 4
to cap-talk
On Tuesday, March 3, 2026 at 7:02:42 PM UTC-8 Jonathan S. Shapiro wrote:
William:

I haven't had a chance to refresh on meltdown, but I think I can answer one of your questions. You asked:

On Tue, Mar 3, 2026 at 5:50 PM William ML Leslie <william.l...@gmail.com> wrote:
On Wed, 4 Mar 2026 at 11:09, William ML Leslie <william.l...@gmail.com> wrote:
If you're doing your own hardware, well, you can do the permission check when you've completed the translation and before you issue the load.  I admit I've never understood the performance arguments against this - it's a single bit, and by the time you know the physical address, you already have said bit.

Oh.  If the cache is virtually addressed, then you can get the value at a virtual address without a matching entry being present in the TLB.  The entry might turn out to be from an invalid mapping, so the physical tag on the cache line is checked once the TLB is filled.

We actually looked at this at HaL back in 1991. It's known as a virtually indexed, physically tagged cache. Ideally you'd add the ASID, but between virtual tag, physical tag, and ASID those tags end up being a large proportion of the per-line state. In all of the designs I know about, you run the TLB check in parallel and try to reconcile the cache line result late. Intel was reluctant to go to virtually indexed caches for a long time, but I'm inferring from your comment and Eric's that they eventually did so?

Yes, VIPT is nearly universal for L1 caches on high performance CPUs.  It caps cache size at (page size) * (number of sets), since you can't index by any but the within-a-page address bits, but number of sets has been slowly increasing as well over the years.  To keep coherence, there is also a PIPT snooping on all the good (from a systems programmers PoV) architectures, which includes x86 and ARM (only dcache snoops).

That idea isn't so bad until you implement an out-of-order, renamed core. What happens next is that somebody decides to kick off the cache load result into the dataflow execution process. Now the TLB check can be really late, because nothing dependent on that load is going to be allowed to go back to memory until all possible "oops" conditions on the load are resolved, and anything that touches registers can be unwound by un-doing the renames. So no harm, no foul, right? But in that kind of design, that unwind might get tripped 60 or more instructions later. And you may execute data-dependent contingent branches in there. And they may load from the cache looking for their destination. And you can statistically measure whether that iCache load happened from a completely unrelated thread of control while this is going on. And then you discover a reason to unwind, and the machine [approximately] serializes while you let all of the discards settle and figure out what the program counter is now.

Yeah.  The workaround - which, to my knowledge has not been commercially implemented - is to keep victim buffers of evicted cache structures such that abandoned speculation can also undo cache evictions (and to invalidate any entries loaded by abandoned speculation).  Or, to accept the slowdown of not speculating.  Revealed preferences unsurprisingly show people want fast systems.
 

All of which takes a very noticeable amount of time that is very measurable.

The idea of multiple threads sharing an L1 cache and/or TLB (SMT) can reasonably be viewed as hardware support for side channel attacks,

Yup.  There are _many_ other microarchitectural channels. Intel keeps a list of publicly disclosed ones (of varying severity) at https://www.intel.com/content/www/us/en/developer/topic-technology/software-security-guidance/processors-affected-consolidated-product-cpu-model.html , and it got so long they had to break the page up by year of discovery.  Basically, every reason to speculate (and then maybe need to unwind speculation), and every optimization that applies sometimes but not universally ends up crating some data channel, and the question is around bandwidth and ergonomics for an attacker.  More will be found, which is why my starting position is "everything in the cache is accessible to attackers" and "leave no live TLB entries nor mappings to contexts not currently running".

Like I said, I haven't refreshed on meltdown, and that could be fun in a whole different way.

It put some wear and tear on defenders who had to mitigate existing silicon.

William ML Leslie

unread,
Mar 4, 2026, 9:02:50 PM (8 days ago) Mar 4
to cap-...@googlegroups.com
On Thu, 5 Mar 2026 at 10:31, digit...@gmail.com <digit...@gmail.com> wrote:
When running a particular Process, no VAs in the current address map (even if they have supervisor-only permission) should map data that Process doesn't have permission to access.  So: don't casually share the substructure of the supervisor mode page tables, even if the virtual address -> object ID mapping is shared.  Exceptions to that rule require reasoning to explain why they are safe.

I'm trying to understand if this is really sufficient.

Let's say I only need a minimal amount of supervisor mode table entries mapped by default to handle interrupts and the first part of system calls.  I might map more as needed, but I always prune back to just the essentials.

Do I know that addresses I used before pruning the mapping are no longer in the cache?

Do I know that addresses the previous process was using, which also are not in the map, are not in the cache?

Am I right to think that the effect of pruning is preventing speculation from filling cache lines at those addresses, limiting the attack surface to addresses already cached?

Thank you Eric for reminding me to revisit this (you can tell by the second message in this thread that I had been putting it off).

Jonathan S. Shapiro

unread,
Mar 7, 2026, 1:04:57 AM (6 days ago) Mar 7
to cap-...@googlegroups.com
On Wed, Mar 4, 2026 at 1:49 AM William ML Leslie <william.l...@gmail.com> wrote:
I don't think any objects are really page-sized.  For example, a page (or cappage) object is an Object Header plus a virtual address.  The process object comes close, especially if you represent SIMD registers inline in the Process structure.

Pages and cappages have object headers off on the side, so they are effectively page-sized objects, and their associated ObjectHeader vector is a contiguous range of pages.

Right now the SIMD registers mostly are kept inline in the process structure. This initially happened because x86 xmm and fp state overlapped. ymm didn't exist yet on those processors, but the (mostly unconsidered) plan was to continue the pattern. The object ratios are such that adding bytes to processes isn't as bad as adding bytes to most other things, and it seems like we probably don't want to separately page in a GPU state object. Eliminating the complexity of process state handling in KeyKOS is one of the really big wins in the Coyotos implementation.

Jonathan
 
 
I don't think we have any idea, and I'm not inclined to change this part of things around until we do.

+1.  I think we can get by changing the pointer to an index into an array that we place into the largest available ram region and revisit when that's not enough.
 
Which regrettably means that the in memory capability structure may have to be larger than the on-disk structure to accommodate two 64-bit pointers. Either that or we eat the on-disk space to keep them the same size. Not the end of the world, but annoying.

Yes.

Update: per my other post, we'll use indices and keep the current cap size except in large-DRAM scenarios.


Jonathan

Jonathan S. Shapiro

unread,
Mar 7, 2026, 1:41:44 AM (6 days ago) Mar 7
to cap-...@googlegroups.com
On Wed, Mar 4, 2026 at 6:02 PM William ML Leslie <william.l...@gmail.com> wrote:
On Thu, 5 Mar 2026 at 10:31, digit...@gmail.com <digit...@gmail.com> wrote:
When running a particular Process, no VAs in the current address map (even if they have supervisor-only permission) should map data that Process doesn't have permission to access.  So: don't casually share the substructure of the supervisor mode page tables, even if the virtual address -> object ID mapping is shared.  Exceptions to that rule require reasoning to explain why they are safe.

Thoughts:
  • The Coyotos supervisor-mode mappings are effectively constant.
  • The transmap virtual range is constant, the transmap itself is CPU-local and flushed on context switch, but the PTEs change, and we rely on them not getting re-loaded into the TLB until we re-allocate them, even if the valid bit is set, because the kernel follows very explicit rules about how it accesses them following an invalidate. I think it would be pretty easy to zero out the transmap PTEs and keep them zero while in user mode; I'd need to double check. If that works, it's even easier on 64-bit, where a bigger transmap region would let us zero the singleton second level PTE.
  • Offhand, I don't think we need to clear the transmap when returning to the same address space. The only case in which pages in another address space end up in the transmap is string transfer, but if we transmitted a string to a receiver, the data in that string is already known to us. Adjacent data in the same receiver page frames could be exposed.
  • User data pages and (I think) user mapping tables are manipulated through transient mappings, so I think that case reduces to "how safe is the transmap?"
  • Shared sub-spaces share mapping tables whenever possible, but I am provisionally inclined to think that if two address spaces are sharing memory already there isn't a marginal exposure arising from the fact that the associated mapping structures are being shared. Mapping tables that contain a mix of shared and non-shared PTEs are not shared unless they are shared at a higher level, which might arise when two process objects acting as threads are sharing their entire address space.
  • Non-page durable objects and their dependency tracking structures have durable kernel maps, but aside from the per-process registers I'm not sure that their content is all that sensitive.
  • If we declared that every in-memory process occupies a full page, we could probably guard them by handling them with the transmap, but holy cow would I want to think that one through pretty carefully.
This is not exhaustive, it's just what came to mind quickly.


Jonathan
Reply all
Reply to author
Forward
0 new messages