Hopefully this is the end of the sequence of Coyotos messages for today. What's really been going on is that I've been slowly wrapping my head around the implications of a 64-bit address space and just how big that is, and how many of the tricks used in the 32-bit kernel are either (a) not relevant, or (b) not possible.In particular, anything that relies on background GC or a background cleanup pass over a 64-bit address space just isn't going to work. We can do things like that for the pages that become dirty, but not by using a linear scan through memory. Which means that the OTEntry idea for "depreparing" capabilities doesn't work at this scale.
On the other hand, we have a 2^63 byte kernel virtual address space to play with. Which means that we can assign a permanent kernel virtual address for every allocation of every page-sized "frame" on the store. Please note that I am not yet convinced this is a good idea - just pondering.
And here I had thought that migration to 64-bit was going to be simple.
There might be some tricks we can use to get there. An algorithm for building ... precliques? comes to mind. We have shared weak address spaces, processes that add mutable storage on top of those, and endpoints that connect processes together.
On Tue, 3 Mar 2026 at 16:59, William ML Leslie <william.l...@gmail.com> wrote:There might be some tricks we can use to get there. An algorithm for building ... precliques? comes to mind. We have shared weak address spaces, processes that add mutable storage on top of those, and endpoints that connect processes together.Endpoints and cappages. I don't think we know clearly yet just how extensively people will use cappages but I guess that's the challenging part.
What got me thinking on those terms is that we have similar opportunities with Mappings. I suspect that most Mappings could be bound for their lifetime to one CPU.
On the other hand, we have a 2^63 byte kernel virtual address space to play with. Which means that we can assign a permanent kernel virtual address for every allocation of every page-sized "frame" on the store. Please note that I am not yet convinced this is a good idea - just pondering.Meltdown aside, I love that we might be able to completely remove mapping changes (via the transmap) from the IPC path. In fact, we could nest both the entire sender and receiver mappings inside one and have indirect string copies apply to contiguous virtual addresses.
On the other hand, we have a 2^63 byte kernel virtual address space to play with. Which means that we can assign a permanent kernel virtual address for every allocation of every page-sized "frame" on the store. Please note that I am not yet convinced this is a good idea - just pondering.
In particular, anything that relies on background GC or a background cleanup pass over a 64-bit address space just isn't going to work. We can do things like that for the pages that become dirty, but not by using a linear scan through memory. Which means that the OTEntry idea for "depreparing" capabilities doesn't work at this scale.
Jonathan
--
You received this message because you are subscribed to the Google Groups "cap-talk" group.
To unsubscribe from this group and stop receiving emails from it, send an email to cap-talk+u...@googlegroups.com.
To view this discussion visit https://groups.google.com/d/msgid/cap-talk/43f72001-b4e8-472c-aa95-b692c65cbc10n%40googlegroups.com.
"SASOS"?Reordering paragraphs for importance
On Monday, March 2, 2026 at 6:34:51 PM UTC-8 Jonathan S. Shapiro wrote:[...]On the other hand, we have a 2^63 byte kernel virtual address space to play with. Which means that we can assign a permanent kernel virtual address for every allocation of every page-sized "frame" on the store. Please note that I am not yet convinced this is a good idea - just pondering.
No, this is unsafe on modern high performance CPUs, absent specific hardware guarantees you know to be true. I don't expect any CPU vendor to stand behind such guarantees except on in-order embedded style cores.
Why: speculative execution will trigger TLB loading along virtual addresses that were never touched, architecturally. Anything valid in the page tables can become resident in the TLB. And, anything in the TLB can produce a physical address that gets sent to DRAM and loaded into a cache. By the time a permission, consistency checks, and rollback logic conclude the access was not supposed to happen, the load is already in flight and can't be cancelled. Assume speculation gadgets can exfiltrate all cache contents, absent reason to believe otherwise.
Never used a transparently persistent system, so my intuitions here are uncalibrated. The changes in order of magnitude since such systems were last used gives me pause. Did KeyKOS swap from tape (cite the 20 year old Jim Gray disk-is-tape comparison and then extrapolate another 20 years)?
"SASOS"?
On Tue, Mar 3, 2026 at 1:16 PM Mark S. Miller <eri...@gmail.com> wrote:"SASOS"?Reordering paragraphs for importance
On Monday, March 2, 2026 at 6:34:51 PM UTC-8 Jonathan S. Shapiro wrote:[...]On the other hand, we have a 2^63 byte kernel virtual address space to play with. Which means that we can assign a permanent kernel virtual address for every allocation of every page-sized "frame" on the store. Please note that I am not yet convinced this is a good idea - just pondering.
No, this is unsafe on modern high performance CPUs, absent specific hardware guarantees you know to be true. I don't expect any CPU vendor to stand behind such guarantees except on in-order embedded style cores.
Why: speculative execution will trigger TLB loading along virtual addresses that were never touched, architecturally. Anything valid in the page tables can become resident in the TLB. And, anything in the TLB can produce a physical address that gets sent to DRAM and loaded into a cache. By the time a permission, consistency checks, and rollback logic conclude the access was not supposed to happen, the load is already in flight and can't be cancelled. Assume speculation gadgets can exfiltrate all cache contents, absent reason to believe otherwise.I'm confident that you're right, but I'm a bit confused. When I wrote "assign a permanent kernel virtual address", I should have written "address region". That is: a range of kernel virtual addresses at which we would eventually map page-sized objects when they come in. So if there isn't an object, or it hasn't come in, there are no valid PTE entries for the corresponding page framess. Even when valid they would be kernel-only mappings. The corresponding user-mode addresses are in a completely different part of the address space.I've been writing my way through that, and I'm no longer convinced that it is as helpful as it seemed.But you seem to be saying that kernel-only mappings more generally need added cautions around speculation. What do I need to read to understand the issues?
If you're doing your own hardware, well, you can do the permission check when you've completed the translation and before you issue the load. I admit I've never understood the performance arguments against this - it's a single bit, and by the time you know the physical address, you already have said bit.
On Wed, 4 Mar 2026 at 11:09, William ML Leslie <william.l...@gmail.com> wrote:If you're doing your own hardware, well, you can do the permission check when you've completed the translation and before you issue the load. I admit I've never understood the performance arguments against this - it's a single bit, and by the time you know the physical address, you already have said bit.Oh. If the cache is virtually addressed, then you can get the value at a virtual address without a matching entry being present in the TLB. The entry might turn out to be from an invalid mapping, so the physical tag on the cache line is checked once the TLB is filled.
William:I haven't had a chance to refresh on meltdown, but I think I can answer one of your questions. You asked:On Tue, Mar 3, 2026 at 5:50 PM William ML Leslie <william.l...@gmail.com> wrote:On Wed, 4 Mar 2026 at 11:09, William ML Leslie <william.l...@gmail.com> wrote:If you're doing your own hardware, well, you can do the permission check when you've completed the translation and before you issue the load. I admit I've never understood the performance arguments against this - it's a single bit, and by the time you know the physical address, you already have said bit.Oh. If the cache is virtually addressed, then you can get the value at a virtual address without a matching entry being present in the TLB. The entry might turn out to be from an invalid mapping, so the physical tag on the cache line is checked once the TLB is filled.We actually looked at this at HaL back in 1991. It's known as a virtually indexed, physically tagged cache. Ideally you'd add the ASID, but between virtual tag, physical tag, and ASID those tags end up being a large proportion of the per-line state. In all of the designs I know about, you run the TLB check in parallel and try to reconcile the cache line result late. Intel was reluctant to go to virtually indexed caches for a long time, but I'm inferring from your comment and Eric's that they eventually did so?That idea isn't so bad until you implement an out-of-order, renamed core. What happens next is that somebody decides to kick off the cache load result into the dataflow execution process. Now the TLB check can be really late, because nothing dependent on that load is going to be allowed to go back to memory until all possible "oops" conditions on the load are resolved, and anything that touches registers can be unwound by un-doing the renames. So no harm, no foul, right? But in that kind of design, that unwind might get tripped 60 or more instructions later. And you may execute data-dependent contingent branches in there. And they may load from the cache looking for their destination. And you can statistically measure whether that iCache load happened from a completely unrelated thread of control while this is going on. And then you discover a reason to unwind, and the machine [approximately] serializes while you let all of the discards settle and figure out what the program counter is now.All of which takes a very noticeable amount of time that is very measurable.
The idea of multiple threads sharing an L1 cache and/or TLB (SMT) can reasonably be viewed as hardware support for side channel attacks,
Another way to look at this is that the computation part of a CPU is kind of "done." You can event new kinds of functional units, but ALUs and single-wide FPUs haven't changed substantially in 35 years. An instruction here or there, but...The interesting steps forward have been in the form of new functional units, improved instruction decode and issue logic, and stuff that happens in the vicinity of the LSU/TLB/cache (the so-called "uncore").
Right. It bifurcates the conversation, though. Even if we can fix this in hardware, I'm still interested in solving this for amd64.
On a 64-bit platform, I was thinking that for [near] page size objects we could go ahead and reserve a KVA region for each disk page or object pot, and then map them there as they come in. This would mean that the "disk page frame" part of the OID and the kernel virtual address are just two encodings for the same thing: frameOf(OID) = KVA + const, which makes unswizzling pretty easy. even if the object itself has been unmapped.But it doesn't extend well to smaller objects, because it implies mapping a full page that contains several of those objects. That's fine if a bunch of them are actually getting used. but not so great as co-usage within a frame declines. Which means that it's probably not great for the per-page ObjectHeader structures.
I don't think we have any idea, and I'm not inclined to change this part of things around until we do.
Which regrettably means that the in memory capability structure may have to be larger than the on-disk structure to accommodate two 64-bit pointers. Either that or we eat the on-disk space to keep them the same size. Not the end of the world, but annoying.
On Tue, Mar 3, 2026 at 1:16 PM Mark S. Miller <eri...@gmail.com> wrote:"SASOS"?Reordering paragraphs for importance
On Monday, March 2, 2026 at 6:34:51 PM UTC-8 Jonathan S. Shapiro wrote:[...]On the other hand, we have a 2^63 byte kernel virtual address space to play with. Which means that we can assign a permanent kernel virtual address for every allocation of every page-sized "frame" on the store. Please note that I am not yet convinced this is a good idea - just pondering.
No, this is unsafe on modern high performance CPUs, absent specific hardware guarantees you know to be true. I don't expect any CPU vendor to stand behind such guarantees except on in-order embedded style cores.
Why: speculative execution will trigger TLB loading along virtual addresses that were never touched, architecturally. Anything valid in the page tables can become resident in the TLB. And, anything in the TLB can produce a physical address that gets sent to DRAM and loaded into a cache. By the time a permission, consistency checks, and rollback logic conclude the access was not supposed to happen, the load is already in flight and can't be cancelled. Assume speculation gadgets can exfiltrate all cache contents, absent reason to believe otherwise.I'm confident that you're right, but I'm a bit confused. When I wrote "assign a permanent kernel virtual address", I should have written "address region". That is: a range of kernel virtual addresses at which we would eventually map page-sized objects when they come in. So if there isn't an object, or it hasn't come in, there are no valid PTE entries for the corresponding page framess. Even when valid they would be kernel-only mappings. The corresponding user-mode addresses are in a completely different part of the address space.
I've been writing my way through that, and I'm no longer convinced that it is as helpful as it seemed.But you seem to be saying that kernel-only mappings more generally need added cautions around speculation. What do I need to read to understand the issues?
William:I haven't had a chance to refresh on meltdown, but I think I can answer one of your questions. You asked:On Tue, Mar 3, 2026 at 5:50 PM William ML Leslie <william.l...@gmail.com> wrote:On Wed, 4 Mar 2026 at 11:09, William ML Leslie <william.l...@gmail.com> wrote:If you're doing your own hardware, well, you can do the permission check when you've completed the translation and before you issue the load. I admit I've never understood the performance arguments against this - it's a single bit, and by the time you know the physical address, you already have said bit.Oh. If the cache is virtually addressed, then you can get the value at a virtual address without a matching entry being present in the TLB. The entry might turn out to be from an invalid mapping, so the physical tag on the cache line is checked once the TLB is filled.We actually looked at this at HaL back in 1991. It's known as a virtually indexed, physically tagged cache. Ideally you'd add the ASID, but between virtual tag, physical tag, and ASID those tags end up being a large proportion of the per-line state. In all of the designs I know about, you run the TLB check in parallel and try to reconcile the cache line result late. Intel was reluctant to go to virtually indexed caches for a long time, but I'm inferring from your comment and Eric's that they eventually did so?
That idea isn't so bad until you implement an out-of-order, renamed core. What happens next is that somebody decides to kick off the cache load result into the dataflow execution process. Now the TLB check can be really late, because nothing dependent on that load is going to be allowed to go back to memory until all possible "oops" conditions on the load are resolved, and anything that touches registers can be unwound by un-doing the renames. So no harm, no foul, right? But in that kind of design, that unwind might get tripped 60 or more instructions later. And you may execute data-dependent contingent branches in there. And they may load from the cache looking for their destination. And you can statistically measure whether that iCache load happened from a completely unrelated thread of control while this is going on. And then you discover a reason to unwind, and the machine [approximately] serializes while you let all of the discards settle and figure out what the program counter is now.
All of which takes a very noticeable amount of time that is very measurable.The idea of multiple threads sharing an L1 cache and/or TLB (SMT) can reasonably be viewed as hardware support for side channel attacks,
Like I said, I haven't refreshed on meltdown, and that could be fun in a whole different way.
When running a particular Process, no VAs in the current address map (even if they have supervisor-only permission) should map data that Process doesn't have permission to access. So: don't casually share the substructure of the supervisor mode page tables, even if the virtual address -> object ID mapping is shared. Exceptions to that rule require reasoning to explain why they are safe.
I don't think any objects are really page-sized. For example, a page (or cappage) object is an Object Header plus a virtual address. The process object comes close, especially if you represent SIMD registers inline in the Process structure.
I don't think we have any idea, and I'm not inclined to change this part of things around until we do.+1. I think we can get by changing the pointer to an index into an array that we place into the largest available ram region and revisit when that's not enough.Which regrettably means that the in memory capability structure may have to be larger than the on-disk structure to accommodate two 64-bit pointers. Either that or we eat the on-disk space to keep them the same size. Not the end of the world, but annoying.Yes.
When running a particular Process, no VAs in the current address map (even if they have supervisor-only permission) should map data that Process doesn't have permission to access. So: don't casually share the substructure of the supervisor mode page tables, even if the virtual address -> object ID mapping is shared. Exceptions to that rule require reasoning to explain why they are safe.