I feel a much smaller endeavor would be of great value: a “how to design an object capability OS” that lays out the common architectural components in the abstract, then identifies the concrete design tensions and rules weakly one way or the other, or identifies concrete examples and the consequences. The balance on some of those design tensions undoubtedly shift as the compute to memory to storage economics change, or just the appearance of hypervisors on commodity hardware.
We may be at a similar moment for object capability operating systems, where a critical mass of folks who understand the architecture well enough to go off and recapitulate the mistakes might emerge, and at the end of this epoch, we’ll have a few mainstream open source options.
--
You received this message because you are subscribed to the Google Groups "friam" group.
To unsubscribe from this group and stop receiving emails from it, send an email to friam+un...@googlegroups.com.
To view this discussion visit https://groups.google.com/d/msgid/friam/9a0d8f23-61db-46ac-9d19-a1e382b0b816n%40googlegroups.com.
To me a historical account would be far more interesting & valuable. What changes were made and why, options considered, what was planned (but not implemented), what you think of those decisions in hindsight, etc.Bakul Shah
... a lot of seL4 proofs are built on an abstract model written in
haskell, which then has a relation between the model, and the kernel
implemented on hardware that people actually run.
I feel like they have shown there is a certain amount of room for this
abstract model approach to work in practice. But then again i'm fond
of "literate formal verification", so of course I'd love to see
an abstract model that was useful not only for a book form, but for
proof purposes on a substrate more "ideal" than the hardware of today.
At the very least I feel like there are a *lot* of details and (both
things that need to be explained, and properties that can be proven)
for which an implementation in metal isn't totally necessary...
for those things may merely serve to complicate things. Like I don't
even know where to start proving process confinement on a hardware
model....
>> But then again i'm fond of "literate formal verification"...
> But I'm thrilled to see you arguing about documenting the verification. When I started this, the prevailing wisdom was that the verification was not, in principle, possible, because "capability systems cannot enforce the confinement property" (and by implication other forms of isolation).
FWIW, I think this is fair, understandable and totally reasonable, a
formal verification project alone is difficult, a book alone is
difficult.
I don't want to misrepresent the amount of progress I actually
achieved, but part of this was due to also writing all the proof
language...
>> Like I don't
>> even know where to start proving process confinement on a hardware
>> model....
On Tue, Aug 26, 2025 at 7:33 AM Jonathan S. Shapiro
<jonathan....@gmail.com> wrote:
>
> On Tue, Aug 26, 2025 at 12:13 AM Matt Rice <rat...@gmail.com> wrote:
>>
>> >> But then again i'm fond of "literate formal verification"...
>
>
> I'm totally making up in my head what that means.
Literate formal verification is essentially just Knuths literate
programming applied to formal verification.
Basically including all the documentation generation tools of modern
programming languages applied to
proof languages.
I suspect people will be enhancing L4 variants for some time to come. And it's interesting to me that some of the architectural principals selected for L4 are so different from the ones in KeyKOS, EROS, and Coyotos.
--
You received this message because you are subscribed to the Google Groups "friam" group.
To unsubscribe from this group and stop receiving emails from it, send an email to friam+un...@googlegroups.com.
To view this discussion visit https://groups.google.com/d/msgid/friam/9a0d8f23-61db-46ac-9d19-a1e382b0b816n%40googlegroups.com.
Perfect security tech only makes the world safer if it gets adopted.
IMO, Both the KeyKOS line and seL4 adoption suffer from the difficulty of programming only with explicitly allocated memory. Most language-based and protocol-based ocap systems adopt the conventional memory-safe language approach of implicit allocation.- As a local language-based system, HardenedJS/Endo implicitly allocate and gc memory. As a protocol, OCapN (the latest CapTP) assumes implicit allocation and provides for distributed acyclic gc. SwingSet is our KeyKOS inspired OS kernel for running multiple vats on a blockchain communicating via OCapN semantics, both to other vats on the same chain/platform and to vats running on other mutually suspicious platforms. We do "perfectly" protect integrity. But because of implicit allocation, can at most deter rather than prevent attacks on availability. Much greater memory headroom makes such deterrence better than us old timers may have expected. But still, it is necessarily much worse than actual protection.- Rust is in the middle of this spectrum for a language. Cap'n Proto (a statically typed CapTP) is in the middle for a protocol (and for most of its respective language bindings). These middles are much harder to program to, but still don't provide the defense of availability possible in the KeyKOS or seL4 lines.- I am not aware of any safe language that treats memory resources at the language level with the same explicitness as the KeyKOS or seL4 lines. This is largely because the usability problems are so obvious in any language design that language designers give up before they get very far. But this is symptomatic of the usability (and teaching) problems that arise even at the OS level.Btw, personally, I love what I remember of the KeyKOS design decisions driven by explicit memory. But I rapidly became disillusioned to trying to teach this style of programming. It infects everything.
To view this discussion visit https://groups.google.com/d/msgid/friam/CAK5yZYj0eNaP834s-XWftm4K-2HTNTaB8Qkg7Y6TOUPdT4JrfQ%40mail.gmail.com.
Given that, today, seL4 is the natural attractor, I have wondered to what degree we could consolidate our interests in KeyKOS-lineage OSes (KeyKOS, CapROS, EROS, Coyotos) into efforts to improve seL4 as needed so we're not losing any of the *significant* differential value of the KeyKOS line. So let's start with a simpler question: What if we just dropped the KeyKOS line entirely and contributed all of that effort and interest in the seL4 system? What would we be missing? Of that, what is hard to live without?
It appears to me that *L4 has succeeded as a microviser that provides isolation, not as an operating system platform.
Perfect security tech only makes the world safer if it gets adopted.
IMO, Both the KeyKOS line and seL4 adoption suffer from the difficulty of programming only with explicitly allocated memory. Most language-based and protocol-based ocap systems adopt the conventional memory-safe language approach of implicit allocation.
As a protocol, OCapN (the latest CapTP) assumes implicit allocation and provides for distributed acyclic gc.
- I am not aware of any safe language that treats memory resources at the language level with the same explicitness as the KeyKOS or seL4 lines. This is largely because the usability problems are so obvious in any language design that language designers give up before they get very far...
Btw, personally, I love what I remember of the KeyKOS design decisions driven by explicit memory. But I rapidly became disillusioned to trying to teach this style of programming. It infects everything.
On Mon, Aug 25, 2025 at 8:33 PM 'Bakul Shah' via friam <fr...@googlegroups.com> wrote:To me a historical account would be far more interesting & valuable. What changes were made and why, options considered, what was planned (but not implemented), what you think of those decisions in hindsight, etc.Bakul ShahThank you, Bakul.Here is an example that may help to frame the challenge.One of the primordial decisions in GNOSIS and KeyKOS was that the universe consisted of nodes (which held capabilities) and pages (which hold data). Every system construct in KeyKOS is either primitive (you don't need a node or a page to halt the system), or is constructed as an arrangement of nodes and pages. Ultimately, the motivation for this was that the disk layout was partitioned into node spaces and page spaces at installation time and unchanged thereafter. In the 1970s, 100MB was a very big disk drive, so every megabyte wasted was significant and every new object type carried a risk that the balance of object types in the partition would be wrong. Because of this, processes and memory maps in GNOSIS/KeyKOS/EROS were a "pun" on nodes. A significant amount of code and complexity existed to preserve and guard invariants that support that pun.At least in the US, I can buy a 4TB SSD today for a very small fraction of the cost of that 1970s hard drive. Completely wasting 100MB of disk space to support a simpler set of on-disk and in-memory data structures is a no-brainer.
If the goal is for people to understand the result, maybe it's better to avoid getting tangled up in the complexity of constructions that are no longer relevant 30 years later. I'm not saying that as a "this is it" position, but as food for thought.
And holy cr*p, the notion that we might even talk about continued 30 year relevance in the field of computing is kind of mind blowing!
I also think this depends on what the existing papers describe. I wrote The KeyKOS Nanokernel Architecture paper 33 years ago. I should probably have a look at it to refresh my memory about what it says. :-)
Jonathan
> In short: GC doesn't scale in the economic sense. I'm actually very sad about that, but the numbers are what they are. I spent quite a long time looking for ways to address this during the BitC effort.
How then do you account for the ginormous popularity of JavaScript running in NodeJS as the server platform of choice for new web applications?
This feels to me a lot like the debates 40 years ago about whether we could afford the performance and memory overhead of using C for performance sensitive code.
Despite all the hand wringing about how impractical it was, the huge improved productivity benefit pushed developers over the edge and then in fairly short order Moore’s Law made up for the cost hit.
On Aug 25, 2025, at 11:03 PM, Jonathan S. Shapiro <jonathan....@gmail.com> wrote:If the goal is for people to understand the result, maybe it's better to avoid getting tangled up in the complexity of constructions that are no longer relevant 30 years later. I'm not saying that as a "this is it" position, but as food for thought.Such complexity might be considered machine dependent. One of the things a port can help with is discovering what is general purpose vs machine dependent. That is, how much does a new species differ from an older one. For that one has to study them both and discover out common factors.
And holy cr*p, the notion that we might even talk about continued 30 year relevance in the field of computing is kind of mind blowing!We should be building structures that are resilient enough to last and be useful for far longer.
[This is why I don't have much faith in proofs -- what happens when requirements change?]
Can we build malleable (operating) systems with capabilities?
I guess I have competing (within myself) points of view on this:
1. Garbage collection (* in erlang) scales nicely (in my experience) and
doesn't have the occasional latency across the VM, because memory
management can be done on the process level, and there is no shared
memory across the VM, only message-passing.
2. In _traditional_ (not cryptocurrency) financial applications with
which I am familiar, even the web servers are hand-written in C/C++
(sometimes, yes, with inline assembly). They have not transitioned much
towards any other language environment.
Rust is a possible game-changer
there since it can eliminate whole classes of memory-related bugs with
the borrow-checker memory management model and maintain C-like
performance.
I'd also be curious how gc impacts power usage as a metric.
I haven't seen any numbers (only those comparing the power usage
differences between different gc algorithms/implementations), but it
seems like another aspect that may impact the choice of gc for some
specific purposes besides the usual suspects.
On Aug 28, 2025, at 8:54 AM, Jonathan S. Shapiro <jonathan....@gmail.com> wrote:This feels to me a lot like the debates 40 years ago about whether we could afford the performance and memory overhead of using C for performance sensitive code.I think those arguments were largely retired in the early 1970s, You have an off by ten error. :-)
On Wed, Aug 27, 2025 at 3:47 PM 'Bakul Shah' via friam <fr...@googlegroups.com> wrote:A lot of this is because most applications hit a ceiling where more memory or more compute resource stops being helpful. In that word processor, the bottleneck is the human. But for the rest, there's a bit of a chicken and egg problem. We mostly haven't seen good languages or system structures to support distributed programming. MarkM's work and its successors seem promising. Go and goroutines may turn out to be interesting as well. async/await is not.
On Fri, 29 Aug 2025 at 02:19, Jonathan S. Shapiro <jonathan....@gmail.com> wrote:On Wed, Aug 27, 2025 at 3:47 PM 'Bakul Shah' via friam <fr...@googlegroups.com> wrote:A lot of this is because most applications hit a ceiling where more memory or more compute resource stops being helpful. In that word processor, the bottleneck is the human. But for the rest, there's a bit of a chicken and egg problem. We mostly haven't seen good languages or system structures to support distributed programming. MarkM's work and its successors seem promising. Go and goroutines may turn out to be interesting as well. async/await is not.With fear of taking a slightly off-topic thread and taking it roaming; having suffered Go professionally for about five years, its concurrency model is exactly the sort of accidental-deadlock bugfarm that MarkM's thesis predicts.I had a fairly small (~25ksloc) service that co-ordinated about a hundred users and a handful of services, and the original architect had the foresight to have one goroutine for its high-level state management. This goroutine listened to a couple of channels, including one that was fed commands by other goroutines on the system. This internal channel was mostly user requests and regular timers. It really didn't handle a lot of events, maybe 5-10 per second, and with a channel size of 10 that should have been generous. One day, after some changes from a senior developer on my team, the system deadlocked, leaving all of those users and the customers they were talking to completely stranded. It turned out they had introduced a code path where the state-management thread tried to post a message to that system channel, and in this case, the channel was full, so it just hung there, waiting for room on the channel.
So this became a thing that I would regularly have to audit for. Are we making sure that we never post from the main thread? Eventually this became: functions then are either "main thread safe" or not, and these are mutually exclusive, since you have to be on the main thread in order to read or write any of the shared state.The flippant response to this is, "that's bad application design". I'll counter with: concurrency safety had been my bag for over a decade, and I still didn't catch it in code review.
--If you find yourself having to deal with Go, the first job should be building a sensible concurrency abstraction on top of channels.William ML Leslie
--
You received this message because you are subscribed to the Google Groups "friam" group.
To unsubscribe from this group and stop receiving emails from it, send an email to friam+un...@googlegroups.com.
To view this discussion visit https://groups.google.com/d/msgid/friam/CAHgd1hGZWv5HONJMyP5SfKLAJ5R%3DMrWjiDUymEa5OohEpZ9ZJg%40mail.gmail.com.
With fear of taking a slightly off-topic thread and taking it roaming; having suffered Go professionally for about five years, its concurrency model is exactly the sort of accidental-deadlock bugfarm that MarkM's thesis predicts.
I had a fairly small (~25ksloc) service ... One day, after some changes from a senior developer on my team, the system deadlocked, leaving all of those users and the customers they were talking to completely stranded. It turned out they had introduced a code path where the state-management thread tried to post a message to that system channel, and in this case, the channel was full, so it just hung there, waiting for room on the channel.
First, to both of you, thanks for the kind words!Somewhere on erights.org (but probably not in my thesis) I gathered a list of "lost progress bugs". This one specifically I call "gridlock". The distinction I make from deadlock is: Had you had more memory (e.g., larger buffers), you wouldn't have lost progress at that moment, though you may still lose it later for the same reason.Off the top of my head, the lost progress bugs are:- Deadlock- Livelock- Gridlock- Lost signal(I think there was another but do not remember)
On Thu, Aug 28, 2025 at 5:18 PM William ML Leslie <william.l...@gmail.com> wrote:With fear of taking a slightly off-topic thread and taking it roaming; having suffered Go professionally for about five years, its concurrency model is exactly the sort of accidental-deadlock bugfarm that MarkM's thesis predicts.Accidental bugfarm isn't fair. There's nothing accidental about it.I've had some questions about eventually consistency that I've never had time to explore:
- What well-known performance patterns exist that can't be expressed this way? Are there alternative expressions?
- What ideas exist about computation placement and deadlines. The do-across pattern is only as good as its slowest leg, which leads me to wonder how that is handled?
I had a fairly small (~25ksloc) service ... One day, after some changes from a senior developer on my team, the system deadlocked, leaving all of those users and the customers they were talking to completely stranded. It turned out they had introduced a code path where the state-management thread tried to post a message to that system channel, and in this case, the channel was full, so it just hung there, waiting for room on the channel.It sounds like queueing opacity has a lot to do with the problem?
Async/await has this on steroids. When it works, it works great. When it goes off a cliff, good luck. And it's tied in to the call/return logic in a way that doesn't seem (at least to me) to leave any room for mitigation.
Off the top of my head, the lost progress bugs are:
- Deadlock
- Livelock
- Gridlock
- Lost signal
--
You received this message because you are subscribed to the Google Groups "friam" group.
To unsubscribe from this group and stop receiving emails from it, send an email to friam+un...@googlegroups.com.
To view this discussion visit https://groups.google.com/d/msgid/friam/CAAP%3D3QMQ-5L2%3DaRhvL88Vq-Djs%2BKQ0-EoZgb7CfjQnQ6XOtsaA%40mail.gmail.com.
To view this discussion visit https://groups.google.com/d/msgid/friam/CAK5yZYj2LWL4wESA3m1-FWjCny4q_C5rnPcYF80hpDhpNUZo8g%40mail.gmail.com.
To view this discussion visit https://groups.google.com/d/msgid/friam/CAPeSZfA9T7PdCh4MsO%3DpV7N_rWwJjn%3DnDk9iYqDh8W-F4%3DLL2g%40mail.gmail.com.
Kris Kowal is right to focus on a smaller endeavor, perhaps the one he describes, perhaps another. But keep it small.
None of us ever remember strongly enough markm's comment, "Perfect security tech only makes the world safer if it gets adopted."
Jonathan implicitly acknowledges that L4 is more successful than the KeyKos family.
My suggestion: Examine the question, "Why is L4 only successful as a hypervisor?", and if you can identify reasons it only works for that niche, ask the question, "how can we enhance it to fit more niches?"
Traditionally, in a field with entrenched winners, you succeed by starting with a tiny niche and expanding with new features into more niches.
Unix derivatives couldn't blow Windows off the desktop. But they could conquer the tiny early smartphone niche...
Jonathan observes (well, I'm going to twist his observation and put some words in his mouth here) that IoT is a fantastic new application space that has a huge number of characteristics that ought to favor KeyKos-style operating systems.
... and it meets Dale's suggestion to start with "the edge of the network, with less legacy pressure"
Combining this observation with Kris's observation, and combining them with the proposal to begin with something that already has some success, a more strategic possibilty arises: one could engage in a small, powerful project to document a plan for transforming L4 into the best-in-class OS for the IoT world.
I now end with an even bigger meta-point for you all to consider before rejecting it. There is an even bigger reason to keep projects small at this time: We may be in the middle of a whopper of a transition.
I have been working with AIs for over a year on a nearly daily basis, using them as assistants for writing novels. The AIs suck. They save me no time, and the quality of my output is different but not really better.
A big project started now could easily be overrun by future tech.
Some of the problems cited seem to come down to "concurrency is hard" and contemporary programmers don't know how to deal with it properly. I'm not going to claim that it's easy. In fact, it may require a dreaded "paradigm shift" to escape the single-threaded sequential-imperative mindset. I think the most important shift is the elimination of shared mutable state.
One of these possible shifts that has yielded exceptional results but little adoption is Software Transactional Memory. In languages like Haskell, Purescript or Unison, it's both fast and safe, because the type system prevents transaction code to contain side effects. I've heard it claimed that the possibility of side-effects in STM code in Scala is usually not a problem in practice.
But I think that when such paradigm shifts are necessary for a technique to work, or for the technique to perform well (whether in speed or safety), you either enforce it strongly or wide adoption means wide misuse. React seems a good example. Most programmers don't understand that they would create safer and possibly faster applications if they didn't see the requirements around side-effects and immutability as hurdles to bypass.
I don't say this as an argument against GC. I do offer it as one of the essential arguments for why GC doesn't scale. Those same programs written in Rust run in heaps that are 4x-10x smaller and do not require memory walks in unpredictable order. Think about the financial implications at the scale of a Google or Amazon data center.
Le 28/08/2025 à 18:48, Jonathan S. Shapiro a écrit :
I wonder if this is not an essential limit of GC but an accidental limit, caused by most GCs being pretty rudimentary. There has been a lot of work on GC to make it able to collect concurrently or suitable with soft real-time.I don't say this as an argument against GC. I do offer it as one of the essential arguments for why GC doesn't scale. Those same programs written in Rust run in heaps that are 4x-10x smaller and do not require memory walks in unpredictable order. Think about the financial implications at the scale of a Google or Amazon data center.